Science for SEO: Another statistical method for IR

Miles Efron published a paper entitled "An Approach to Information Retrieval Based on Statistical Model Selection" this August. He proposes to use statistical model selection for information retrieval.

"The proposed approach offers two main contributions. First, we posit the notion of a document's "null model," a language model that conditions our assessment of the document model's significance with respect to the query. Second, we introduce an information-theoretic model complexity penalty into document ranking. We rank documents on a penalized log-likelihood ratio comparing the probability that each document model generated the query versus the likelihood that a corresponding "null" model generated it. Each model is assessed by the Akaike information criterion (AIC), the expected Kullback-Leibler divergence between the observed model (null or non-null) and the underlying model that generated the data. We report experimental results where the model selection approach offers improvement over traditional LM retrieval."

In short, he choses a single model from a pool of candidate models, favouring models that fit the data well. They use Occam's razor to avoid overfitting. He ranks documents on a statistic related to AIC, which consists of the difference between the document model and its corresponding null model. Given a document di they derive a statistic corresponding to a test on the null hypothesis, H0. And....

"Specically, we rank documents on the dierence in the Akaike information criterion (AIC) between the non-null and null models. AIC is the expected Kullback- Leibler divergence between a given model and the unknown model that generated the data. Thus ranking documents by AIC dierence oers a theoretically sound method of conducting IR."

He also states:

"We argue that we can improve retrieval performance by mitigating the role of query- document term coordination. Instead of rewarding documents that match many query terms, we argue, we should reward documents that match the best query terms. Using AIC dierences aords a natural means of operationalizing this intuition."

He finds that his method rarely does worse than LM, and that it perforn significantly better when a small amount of smoothing was applied to language models.

It's interesting to see another statistical method and I'd love to see more evaluation and progress, it looks promising. Personally I am of the opinion that a mixture of linguistic models and statistical models is necessary. I think using one of the other is limiting. I've talked about the Lemur project before and this is statistical as well. N-grams, markov models, tf-idf, the query likelihood model, multi-variate Bernoulli, there are a lot of different techniques and very prominent and much respected people who have worked on them and are working on them right now. But lets not forget to involve the linguists now and again.

Also lets remember one of the well know computer science sayings:

"If enough data is collected, anything can be proved by statistical methods" ( I might not include this quote in my thesis)