August 07, 2008

Google stopwords

There's been a lot of talk around a Google patent called "Locating meaningful stopwords in keyword-based retrieval systems". I won't explain all the inricaties of it seeing as there have already been many blog articles on this, but I recommend reading SEO by the sea's version of the facts.

Stopwords are basically words that don't contain any useful information when performing IR work, such as "a", "the", etc... It's far more important to extract the named entity instead. Well this is the general rule so far. This patent proposes something quite different, evaluating the stopwords to figure out which ones are actually useful to the search. This hasn't been looked at so far, as far as I know that is.

I think it's really useful to read the references listed in the paper to get a good understanding of how the method came to be.

Stopword removal is really useful because in the past it has always improved IR performance, and decreases the index size. It has however been observed that removing too many can hard retrieval effectiveness. "To be or not to be" is a common example which causes problems with stopword removal. Stopword lists are usually constructed by using the n most frequent terms in a corpus. A general stopword list can be issued and the useful stopwords removed from it. You can for example get it to delete stopwords unless they are preceded by the + operator.

Many systems use n-grams and these can yield really useless bigrams such as "and the" for example. However it's important to be cautious when getting rid of stopwords accompanying nouns, as it's possible to discard valuable information.

Gregory Marton from MIT tested stopword retention rather than removal and concluded:

"Removing stopwords significantly hurt precision among description-only runs because many of the descriptions were now so short that recall became more coarse-grained, and thus more difficult to threshold".

It's interesting because it means that a more contextual approach is being taken.

