Science for SEO: Mining Search Engine Query Logs via Suggestion

This paper from a Googler called Ziv Bar-Yossef and colleague Maxim Gurevich is about using the suggestions that are generated when we type in search boxes. These methods can be used to determine the popularity of keywords in the search engine log, an estimation of volume and suggestion success rate.

First off, some definitions:

"Monte Carlo" method = "The use of randomly generated or sampled data and computer simulations to obtain approximate solutions to complex mathematical and statistical problems". (Nature)

"Suggestion service" = tries to anticipate what the user is looking for by attempting to auto-complete the query.

The researchers have used data freely available to them, but obviously personal user data would be more valuable, however privacy constraints prevent the use of this data. They state that their algorithms do not compromise privacy because:

"(1) they use only publicly available data provided by search engines; (2) they produce only aggregate statistical information about the suggestion database, which cannot be traced to a particular user."

These methods build on 2 existing applications:

Online advertising and keyword popularity estimation.
Search engine evaluation and ImpressionRank sampling.

"We present two sampling/mining algorithms:

(1) an algorithm that is suitable for uniform target measures (i.e., all suggestions have equal weights); and
(2) an algorithm that is suitable for popularity-induced distributions (i.e., each suggestion is weighted proportionally to its popularity).

Our algorithm for uniform measures is provably unbiased: we are guaranteed to obtain truly uniform samples from the suggestion database. The algorithm for popularity-induced distributions has some bias incurred by the fact suggestion services do not provide suggestion popularities explicitly."

Through thorough testing, they found that the uniform sample is both unbiased and efficient and that the score induced one was less effective.

Some of the limitations include the sending of thousands of queries to the suggestion server (though this is not a big problem as the effect is marginal), and the method reflects the suggestion database more than the query log.

For search engine users, it means that there is good research being carried out in order to help us obtain better results, which is always a good thing. For the SEO people, it means that with engines helping users get even more focused results, keyword analysis, and user behaviour data becomes even more important.