My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
and update your bookmarks.

August 13, 2008

Mining query logs: a Google paper

Ziv BarYossef ( Haifa and Google) and Maxim Gurevich (Dept. of Electrical Engineering Technion, Haifa) have written a paper presented at VLDB entitled "Mining Search Engine Query Logs via Suggestion Sampling".

Suggestion sampling is when you type in a query and an algorithm returns the k best suggestion results to you. It helps you refine your query. All of this information is stored in a database of past queries or dictionaries for example, lists of place names and so on.

"In this paper we describe two algorithms for sampling suggestions using only the public suggestion interface. One of the algorithms samples suggestions uniformly at random and the other samples suggestions proportionally to their popularity. These algorithms can be used to mine the hidden suggestion databases. Example applications include comparison of popularity of given keywords within a search engine’s query log, estimation of the volume of commercially oriented queries in a query log, and evaluation of the extent to which a search engine exposes its users to negative content."

Their methods do not compromise privacy because they only use information provided by the search engine and aggregate statistical information that can't be traced to an individual user. So no panic.

They use Monte Carlo methods to get "unbiased samples from the suggestion database". This is used in search engine sampling and measurements. They do this because they can't sample suggestions directly from the target distribution.

They state that their invention could be very useful for online advertising to "the quality of
search engines, and for user behavior studies". You can basically estimate the popularity of given keywords. You can compare alternative keywords to those with search engine traffic. They say that this way you can track the popularity of your keywords over time.

Because it's not possible to estimate the quality of the index of a search engine, they use a method called "ImpressionRank" to evaluate it. Whenever a query comes in to the search engine, the top rankings receive an impression. "The ImpressionRank of a page x is the (normalized) amount of impressions it receives from user queries in a certain time frame."

Their 2 algos are:

(1) an algorithm that is suitable for uniform target measures (i.e., all suggestions have equal weights); and (2) an algorithm that is suitable for popularity-induced distributions (i.e., each suggestion is weighted proportionally to its popularity).

In conclusion they found that their uniform sampler is unbiased and efficient and that the score-induced sampler doesn't work as well.

If you'd like to read more details and look at the long equations, please read this very well written paper (I might get more publications if I write like this).

The importance of this paper right now for SEO people is all the stuff going on at the moment about rankings and their importance. These scientists show you that there other methods for them to initiate some kind of personalisation which is out of reach for SEO because you can't control the suggestion service. I think it is time to work on something other than ranking analysis.

No comments:

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at