Science for SEO: August 2008

August 13, 2008

Mining query logs: a Google paper

Ziv BarYossef ( Haifa and Google) and Maxim Gurevich (Dept. of Electrical Engineering Technion, Haifa) have written a paper presented at VLDB entitled "Mining Search Engine Query Logs via Suggestion Sampling".

Suggestion sampling is when you type in a query and an algorithm returns the k best suggestion results to you. It helps you refine your query. All of this information is stored in a database of past queries or dictionaries for example, lists of place names and so on.

"In this paper we describe two algorithms for sampling suggestions using only the public suggestion interface. One of the algorithms samples suggestions uniformly at random and the other samples suggestions proportionally to their popularity. These algorithms can be used to mine the hidden suggestion databases. Example applications include comparison of popularity of given keywords within a search engine’s query log, estimation of the volume of commercially oriented queries in a query log, and evaluation of the extent to which a search engine exposes its users to negative content."

Their methods do not compromise privacy because they only use information provided by the search engine and aggregate statistical information that can't be traced to an individual user. So no panic.

They use Monte Carlo methods to get "unbiased samples from the suggestion database". This is used in search engine sampling and measurements. They do this because they can't sample suggestions directly from the target distribution.

They state that their invention could be very useful for online advertising to "the quality of
search engines, and for user behavior studies". You can basically estimate the popularity of given keywords. You can compare alternative keywords to those with search engine traffic. They say that this way you can track the popularity of your keywords over time.

Because it's not possible to estimate the quality of the index of a search engine, they use a method called "ImpressionRank" to evaluate it. Whenever a query comes in to the search engine, the top rankings receive an impression. "The ImpressionRank of a page x is the (normalized) amount of impressions it receives from user queries in a certain time frame."

Their 2 algos are:

(1) an algorithm that is suitable for uniform target measures (i.e., all suggestions have equal weights); and (2) an algorithm that is suitable for popularity-induced distributions (i.e., each suggestion is weighted proportionally to its popularity).

In conclusion they found that their uniform sampler is unbiased and efficient and that the score-induced sampler doesn't work as well.

If you'd like to read more details and look at the long equations, please read this very well written paper (I might get more publications if I write like this).

The importance of this paper right now for SEO people is all the stuff going on at the moment about rankings and their importance. These scientists show you that there other methods for them to initiate some kind of personalisation which is out of reach for SEO because you can't control the suggestion service. I think it is time to work on something other than ranking analysis.

A funny A.I quote by Searle

Searle is a really important person in A.I because of his wonderful philosophical perspective on things. The "chinese room" was a good point to make about the Turing test. He did a lot of work in language as well, like his speech acts research.

Anyway...here's the quote:

"Because we do not understand the brain very well we are constantly tempted to use the latest technology as a model for trying to understand it. In my childhood we were always assured that the brain was a telephone switchboard. ('What else could it be?') I was amused to see that Sherrington, the great British neuroscientist, thought that the brain worked like a telegraph system. Freud often compared the brain to hydraulic and electro-magnetic systems. Leibniz compared it to a mill, and I am told some of the ancient Greeks thought the brain functions like a catapult. At present, obviously, the metaphor is the digital computer." — John R Searle

August 11, 2008

Multi-document summarization patent

Kathleen McKeown and Regina Barzilay (also a Microsoft faculty fellow) have published a patent on the 29th April 2008 (assignee University of Columbia in NY). It is entitled "Multi-document summarization system and method".

Basically the idea is to create relevant summaries from a number of documents containing the correct type of information. These summaries are then presented to the user, ideally containing all the information s/he asked for.

They extract phrases from the documents, analyse the extracted phrases (using phrase intersection analysis) which identifies the relevant phrases, remove ambiguous time references, and then generate a sentence to include in the summary. The system comprises a storage device for storing the documents in the collection; a lexical database; and a processing subsystem. The lexical database is described as something like WordNet. WordNet is like the Google of lexical databases, it pops up all over the place in systems and research papers (and yes, I use it too. It works, although I find it restricted for certain domains and have had to extend it).

The whole point of a system like this is to help the user who often faces information overload and does not have the time to scan all the documents presented. This system extracts the right stuff from the right documents and presents a summary of all that information.

They state "For individual documents, domain-dependent template based systems and domain-independent sentence extraction methods are known. Such known systems can provide a reasonable summary of a single document. However, these systems are not able to compare and contrast related documents in a document set to provide a summary of the collection".

This is why their method is really quite cool. It nicely supports my theory of not having to go to the website at all in the future, and just using a few good systems to do all the searching and site visiting for you.

It presents some interesting challenges for SEO, because here we would have the main task of providing incredible content and really relevant images. You'd still have to prove to the system as you do to the search engine that you're relevant to the topic, but I think natural language queries will be used more and more, which means that optimisation would have to be more like contextual search, rather than keyword based optimisation and that kind of thing. So it all gets more and more complicated as we advance and refine, both for computer scientists and SEO experts. I seem to cheer on both sides :)

August 08, 2008

Cuil vs Powerset

There's been an awful lot of talk around Cuil, the alternative search engine to Google that xooglers launched recently. There hasn't been much noise around Powerset (founded in 2005), recently aquired by Microsoft. It's important to note that the engine is running on wikipedia for now.

The guys from Powerset get to join the core Search Relevance team. Powerset is is a "semantic" search engine, meaning that they move more towards intent-based search. The search engine does not base the importance of documents based on links but rather on content using dictionaries, thesauri, syntax, setence structure, and whole host of other NLP tools to extract meaning.

Powerset people state: "Powerset is first applying its natural language processing to search, aiming to improve the way we find information by unlocking the meaning encoded in ordinary human language."

Cuil owners state that it is a "contextual" engine. They also state: "When we find a page with your keywords, we stay on that page and analyze the rest of its content, its concepts, their inter-relationships and the page’s coherency."

Sounds pretty similar doesn't it?

What's the difference?

Well Powerset allows you to use natural language queries (normal conversation expression), and also aggregates information across multiple articles.

The interface is pretty cool, and very very different to Google or Cuil for that matter. You type in a query (cake making) and you get a list of wikipedia results. There is a little drop down button on the left of the result, which when pressed displays beneath the result the actual text (and images if you want) of that result. I don't have to go to the actual website to see the info. I can also click on a display of the article to go to the part of the document that I want.

There are also links at the bottom suggesting related searches, and low and behold, they are all relevant.

It doesn't do so well with natural language queries, which I am not surprised about. No one as far as I know has managed to do a properly good job on this problem area yet. It requires natural language understanding which we haven't found a good solution for yet. It isn't rubbish though, I tried "How do I make cake" and I did get some instructions as a first result. Of course "cake" is also a band, there was no dismbiguation. The very relevant links at the bottom do help a great deal though.

Cuil is a lot more traditional in that it gives you a number of individual documents to look at. They're not ranked in any particular order, and they are in columns. "How do I make cake" returns nothing, because they don't support natural language querying, which I find a bit concerning because this is something which naturally should eventually become the norm. There is loads of research in this area. So, I enter "cake making", and for this I get a mixed bag of results. A few are relevant, one is perfect but the others are a bit of a mess. I don't get any related searches suggested, which limits my options.

I do see Powerset working in a way that indicates that indeed they are working hard to "unlock meaning" as it suggests a load of thing and it also the results are varied enough for me, but still very relevant for the most part. Cuil however doesn't seem to be doing anything apart from throwing out a bag of results from relevant to totally not, and doesn't help me to get better results.

Powerset is a superior engine imho. Cuil to me, is in need of a trip back to the drawing board. One very important thing to note however is that Powerset is working in a "closed domain" which is wikipedia. Cuil has apparently 3 times more documents than Google. That's a really big index. In IR mostly people try and keep the index small because of cost but also performance. With an index that big, I think you'd need to be pretty confident that your super tough and clever algorithm can handle all the intricacies of language and all of the analysis of individual documents, as well as clusters of documents.

What Powerset does that I really like, is not actually take you to the webpage unless you request it. You can read the info from the results page. How does this affect seo? Content, content, content...I think there are going to be more and more openings for writers and the like in the future. The optimisation becomes based not around individual pages but rather around your entire site which is checked for relevancy. If only one of your pages is about "cake" and the others about "candles", I don't think you'd return for "cake" even if that particular page was really optimised. You might return for "birthdays though" if you see what I mean.

August 07, 2008

Google stopwords

There's been a lot of talk around a Google patent called "Locating meaningful stopwords in keyword-based retrieval systems". I won't explain all the inricaties of it seeing as there have already been many blog articles on this, but I recommend reading SEO by the sea's version of the facts.

Stopwords are basically words that don't contain any useful information when performing IR work, such as "a", "the", etc... It's far more important to extract the named entity instead. Well this is the general rule so far. This patent proposes something quite different, evaluating the stopwords to figure out which ones are actually useful to the search. This hasn't been looked at so far, as far as I know that is.

I think it's really useful to read the references listed in the paper to get a good understanding of how the method came to be.

Stopword removal is really useful because in the past it has always improved IR performance, and decreases the index size. It has however been observed that removing too many can hard retrieval effectiveness. "To be or not to be" is a common example which causes problems with stopword removal. Stopword lists are usually constructed by using the n most frequent terms in a corpus. A general stopword list can be issued and the useful stopwords removed from it. You can for example get it to delete stopwords unless they are preceded by the + operator.

Many systems use n-grams and these can yield really useless bigrams such as "and the" for example. However it's important to be cautious when getting rid of stopwords accompanying nouns, as it's possible to discard valuable information.

Gregory Marton from MIT tested stopword retention rather than removal and concluded:

"Removing stopwords significantly hurt precision among description-only runs because many of the descriptions were now so short that recall became more coarse-grained, and thus more difficult to threshold".

It's interesting because it means that a more contextual approach is being taken.

Google gadgets hacked

A quick post on those cool little gadgets for iGoogle. 2 security consultant from SecTheory demonstrated such an attack at the Black Hat hackers conference in Las Vegas. They broke into a web browser and read all the personal searches in real time. Gadgets that store personal info are the most at risk of course.

Google says that it scans all the gadgets regularly for malicious code, ones containing malicious code are immediately blacklisted.

They say also that since November 2007 no new "inline" gadgets are accepted because they store personal information.

Companies are always making gadgets to promote their company like adding a route planner application for example for users to include on their page. This means that users are more likely to go to that site for services such as buying car insurance or something. If users start to mistrust the gadgets, then these will be useless.

I guess not allowing personal information to be stored is a good way of averting these particular hacks but there are plenty of other malicious hacks that will surface I'm sure. It's not just Google who is faced with this problem but also sites like Facebook for example.

Google's Music Onebox

Google just launched legal music search in China, it's called Music Onebox. Music can be streamed or downloaded for free. itunes and other services aren't available in China. Maybe it's worth rolling out some tunes then to get visitors to your site through this service...depending in what industry you work of course :)

"This legal music service will help users avoid dead links, slow downloads, inaccurate search results, and poor quality or incomplete songs," Google said in a statement.

Read more here.

August 06, 2008

SIGIR 08 coverage

SIGIR is pretty much THE IR conference of the year. I didn't go this year, but Greg Linden did and has extensive coverage of the most important papers presented this year over at his blog.

You'll find:

- "BrowseRank"
- "ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines"
- "To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent"
- "A User Browsing Model to Predict Search Engine Click Data from Past Observations"

Well worth a read, seeing as these ideas form the future of information retrieval.

Mozilla Aurora browser

Mozilla labs just released a video on their Aurora browser project. It's in Mozilla labs and it's open source as well which is cool.

The footage shows really innovative ways to collaborate online, interact with data, a different way of looking at data and easy ways to share stuff with your friends or work collegues, or anyone you want. It looks cool.