Science for SEO: July 2008

July 30, 2008

Lemur toolkit

The Lemur toolkit is a natural language processing and information retrieval toolkit. Having a go on this is a nice way of seeing some IR technologies functioning first hand, rather than guessing on a major SE to observe the phenomenon.

It supports all major languages, performs stemming using Porter and Krovetz, indexes loads of file formats, uses part-of-speech tagging and named entity recognition, and has an API of course (C++, C# and Java).

For retrieval:

Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
Relevance- and pseudo-relevance feedback
Wildcard term expansion (using Indri)
Passage and XML element retrieval
Cross-lingual retrieval
Smoothing via Dirichlet priors and Markov chains
Supports arbitrary document priors (e.g., Page Rank, URL depth)

Best of all, it's free! There's a set of tutorials to get you started here.

There is a new engine from the Lemur project called Indri which uses inference networks.

July 29, 2008

Cuil and contextual search.

I wasn't going to blog about the newly released search engine Cuil, because there's so so so much talk about it all over the place, and the collection of resources available will give you good insight and an awful lot of info, which I don't need to repeat here. However it would be a bit rude of me to not acknowledge it and give my couple of pence about it.

In short, Cuil is a new search engine launched by Anna Patterson, Russell Power and Louis Monier. Ex-Google employees imho are top class engineers and scientists. This means that they have the skills to make a good engine.

People have tested it and report bad results, images being associated with the wrong sites, porn being shown in safe mode, a mix up of links internally to the engine, not respecting robots.txt, omitting well know important websites, etc...basically not a good start. I've also noticed they don't make great use of stemming, but seeing as they are using a form of contextual search, then that might make sense, depending on what they're trying to do.

Contextual search: a method based on search through the text of a page in any part of the file rather than in pre-defined fields. It's a similarity measure. Context, inter-relationships and coherence are measured and analysed in oder to give good information on the page. Google uses very statistical methods, and to be honest most IR is based on those but contextual search is more language oriented.

A good example of this concept is given by Miley Watts and Anthony Coats:

"For example, a user searching for components with both “Automotive” and “North America” contexts should receive results that include business components tagged with “Automotive Engines” and “Detroit” (but which do not have “Automotive” nor “North America” as direct contexts). Equally, if the contexts “United States of America”, “USA”, and “America” all refer to the same country, then a search for any one of those contexts should return results from all three equivalent contexts."

Let's try this in Cuil:

I typed in "Automotive" and "North america" and got results about:

Automotive designLine
Automotive modules
Unusual automotive solutions
Automotive accessories (ipod)
Automotive testing
Linux automotive
BMW (north america)
AERA engine builders association (north america)
Automotive testing (x2 results)
buy/sell cars (canada)

The "USA" and "America" example doesn't work either.

This doesn't live up to the academic example does it. It doesn't look very contextual at all. Some of those results have nothing to do with north america. Some do like BMW which I think is probably an ok result as well as AERA. "Buy/sell cars" is totally irrelevant.

Something is clearly wrong. Everyone else is right, the results aren't really relevant and also there's not really evidence of good contextual search. Yes, it's not a simple keyword search or a text-free or text based search, but where is the contextual stuff?

Why? There are good people working on this with exceptional skills and expertise. I cannot understand what has gone so wrong.

But, I like the interface personally and the right hand side categories and think it looks fine.

This SIGIR paper will show you the use of contextual analysis for email, which is interesting.

July 28, 2008

BrowseRank explained

Check out the cool article at Seobook for BrowseRank info. It's easy to read and he's done a better job than I would at explaining it in an easy way.

New stuff in IR

The BCS informer newsletter is out with a whole load of easy to digest info on new stuff emerging from IR right now, such as for example automatic text analysis, multimedia retrieval.

A few highlights for you:

The annual ECIR (European conference on information retrieval) conference took place recently in Glasgow. Amil Singhal from Google, Nick Belkin of Rutgers Uni, Bettina Berendt of K.U. Leuven all gave interesting speeches. They talked about the main challenges in IR at the moment, namely user interaction, information seeking and IR, intentions, scalability, ranking, privacy issues...Amil Singhal stated that the main thing right now for users was “This is what I said, give me what I want”.

The conference also covered cross-language IR, IR models, evaluation, web IR, and new topics such as social media.

It was decided that language technology has done little to further IR. It was shown however that search will turn gradually more and more to conversational systems (thankfully or all my research would be in vain!).

Theo Huibers said something I really like and find very appropriate "“Forget about the
structure. Deal with the chaos!”, referring to the problems in search today.

Book reviews: “Automatic Text Analysis” By Alexander Mehler & Reinhard Köhler and “Multimedia Retrieval” By H.M. Blanken et al.

You can join IRSG (Specialist group on information retrieval) for free.

July 24, 2008

Google Knol

Google just made Knol available to us all!

It's a big repository of expert articles, written by...well...experts. Just like wikipedia but written by experts. A knol is a unit of knowledge.

"With Knol, we are introducing a new method for authors to work together that we call "moderated collaboration." With this feature, any reader can make suggested edits to a knol which the author may then choose to accept, reject, or modify before these contributions become visible to the public. This allows authors to accept suggestions from everyone in the world while remaining in control of their content. After all, their name is associated with it!"

The hardest thing for Google is going to be ranking the knols accurately in Google search results. Knol pages are supposed to be the first thing a user sees when searching for something like that. A know refers both to the project and to the article. Authors can include ads and can earn money from those. Another money spinner, or a genuine attempt at generating quality data?

July 23, 2008

The evolution of web search

Yihong Ding from Brigham Young university wrote an article on the evolution of websearch (search 3.0). There's an interesting challenge from Haika engine's Riza, countered by Yihong. Hakia is a good engine but has yet to evolve, which is a good thing, there's improvement ahead.

Yihong proposes a criterion by which to evaluate search engine performance. He explains "every evolutionary stage of Web search can be uniformly determined by the quality of its produced link resources. In specific, Search 1.0 means the produced link resources are 1.0-level quality, Search 2.0 means the produced link resources are 2.0-level quality, and so on."

In his opinion, almost all search engines belong to search 1.0. Hakia claim to be doing search 4.0 however Yihong responds with "The criterion to measure stages of evolution is not about how well the search results are but about in which quality of productivity the search results are produced."

Food for thought and debate clearly.

July 21, 2008

Natural language querying

A natural language query is expressed in conversational syntax. An example would be "What is a tornado". In keyword search you would enter the term "tornado", and some engines have commands such as "define: tornado".

What makes natural language querying different is that not only can the user simply ask for information, but also the search engine has more chance of getting the right information.

Google recognises some natural language queries, like our example because of the question type pattern. We call these Wh-words: who, what, where, when, and how slips in there too. Because we can tell the system to recognise the pattern that these types of sentence have, the translation is pretty much "define:tornado".

Things get more tricky of you entered for example "if i exercise will i lose weight?". What we'd ideally want is a list of documents explaining about the benefits of exercise for weight loss, etc...Google doesn't get it completely wrong, but the results are quite clumsy, centered around the right topic, just my answer isn't there. That's because Google isn't a natural language querying system right now.

These systems are also called Q&A systems (question-answering). They are expected to not only retrieve the correct information from the index or knowledge base, but also formulate a natural language answer. In our example "If I exercise will I lose weight?", a good answer from the system would be "yes, exercising increases metabolic rate and helps burn fat". Then the user continues the conversation to learn more. Access to relevant documents is available.

How do these systems work? They use natural language processing techniques, question classifiers, information retrieval techniques, natural language generation techniques such as grammars, taxonomies of constructions,named entity recognition, tagging and parsing...there is quite an arsenal of tools here.

The big question is whether we actually want a natural language answer and a conversation with a machine on a daily basis to get our information or not. Google researchers are I am sure working on natural language querying, but I don't think they would be working on generating a natural language answer, simply a better collection of results.

An interesting thing to consider is the use of anaphora of sorts, so how the engine keeps track of the questions you've asked, to get a feel for the area you're working on. If you've asked about strawberries, then jars, then whatever...the engine might think..right this is about jam and cooking. It should also know when you change the subject. It creates some kind of continuity in the querying. It means that context is retained by the engine.

But. It's really really hard to do. I haven't used a single system that worked properly in an open domain. Some questions are interrogative and some are assertive, this needs to be recognised by the system. Understanding the syntactics and semantics of a question despite the tools available isn't easy. The system needs to understand not only what is in the index, but also what exists in the world of the user. It might need to know that water is wet for example in order to understand what we're talking about around a particular topic.

Bottom line? It's hard. Do we need it? Yes, I think so. It's one of the best ways for engines to work with us, and for them to deal with our information more accurately.

How does it impact seo work? I think that seeing as the machines still index and retrieve the documents the same way, it isn't much of an issue in that sense. On the other hand having well written and coherent content becomes more and more important.

Want to try some? Try Qualim, Start and OpenEphyra.

July 14, 2008

A.I for marketing

The major strength of A.I for SEO is in my opinion it's ability to trawl through large amounts of data quickly and intuitively. Obviously datamining techniques come into play. You can automatically examine, visualise, and find patterns in your data. This would be a huge help in any business.

Customer analytics is an area that could benefit from these techniques. Being able to trawl through data to reveal customer behaviour, buying patterns,site preferences, product preferences over time and show patterns in these, amongst a whole host of other things including cross-selling.

We've been using this type of technology for quite some time in bioinformatics, medical engines, pharmaceuticals, and a lot of other scientific areas. A.I has also been included in CRM systems.

Artificial intelligence marketing (AIM) is a form of direct marketing and helps you basically find out more about your customer. That's the first step at least. Privacy issues are a problem because user data has to be gathered and user monitored.

If you have a subscription to the IEEE, there's a good paper on A.I for marketing from Bowen called "Marketing and artificial intelligence: with neural net market segmentation example".

New search engine using A.I

ZeBAze is a new type of search engine that treats querying and information retrieval in a very different way. It's supposed to treat the task the same way as a human would (cognitively), seeing as it uses flexibility allowing users to be a bit looser in their searches. As well as A.I techniques it uses a lot of datamining functions.

The user can scale values to any field, and ZeBAze retrieves the rows closest to these preferences in the database. "Flexible understanding" is used to rank the results. You can also use regular expression in your query as well as the classic form of querying. You can display the results on the screen or you can write them to a database to be used later either with ZeBAze or another application. ZeBAze adapts to any database structure.

It's windows based only sadly, but you can use any database you want, spreadsheet table or structured text file. It costs USD 99, but there's also a free edition.

July 11, 2008

Google talk about ranking

Google posted an article on their blog about their ranking algorithms.

Amit Singhal (in charge of the ranking team at Google) writes about the "no query left behind principle where all queries are dealt with efficiently, how they work very hard to keep the system as simple as possible without compromising the quality of their results (10 ranking changes are made every week), and how there is no manual intervention. Human editing is too subjective and additionally "often a broken query is just a symptom of a potential improvement to be made to our ranking algorithm."

There will be a follow-up article as well, which will be well worth reading. It's very accessible and quite interesting. It should sort out a few of those rumors.

Google made 450 changes to the algorithm last year (Udi Manber- search quality at Google). Improvements are constantly being made and usually for the better. Not all changes are detected by the general public and are subtle, others are evident as fluctuations in the results happen. Recently I noticed very competitive keyword rankings fluctuating significantly on a daily basis. New sites appearing, dropping out the next day, different sites at number 1-5, etc...

Why you ask? Oh I have loads of theories, but they're only theories, and speculations and a bit of fun. I mostly believe the ranking is determined by the number of letters in the URL. Only kidding :)

July 10, 2008

Where are the expert documents?

Two notable computer scientists, Krishna Bharat and George Mihaila, filed a patent describing a "Method for ranking hypertext search results by analysis of hyperlinks from expert documents and keyword scope". The patent was published on the 18/03/2008, but filed in 1999.

In short: "A computer-implemented method and system for determining search results for a search query for hypertext documents. The hypertext documents are reviewed to determine expert documents. When a query is received, the expert documents are ranked in accordance with the query. Then the target documents of the ranked expert documents are ranked to determine the search result set."

An expert document is is a document that is about a certain topic and has links to many “non-affiliated” documents on that topic.

It's very difficult to asses how authoritative a page is, analysing their content alone is not enough. Human editors have been used in the past, but that method is way too slow. Collecting usage information has also been look into but you'd need huge amounts of it to be accurate. The method described in the patent proposes expert lookup followed by target ranking.

A summary of the process:

The expert document list is created in pre-processing, and these are indexed in a special inverted index called an “expert reverse index.”
A query is raised and the "expert reverse index" is used to find and rank documents matching the query. The best expert pages are found and ranked according to match information.
Out-going links on expert pages are analysed by the target ranking and combining relevant outgoing links from many experts on the query topic, the best pages can be found: "This is the basis of the high relevance that the described embodiment of the invention delivers."

It determines which hypertext documents are experts, ranking the expert documents according to the query, ranking target documents pointed to by the ranked expert documents, then the results are based on the ranked target documents.

Hilltop deals with expert documents and was acquired by Google in 2003. It's useful for identifying strong cross-linking, and defines how a site is related to another. You can also find info In Poster proceedings of WWW9, pages 72-73, 2000.

July 03, 2008

Web 3.0 - an intelligent web

Web 2.0 brought us basically a better way to interact with web resources and made it possible for millions of people to connect with each other via social networking sites and through wikis, blogs and folksonomies, podcasts, RSS... In fact the key idea of web 2.0 is collaboration between users. Web 2.0 is not a new platform, but rather a "business revolution" as Tim Berners Lee said. It's changed the way we interact, the way we work, the way we live. Businesses have started sharing their information, products and services through the web 2.0 medium and we are starting to interact with these businesses as customers in a different way.

What is web 3.0 then? Well to be honest we don't really have a definition as such yet. It's early days. Web 3.0 is about moving forwards with the Internet, about improving it, and yes, it's about data, and what we do with it...and what the machine does with it. The important part of web 3.0 for me is the machine's ability to manipulate data rather than just displaying it. I work in natural language generation and machine understanding and the overall path seems to be towards allowing machines to understand web content and create their own. You could ask it to buy stuff for you on amazon, or book you a flight or a hotel. An early example of this in my opinion is the work being done in music recommendation.

Tim Berners-Lee's definition is "an overlay of scalable vector graphics - everything rippling and folding and looking misty - on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource."

My definition is , web 3.0: artificial intelligence to allow for a web with reasoning capabilities built in.

The technologies used for web 2.0 are (amongst some) REST/XML, CSS, Microformats, Ajax, RSS, folksonomies. The technologies used for web 3.0 are description logics and intelligent agents. I'm working in this area and I'm using web 2.0 technologies and getting intelligent agents to interact with them for example.

You might also enjoy SCIgen, a system that generates computer science research papers all on it's own. Sadly it's just for fun so I won't be able to use it to avoid writing papers :(

What does it mean for SEO? Well i don't think it's going to die out any time soon, but I think the way it happens will change. Already with web 2.0 we're using social media, most companies have a blog, podcasts even, most provide newsfeeds,...we've embraced web 2.0 in our seo efforts. Web 3.0 (when we finally have the technology to make it happen) will be no different. It'll become more important than ever to create well focused sites that a machine can breakdown and draw information from for use in another medium. We'll also need to cater for the user as well as possible in order to encourage them to choose our site information over another's. I think that this is going to have more to do with the information in the site rather than the site itself. In my research for example, the user doesn't even go to the site, just interacts with the information.

An interesting thing to note is that my system works with the user and also makes up it's own mind. There are many systems in research that are using this method. Do we finally have a partnership between man and machine? The user fine tunes and the machine learns and creates...The machine has a long way to go yet, and so does the user.

Some resources for you: a primer on the "semantic web", Eric Schmidt defines web 3.0,

July 01, 2008

Mining Search Engine Query Logs via Suggestion

This paper from a Googler called Ziv Bar-Yossef and colleague Maxim Gurevich is about using the suggestions that are generated when we type in search boxes. These methods can be used to determine the popularity of keywords in the search engine log, an estimation of volume and suggestion success rate.

First off, some definitions:

"Monte Carlo" method = "The use of randomly generated or sampled data and computer simulations to obtain approximate solutions to complex mathematical and statistical problems". (Nature)

"Suggestion service" = tries to anticipate what the user is looking for by attempting to auto-complete the query.

The researchers have used data freely available to them, but obviously personal user data would be more valuable, however privacy constraints prevent the use of this data. They state that their algorithms do not compromise privacy because:

"(1) they use only publicly available data provided by search engines; (2) they produce only aggregate statistical information about the suggestion database, which cannot be traced to a particular user."

These methods build on 2 existing applications:

Online advertising and keyword popularity estimation.
Search engine evaluation and ImpressionRank sampling.

"We present two sampling/mining algorithms:

(1) an algorithm that is suitable for uniform target measures (i.e., all suggestions have equal weights); and
(2) an algorithm that is suitable for popularity-induced distributions (i.e., each suggestion is weighted proportionally to its popularity).

Our algorithm for uniform measures is provably unbiased: we are guaranteed to obtain truly uniform samples from the suggestion database. The algorithm for popularity-induced distributions has some bias incurred by the fact suggestion services do not provide suggestion popularities explicitly."

Through thorough testing, they found that the uniform sample is both unbiased and efficient and that the score induced one was less effective.

Some of the limitations include the sending of thousands of queries to the suggestion server (though this is not a big problem as the effect is marginal), and the method reflects the suggestion database more than the query log.

For search engine users, it means that there is good research being carried out in order to help us obtain better results, which is always a good thing. For the SEO people, it means that with engines helping users get even more focused results, keyword analysis, and user behaviour data becomes even more important.

Hooray!

Welcome to "Science for SEO". Here I will be sharing with you my usual everyday reading on computer science developments which may or do affect the SEO industry. I will try and provide a layman's explanation and delve into as much detail as is necessary, there's no need to confuse everyone.

I'm finishing (endlessly it seems) a PhD in Natural language generation and machine understanding. I have many years of experience with search engines, having built some, and also having studied their mechanics in depth as I use some of these techniques for my own area of research. I also do some SEO work which helps with the rent, the bills, and funding some of the more fun things in life.

I hope you enjoy reading and that you get something out of it. Happy trails.

cj