Science for SEO: Multi-document summarization patent

Kathleen McKeown and Regina Barzilay (also a Microsoft faculty fellow) have published a patent on the 29th April 2008 (assignee University of Columbia in NY). It is entitled "Multi-document summarization system and method".

Basically the idea is to create relevant summaries from a number of documents containing the correct type of information. These summaries are then presented to the user, ideally containing all the information s/he asked for.

They extract phrases from the documents, analyse the extracted phrases (using phrase intersection analysis) which identifies the relevant phrases, remove ambiguous time references, and then generate a sentence to include in the summary. The system comprises a storage device for storing the documents in the collection; a lexical database; and a processing subsystem. The lexical database is described as something like WordNet. WordNet is like the Google of lexical databases, it pops up all over the place in systems and research papers (and yes, I use it too. It works, although I find it restricted for certain domains and have had to extend it).

The whole point of a system like this is to help the user who often faces information overload and does not have the time to scan all the documents presented. This system extracts the right stuff from the right documents and presents a summary of all that information.

They state "For individual documents, domain-dependent template based systems and domain-independent sentence extraction methods are known. Such known systems can provide a reasonable summary of a single document. However, these systems are not able to compare and contrast related documents in a document set to provide a summary of the collection".

This is why their method is really quite cool. It nicely supports my theory of not having to go to the website at all in the future, and just using a few good systems to do all the searching and site visiting for you.

It presents some interesting challenges for SEO, because here we would have the main task of providing incredible content and really relevant images. You'd still have to prove to the system as you do to the search engine that you're relevant to the topic, but I think natural language queries will be used more and more, which means that optimisation would have to be more like contextual search, rather than keyword based optimisation and that kind of thing. So it all gets more and more complicated as we advance and refine, both for computer scientists and SEO experts. I seem to cheer on both sides :)