We do spend an awful lot of time doing searches on Google and then going through the list of results to find information related to our query or rather the exact information we are looking for. Sometimes we don't actually know exactly what we're looking for until we get to a resource which tells us that by addressing our query in a different way. Then our search deviates and we continue this process. It's time consuming because you need to scan read at least each other resources you think might be relevant. The ones that appear indeed to be relevant then need to be read in more depth. This is not an efficient way of collecting useful information. This is why technology such as document summarization is important.
Document summarization involves automatically creating a summary of a document. Lots of things have to be taken into consideration such as the type of language used (this needs to be successfully recognised), the style of writing, and the document syntax.
There are different approaches which have been discussed and evaluated recently which I will introduce. The basic ideas though are extraction (pulling out useful information) and abstraction (paraphrasing sections of the document to summarise it). In a search engine you need to have a slightly different type of summarization approach than in other areas because it needs to be relevant to the query, or rather "query biased".
The most efficient 1st step in summarization for a search engine (imho) is multi-document summarization. This means that it produces a summary of all of the results returned in relation to your query. This means that you are much closer to getting an answer to your query rather than a list of documents that might be useful to you. This hugely speeds up your interaction with the data and addresses the issue of data overload.
So that multi-document summarization can happen, the documents have to be clustered. This is easier in a search engine because the list of results is indeed a cluster. The summarization stage however can offer further opportunities for a more focused clustering.
The various methods for summarization in the past aren't really what I want to look at in this post, I actually want to focus on recent research which gives us valuable insight into how this might work in a fully working search engine. I'm going to introduce a number of papers and a very short low-down of the method presented because without the how, we can't really start to understand the why fully.
"Comments-Oriented Document Summarization:Understanding Documents with Readers’ Feedback" byHu, Sun and Lim from Nanyang Technological university of Singapore
(SIGIR 08)
Interestingly they looked at improving the performance of their summarization system by using comments left by readers on the web documents. This is described as "comments-based document summarization". Comments are linked to one another by 3 relations: topic, quotation and mention, producing 3 graphs which are merged into a multi-relation graph. A second method used is to construct a 3rd-order tensor with the 3 graphs. Sentences are extracted using a feature-biased (scores sentences with a bias to the keywords derived from the comments) or uniform-document approach (scores sentences uniformly without comments). They found that the latter significantly improved the performance of their system. This does however only work if there are any comments and these are most likely to occur in blog posts.
"Multi-Document Summarization Using Cluster-Based Link Analysis" by Wan and Yang from Peking University (SIGIR 08)
They used the Markov Random Walk model for their system which deals with multi-document summarization. Link relationships between sentences in the document set are analysed. They isolate topic clusters within the documents and form sentence clusters. Their method is the "Cluster-based Conditional Markov Random Walk Model" (ClusterCMRW). and the cluster-based HITS model (ClusterHITS) to identify the clusters. The former approach worked better than ClusterHITS as far as different cluster numbers went.
"MANYASPECTS: A System for Highlighting Diverse Concepts in Documents" by Liu, Terzi and Grandison from IBM Almaden Research (PVLDB 08)
Their system takes a document and then highlights a small set of sentences that are likely to cover different aspects of that document. They use "simple coverage" and "orthogonality criteria". The cool thing about this system is that it can handle both plain text and RSS/ATOM feeds. They quite rightly say that it can also be integrated in web 2.0 forums so that you can easily find different opinions on things and discussions. They also used the standard methods for clustering and summarization such as k-median and SVD.
There's talk of integrating this into Firefox too and to allow for spam control which is quite exciting.
"Web Content Summarization Using Social Bookmarks:A New Approach for Social Summarization" by Park and Fukuhara from Seoul National University (WIDM 08)
Their approach is to exploit user feedback (comments and tags) in social bookmarking services like Del.icio.us, Digg, YouTube and Amazon. They used a prototype system called SSNote which analyses tags and user comments and also extracts summaries. Their approach shows promise. Their method is "Social summarization" which allows them to produce text summaries that are just as good as human produced ones.
"Latent Dirichlet Allocation Based Multi-Document Summarization" by Arora and Ravindran from the Indian Institute of Technology Madras (AND 08)
As the title says, they used Latent Drichlet Allocation for their system. This method allows them to capture events covered in the documents and to produce a summary which respects these different events. This method means that they don't need to pay attention to any of the details concerning grammar and structure. Their method was very efficient. Basically the central theme and events in the documents are identified as well as the sub-topics and themes. Then these are represented in the summary. They extract entire sentences and do not modify anything.
Why should you care?
Do you remember all that talk about how there was no point in checking search engine rankings anymore? Everyone was very divided on this issue, and it isn't an easy thing to explain to clients either. Well I think that this research clearly highlights that there are very definite moves to break away from the standard list of documents. As these techniques become more refined and as they become implemented successfully, they will no doubt change the way that users find information and products.
What should you do then?
Same as you should already be doing, produce well structured rich content, grammatically and syntactically sound. Not only do you need to show up in the initial results, as you do anyway in a cluster like the ranking list in a search eninge, but you are also going to have to provide very focused and relevant information, because the summarization stage can act as a further filter to the initial clustering.
More papers that are freely accessible should you be tickled by the subject:
Multi-document summarization system and method patent by McKeowan and Barzilay
Document summarization based on topicality and specificity patent by Rie Ando et al
2 comments:
Very insightful post. The science behind the work we do is usually far removed from observation - it's great fun to pull the covers and take a peak! Thanks for the research.
At the same time - it's telling that the staple action is usually the same in this industry: do what makes sense (create well organized, meaningful content).
Hi Lyal,
Thank you. I'm glad you liked the post and found it helpful!
cj
Post a Comment