Science for SEO: G patent: identifying similar passages in text

The patent entitled "Identifying and Linking Similar Passages in a Digital Text Corpus" was published on the 22nd of January and filed on the 20th July 2007.

It's a really interesting one, not just because it covers a topic I'm particularly interested in but because it describes a very useful method for digital libraries in particular. Digital libraries are different to web documents because they don't have loads of functional links in them. They mention that using references and citations listed in the documents isn't useful because they aren't used outside of academia or such related activities.

Basically they're saying that it's hard to browse a load of documents in a digital library efficiently. You can't navigate the corpus like you would navigate the web because of the nature of the structure.

"As a result, browsing the documents in the corpus can be less stimulating than traditional web browsing because one can not browse by related concept or by other characteristics."

They're saying that finding papers in a digital library is boring because everything is classified either by the keywords the conferences ask for in that particular section of the paper or by author, title, year, subject...It would be far more useful to browse by related concept for example. And I agree.

The claim:

"A computer-implemented method of identifying similar passages in a plurality of documents stored in a corpus, comprising:building a shingle table describing shingles found in the corpus, the one or more documents in which the shingles appear, and locations in the documents where the shingles occur; identifying a sequence of multiple contiguous shingles that appears in a source document in the corpus and in at least one other document in the corpus; generating a similar passage in the source document based at least in part on the sequence of multiple contiguous shingles; and storing data describing the similar passage. " ("shingles" are simply fragments)

Documents are processed and similar passages amongst them are identified. Data describing the similarities is stored and the "passage mining engine" then groups similar passages into further groups which are based on the degree of similarity amongst other things, so we have a ranking algorithm too. They also describe an interface which shows the user the hyperlinks that are associated with these passages so they can easily navigate them.

Their method basically identifies all shingles, gathers as much data as is available on them (location, documents they appear in, etc...) and then groups them together into clusters based on similarity.

Users could navigate passages that are relevant to them in text rather than the whole document which may not be in its entirety. Being able to browse all this data by related features like that would help us find far more relevant papers for our information needs.

This is a different approach to the one where an entire document is analysed (like in LSA) and classified and defined in terms of its overall features. Using passages instead means that the entire exercise is far more granular. Here we take into account that a document may be about a topic in a broad sense but actually about several particular subtopics. We can also tell that perhaps part of a document is useful to a user in response to a query but not the whole thing.

Search engines for digital libraries containing scientific papers for example do not perform half as well as the search engines we're used to using on the web. Google scholar can sometimes yield much better results than Citeseer for example, but then they work very differently. The documents are usually in PDF format or something similar so as Google note you need to be able to make that machine readable for starters.

This conveniently, as far as I'm concerned, brings us to the elusive and wonderful exercise of summarization. I say this because if you have a number of fragments from different documents and that you can identify how similar they are, you can discard any duplicate information and create a complete summary from the data retrieved for your user, also offering up access to each individual document if the user wants to read the whole thing or the original passages. This is not ground breaking in summarization but the model described in the patent fits.

I really like that idea.