Science for SEO: Google patent

Showing posts with label Google patent. Show all posts

January 23, 2009

G patent: identifying similar passages in text

The patent entitled "Identifying and Linking Similar Passages in a Digital Text Corpus" was published on the 22nd of January and filed on the 20th July 2007.

It's a really interesting one, not just because it covers a topic I'm particularly interested in but because it describes a very useful method for digital libraries in particular. Digital libraries are different to web documents because they don't have loads of functional links in them. They mention that using references and citations listed in the documents isn't useful because they aren't used outside of academia or such related activities.

Basically they're saying that it's hard to browse a load of documents in a digital library efficiently. You can't navigate the corpus like you would navigate the web because of the nature of the structure.

"As a result, browsing the documents in the corpus can be less stimulating than traditional web browsing because one can not browse by related concept or by other characteristics."

They're saying that finding papers in a digital library is boring because everything is classified either by the keywords the conferences ask for in that particular section of the paper or by author, title, year, subject...It would be far more useful to browse by related concept for example. And I agree.

The claim:

"A computer-implemented method of identifying similar passages in a plurality of documents stored in a corpus, comprising:building a shingle table describing shingles found in the corpus, the one or more documents in which the shingles appear, and locations in the documents where the shingles occur; identifying a sequence of multiple contiguous shingles that appears in a source document in the corpus and in at least one other document in the corpus; generating a similar passage in the source document based at least in part on the sequence of multiple contiguous shingles; and storing data describing the similar passage. " ("shingles" are simply fragments)

Documents are processed and similar passages amongst them are identified. Data describing the similarities is stored and the "passage mining engine" then groups similar passages into further groups which are based on the degree of similarity amongst other things, so we have a ranking algorithm too. They also describe an interface which shows the user the hyperlinks that are associated with these passages so they can easily navigate them.

Their method basically identifies all shingles, gathers as much data as is available on them (location, documents they appear in, etc...) and then groups them together into clusters based on similarity.

Users could navigate passages that are relevant to them in text rather than the whole document which may not be in its entirety. Being able to browse all this data by related features like that would help us find far more relevant papers for our information needs.

This is a different approach to the one where an entire document is analysed (like in LSA) and classified and defined in terms of its overall features. Using passages instead means that the entire exercise is far more granular. Here we take into account that a document may be about a topic in a broad sense but actually about several particular subtopics. We can also tell that perhaps part of a document is useful to a user in response to a query but not the whole thing.

Search engines for digital libraries containing scientific papers for example do not perform half as well as the search engines we're used to using on the web. Google scholar can sometimes yield much better results than Citeseer for example, but then they work very differently. The documents are usually in PDF format or something similar so as Google note you need to be able to make that machine readable for starters.

This conveniently, as far as I'm concerned, brings us to the elusive and wonderful exercise of summarization. I say this because if you have a number of fragments from different documents and that you can identify how similar they are, you can discard any duplicate information and create a complete summary from the data retrieved for your user, also offering up access to each individual document if the user wants to read the whole thing or the original passages. This is not ground breaking in summarization but the model described in the patent fits.

I really like that idea.

January 05, 2009

Query expansion using MT

The Google patent entitled "Machine Translation for Query Expansion" (25/12/08) is a really interesting read. It describes a really exciting new method for query expansion.

"Query expansion" is when a users query is modified before the search is performed. This is done to improve the search results. To do this techniques such as stemming, spelling correction, and the adding of synonyms are used.The method described deals with query expansion using synonyms. Usually this is done using thesauri or lexical ontologies but here it is proposed that machine translation be used - ingenious.

Synonym selection is really not that easy at all. WordNet and such resources have helped us a lot, but there's room for improvement. Sometimes a word can have several different meanings and choosing the wrong one would completely change the query.

"The method includes receiving a search query and selecting a synonym of a term in the search query based on a context of occurrence of the term in the received search query, the synonym having been derived from statistical machine translation of the term. The method also includes expanding the received search query with the synonym and using the expanded search query to search a collection of documents."

Google uses statistical machine translation (as opposed to the rule-based approach). This type of system includes a language model which is used to figure out which bit of text is in the target language and a translation model which uses certain probabilities to determine the translation. So it looks at the likelihood of a particular string being the translation of another. The language model tells the system which proposed translation coming out of the translation model is likely to be right.

"In general, in another aspect, a method is provided. The method includes receiving a request to search a corpus of documents, the request specifying a search query, using statistical machine translation to translate the specified search query into an expanded search query, the specified search query and the expanded search query being in the same natural language, and in response to the request, using the expanded search query to search a collection of documents."

The end result is that there is an increased likelihood hat the search results are more accurate. Also it limits the expansion of the query with erroneous words (which is nice technically speaking).

"Statistical correlations between the occurrences of words in the source language and words in the target language are expressed as alignments between particular words or phrases. When the target language and source language are the same natural language, the principal meaning of an aligned pair is the same. The aligned word or phrase pair is presumed to have similar meaning, i.e., they are presumed to be synonymous. For example, the word "ship" can be aligned under certain circumstances (e.g., in a particular context) with the word "transport". Thus, for those circumstances, "ship" is synonymous with "transport".

The Google translate system is far from accurate though, using their example:

User query: "How to ship a box"

Google translate French: "Comment une boîte de livraison"

Google translate German: "Wie Sie,ein Feld"

French:

Comment = How

une boîte = a box

livraison = delivery

German:

Wie = how

Sie = you

ein feld = a field

The synonyms and the context are pretty hard to get from these translations - obviously this is a really really simplistic and short test. It just gives and idea of the thing. It works better with "Achilles heel running injury" btw but...evaluation is not done like this it's a bit more complex.

Here basically the idea is to add more context awareness to the search system. I like it, it's very clever indeed. The Google translate engine is therefore capable of being put to use in other ways than just translation.

Why should you care?

Well this method shows that queries are being expanded to include far more words than are actually present in the query. This means that going after particular keywords may be useful at the basis but is a very limited approach. As an SEO expert, you should be seeking to create content rich with not only your top target keywords but also terms and concepts that belong to that topic. It's time to look at things in more dimensions than one.

November 13, 2008

New Google patent - more personalization

Google publish (another) patent on the 11th November. This one is interesting because it deals with serving up queries in a preferred language. It means that your search is no longer limited to English sources exclusively if you're in the UK or US or somewhere, because you can specify a language you'd prefer results to be in. This doesn't mean all your results are in your chosen language but Google will also serve those as well as your English results. You could also use dialects if you wanted and also dead languages (Latin, Greek...) AND...Klingon.

You might want Italian results although you are French, because you can read both, so why not? It makes things much easier for people who study or who speak several different languages. If you speak 3 languages, you could be missing out on great information available in German or Lao for example. I'd be interested to know how translators feel about this.

The invention dynamically determines the preferred languages and ranks the search results. The system can determine what you preferred and least preferred languages are by evaluating queries, user interface and search result characteristics.

Query terms are not a good way of determining the language preference because for example, proper nouns are for the most part language independent, so "Marlena Shaw" is always going to be the same. It gives no clue as to what language you want your results in.

Also, keyword searches are not complete enough to determine a language preference, because there's no context. Also individual words can be language-independent or language-misleading. The example used in the patent is the "Waldorf Astoria".

Rankings....

These results are going to have to be ranked to favour the results in the preferred language whilst still allowing for the other results to appear. It's done by using a predetermined shifting factor or by adjusting a numerical score assigned to each search result by a weighting factor and resorting the search results.

I hope this happens soon, it would be really interesting to get multiple language results. This is once again, another example of how personalisation is charging towards us at full throttle. Cool.

November 07, 2008

Patent for SEO software (2008)

I came across a patent for seo software. The inventors are Ray Grieselhuber, Brian Bartell, Dema Zlotin, and Russ Man. It's called "Centralized web-based software solution for search engine optimization" and it was published on the 12th June 2008.

They have patented a piece of software for SEO:

"In one aspect, the invention provides a system and method for modifying one or more features of a website in order to optimize the website in accordance with an organic listing of the website at one or more search engines. The inventive systems and methods include using scored representations to represent different portions of data associated with a website. Such data may include, for example, data related to the construction of the website and/or data related to the traffic of one or more visitors to the website. The scored representations may be combined with each other (e.g., by way of mathematical operations, such as addition, subtraction, multiplication, division, weighting and averaging) to achieve a result that indicates a feature of the website that may be modified to optimize a ranking of the website with respect to the organic listing of the website at one or more search engines."

"... The solution 290 may make recommendations regarding improvements with respect to the site's construction. For example, the solution 290 may make recommendations based on the size of one or more webpages ("pages") belonging to a site. Alternative recommendations may pertain to whether keywords are embedded in a page's title, meta content and/or headers. The solution 290 may also make recommendations based on traffic referrals from search engines or traffic-related data from directories and media outlets with respect to the organic ranking of a site. Media outlets may include data feeds, results from an API call and imports of files received as reports offline (i.e., not over the Internet) that pertain to Internet traffic patterns and the like. One of skill in the art will appreciate alternative recommendations ."

One of the claims is:

"...acquiring data associated with the website; generating a plurality of scored representations based upon the data; and combining the plurality of scored representations to achieve a result; recommending, based on the result, a modification to a parameter of the website in order to improve an organic ranking of the website with respect to one or more search engines."

How many of us use statistical methods for SEO optimisation? I know I collect a lot of data, but not in the same format as this. Can this be reliable? Every site is very different and has different needs. A human is able to discuss this with the client and adapt the strategy depending on that. Can this system take those parameters into account also? It is a recommendation system, so I would think that you could adjust the weightings depending on the site you're analysing. I would be interested to try this out in a free beta, but don't see myself handing over a handful of cash just yet.

I'm all for applying data mining techniques to SEO, I've looked at this before and it is useful.

October 21, 2008

New Google patent about ads

William Slawski over at SEO by the Sea brings our attention to a patent about Google's possible intentions for advertising in podcasts, television and radio.

Snippet from the patent:

"Systems and methods for delivering audio content to listeners. In general, one aspect can be a method that includes receiving a request to download a podcast, and determining a targeted advertisement to be inserted into the podcast. The method also includes inserting the targeted advertisement into the podcast dynamically at a predetermined time. Other implementations of this aspect include corresponding systems, apparatus, and computer program products."

For loads more in depth information trek over the web to the original post.