My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

October 07, 2008

What is semantic search?

There's been a lot of talk recently about "semantic search", and is also refered to as the "readwrite web".  Powerset, Cognition, Ask, Hakia, and many others are "semantic search engines".  It's not a new concept, research has been available in academia for at least 10 years.  In fact a lot of people involved in that in the early years are involved in the newly released semantic search engines today.  Not a big surprise!

So, what are semantics for start:

It refers to meaning in language (or code or anything else).  It used syntax and pragmatics as well as contextual information to provide the meaning of the text or even audio stream if you want to use that.  It's not just about finding similarities or context between 2 words but rather taking the entire text or query to establish meaning.  

What is the semantic web?

It's a common framework allowing information to be shared and reused.  Information is stored in machine readable formats.  

"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." - Tim Berners-Lee

That's probably the best definition, seeing as he invented it.  It's  relevant to semantic search and uses many of the same techniques and also is based on the same idea.  It's not a new version of the web, but rather an extension.  There are a lot of conferences about this worldwide, such as "The semantic web technology conference" for example.  

Technologies used obviously include ontologies (which are like big storage boxes full of information on how words and concepts link to each other).  These are built in OWL mostly, Natural language processing tools for named entity extraction for example, Data interchage formats (like RDF/XML or turtle for example), schemas like the RDF Schema, XML to provide syntax for content structure and SPARQL which is a web query language for semantic resources.   

And what is semantic search?

Google uses PageRank to identify relevancy, whereas semantic search engines use meaning to return highly relevant results.  Google returns keyword/keyphrase results, and the semantic solution returns information.  

The data has to be really structured in ontologies just like in the semantic web.  A semantic network is created which links all of the concepts and words together.  It used word sense disambiguation (WSD) in order to decipher what a word may related to.  WordNet, which you can download for free, is a machine readable dictionary that a lot of scientists have used for this task, although it's far from foolproof.  Here is a very comprehensive list of which semantic search engines use what kind of procedure, in pretty plain English.

Google does respond to natural language queries, such as "Where was Marilyn Monroe born?".  Hakia doesn't understand the query and tells me what Marilyn's real name was.  Powerset (only searches Wikipedia and Freebase) comes up with the goods, "LA".  Working in a "closed domain", Powerset has an easier job than Hakia who searches the whole web, just like Google.  Google however delivered where Hakia didn't this time round.

Then I tried in all 3 "Is chili bad for you?" - Hakia came up with books reviews for a book called "Bad chili", Google came up with a forum article with that exact question in it, and Powerset delivered an in depth article on the effects of chili on humans.  The following results for Powerset are all off though.  Hakia results continue with the book, but Google gives me loads of results all about the effect of chili on the body.

Have a go yourself and see what happens.  

This little test definitely shows that Google can come up with the goods, whereas the semantic engines struggle.  More work needed there.  I would be very surprised if Google were shunning semantic web technology and natural language queries.  I would leave that open for discussion actually.

The future?  Natural language queries, and natural language generation for a straight answer to a question, and a summary of all of the most relevant resources in one text, and the option to read the individual documents.  That's not an easy feat!

No comments:

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.