Science for SEO: Advances in IE for the Web

This article was published in the ACM communications and was written by Oren Etzioni, Michele Banko,Stephen Soderland, and Daniel S. Weld. It's freely availble and you can read the whole issue here.

Google usually give you way too many documents when you're searching for a very simple query, and as the Authors say: does not allow you to make very advanced searches like listing all the people who published at a conference and list them by geographical location. In fact the "Advances search" function allows for very basic operations. He says that the time has come for systems to sift through all the information for you and deliver an answer to your query. I obviously agree since this is the area I work in :)

They discuss a range of Information extraction (IE) methods that are "Open" as in the identities of the relations to be extracted are unknown and the all of the mountains of web documents need highly scalable processing. (Open domains are exceptionally hard to test on, so usually you test on a "closed domain" which is far more structured and easier to obtain good results from - the 2nd step is extending this method to work in an "open" domain).

What's an IE system composed of?

The extractor finds entities and relationships between them, and you can use RDF (semantic web) or another formal language. You need an enormous amount of knowledge to do this, and this can be obtained from a ready made knowledgebase made through supervised or unsupervised machine learning methods.

IE methods:

- Knowldege-based methods:

This relies on pattern-matching, human-made rules constructed for each domain. Semantic classes are applied and relationships identified between concepts, however this is obviously not scalable (I can guarantee you that as my own system is KB-based).

- Supervised methods:

Learns an extractor from a training set which has been tagged by humans. The system uses a domain-independent architecture and sentence analyser. Patterns are automatically learned this way and the machine can find facts about texts. Getting training data is the problem. Snowball and such systems addressed this issue by reducing the manual labour necessary to create relation-specific extraction. Recent work with Markov models for example.

- Unsupervised methods:

Labels it's own training data using a small set of domain-independant extraction patterns. KnowledgeItAll was the 1st system to do this, extracting from web pages, unsupervised large-scale and domain independent. It Bootstraps its learning process. Very very basically (see document for loads more detail) the rules were applied to web pages found via SE queries and the extractions were assigned a probability. Later frequency stats were added. It uses labeled data and made classifiers.

Yep - next is Wikipedia which I think we all take a bit for granted. The Intelligence in Wikipedia Project (IWP)It also uses unsupervised training to train its extractors, then IWP bootstraps from the wikipedia corpus. The cool thing about using Wikipedia as a corpus, as many have figured out, is that it's nicely structured. It's used to complement Wikipedia with additional content.

Open IE (web extraction):

The problem is that it is huge and very unstructured. I think it's the hardest corpus ever to be tackled. These systems can for example learn from a model of how relations are expressed based on features like part-of-speech tags, domain-independent regular expressions and so on.

The new method for IE:

The authors analysed 500 randomly selected sentences from a training corpus.

They found that most relationships could be characterized by a set of relation-independent patterns.

TextRunner extracts high-quality information from sentences and learns the relations (you can actually test it), classes and entities from the corpus using its relation-independent extraction model. You will find more references to Markov models here, and also find out how it trains a conditional random field. The sentences are extracted linearly, and it extracts triples that it thinks are important. The language on the web is very ambiguous though which makes it notoriously difficult to deal with. I think it's important to say that TextRunner uses Lucene (very good open source search engine - many of us owe a lot to it).

They tested Open IE in collaboration with Google and found that it highly increased precision and recall.

It can be used for IR tasks of course, but also opinion mining, product feature extraction, Q&A, fact checking, and loads of others.

Further research is aready being carried on where the system is able to reason based on facts and generalizations. They will use ontologies like WordNet (good 'ol WN) and cyc, Freebase and OpenMind.

See. web 3.0, the web of machine reasoning and information extraction is very very real.