Science for SEO: microsoft research

Showing posts with label microsoft research. Show all posts

January 14, 2009

Microsoft's Game-Powered Search Engine

Someone dropped me this patent and I instantly loved it because it describes a completely different solution to the problem of IR and does so in a very entertaining way...well obviously. The patent was filed in 2005 and published on the 13th of January 2009. The authors are all brilliant and renowned computer scientists from slightly varied fields.

Anyway, it's called "Game-powered search engine".

The idea is that:

The user types in a query

The game participants receive this query

Responses are collected from the game participants - these can be anything from images, text, audio etc...

The game rewards participants with the most suitable responses

The suitability is calculated by analyzing the degree of agreement between the responses. Agreement depends on the level of similarity.

You could think "who on earth is going to bother playing that game?". Probably the same kind of people who answer questions in forums, on Google answers or such places I imagine. The advantage of the game system is that a machine actually checks all responses and filters them first so you're more likely to get a correct answer.

You know, the more I think about it, the less I find it quirky and funny and the more I think it could work. It's a bit like super-users (people who are experts at using search engines) helping out less savvy users.

It would have to be really well marketed and introduced because there have been some human edited engines before like ChaCha that haven't won the majority over. It would also need to be really swanky looking with a top level of usability. Then it would need to actually give the people answering a motivation for doing so. What do you get, points? For what?

January 13, 2009

Clickstream spam detected

Clickstream analysis is a basic form of metric used to determine how much traffic comes to a site and some analysts also look at the quality of the traffic using this metric. There is more research being done into clickstream analysis because it is littered with noise, has a very high dimensionality, and 3rd party systems warping the data amongst other things. furthermore this data can be used more effectively when the users sessions are split into categories.

Here I look at one paper from AIRWeb by Microsoft research people. It's interesting because it highlights issues that search engines have with automates ranking systems for one and other automated bots. It shows and these can be faded out from the engines click-stream analysis, which it can well use for ranking documents.

In "A Large-scale Study of Automated Web Search Traffic" Buehrer, Stokes and Chellapilla found that 3rd party systems which interact with search engines are a nuisance because they make it hard to pick out human queries. 3rd party systems (like rank checking software for example) access the search engines to check ranks, augment online games or maliciously alter click-through rates. They have devised the basis for a query-stream analyser. I'm sure we can all see how useful this type of system would be.

Interestingly: "One study suggested that 85% of all email spam, which constitutes well more than half of all email, is generated by only 6 botnets"

They say the problem with web spam is that "a high number of automatically generated web pages can be employed to redirect static rank to a small set of paid sites".

Some checkers perform about 4,500 queries per day - far more than a human would). This means that there is search result latency for the user and that the engine can't improve quality of service. Some engines see clickthrough rate as implicit feedback for the relevance of a URL, this bad data is a real hindrance for them. This is why I think this type of variable in ranking is not useful. It's too easily manipulated. As they say "an SEO could easily generate a bot to click on his clients’ URLs". This is click-fraud.

They note that Clickforensics found that search engine ads experience a fraud rate of 28.3%. This paper however focuses on organic results only.

The 1st bot analysed is one that "rarely clicks, often has many queries, and most words have high correlation with typical spam". The 2nd bot had similar characteristics to the 1st but searched for financial stuff (you could search for any topic really). The queries for this bot revolve around the keywords any SEO would have pinpointed to be honest. The 3rd bot tried to boost search engine rank, as it looks for various URLs. The 4th bot has an unnatural query pattern because it looks for single words rather than the 3-4 terms usually entered by users. This bot searched for financial news related to specific companies (clearly online reputation management). Bot 5 sends queries form loads of cities within a short period of time and it also never clicks on anything and uses NEXT a lot - they did take into consideration mobile devices though. Lastly example bot 6 searches for the same terms over again over the course of the day. This is typically to boost rankings. They say that a possible motive for high click rate is:

"For example, if a user queries the index for “best flowers in San Francisco” and then scrapes the html of the top 1,000 impressions, he can find the most common keywords in those pages, their titles, etc. and incorporate them into his own site."

There are basically 3 main types of bots: those that don't click on links, those that click on every link and those that click on targeted links.

The things they added to the click through data analysis were:

- Actual clicks & the number of queries issued in a day

- Alphabetical searches

- Spam terms (viagra)

- Black listed IPs, particular country coeds and blacklisted user-agents

- Rare queries used often

- low probability query pairs

They used Weka (great open source machine learning tool) and achieved a high accuracy. The classifiers used were Bayes Net, Naive Bayes, AdaBoost, Bagging, ADTree and PART. Al produced results higher than 90%. Now they're furthering their research and working on new data sets.

Why should you care?

This is interesting because Google banned some automated ranking tools in the past, and this research does kinda suggest that the spam that these programs produce could simply not be counted in the analysis. The thing is that hitting the servers so often does affect the search engine's performance and this is bad for users. I think that we can expect to see these kinds of systems suffer further in the future, but as I have previously said and other have too, the rankings aren't the be all and end all. There's a lot more else to consider when measuring site performance.

Yes I've used rank checking software like everyone else in the past but when I wear my computer scientist hat I see them as evil because of the damage that they do to systems and I want to eradicate them. This goes for all the other bots too.

November 11, 2008

About machine translation

Machine translation (MT) is all about translating text (or speech even) from one language to another. It's part of computational linguistics, and uses a lot of NLP methods as well as statistical methods, rule-based methods, corpus techniques, some AI too, amongst other things. Apparently it was started in the 17th century and in the 1950's the Georgetown experiment wen on, but it didn't really work so funding was really reduced, meaning that a lot of research in this area was terminated. In the 1980's it made a comeback.

It's important today, in the age of the Internet, because a lot of data is in different languages, and when we can't understand another language, we are deprived from what may be the most relevant content to our query.

First you have to pull apart the source text to make sense of it, and then you have to re-engineer it into the target language so it makes perfect sense to a target language reader. Not only do you have to understand all the grammatical elements, the syntax, the idioms, the semantics, and so on, you also have to have a good grasp of the culture associated with the target language.

Different systems use different approaches, here is a brief description:

Rule-based systems:

It's basically made up of a load of rules relating to translation between the two languages. It can use a dictionary and map to that. You can use a parallel corpus to find those rules, which means that you map between ready made translations and pick out the common patterns, then feed these into a machine. I did this and it wasn't very precise. Google used SYSTRAN for many years, and this is a rule-based system.

Statistical methods:

Google translate now works with these. It involves generating a load of statistics derived from a large corpus. The problem is finding a very large corpus. this isn't too much of a problem for Google, but not very many corpora exist, but even Google used the united nations corpus to add 200 billion words to it's system. These are used to train the system.

The main issues:

Word disambiguation is very difficult. This is when words can have more than one meaning. Google doesn't do so well in this area. There are 2 methods that are known to deal with this, the shallow approach (looking at surrounding words and drawing statistical information from this), and the deep approach (providing a comprehensive definition to the system). The deep approach takes a lot of time, and isn't so precise, so statistical methods tend to do better.

Consider this for example: "Cleaning fluids can be dangerous" - does it mean that cleaning fluids IS dangerous or that they ARE dangerous?

There are so many difficult issues in handling language anyway, seeing as it requires natural language understanding, which is far from performing right now. There is a lot of research going on though, and eventually machine translation will work, but I'm not so sure how soon that will be.

What does it mean for SEO?

Well your keywords and your content is going to look a lot different in other languages, and the text may also be modified and re-written in places. This means that you have a lot less control over how these pages rank in other languages. The solution? Maybe it would be worth having multi-lingual staff :)

Read more here, from the University of Essex.

There's also good information at Microsoft research (MT labs).

John Hutchins is a great source of information.

And check out Carnegie Mellon University MT labs too.