My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

January 13, 2009

Clickstream spam detected

Clickstream analysis is a basic form of metric used to determine how much traffic comes to a site and some analysts also look at the quality of the traffic using this metric.  There is more research being done into clickstream analysis because it is littered with noise, has a very high dimensionality, and 3rd party systems warping the data amongst other things.  furthermore this data can be used more effectively when the users sessions are split into categories.

Here I look at one paper from AIRWeb by Microsoft research people.  It's interesting because it highlights issues that search engines have with automates ranking systems for one and other automated bots.  It shows and these can be faded out from the engines click-stream analysis, which it can well use for ranking documents.

In "A Large-scale Study of Automated Web Search Traffic" Buehrer, Stokes and Chellapilla found that 3rd party systems which interact with search engines are a nuisance because they make it hard to pick out human queries.  3rd party systems (like rank checking software for example) access the search engines to check ranks, augment online games or maliciously alter click-through rates.  They have devised the basis for a query-stream analyser.  I'm sure we can all see how useful this type of system would be.  

Interestingly: "One study suggested that 85% of all email spam, which constitutes well more than half of all email, is generated by only 6 botnets"

They say the problem with web spam is that "a high number of automatically generated web pages can be employed to redirect static rank to a small set of paid sites".

Some checkers perform about 4,500 queries per day - far more than a human would).  This means that there is search result latency for the user and that the engine can't improve quality of service.  Some engines see clickthrough rate as implicit feedback for the relevance of a URL, this bad data is a real hindrance for them.  This is why I think this type of variable in ranking is not useful.  It's too easily manipulated.  As they say "an SEO could easily generate a bot to click on his clients’ URLs".  This is click-fraud.

They note that Clickforensics found that search engine ads experience a fraud rate of 28.3%. This paper however focuses on organic results only.  

The 1st bot analysed is one that "rarely clicks, often has many queries, and most words have high correlation with typical spam".  The 2nd bot had similar characteristics to the 1st but searched for financial stuff (you could search for any topic really).  The queries for this bot revolve around the keywords any SEO would have pinpointed to be honest.  The 3rd bot tried to boost search engine rank, as it looks for various URLs. The 4th bot has an unnatural query pattern because it looks for single words rather than the 3-4 terms usually entered by users. This bot searched for financial news related to specific companies (clearly online reputation management).  Bot 5 sends queries form loads of cities within a short period of time and it also never clicks on anything and uses NEXT a lot - they did take into consideration mobile devices though.  Lastly example bot 6 searches for the same terms over again over the course of the day.  This is typically to boost rankings.  They say that a possible motive for high click rate is:

"For example, if a user queries the index for “best flowers in San Francisco” and then scrapes the html of the top 1,000 impressions, he can find the most common keywords in those pages, their titles, etc. and incorporate them into his own site."

There are basically 3 main types of bots: those that don't click on links, those that click on every link and those that click on targeted links.

The things they added to the click through data analysis were:

- Actual clicks & the number of queries issued in a day
- Alphabetical searches
- Spam terms (viagra)
- Black listed IPs, particular country coeds and blacklisted user-agents
- Rare queries used often
- low probability query pairs

They used Weka (great open source machine learning tool) and achieved a high accuracy.  The classifiers used were Bayes Net, Naive Bayes, AdaBoost, Bagging, ADTree and PART.  Al produced results higher than 90%.  Now they're furthering their research and working on new data sets.

Why should you care?

This is interesting because Google banned some automated ranking tools in the past, and this research does kinda suggest that the spam that these programs produce could simply not be counted in the analysis.  The thing is that hitting the servers so often does affect the search engine's performance and this is bad for users.  I think that we can expect to see these kinds of systems suffer further in the future, but as I have previously said and other have too, the rankings aren't the be all and end all.  There's a lot more else to consider when measuring site performance.

Yes I've used rank checking software like everyone else in the past but when I wear my computer scientist hat I see them as evil because of the damage that they do to systems and I want to eradicate them.  This goes for all the other bots too.

2 comments:

Greg Linden said...

Hi, CJ, I think you might have the papers mixed up here. The paper you referred to, "Characterizing Typical and Atypical User Sessions in Clickstreams", is by some folks at Yahoo Research.

For Buehrer et al., I think you probably mean "A Large-scale Study of Automated Web Search Traffic". That is available at

http://research.microsoft.com/apps/pubs/default.aspx?id=69505

I agree, it is a fun and very interesting paper (but, full disclosure, Greg and Kumar sit one door away from me at Microsoft Live Labs, so I may be biased).

CJ said...

Duh! You are so completely right! I was writing 2 at the same time, thank you so much for pointing that one out! What a big mistake!

It is a great paper, be sure to tell them :)

CJ

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.