My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

January 30, 2009

TGIF - wicked!

Hi all, welcome to another TGIF post.  I hope that you have had a chilled out week and that you have enjoyed working and got time to have a bit of fun, if your work isn't also your source of fun.  This week I have been experiencing Sydney for the 1st time and mostly working unfortunately but I also made it to the beach and swam in the pool enough times to balance all that out :)

It was pointed out to me that this post is basically "what I wasted time on this week" - fair point.

Without further ado...

Stuff I liked this week:

The list of top 10 complaint letters courtesy of The Telegraph.

Levitated have some super cool pictures of abstract things which I really like (fractals and stuff).

The 100m long picture depicting people walking on Warschauer strasse railroad bridge - shot over 20 days.

Quotes:

"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind". (Lord Kelvin) [but I don't agree :)]

"Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration". (S.Kelly-Bootle)

"It's hard enough to find an error in your code when you're looking for it; it's even harder when you've assumed your code is error-free".( S Mconnell)

"As a rule, software systems do not work well until they have been used, and have failed repeatedly, in real applications". (D. Parnas)

Facts:

Every second around 100 lightning bolts strike the Earth.

Every year lightning kills 1000 people.

If you could drive your car straight up you would arrive in space in just over an hour.

On the day that Alexander Graham Bell was buried the entire US telephone system was shut down for 1 minute in tribute.

The only letter not appearing on the Periodic Table is the letter “J”.

The microwave was invented after a researcher walked by a radar tube and a chocolate bar melted in his pocket.

Presenting the "Round orbita mouse" - it's all round, it's wireless, it spins around...


Effective Query Log Anonymization

Check out this very good Google tech talk about using query logs:

"User search query logs have proven to be very useful, but have vast potential for misuse. Several incidents have shown that simple removal of identifiers is insufficient to protect the identity of users. Publishing such inadequately anonymized data can cause severe breach of privacy. While significant effort has been expended on coming up with anonymity models and techniques for microdata/relational data, there is little corresponding work for query log data -- which is different in several important aspects. In this work, we take a first cut at tackling this problem. Our main contribution is to define effective anonymization models for query log data, along with techniques to achieve such anonymization. "

Heat diffusion for Social net marketing

The paper we look at here is called "Mining Social Networks Using Heat Diffusion Processes for Marketing Candidates Selection" and is by Yang, Liu and King from The Chinese University of Hong kong.

Companies have started social networks more and more for WOM promotion, increase bran awareness, attract potential clients and so on.  This is openly apparent in Facebook pages, Twitter, collaborative filtering, blogs and many many others.  This paper presents a model enabling marketers to make the best use of these networks using the "Heat Diffusion Process" (which is an idea borrow from the field of physics).  They have 3 models and 3 algorithms to demonstrate that allow for marketing samples to be collected.

In physics the heat diffusion model states that heat flows from a position of high temperature to one of low temperature.  

They show that these methods allow us to select the best marketing candidates using the clustering properties of social networks, the planning of a marketing strategy sequentially in time, and they construct a model that can diffuse negative and positive comments about products and brands to simulate the complex discussions within social networks.  They want to use their work to help marketeers and companies to defend themselves against negative comments.  The idea is to get a subset of individuals to adopt a new product or service based on a potential network of customers. 

The heat diffusion model has a time dependant property which means that it can simulate product adoptions step by step.  The selection algorithms can represent the clustering coefficient of real social networks.  All users of social networks can diffuse comments that can influence other users. Based on this they say that nodes 1 and 2 represent adopters and the heat reaches nodes 3,4 and 5 as time elapses.  The users in the trust circle of other users have a greater influence.  Not all of the people in the trust circle however will be contacted about the heat source.  Also some users are more active than others in diffusing information.  they observe that bad news or negative comments diffuse much faster than other news.  

Individuals are selected as seed for heat diffusion.  The influence of individuals is based on the number of individuals they influence. The heat vector they construct decides on the amount of heat needed for each source.  In order for them to diffuse properly they need a lot of heat.  Thermal conductivity is then calculated.  It sets the heat diffusion rate.  The adoption threshold is then set, as if one consumers heat value is higher than others, they are likely to adopt this product.

If a user doesn't like the product, s/he is allocated negative heat as they will diffuse negative comments.  At some point someone in the network will provide different information which might be positive.  If that user adopts the product anyway, they will diffuse positive comments.  Two defense candidates are then selected and then the negative impact is alleviated.  

They conclude:

"So far, our work considers social network as a static network only, and ignores newcomers, new relationships between existing members and the growth of the network’s size. In the future, we plan to consider the evolution property of social networks, and permit our social network to grow at a certain rate"   

Why should you care?

This paper shows a new and very different way of analyzing social networks.  It gives you a nice opening ti discussions concerning this in a different light.  The current methods used in business are not foolproof and such research shows us how simplistic they are and how they can actually be misleading.

The RankMass crawler

The  paper entitled "RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee" (Cho, Schonfeld, University of California) deals with the important topic of how many pages should be collected to cover most of the web, and how to ensure that important documents are not left out when the crawl is halted.  They answered these questions by showing that crawling a number of important pages or the most important part of the web helps to limit the amount needed.  This is important because a large corpus is expensive to house and the amount of computing cost is high.

They state that comparing search engines by the number of pages they index is misleading because of the huge amount of data available on the web.  They say that for example calendar pages generated by dynamic sites have links to "next day" and so on meaning that potentially useless information is collected. Additionally no search engine is capable of downloading the entire web, we don't even know how big it is.  So when should they stop the crawl?  At 8 billion pages like Google?

Their method, RankMass is a metric for comparing the quality of search engine indexes and is a variant of the Personalised PageRank metric which assumes that users go to only important pages.  They change this a little bit because they look at how much of the important documents the set covers.  The coverage measurement involves looking at pages in the subset which are not known.  Their crawler is focused on the pages users go to.  It can prioritize the crawl downloading high personalised PageRank first and so the highest RankMass is achieved when the crawl is over.

How are pages deemed important?

This could as they say be based on the relevance to the queries but this would mean having a set of queries to start with which isn't ideal.  They say that PageRank with it's random surfer model is very effective but as we have seen it can be easily spammed (by webmasters and seo people I presume!).  Personalised PageRank assumes that a user eventually goes to a trusted site rather than to a page of equal probability.  

The RankMass metric is based on the link structure of the whole web rather than just the graph structure in the subset, meaning that they don't have to download a huge amount.  They download all the pages that are reachable from neighbouring sites of a page and calculate RankMass, however they note that users are unlikely to go only to one trusted page.  This method however is greedy so they adapted it to form the "Windowed-RankMass algorithm".

"The Windowed-RankMass algorithm is an adaptation of the RankMass algorithm and is designed to allow us to reduce the overhead by batching together sets of probability calculations and downloading sets of pages at a time."  

The starting point is referred to as a set of seeds which are a number of documents which form the starting point of the crawl:

"Deciding on the number of seeds is influenced by many factors such as: the connectivity of the web, spammability of the search engine, efficiency of the crawl, and many other factors".

The result of their evaluation and experiments showed that their RankMass metric, that maximizes the PageRank of every page, is very effective.  It allows search engines to specify the end of the crawl based on specific conditions.  This means that the crawl runs until the required percent of the web's PageRank is collected.

Why should you care?

This interesting papers shows that search engines can have smaller sized indices and still prove very effective.  This should improve both precision and recall, due to the fact that there are fewer unimportant documents that are considered in the computation stages.  The constant talk and wow-factor associated with the huge size of indices are shown to be rather irrelevant really when you consider the actual quality of these indices.  The bigger the index the harder it is to manage.

You should care because the quality of your sites becomes crucial.  Not only the way that they are built but also their ability to attract users based on how important they are seen to be.

January 29, 2009

"I won't adopt the semantic web!"

I've heard variants of this for quite a long time, in fact since the semantic web thing became mainstream.  It's never easy to introduce something new, as novel applications designers will know, and users don't want to learn something new, however easy it is to pick up.  The semantic web is suffering from this too, in fact many webmasters don't want to adopt it saying it's too much work or they can't see why it's important etc...

Well low and behold (you know me by now) I found a paper which addresses this issue in a very intelligent way and answers quite a few questions.  It's called "What is an analogue for the semantic web and why is having one important?" by mc schraefel from the university of Southampton (which regularly produces cool papers).

First off, an "analogue" in this sense is something that has a direct resemblance to something else.  For example print (books and things) is analogous to the web because a web page is like a page from a book or newspaper, manual and so on.  The central topic of this paper is about finding something analogous to help people understand the semantic web so it's not so foreign and scary or whatever.

The author says that the web has been represented by Pages + Links and proposes that the semantic web be represented by Notebook + Memex.

The Memex was invented by Vannevar Bush in 1945.  It's the concept of an online library, an interconnected knowledge-base.

He points out that the semantic web offers very powerful way to interact with information, and to build new interactions for that information.   He does, as I do, believe that the entire issue concerning the semantic web and its acceptance is due to the research community not communicating it properly.

"It is important to note that the motivation for this question of analogue is not a marketing/packaging question to help sell the Semantic Web, but is simply a matter of fundamental importance in any research space: it is critical to have both a shared and sharable understanding of a (potentially new) paradigm. If we do not have such a shared understanding, we cannot interrogate the paradigm for either its technical or, perhaps especially, its social goals."

He notes that all web 2.0 stuff has been based on highly familiar models, as RSS for example is still text and the idea of tag clouds and tagging is still displayed just like a catalogue.  The understanding of the web when it was 5 years old is very different to the understanding of the semantic web which has just turned 5.  He also very nicely points out that the wikipedia entry for the semantic web is a bit rubbish: "All that description tells anyone about the semantic Web is that it is for Machines."   The emphasis as far as semantic web researchers are concerned is the end user.  It is all about creating powerful links in information so that question-answering is made possible and a whole host of other knowledge discovery methods.  

"But how do we describe this potential? For a community steeped in rich link models, Hypertext is an obvious conceptualization. But beyond this community, Hypertext equals “a page with links” – it equals the current Web, not the rich possibility of what we might call Real Hypertext, which was modeled in Note Cards and Microcosm."

The hyperlink in the semantic web can be thought of as "meaning" which is one of the hard to grasp concepts.  "That is, the way meaning is communicated that is not via the explicit prose page or catalogue page, but is via the exposure of the ways in which data is associated, and can be discovered, by direct semantic association, for the reader/interactor/explorer to make meaning."

The author says that the notepad is a good way of introducing the semantic web because it is a page but is unstructured,  Calendars and things can be shared, and things like Twitter allow users to post snippets.  "One item can act as a way of redefining another".  Google docs and various snippet keepers on the web (like the now defunct Google notebook) mean that data is generated and shared and linked in too.  The memex is good too because it is designed to retrieve data and not "denature it" by this we mean that it isn't taken out of context.  

The semantic web languages like RDF mean that everything can be connected and retrieved intelligently and effectively.  It is "automatic structure extraction" which means that you can look at all of your information in the context it was created or saved in.  The data can also be associated to other relevant information too.  For example you could find other people talking about the same topic, or working on the same kind of project, events you might be interested in...

The author observes that the notepad model has limitations like the structure and also "viewing page 6 next to page 36" isn't easy.  He suggests the note card model which is a stack of cards with ideas, people, data about stuff written on them and interlinked to external data.  "The relevance of the note card model to the concept of the Semantic Web as personal work space with associated public data is in the integration of personal ideas with external sources: the idea cards are backed up with/informed by the quotations from external sources."

He concludes asking if we are ready for a system to support such creativity.  This means that computers are no longer simply used for productivity.  He does believe that we are.  I also believe that we are.  More importantly I believe that the semantic web is a foundation for far more intelligent systems.  We need to experiment with this model and develop it properly in order to advance.

Obviously this kind of system can't be implemented by humans completely and needs to be automated, which it is really.  There are all sorts of programs out there that will create stuff for you.

Why should you care?

If you don't embrace the semantic web it is likely that you will be left behind because it is a necessary step in web evolution.  It will become widespread and it is important to be prepped for it.  It should have your support because it is an ingenious and very powerful way of dealing with the ever growing mass of data available.

January 28, 2009

Information credibility analysis

I wanted to draw a little attention to a Japanese project called the "Information Credibility Criteria Project".  The NICT (National Institute of Information and Communications Technology) started it in 2006. 

This project is all about looking at how information sources are not all equal in that they are written by different people who...are also not equal!  If you write a post about banana skins and their use in cancer treatment, unless you're a researcher in this area, you post isn't 100% credible.  If you are a researcher in that area, then it is more credible because you have the proved expertise to write about such a thing.  They don't exclusively look at writers and their authority but also at other criteria, that are not easy to determine automatically:

Credibility of information contents
They use predicate argument structures rather than words for this analysis and use "automatic synonymous expression acquisition" to deal with synonymous expressions.  The sentences in the documents are classified into opinion, events and facts.  Opinions are classified into positive and negatives ones.  An ontology is produced dynamically for each given topic which helps the user interface with the data.

There are a lot of different variables that come into play when we look at the credibility of a document.  The grammar, syntax and accurateness of the data presented are all strong variables when I generally look at a blog post or a website.   

Credibility of information sender
They classify writers into individuals or organisations, celebrities or intellectuals, real name or alias and many more groupings. This information is gleaned from meta-information but they also use NLP techniques for this too.  The credibility evaluation is based on the quantity and quality of the information the user sender has produced so far.

Credibility estimated from document style and superficial characteristics
They take into account whether a document is written in a formal or informal way, what kind of language is being used, how sophisticated the layout is and other such criteria.

Credibility based on social evaluation of information contents/sender
This is based on how the sender is viewed by others.  They use opinion mining from the web based on NLP or using existing rankings or evaluations available.

The research can be applied to all areas of electronic information access, such as email, web docs, desktop docs,...The idea is not to replace the human but to support the human in his/her judgment of an information source.  

Document credibility is an area that I believe is very important for the future of the web.  We can rank documents in a sequence, as Google does for example, based on their relevance to the initial user query.  Google looks at authority as well, and also at the content, and other factors too.  The problem though is that without a thorough analysis like the one being devised by NICT there are documents that are perhaps not as important finding themselves at the top of the rankings for example.  

Looking at things like author authority rather than simply document authority is useful obviously but if this isn't flexible enough then good relevant documents could be omitted.  Someone who has never written anything before will not I assume be considered to be very authoritative, and someone who has written a lot of bad content shoots themselves in the foot for all their future work!  It therefore becomes important to have a certain standing on the web or rather in the information community.  If you are not considered very influential, then your work might not be considered influential also.

Obviously there is a lot more research to be done here and I really look forward to reading a lot more about it.  You can check the publications page if you want to read more about the work that NICT has been doing since 2007.  

Why should you care?

If this type of method works perfectly, you will need to not only provide good content but also have a good reputation.  This is applicable both to companies and individuals.  By finding out about the author in particular and taking that into account for overall document scoring an engine could wipe a good deal of spam but also the standard for "good content" would be set.  It all reminds me of FOAF and the other methods which exist for tagging up individuals and their connections online.  This is a fundamental part of the semantic web after all and it could be easily exploited in this way.

January 26, 2009

Question everything

When I'm in a situation where I'm talking about the stuff I really know, I'm an expert.  I'm confident I can answer and help.  If I am surrounded by people who want and need to know about what I'm good at, it's great! I obviously enjoy the topics and get a buzz from sharing.

Some of these areas are:

 - Translation 
 - NLP
 - IR
 - NLG/U
 - Yoga
 - Research skills
 - SEO
 - Running
 - blah, probably some others

When I'm in a situation where I either don't really know about the topic or am not as proficient as the other people, I'm a learner.  I question, see what I can get from this.  

Some of these areas are:

 - Programming (Gasp! - yes, your programming geeks will get the job done much faster than me)
 - Computer graphics
 - Speech technology
 - Robotics
 - Windsurfing
 - Skydiving
 - Fixing computers
 - Cooking (I know, and I'm half French)
 - and endless others
 
What I know is that both situations are fun and exciting for the most part.  I am actually always learning, and sometimes a non-expert can really shed light on something for me.  I'm open to learning anything and everything (even dreaded cooking).  All of the topics that touch my life converge somewhere and bits from yoga get used in computing and bits from running get used in writing.

I'm not always the expert and that's fine, in fact that's a relief.  What a privilege to be a n00b and be allowed to make all sorts of mistakes and ask all those silly questions unabashed, no expectations.  How cool to be average at something and learn from someone who inspires you and get better at it.  

The message is I guess to not be afraid of those situations.  Not be afraid of looking more vulnerable than usual sometimes and also not being afraid of being the expert too.  I've found out that I am not judged by my questions, level of ability, or education but by my attitude.  In fact I was once told that it's not the questions that are stupid, just the idiots who think they're above them.  And that applies in both situations.

This blog I realise isn't always easy for everyone to digest so I want to encourage you to get in touch with either me or other people you have questions for.  You'll probably be surprised at how much I/they learn from you too.  None of us would want to pass up that opportunity.  Doing this opens up the discussion.

Don't be afraid of asking a well known scientist why they hadn't thought of doing x,y or z or why their theory doesn't work in a particular case.  It isn't an attack on their work and won't be percieved that way (as long as you attitude doesn't suck).  They either have the answers or you both have an interesting conversation coming up :)

“Socrates, you will remember, asked all the important questions - but he never answered any of them” (Dickinson Richards)

January 23, 2009

TGIF - weeeeeee

Welcome to yet another installment of TGIF, and I do hope that this post finds you well.  It has been a long week for a lot of people I have spoken to, some of the adjectives used were "crappy", "boring", "stressful", "@!*%£^&!" and so on.  Something must be up with the universe. I think me having a particularly relaxing time on idyllic islands in southern Thailand has put everything out of synch.  Fear not, the balance will return very soon.

Without further ado...

Stuff I liked this week:

This cool site shows you how the brain works from top to bottom.  

A list of 10 debunked scientific beliefs of the past.

I like Design 21, the Social Design network in partnership with UNESCO - excellent stuff.

My favourite thing this week was Steve Spalding's article called "Why I love the scientific method and so should you" on the "How to split an atom" site. 

The scientific method is:

1 - Ask a Question
2 - Do Background Research
3 - Construct a Hypothesis
4 - Test Your Hypothesis by Doing an Experiment
5 - Analyze Your Data and Draw a Conclusion
6 - Communicate Your Results 

Oh look, there's my thesis plan all written up and ready (apart from the "further work" section) :)

Quotes:

Computer Science is a science of abstraction -creating the right model for a problem and devising the appropriate mechanizable techniques to solve it. (A.Aho and J. Ullman)

The Analytical Engine weaves Algebraical patterns just as the Jacquard loom weaves flowers and leaves. (The Countess of Lovelace on Babbage's Analytical Engine)

I, myself, have had many failures and I've learned that if you are not failing a lot, you are probably not being as creative as you could be -you aren't stretching your imagination. (J. Backus)

Optimism is an occupational hazard of programming: testing is the treatment. (K.Beck)

Walking on water and developing software from a specification are easy if both are frozen. (E. Berard)

I particularly like:

Rules of Optimization:
  Rule 1: Don't do it.
  Rule 2 (for experts only): Don't do it yet.
(M.A Jackson)

Facts:

Macintosh invented the start menu in 1982 and the Recycle bin in 1984

Xerox invented desktop icons in 1981

John Atanasoff & Clifford Berry founded ABC Computer in 1942 thus becoming the 1st computer business.

John Presper Eckert & John W. Mauchly invented the UNIVAC computer which was able to pick presidential winners

Dan Bricklin & Bob Frankston invented the 1st Spreadsheet Software called VisiCalc in 1978 (it paid for itself within 2 weeks of its release)

This is a really cool video made in stop motion with paper - it's simply brilliant.





G patent: identifying similar passages in text

The patent entitled "Identifying and Linking Similar Passages in a Digital Text Corpus" was published on the 22nd of January and filed on the 20th July 2007.

It's a really interesting one, not just because it covers a topic I'm particularly interested in but because it describes a very useful method for digital libraries in particular.  Digital libraries are different to web documents because they don't have loads of functional links in them.  They mention that using references and citations listed in the documents isn't useful because they aren't used outside of academia or such related activities.  

Basically they're saying that it's hard to browse a load of documents in a digital library efficiently.  You can't navigate the corpus like you would navigate the web because of the nature of the structure.  

"As a result, browsing the documents in the corpus can be less stimulating than traditional web browsing because one can not browse by related concept or by other characteristics."

They're saying that finding papers in a digital library is boring because everything is classified either by the keywords the conferences ask for in that particular section of the paper or by author, title, year, subject...It would be far more useful to browse by related concept for example.  And I agree.

The claim:

"A computer-implemented method of identifying similar passages in a plurality of documents stored in a corpus, comprising:building a shingle table describing shingles found in the corpus, the one or more documents in which the shingles appear, and locations in the documents where the shingles occur; identifying a sequence of multiple contiguous shingles that appears in a source document in the corpus and in at least one other document in the corpus; generating a similar passage in the source document based at least in part on the sequence of multiple contiguous shingles; and storing data describing the similar passage. " ("shingles" are simply fragments)

Documents are processed and similar passages amongst them are identified.  Data describing the similarities is stored and the "passage mining engine"  then groups similar passages into further groups which are based on the degree of similarity amongst other things, so we have a ranking algorithm too.  They also describe an interface which shows the user the hyperlinks that are associated with these passages so they can easily navigate them.

Their method basically identifies all shingles, gathers as much data as is available on them (location, documents they appear in, etc...) and then groups them together into clusters based on similarity.

Users could navigate passages that are relevant to them in text rather than the whole document which may not be in its entirety.  Being able to browse all this data by related features like that would help us find far more relevant papers for our information needs.  

This is a different approach to the one where an entire document is analysed (like in LSA) and classified and defined in terms of its overall features.  Using passages instead means that the entire exercise is far more granular.  Here we take into account that a document may be about a topic in a broad sense but actually about several particular subtopics.  We can also tell that perhaps part of a document is useful to a user in response to a query but not the whole thing.  

Search engines for digital libraries containing scientific papers for example do not perform half as well as the search engines we're used to using on the web.  Google scholar can sometimes yield much better results than Citeseer for example, but then they work very differently.  The documents are usually in PDF format or something similar so as Google note you need to be able to make that machine readable for starters.  

This conveniently, as far as I'm concerned, brings us to the elusive and wonderful exercise of summarization.  I say this because if you have  a number of fragments from different documents and that you can identify how similar they are, you can discard any duplicate information and create a complete summary from the data retrieved for your user, also offering up access to each individual document if the user wants to read the whole thing or the original passages.  This is not ground breaking in summarization but the model described in the patent fits.

I really like that idea.

January 22, 2009

Sentiment analysis in text

Sentiment analysis (also opinion retrieval/mining) is a very useful area of research as once fully functional it would enable us to determine the overall sentiment in text.  We could for example determine automatically if product reviews are negative or positive, if a blog post is in agreement or disagreement with a particular topic or debate, whether news is favourable or not towards a story line and many more possibilities open up once we start thinking along these lines.

For a really nice overview of this topic check out Cornell's freely available resource on it.

As far as blogs go (an area which interests us bloggers greatly) it would enable further clustering possibilities for search engines.  This means that you could look for contradicting views in response to one of your posts or like minded people.  There are of course further things you could do with this kind of technology and I'll leave those open for you to debate amongst friends.

Some of the problems we encounter in making this whole thing possible aren't the easiest to deal with.  Some of them include methods for extracting opinion or sentiment based sections in text, meaning that you have to analyse the content in depths not looked at currently.  You need to be able to rank these documents in order of sentiment intensity afterwards.  How do we determine this?  Also how do you pick out affective or emotive words as opposed to generic ones?

Lets look at some of the research undertaken recently in this area.  I've picked 3 papers where the researchers looked at different approaches:

"A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval" by Zhang and Ye from Tsinghua University (Beijing) - SIGIR 08

They focused on the issue of combining a document's opiniate score and topic relevance score.  The used a lexicon-based opinion retrieval method which unifies "topic-relevance" and "opinion generation" by a quadratic combination. They used TREC blog data sets and observed an improvement on techniques which are more linear. They were able to show that a Bayesian approach to combining multiple ranking functions is better than using the linear combination.  Their "relevance-based ranking criterion" is used as the weighting factor for the "lexicon-based sentiment ranking function".

They figured out that there was a need to not only identify the sentiment in the text but also to identify and meet the users needs.  I found interesting the way that they pick out sentiment out of the text.  They use WordNet and identify a subset of seed sentiment terms and then enlarged that list with synonyms and antonyms.  The interesting part is that they used HowNet which is a Chinese database where some of the words are tagged as positive or negative.  

Their method can be adapted to all sorts of documents, not just blogs because of its generalised nature.  They're looking at constructing a collection-based sentiment lexicon which I would be highly interested in having access to!


"A Holistic Lexicon-Based Approach to Opinion Mining" by Ding, Liu and Yu from the university of Illinois (Chicago) - WSDM 08

These guys particularly looked at product reviews and also wanted to find a solution to establishing what was negative or positive. They used a "holistic lexicon-based approach" which means that the system looks at opinion words that are context dependant (they use the example of "small").  Their system "OpionionObserver" can also deal with particular constructs, phrases and particular words which typically have an impact on opinion as far as the language is concerned.  More importantly, their system can deal with conflicting opinion words in a sentence.   Here we can say that the approach is semantic based.

They describe the shortcomings of the lexicon based approach.  They state: "To complete the proposed approach, a set of linguistic patterns are devised to handle special words, phrases and constructs based on their underlying meanings or usage patterns, which have not been handled satisfactorily so far by existing methods." - They outperformed all "state-of-the-art existing models".

"Learning to Identify Emotions in Text" by Strapparava (FBK-Irst, Itlay) and Mihalecea (University of North Texas) - SAC 08

They approached the problem from another angle.  They annotated a big dataset with 6 basic emotions: Anger, Disgust, Fear, Joy, Sadness and Surprise.  They worked on automatically identifying these in text.  They used news headlines for this and looked at lexicon approaches, the latent semantic space approach, naive Bayes classifiers and others.  They also looked at the co-occurrence of affective words in text.  They followed the classification found in WordNet Affect and collected words related to their 6 groups.  

You see here that variants of LSA are used in current systems because here for example they used it to identify generic terms and affective lexical concepts. Their method also takes into account a tf-idf weighting schema. They explain that "In the LSA space, an emotion can be represented at least in three ways: (i) the vector of the specific word denoting the emotion (e.g. “anger), (ii) the vector representing the synset of the emotion (e.g. {anger, choler, ire}), and (iii) the vector of all the words in the synsets labeled with the emotion."

They specifically evaluated 5 systems: WN-Affect presence (annotates emotions in the text using Net Affect), LSA single word (similarity between the given text and each emotion), LSA emotion synset (words denoting emotion), LSA all emotion words (adds all words in the synsets labelled with a given emotion), NB trained on blogs (Naive Bayes classifier trained on blog data annotated for emotions).

The WN-Affect system is the highest precision and lowest recall.  LSA using all emotion words has the largest recall but precision is a bit lower. The NB method worked best on Joy and Anger because this was prevalent in the training set.  All other emotions were best identified by the LSA models.  

Why should you care?

This kind of system when working efficiently would mean that reputation management suddenly becomes very important as negative and positive comments could easily be retrieved by users.  An overall bad reputation as far as a company, product or individual is concerned could be very damaging.
 

January 16, 2009

Off topic: Heads-up

Hey all,

I am taking a few days off from the webnet and going to some lovely tropical islands off the coast of Southern Thailand.  I will mostly be snorkling, reading my book by the pool and in the hammock, and sampling delicious food.

If you don't see a blog post from me on Thursday I've probably been had by one of these and might not be back for a while.

Ciao for now!

TGIF - yippie

Welcome to another installment of TGIF.  I am assuming that those of you who have had a very cold week will not want to hear about how hot and sunny it is here in Thailand so I won't expand.  I hope that you have all had a great week and that you picked up a few exciting projects along the way.

Without further ado...

I love Larissa Meek's 28 cakes for Geeks.  Feel free to make me one and send it through. I love cake.

Check out some very cool videos called "“Accurate Scientific Visualizations of the T4 bacteriophage infection process and replication” - you'll have to go and look.

Quotes:

A little inaccuracy sometimes saves a ton of explanation. (H. H. Munro)

The biggest difference between time and space is that you can't reuse time. (Merrick Furst)

Beware of bugs in the above code; I have only proved it correct, not tried it. (Donald Knuth)

Documentation is like sex: when it is good, it is very, very good; and when it is bad, it is better than nothing. (Dick Brandon)

Any sufficiently advanced technology is indistinguishable from magic. (A. Clarke)

Facts:

A byte, in computer terms, means 8 bits. A nibble is half that: 4 bits. (Two nibbles make a byte)

While it took the radio 38 years, and the television a short 13 years, it took the World Wide Web only 4 years to reach 50 million users.

The Afghan capital Kabul has a cyber cafe

Wikipedia has a page devoted to toilet roll holders

The oldest surviving computer in the world is called CSIRAC and is located in Melbourne

Microsoft writes the code for autopilot systems in all major airplanes

And thank you to Steve for this nice little YouTube gem :)


Metcalfe's Law & web stuff

Metcalfe's law says that “the value of a network increases proportionately with the square of the number of its users".  This law works for Internet, social networking, the www and any other type of network like that.  The idea is to be able to give a value to the network.  

Metcalfe's law has been used in the context of web 2.0, or rather all sorts of people are trying to see if it can help us understand it all a bit better.  There are a number of non-believers in Metcalfe's law, for example Bob Briscoe Andrew Odlyzko and Benjamin Tilly don't like it at all.  Simeon Simeonov who works with Bob Metcalf addresses this himself and it's a nicely put argument. 

The Sun Balbelfish blog has a really nice short overview of what this is all about.  

Whether it is right or wrong is a very long debate to have, well beyond the scope of this post or even this blog, so we will simply take Metcalfe's law at face value and see if it can work in a web 2.0 context.  In this sense it would be something like the the value of a service is given by the number of users it has. 

The more users there are, the more links there are and the number of potential links increases every time a user joins (the debate is around how the value of each link is not equal).  I really liked the paper by Hendler (Tennselaer Polytechnic institute) and Golbeck (University of Maryland).  It's called "Metcalfe's law, web 2.0 and the semantic web".

The problem is nicely summarized in the abstract:

"The power of the Web is enhanced through the network effect produced as resources link to each other with the value determined by Metcalfe's law. In Web 2.0 applications, much of that effect is delivered through social linkages realized via social networks online. Unfortunately, the associated semantics for Web 2.0 applications, delivered through tagging, is generally minimally hierarchical and sparsely linked. The Semantic Web suffers from the opposite problem. Semantic information, delivered through ontologies of varying amounts of expressivity, is linked to other terms (within or between resources) creating a link space in the semantic realm. However, the use of the Semantic Web has yet to fully realize the social schemes that provide the network of users."

Interestingly they mention Tim O'Reilley who said that the importance of web 2.0 is centered around content creation but the critical thing about it is RSSpermalinks and other kinds of linking technology.  They say that the network effect comes from the social constructs within the sites, and that the value of the network can be deduced through the links between the people who interact in them.

They rightly point out also that a short-coming of web 2.0 is that tags don't create much of a link space.  Tags are always more sparse than links. This is why there are problems at the moment with clustering efforts because there's not so much to go on.  More work is being done to automate tags and such things.  There are ontologies and taxonomies and so many more structures being tried and tested. 

RDF, OWL, RDFS and so on are all about assigning URIs in order to represent relationships.  The authors are right when they say that the most important thing about these languages is that they provide "common referents".  The latent value of the semantic web is in the vocabularies because we can assess the value, and the other characteristics of words.  This based on how they link to each other, the relationships they have and share.

They say that Matcalfe's law comes into play here again because "the more terms to link to, and the more links created, the more the value in creating more terms and linking them in".

The drawbacks as they describe mostly revolve around the fact that our early attempts at semantic web evolution have failed because of the amount of tagging needed.  The whole tagging and folksonomie movement hasn't worked because it's flat and doesn't exploit the links between the elements properly.

They mention FOAF as the most successful semantic web effort to date.  I have to agree with this, and I have blogged about it before.

The main problem is that tagging (like for Del.icio.us) isn't very useful because it's not expressive enough and it isn't structured.  OWL for example is and once the teething problems are ironed out, it will be much easier to extract important and useful data from the web.  We just need to learn how to use all of this new technology effectively.

They state:

"Metcalfe's law makes it clear that the value of these systems, viewed as networks of communicating agents (whether human or machine), arises from the many connections available between online resources. To exploit this space, however, there must be explicit linkages between the resources: when it comes to the network effect, if you don't have links, you don't get it."

So basically Metcalfe's law allows us to see the enormous amount of possible linkages and current ones too.  It is truly staggering. Imho Metcalfe's law is good enough here to help us form an idea of the vastness we are dealing with.  Obviously not all links are equal and there is the issue with valuing them adequately, but surely this is something we can busy ourselves with once we have a full picture.

Why should you care?

Metcalfe's law is simply a stab at a metric for evaluating the landscape.  What it shows is that it is huge, and we don't really need to be told much more.  It's important to get into ontologies and things, play with FOAF, see how you can include your site in this model.  

January 15, 2009

Search Engine Result Evaluation

Search engines are often evaluated using information retrieval techniques such a precision and recall.  These methods are very effective metrics in these systems but less so in search engines.  The reasons for this is that high precision isn't necessarily a good measure of user satisfaction.  The quality of the resources is of course a factor but what users class as authoritative may vary.  
This does really show that results are personal to each user, we're not looking for the same things every time and if we are, maybe not for the same reasons.  This is why personalisation is a good solution, but that's a topic for another day.

Usually you can classify queries into navigational ones or information motivated ones.  This also affects the evaluation of the search engine.  Information ones are hardest because you're looking for a bunch of relevant documents but the query isn't usually rich enough to establish what exactly is needed.  Navigational queries such as looking for the Sofitel in Bangkok are much easier because they're more exact.

You can use human evaluators or automated methods to check how good the results are.  Human evaluators are very biased towards their own motivations of course which have in the past shown that results vary widely.  Automated testing isn't biased of course, the machine doesn't care, but it isn't always very representative of human search if you like.  Google use human evaluators and also live traffic experiments.  

Here I'll introduce a few papers you might find interesting on the subject.  I've chosen a bit of a mixture but of course there are many more ways to do this.

"Search Engine Ranking Efficiency Evaluation Tool" by Alhalabi, Kubat and Tapia from the University of Miami.

They also note that "precision" and "recall" doesn't take into consideration ranking quality. They propose using SEREET (Search Engine Ranking Efficiency Evaluation Tool).
 
They compare a known correctly ordered list to a search engine's one.  The method is to start at 100 points and then deduct from those each time a relevant document isn't present in the search engine rankings and also if an irrelevant document is returned.  It's basically (the number of misses/RankLength) x 100.  RankLength is the number of links in the rank list.They found it was more sensitive to change and efficient in space and time.

"Automatic Search Engine Performance Evaluation with Click-through Data Analysis" by Liu, Fu, Zhang, Ru from Tsinghua University.

They note than human evaluation is too time consuming to be an efficient method of evaluation. Their click-through data analysis method allows them to evaluate automatically.  Navigational type queries, query topics and answers are made by the system based on user query and click behaviour.  They found that they got similar results from those of human evaluators.


They looked at "user-effort-sensitive evaluation measures", namely search length, rank correlation and first 20 full precision.  They say this is better because it focuses on the quality of the ranking.  They found overall that the 3 measures were consistent.  "Search length" is the number of non-relevant documents the users has to sift through, "Rank correlation" is comparing the user ranking to the search engine ranking, and "First 20 Full Precision" is the ratio of relevant document within the total set of documents returned.

More reading if you fancy it:



and there are many more...

Why should you care?

Well obviously if search engine results are not showing the best results to the user, your very content rich, useful and perfect website will always have difficulty in ranking well.  If the results are very credible and accurate, spam in the results and rubbish sites ranking higher wouldn't ever happen.  It's in your interest as a user, a webmaster, a site owner, an seo to evaluate these results for yourself too.  Knowing about some of the methods gives you some insight into this.

January 14, 2009

Microsoft's Game-Powered Search Engine

 Someone dropped me this patent and I instantly loved it because it describes a completely different solution to the problem of IR and does so in a very entertaining way...well obviously.  The patent was filed in 2005 and published on the 13th of January 2009.  The authors are all brilliant and renowned computer scientists from slightly varied fields.

Anyway, it's called "Game-powered search engine".

The idea is that: 
The user types in a query
The game participants receive this query
Responses are collected from the game participants - these can be anything from images, text, audio etc...
The game rewards participants with the most suitable responses

The suitability is calculated by analyzing the degree of agreement between the responses.  Agreement depends on the level of similarity.  

You could think "who on earth is going to bother playing that game?".  Probably the same kind of people who answer questions in forums, on Google answers or such places I imagine.  The advantage of the game system is that a machine actually checks all responses and filters them first so you're more likely to get a correct answer.  

You know, the more I think about it, the less I find it quirky and funny and the more I think it could work.  It's a bit like super-users (people who are experts at using search engines) helping out less savvy users. 

It would have to be really well marketed and introduced because there have been some human edited engines before like ChaCha that haven't won the majority over.  It would also need to be really swanky looking with a top level of usability.  Then it would need to actually give the people answering a motivation for doing so.  What do you get, points?  For what? 

January 13, 2009

Clickstream spam detected

Clickstream analysis is a basic form of metric used to determine how much traffic comes to a site and some analysts also look at the quality of the traffic using this metric.  There is more research being done into clickstream analysis because it is littered with noise, has a very high dimensionality, and 3rd party systems warping the data amongst other things.  furthermore this data can be used more effectively when the users sessions are split into categories.

Here I look at one paper from AIRWeb by Microsoft research people.  It's interesting because it highlights issues that search engines have with automates ranking systems for one and other automated bots.  It shows and these can be faded out from the engines click-stream analysis, which it can well use for ranking documents.

In "A Large-scale Study of Automated Web Search Traffic" Buehrer, Stokes and Chellapilla found that 3rd party systems which interact with search engines are a nuisance because they make it hard to pick out human queries.  3rd party systems (like rank checking software for example) access the search engines to check ranks, augment online games or maliciously alter click-through rates.  They have devised the basis for a query-stream analyser.  I'm sure we can all see how useful this type of system would be.  

Interestingly: "One study suggested that 85% of all email spam, which constitutes well more than half of all email, is generated by only 6 botnets"

They say the problem with web spam is that "a high number of automatically generated web pages can be employed to redirect static rank to a small set of paid sites".

Some checkers perform about 4,500 queries per day - far more than a human would).  This means that there is search result latency for the user and that the engine can't improve quality of service.  Some engines see clickthrough rate as implicit feedback for the relevance of a URL, this bad data is a real hindrance for them.  This is why I think this type of variable in ranking is not useful.  It's too easily manipulated.  As they say "an SEO could easily generate a bot to click on his clients’ URLs".  This is click-fraud.

They note that Clickforensics found that search engine ads experience a fraud rate of 28.3%. This paper however focuses on organic results only.  

The 1st bot analysed is one that "rarely clicks, often has many queries, and most words have high correlation with typical spam".  The 2nd bot had similar characteristics to the 1st but searched for financial stuff (you could search for any topic really).  The queries for this bot revolve around the keywords any SEO would have pinpointed to be honest.  The 3rd bot tried to boost search engine rank, as it looks for various URLs. The 4th bot has an unnatural query pattern because it looks for single words rather than the 3-4 terms usually entered by users. This bot searched for financial news related to specific companies (clearly online reputation management).  Bot 5 sends queries form loads of cities within a short period of time and it also never clicks on anything and uses NEXT a lot - they did take into consideration mobile devices though.  Lastly example bot 6 searches for the same terms over again over the course of the day.  This is typically to boost rankings.  They say that a possible motive for high click rate is:

"For example, if a user queries the index for “best flowers in San Francisco” and then scrapes the html of the top 1,000 impressions, he can find the most common keywords in those pages, their titles, etc. and incorporate them into his own site."

There are basically 3 main types of bots: those that don't click on links, those that click on every link and those that click on targeted links.

The things they added to the click through data analysis were:

- Actual clicks & the number of queries issued in a day
- Alphabetical searches
- Spam terms (viagra)
- Black listed IPs, particular country coeds and blacklisted user-agents
- Rare queries used often
- low probability query pairs

They used Weka (great open source machine learning tool) and achieved a high accuracy.  The classifiers used were Bayes Net, Naive Bayes, AdaBoost, Bagging, ADTree and PART.  Al produced results higher than 90%.  Now they're furthering their research and working on new data sets.

Why should you care?

This is interesting because Google banned some automated ranking tools in the past, and this research does kinda suggest that the spam that these programs produce could simply not be counted in the analysis.  The thing is that hitting the servers so often does affect the search engine's performance and this is bad for users.  I think that we can expect to see these kinds of systems suffer further in the future, but as I have previously said and other have too, the rankings aren't the be all and end all.  There's a lot more else to consider when measuring site performance.

Yes I've used rank checking software like everyone else in the past but when I wear my computer scientist hat I see them as evil because of the damage that they do to systems and I want to eradicate them.  This goes for all the other bots too.

January 12, 2009

Summarization and rankings

We do spend an awful lot of time doing searches on Google and then going through the list of results to find information related to our query or rather the exact information we are looking for.  Sometimes we don't actually know exactly what we're looking for until we get to a resource which tells us that by addressing our query in a different way.  Then our search deviates and we continue this process.  It's time consuming because you need to scan read at least each other resources you think might be relevant.  The ones that appear indeed to be relevant then need to be read in more depth.  This is not an efficient way of collecting useful information.  This is why technology such as document summarization is important.

Document summarization involves automatically creating a summary of a document.  Lots of things have to be taken into consideration such as the type of language used (this needs to be successfully recognised), the style of writing, and the document syntax.  

There are different approaches which have been discussed and evaluated recently which I will introduce.  The basic ideas though are extraction (pulling out useful information) and abstraction (paraphrasing sections of the document to summarise it).  In a search engine you need to have a slightly different type of summarization approach than in other areas because it needs to be relevant to the query, or rather "query biased".

The most efficient 1st step in summarization for a search engine (imho) is multi-document summarization.  This means that it produces a summary of all of the results returned in relation to your query.  This means that you are much closer to getting an answer to your query rather than a list of documents that might be useful to you.  This hugely speeds up your interaction with the data and addresses the issue of data overload.

So that multi-document summarization can happen, the documents have to be clustered.  This is easier in a search engine because the list of results is indeed a cluster.  The summarization stage however can offer further opportunities for a more focused clustering.

The various methods for summarization in the past aren't really what I want to look at in this post, I actually want to focus on recent research which gives us valuable insight into how this might work in a fully working search engine.  I'm going to introduce a number of papers and a very short low-down of the method presented because without the how, we can't really start to understand the why fully.

"Comments-Oriented Document Summarization:Understanding Documents with Readers’ Feedback" byHu, Sun and Lim from Nanyang Technological university of Singapore 
(SIGIR 08)

Interestingly they looked at improving the performance of their summarization system by using comments left by readers on the web documents.  This is described as "comments-based document summarization".  Comments are linked to one another by 3 relations: topic, quotation and mention, producing 3 graphs which are merged into a multi-relation graph.  A second method used is to construct a 3rd-order tensor with the 3 graphs. Sentences are extracted using a feature-biased (scores sentences with a bias to the keywords derived from the comments) or uniform-document approach (scores sentences uniformly without comments).  They found that the latter significantly improved the performance of their system.  This does however only work if there are any comments and these are most likely to occur in blog posts.

"Multi-Document Summarization Using Cluster-Based Link Analysis" by Wan and Yang from Peking University (SIGIR 08) 

They used the Markov Random Walk model for their system which deals with multi-document summarization.  Link relationships between sentences in the document set are analysed.  They isolate topic clusters within the documents and form sentence clusters.  Their method is the "Cluster-based Conditional Markov Random Walk Model" (ClusterCMRW). and the cluster-based HITS model (ClusterHITS) to identify the clusters.  The former approach worked better than ClusterHITS as far as different cluster numbers went.

"MANYASPECTS: A System for Highlighting Diverse Concepts in Documents" by Liu, Terzi and Grandison from IBM Almaden Research (PVLDB 08)

Their system takes a document and then highlights a small set of sentences that are likely to cover different aspects of that document.  They use "simple coverage" and "orthogonality criteria".  The cool thing about this system is that it can handle both plain text and RSS/ATOM feeds.  They quite rightly say that it can also be integrated in web 2.0 forums so that you can easily find different opinions on things and discussions.  They also used the standard methods for clustering and summarization such as k-median and SVD.  

There's talk of integrating this into Firefox too and to allow for spam control which is quite exciting.

"Web Content Summarization Using Social Bookmarks:A New Approach for Social Summarization" by Park and Fukuhara from Seoul National University (WIDM 08)

Their approach is to exploit user feedback (comments and tags) in social bookmarking services like Del.icio.us, Digg, YouTube and Amazon. They used a prototype system called SSNote which analyses tags and user comments and also extracts summaries.  Their approach shows promise.  Their method is "Social summarization" which allows them to produce text summaries that are just as good as human produced ones.  
        
"Latent Dirichlet Allocation Based Multi-Document Summarization" by Arora and Ravindran from the Indian Institute of Technology Madras (AND 08)

As the title says, they used Latent Drichlet Allocation for their system.  This method allows them to capture events covered in the documents and to produce a summary which respects these different events.  This method means that they don't need to pay attention to any of the details concerning grammar and structure.  Their method was very efficient.  Basically the central theme and events in the documents are identified as well as the sub-topics and themes.  Then these are represented in the summary.  They extract entire sentences and do not modify anything.  

Why should you care?

Do you remember all that talk about how there was no point in checking search engine rankings anymore?  Everyone was very divided on this issue, and it isn't an easy thing to explain to clients either.  Well I think that this research clearly highlights that there are very definite moves to break away from the standard list of documents.  As these techniques become more refined and as they become implemented successfully, they will no doubt change the way that users find information and products.  

What should you do then?

Same as you should already be doing, produce well structured rich content, grammatically and syntactically sound.  Not only do you need to show up in the initial results, as you do anyway in a cluster like the ranking list in a search eninge, but you are also going to have to provide very focused and relevant information, because the summarization stage can act as a further filter to the initial clustering.

More papers that are freely accessible should you be tickled by the subject:

Multi-document summarization system and method patent by McKeowan and Barzilay



January 09, 2009

TGIF - 2009 starts

Welcome to the 1st TGIF of 2009.  I'm glad many of you have enjoyed this series during 2008 and I hope that you enjoy the 2009 collection.  Hopefully the festive season was enormous fun for everyone and you are not feeling the January Blues too much.  If you are I hope that this installment lightens your mood.  

Heads-up: I'm off on a 2 week holiday to Thailand and Cambodia for a couple of weeks and then I will be in Sydney for a while.  If you're in any of those places drop me line and we can meet up for some geek talk.  I will still be blogging and available to you though!

Without further ado...

Quotes:

"The first time you do something it’s science.  The second time is engineering.  The third time is … just being a technician.  I’m a scientist: once I do something, I want to do something else. (Clifford Stoll) - thanks to Alex.

"The person who says it cannot be done should not interrupt the person doing it." (Anon)

"He who is ashamed of asking is ashamed of learning." (Anon)

"The most exciting phrase to hear in science -the one that heralds new discoveries- is not "Eureka!" but "That's funny...". (I. Asimov)

Facts:

From the smallest microprocessor to the biggest mainframe, the average American depends on over 264 computers per day. 

In the 1980s, an IBM computer wasn't considered 100 percent compatible unless it could run Microsoft Flight Simulator*

The average computer user blinks 7 times a minute, less than half the normal rate of 20.

By the year 2012 there will be approximately 17 billion devices connected to the Internet.

Cool links:

Computer parts art - pretty cool and imaginative!

The game "Window cleaner" looks super boring - why on earth did someone bother to invent and code it (this is not a joke part of the film)?  And who would want a talking John Mcenroe robot for 5.5 million dollars? - Very very funny stuff!

Affective Feedback

Nicholas Belkin when he gave the 2008 Grand Challenges lecture for Information Retrieval stated that there needs to be far more research into affective computing.  This means taking into consideration user emotions.  

"This could help us understand what subsequent actions the user is likely to take for example, and of course understand where negative feelings arise and allow us to reduce them."

This is an area of research which has been left behind and is still in its infancy.  Explicit and implicit feedback have been used and researched however they have very real limitations which affective computing may be able to address.

There's a paper that was presented at SIGIR 2008 called "Affective Feedback: An Investigation into the Role of Emotions in the Information Seeking Process" by Ioannis Arapakis, Joemon M Jose and Philip D Gray.  Here is a summary of the main points:

The current techniques which are used for relevance feedback (namely Explicit and implicit) can determine content relevance according to the cognitive and situational levels of interaction between the user and the retrieval system.  The problem with this is that they don't take into consideration user intentions, motivations, feelings and so on which can affect their information retrieval behaviour. 

The emotional responses observed vary widely from user to user and from situation to situation.  This means that there needs to be a method which is capable of dealing with this.

Implicit and explicit feedback can determine whether a document is relevant or irrelevant.  For Explicit feedback there's a trade-off between getting the documents that the system sees as important and those that the user are genuinely interested in.  "Eventually, as the task complexity increases the cognitive resources of the users stretch even thinner, turning the process of relevance assessment into a non-trivial task."    

Additionally implicit feedback  whilst collecting information about the user search behaviour suffers from reliability issues.  What can be observed does not match the user intention.  Belkin and Kelly showed that implicit feedback is "unreliable, difficult to measure and interpret".

Kuhlthau found that there were 3 dimensions: affective, cognitive, and physical.  The authors measure the physical using a range of biometric measurements (GSR, skin temperature, etc.).  They used a facial expression recognition system and applied hidden recording (because they wanted to be as invisible as possible). 

They used the Indri opens source search engine from the Lemur project because it can parse TREC newswire and web collections and return results in the TREC standard format and it's also very reliable.

The results were that happiness and irritation were the most intense emotions - followed by sadness, pleasure and surprise.  

"Task difficulty and complexity have a significant effect on the distribution of emotions across the three tasks. As the former increase, so do the negative emotions intensify and progressively overcast the positive ones. We hypothesize that this progression is the result of an underlying analogy between the aforementioned search factors and emotional valence, and,  furthermore, that it is indicative of the role of affective information as a feedback measure, on a cognitive, affective and interactional level".

They do believe however that low-frequency scores may be more important compared to those with higher scores.  They also found that "affective feedback should be treated differently as the task difficulty increases".

This is encouraging research because it does begin to address the issues in explicit and implicit relevance feedback.  It also stays in line with Nicholas Belkin's request for more affective computing research!  

Why should you care?

well if you imagine this system widely implemented across many search engines you would need to take into consideration psychology and cognitives when designing, structuring and optimizing your website.  Knowing where users are going to go next in a search for documents affects your keyword research a great deal.  This does throw a whole load of new variables into that task.



January 07, 2009

SEO = Adversarial IR

SEO is more than often classified as an "Adversarial information retrieval" technique in the computing world.  I say this because AIRWeb for example consider "malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection" as spam, and there is a fine line between all that black hat stuff and the white hat stuff if you define it like that.  The fact that "algorithm reverse engineering" is viewed as a spam issue does directly affect the SEO industry.  

The relationship between the SEO and the search engine can be described as adversarial because any undeserved gain in ranking for the SEO means a loss in accuracy for the search engine.

Bechetti, Baeza-Yates, Castillo, Donato and Leonardi say that "This relationship is however extremely complex in nature, both because it is mediated by the non univocal attitudes of customers towards spam, and because more than one form of Web spam exists which involves search engines" ("Link Analysis for webspam Detection", Feb 08).

"Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is "search engine spamming" or spamdexing, i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection. There is an economic incentive to rank higher in search engines, considering that a good ranking on them is strongly correlated with more traffic, which often translates to more revenue". (AIRWeb)

It's a tricky one because basically the way I see it the computer scientists would quite like to protect their work and systems and stop them being tampered with.  SEO or rather malicious techniques used to change the outcome of those systems is a right pain.  It means more work and it's annoying.  But the spammers and the SEO pros have created jobs in the search industry lets not forget :)

The SEO wants to get top rankings for his/her clients.  It's necessary to figure out how the search engines work in order to be able to make sites stand out to the search engines and be ranked higher than other competitor sites.  There is a fair bit of rubbish going on where dud sites are ranking higher than content rich, more relevant ones, but overall in Google the results are good.  In order for a website to be well optimised, it needs to be highly relevant, useful, and basically be the best for what it does.  Back in the day this wasn't the case but now things have changed for the better.  

It's good for the whitehat SEO to have engineers penalising and banning blackhat sites, they support it and cheer when "Justice" is done.  So here we can say they are on the same side, right?

Now if I also decide as an engineer to no longer use ranking to deliver my results to my users, the whole game play changes.  The relationship between SEO specialist and engineer changes.  The SEO professional can become an ally.   That list has long been deemed overly simplistic and too flat.  There is a next stage in this story, and all of the web 3.0 and beyond points to a change.  The rankings list is just an example, the whole web is undergoing a lot of change right now as well all know.

The thing is that SEO shouldn't be malicious in any way.  Why would a search engine engineer be upset about people trying to make their sites more compliant, higher quality and in the right format amongst other things?  If my vision is the semantic web for example, then I'd be pretty pleased to have these website specialists available to me, helping tag up the whole web properly and expertly.  People creating super useful, highly relevant sites in a way that works for my technology is really good news.

AIRWeb has issued a call for papers, and I think some SEO people should submit something, because they need to start explaining what it is they do and showing how expert they are at using the technology.  I see some SEO experts as being part of the computing community as well.  Why would you want to be as an SEO?  Thinking about your work and skills in terms of tools for building the web of tomorrow is a really exciting thing.  Your questions and input would be valuable I think.

  If you feel like you have something you would like to contribute check out the site.  The paper will have to be of high quality, and it's a good idea to read past papers from the collection to see what kind of format they're after.  I am sure they would welcome people from the SEO community participating.  What can you do to help them?  The organisers are Dennis Fetterly from Microsoft Research and Zoltan Gyongyi from Google Research.

Best of luck!

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.