Science for SEO: Information credibility analysis

I wanted to draw a little attention to a Japanese project called the "Information Credibility Criteria Project". The NICT (National Institute of Information and Communications Technology) started it in 2006.

This project is all about looking at how information sources are not all equal in that they are written by different people who...are also not equal! If you write a post about banana skins and their use in cancer treatment, unless you're a researcher in this area, you post isn't 100% credible. If you are a researcher in that area, then it is more credible because you have the proved expertise to write about such a thing. They don't exclusively look at writers and their authority but also at other criteria, that are not easy to determine automatically:

Credibility of information contents

They use predicate argument structures rather than words for this analysis and use "automatic synonymous expression acquisition" to deal with synonymous expressions. The sentences in the documents are classified into opinion, events and facts. Opinions are classified into positive and negatives ones. An ontology is produced dynamically for each given topic which helps the user interface with the data.

There are a lot of different variables that come into play when we look at the credibility of a document. The grammar, syntax and accurateness of the data presented are all strong variables when I generally look at a blog post or a website.

Credibility of information sender

They classify writers into individuals or organisations, celebrities or intellectuals, real name or alias and many more groupings. This information is gleaned from meta-information but they also use NLP techniques for this too. The credibility evaluation is based on the quantity and quality of the information the user sender has produced so far.

Credibility estimated from document style and superficial characteristics

They take into account whether a document is written in a formal or informal way, what kind of language is being used, how sophisticated the layout is and other such criteria.

Credibility based on social evaluation of information contents/sender

This is based on how the sender is viewed by others. They use opinion mining from the web based on NLP or using existing rankings or evaluations available.

The research can be applied to all areas of electronic information access, such as email, web docs, desktop docs,...The idea is not to replace the human but to support the human in his/her judgment of an information source.

Document credibility is an area that I believe is very important for the future of the web. We can rank documents in a sequence, as Google does for example, based on their relevance to the initial user query. Google looks at authority as well, and also at the content, and other factors too. The problem though is that without a thorough analysis like the one being devised by NICT there are documents that are perhaps not as important finding themselves at the top of the rankings for example.

Looking at things like author authority rather than simply document authority is useful obviously but if this isn't flexible enough then good relevant documents could be omitted. Someone who has never written anything before will not I assume be considered to be very authoritative, and someone who has written a lot of bad content shoots themselves in the foot for all their future work! It therefore becomes important to have a certain standing on the web or rather in the information community. If you are not considered very influential, then your work might not be considered influential also.

Obviously there is a lot more research to be done here and I really look forward to reading a lot more about it. You can check the publications page if you want to read more about the work that NICT has been doing since 2007.

Why should you care?

If this type of method works perfectly, you will need to not only provide good content but also have a good reputation. This is applicable both to companies and individuals. By finding out about the author in particular and taking that into account for overall document scoring an engine could wipe a good deal of spam but also the standard for "good content" would be set. It all reminds me of FOAF and the other methods which exist for tagging up individuals and their connections online. This is a fundamental part of the semantic web after all and it could be easily exploited in this way.

1 comment:

Anonymous said...: I read something about the topic from a colleague's site; the post is here:
http://www.webdesignfromscratch.com/blog/web30-concept-notes.php

The big problem as I see it, is trying to identify people in a trustworthy way...like with social media - it's so easily spammed with multiple pseudonyms, etc...

It's good to see that there is research going in to this field, but I do doubt it's reliablity for now. People would have to entrust a lot more information to such a technology in order for them to be accurately identified, and of course avoid the situation of multiple pseudonyms.

...This is only a potentially negative repercussion, but for those that follow the rules, it would be massively valuable I expect.

Really interesting post CJ - thanks for that.

Ben; 28 January 2009 at 12:16

Science for SEO

January 28, 2009

Information credibility analysis

1 comment:

About Me

Follow me on Twitter

Subcribe

CJ's shared items

My Blog List

Blog Archive

ShareThis

Content Recommendations powered by Evri