I wanted to draw a little attention to a Japanese project called the "Information Credibility Criteria Project". The NICT (National Institute of Information and Communications Technology) started it in 2006.
This project is all about looking at how information sources are not all equal in that they are written by different people who...are also not equal! If you write a post about banana skins and their use in cancer treatment, unless you're a researcher in this area, you post isn't 100% credible. If you are a researcher in that area, then it is more credible because you have the proved expertise to write about such a thing. They don't exclusively look at writers and their authority but also at other criteria, that are not easy to determine automatically:
Credibility of information contents
They use predicate argument structures rather than words for this analysis and use "automatic synonymous expression acquisition" to deal with synonymous expressions. The sentences in the documents are classified into opinion, events and facts. Opinions are classified into positive and negatives ones. An ontology is produced dynamically for each given topic which helps the user interface with the data.
There are a lot of different variables that come into play when we look at the credibility of a document. The grammar, syntax and accurateness of the data presented are all strong variables when I generally look at a blog post or a website.
Credibility of information sender
They classify writers into individuals or organisations, celebrities or intellectuals, real name or alias and many more groupings. This information is gleaned from meta-information but they also use NLP techniques for this too. The credibility evaluation is based on the quantity and quality of the information the user sender has produced so far.
Credibility estimated from document style and superficial characteristics
They take into account whether a document is written in a formal or informal way, what kind of language is being used, how sophisticated the layout is and other such criteria.
Credibility based on social evaluation of information contents/sender
This is based on how the sender is viewed by others. They use opinion mining from the web based on NLP or using existing rankings or evaluations available.
The research can be applied to all areas of electronic information access, such as email, web docs, desktop docs,...The idea is not to replace the human but to support the human in his/her judgment of an information source.
Document credibility is an area that I believe is very important for the future of the web. We can rank documents in a sequence, as Google does for example, based on their relevance to the initial user query. Google looks at authority as well, and also at the content, and other factors too. The problem though is that without a thorough analysis like the one being devised by NICT there are documents that are perhaps not as important finding themselves at the top of the rankings for example.
Looking at things like author authority rather than simply document authority is useful obviously but if this isn't flexible enough then good relevant documents could be omitted. Someone who has never written anything before will not I assume be considered to be very authoritative, and someone who has written a lot of bad content shoots themselves in the foot for all their future work! It therefore becomes important to have a certain standing on the web or rather in the information community. If you are not considered very influential, then your work might not be considered influential also.
Obviously there is a lot more research to be done here and I really look forward to reading a lot more about it. You can check the publications page if you want to read more about the work that NICT has been doing since 2007.
Why should you care?
If this type of method works perfectly, you will need to not only provide good content but also have a good reputation. This is applicable both to companies and individuals. By finding out about the author in particular and taking that into account for overall document scoring an engine could wipe a good deal of spam but also the standard for "good content" would be set. It all reminds me of FOAF and the other methods which exist for tagging up individuals and their connections online. This is a fundamental part of the semantic web after all and it could be easily exploited in this way.