Science for SEO: September 2008

September 30, 2008

SEO and CLIR

CLIR, "Cross Language Information Retrieval" (also referred to many names as well as "translingual") has been in research since at least 1996, when the first conference on the topic was held as part of SIGIR. It involves retrieving information from a user query which is in a different language. The user may ask a question in Dutch and require results in German for exmaple.

Google translate offers such a service, but research continues in this field. It provides results both in English and French is you choose those languages, and I think they're of pretty good quality as well.

Researchers from Umass have published a paper entitled "Simultaneous Multilingual Search for Translingual Information Retrieval". They describe a method which involves integrating document translation and query translation into the retrieval model.

Basically each document has the text of the document and also its translation into the query language. Each term in the query and its translation are treated as synonym sets. So, they run one search instead of 2 separate searches, and also have 1 index instead of 2.

As a result, they state:

"This model leverages simultaneous search in multiple languages against jointly indexed documents to improve the accuracy of results over search using document translation or query translation alone. For query translation, we compared a statistical machine translation (SMT) approach to a dictionary based approach. We found that using a Wikipedia-derived dictionary for named entities combined with an SMT-based dictionary worked better than SMT alone."

They made use of Wikipedia for names but because they were dealing with non-overlapping languages like English and Chinese for example, all of the names had to be translated. There were occurrences of misspellings so they had to build a set of name variants. They were restricted by the limitations of machine translation technology, particularly in the area of named entities.

Their evaluation showed that:

"Our experimental results show that this approach significantly outperforms a previous hybrid approach, which merges the results of separate queries issued over separately indexed source and English documents. Our experiments evaluated results for English queries and Chinese documents, but our implementation of SMLIR currently includes three languages, English, Chinese and Arabic, demonstrating the ability to seamlessly integrate multiple languages into one framework."

They continue to work on their project to improve the performance of the system. Please read their paper for more precise information and a better understanding of their interesting work.

CLIR is an important area of research for the future of search engines because the greatest number of Internet users are from Asian countries, meaning that their searches and the amount of data produced by them is likely to be in their own language and not English. It is important for us to be able to access these documents and understand their information as otherwise we'd be missing out quite considerably. The same goes the other way and for countries where there are fewer users who would need to access information mostly in a foreign language.

How does SEO fit in? I guess you still make your site relevant and content rich, and maybe you'll be able to translate it to see how it looks in Chinese or whatever other language. The method of search is the same though, I mean the search results are the same anyway. You can check in Google translate.

New developments though may change things a little at least. Watch this space.

Paul Baran

Paul Baran was one of the 3 inventors (1960's) of the "packet-switched networks". This is a method where data traffic gets split into chunks which we then call packets. These are routed over shared networks.

Without this invention, the Internet would really have struggled to happen! He wrote a lot of papers which described his ideas, in particular an architecture for "a large-scale, distributed, survivable communications network."

He is soon to visit the White House to be honoured for his contribution to the Internet.

He says: ""When it comes to things like science it doesn't make a damn bit of difference where the idea comes from, whether it comes from a person in India or here, as long as we all share it."

Spoken like a true scientist.

Technorati report - part 5

This section is entitled "Brands Enter The Blogosphere".

Here is a summary of Technorati's findings:

They found that 4 in 5 bloggers mention brands either in a positive or negative light. This is very true in my opinion, I discuss Google, and other search engines after all. 1/3 of bloggers have been approached by a brand.

Most believe that newspapers won't survive the next 10 years. I think it would be a shame because I like to get the Sunday paper and read all the supplements over coffee, and sprawl out over the sofa, but I have to admit that I rarely buy a daily and I do read all my news online.

Bloggers spend twice as much time online than other users. This I believe. I am a little ashamed to admit that I must spend on average at least 12hrs online a day between my job, my blog and my usual net admin (email, social networks) and news fix. Technorati actually found that bloggers were the most likely to use new applications, which is true, for me as I pounce on them, and they use social media a lot, in at least 5 web 2.0 activities.

"37% of bloggers have been quoted in traditional media based on a blog post." - I haven't had the honour but I have read blog mention in New Scientist even.

The report also states that bloggers don;t spend as much time watching tv. I hardly ever watch TV and if I have it on, it's humdrum in the background. I'd be interested to know what kind of TV programs bloggers watch.

For more, read the report.

September 26, 2008

60% addicted to social media.

60%

This quiz was provided by - Search & Social - Media Experts

The next internet - by Google

Google released (yesterday) an article from Vint Cerf about "The next internet":

"The flexibility we have seen in the Internet is a consequence of one simple observation: the Internet is essentially a software artifact. As we have learned in the past several decades, software is an endless frontier. There is no limit to what can be programmed. If we can imagine it, there's a good chance it can be programmed. The Internet of the future will be suffused with software, information, data archives, and populated with devices, appliances, and people who are interacting with and through this rich fabric.

And Google will be there, helping to make sense of it all, helping to organize and make everything accessible and useful."

Read it, it's important, even if some of us might disagree with it.

Friday's geeky humour

A manager went to a master programmer and showed him the requirements document for a new application. The manager asked the master, "How long will it take to design this system if I assign five programmers to it?"

"It will take one year," said the master promptly.

"But we need this system immediately if not sooner! How long will it take if I assign ten programmers to it?"

The master programmer frowned. "In that case, it will take two years."

"And what if I assign a hundred programmers to it?"

The master programmer shrugged. "Then the design will never be completed," he said.

Have a good weekend :)

Technorati report - part 4

This section is about blogging for profit. Here are some snippets, it's not as long as the other sections have been so far.

"Bloggers with advertising are more sophisticated in terms of their use of tools, advertising platforms and even events to build reader loyalty. They also invest more resources (both time and money) in their blogs. "

Apparently those bloggers who choose to not have advertising on their blogs have a lack of interest in it, and don't want their blog to look too crowded (that was actually one of my reasons).

One in four bloggers uses three or more means of advertising.

"...two in three have contextual ads (such as Google AdSense). One-third of bloggers have affiliate advertising on their blog. One in five negotiate directly with advertisers and one in ten sell advertising through a blog ad network."

"The average annual blogger revenue is more than $6,000. However, this is skewed by the top 1% of bloggers who earn $200k+." - wow!

But...they invest an average of $1,800 annually in their blogs. - not bad really.

Most high earning bloggers are male, work for themselves, write corporate blogs, earn about $19,000 a year.

I think they deserve the money, it's hard work and takes a lot of time to keep a blog. I'm not sure I could do it without the love of the topic!

September 25, 2008

3 degrees of seperation

Cow's blog brought my attention to this bit of research from O2. They say that (within a shared ‘interest’ network) there are no longer 6 degrees of separation but now only 3, due to the rise in telecommunication and social networking. They say email and mobile phones had the most significant impact.

"All respondents were asked to make contact with an unknown person from destinations selected at random from across the globe using only personal connections. By using their shared interest networks the participants were able, on average, to make the connection in three person-to-person links."

Scary!

Technorati report - Part 3

This one is called "The how of blogging".

One in 4 blogger spend at least 10hrs on their blog. This, when you think about it, is quite a lot of time if you consider that a working week is on average 40hrs. I spend time here and there when I have it and aim to post once a day but sometimes, twice and day and at times not at all.

It's worth posting often and lots though, if you want to be popular:

"Over half of the Technorati top authority bloggers post five or more times per day, and they are twice as likely to tag their blog posts compared to other bloggers."

They're also good at getting the word out about their blog, listing them in Technorati, Google, and commenting and linking out to other blogs. Tagging them also increases traffic. Apparently "half of active blogs attract more than 1000 monthly visitors." That's a lot as far as "Science for seo" is concerned :)

But, one in ten will hire people to work on their blogs, and corporate blogs hire full-time and/or part-time staff, but most have unpaid help. I suspect the unpaid help comes in the form of employees.

Also "Bloggers with advertising invest significantly more money in their blogs than bloggers who do not accept advertising.".

This blog has not been optimised, and is not listed all over the place. I don't participate in forums and other blogs for coverage either, but yes, the URL is listed here and there. I have analytics enabled, but I can't remember when I last checked them. I am therefore a bad blogger.

This is a bit of an experiment though. I spend a lot of my time optimizing blogs and websites, forums and for social media. I didn't want this blog to be all about that, but rather about the interesting bit of SEO we sometimes forget while we frantically rush to get everything done in the week. That bit is the research and investigation into what's going on out there, in academic and corporate research, on the internet, in computing. I don't blog about mainstream seo news, because many people do that well enough.

I guess there weren't enough bloggers like me to be incorporated into the Technorati stats :)

Technorati report - part 2

This section is called "The what and why of blogging".

This part of the report explains the motivation of bloggers and gives some interesting stats about bloggers.

It says that most bloggers consider their style to be "sincere, conversational, humorous and expert." Apparently snarky, confessional and gossip orientated blogs fell to the bottom of the list. however both personal and professional blogs are popular.

Some bloggers want to remain anonymous, because of colleagues, work and so on, but many bloggers are happy to disclose their identity. Those who have are better know in their industry and have been invited to speak at events, write for a publication or be on the radio as the result of keeping their blog.

I have made friends through my blogs, especially my blog about my trip to Mysore, India. My most successful blog was in fact under an alias, as I was commenting on papers, not always in a positive light, by people who could be an examiner in my viva or a reviewer for an academic research publication. It was best to keep a low profile. I did have a lot of success from it, being quoted in "SEOmoz", "eConsultancy", "SEO by the sea" and such sites. I've still never owned up to it!

The report also states that:

"International bloggers tend to be less conversational and snarky. Asian bloggers tend to be more motivational and confessional, while European bloggers are more confrontational. Women tend to be more conversational in their blogging style, while men tend to be expert."

It depends on what blog you're writing I think. Here I'm more expert, and on my other blogs more conversational. This is a blog about science and seo and my others are about trips and yoga. I don't think those stats necessarily represent truly the demographics.

People blog because:

They want recognition
To network
To get into traditional media
For career advancement
To make money
To self-promote
To share ideas
To be known as an expert
“to bake half-baked ideas.”

I would say that my motivation is to become more involved in the seo community and the science community once again, after a break. I enjoy the exchange of ideas and it's a good place to get my ideas straight. Well mostly :)

September 23, 2008

State of the Blogosphere / 2008

Technorati have released their "State of the Blogosphere" report for 2008. They define the blogosphere as "The ecosystem of interconnected communities of bloggers and readers at the convergence of journalism and conversation."

It's a really interesting read, full of good facts and figures. It'll be released in 5 consecutive daily segments so we'll look forward to those. Today's one is full of fun facts. They also surveyed bloggers directly as well as delving into their resources.

I have 43 blogs in my feedreader that I read daily, it's the first thing I do in the morning, and then I'll pick up the snippets throughout the day. They're blogs about computer science (i.e the MIT news), SEO (i.e Search engine watch), physics/science (i.e physorg), technology news (i.e techcrunch), searchengines (i.e The Google blog), and then random ones (i.e The lazy linguist, Digg,...). I'd say that in total I must read about 50 in all, as I get news on iGoogle too, like Reuters and so on.

Is it excessive? I don't think so, for a computing person who lives online and is a news junkie. Also I consider it part of my job. I need to know what's happening out there so I can adjust my approach and also be up to date on developments. It also allows me to prepare in advance of things that are likely to happen, like the explosion of social media for example.

Figures from the Technorati report from ComScore show that: the total audience for blogs (77.7 million US) and FB (41 million) and MySpace (75.1 million) combined is at 188.9 million. 50% of internet users in the US read blogs, and 12% in the US are bloggers. 184 million have started a blog and 346 million read them. But how many of those blogs are actually kept up to date after being started?

Apparently bloggers cover about 5 topics per blog. I cover seo, computing, and internet news, so I guess that's very true. Also 4 in 5 bloggers post product and brand reviews. This shows how important bloggers are for business. It says that 1/3 of bloggers have been approached to be brand advocates. Bloggers are also making money from having ads on their blogs (50% in Europe!). It's not something I really want to consider, because this blog is about information. Girls tend to write personal blogs and guys tend to write professional blogs.

It's really valuable information as it's good for companies to have a blog and post information about their industry and about other news that is interesting and useful. These figures can help convince those reticent, and those that have a half-hearted approach to it. They need to get involved, read blogs related to their industry, participate and I hope enjoy the experience and learn a lot from others about their industry as well. You can collect natural links from really good sources for your site by simply getting out there and getting stuck in.

A lot of companies argue that it takes a lot of manpower to do this and it's true, it's time consuming. I do think it's well worth it though and it might be worth outsourcing to some professionals or hiring someone experienced to do a good job.

September 20, 2008

Howard Aiken

I usually post a friday piece of light humour or something to celebrate the end of the week, but I've been out of action with a nasty cold that is doing the rounds. Anyway, here it is now:

"Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats."

That's by Howard Aiken, he had a PhD in physics and went off to work with IBM on the Harvard Mark computer which was his little idea.

Another statistical method for IR

Miles Efron published a paper entitled "An Approach to Information Retrieval Based on Statistical Model Selection" this August. He proposes to use statistical model selection for information retrieval.

"The proposed approach offers two main contributions. First, we posit the notion of a document's "null model," a language model that conditions our assessment of the document model's significance with respect to the query. Second, we introduce an information-theoretic model complexity penalty into document ranking. We rank documents on a penalized log-likelihood ratio comparing the probability that each document model generated the query versus the likelihood that a corresponding "null" model generated it. Each model is assessed by the Akaike information criterion (AIC), the expected Kullback-Leibler divergence between the observed model (null or non-null) and the underlying model that generated the data. We report experimental results where the model selection approach offers improvement over traditional LM retrieval."

In short, he choses a single model from a pool of candidate models, favouring models that fit the data well. They use Occam's razor to avoid overfitting. He ranks documents on a statistic related to AIC, which consists of the difference between the document model and its corresponding null model. Given a document di they derive a statistic corresponding to a test on the null hypothesis, H0. And....

"Specically, we rank documents on the dierence in the Akaike information criterion (AIC) between the non-null and null models. AIC is the expected Kullback- Leibler divergence between a given model and the unknown model that generated the data. Thus ranking documents by AIC dierence oers a theoretically sound method of conducting IR."

He also states:

"We argue that we can improve retrieval performance by mitigating the role of query- document term coordination. Instead of rewarding documents that match many query terms, we argue, we should reward documents that match the best query terms. Using AIC dierences aords a natural means of operationalizing this intuition."

He finds that his method rarely does worse than LM, and that it perforn significantly better when a small amount of smoothing was applied to language models.

It's interesting to see another statistical method and I'd love to see more evaluation and progress, it looks promising. Personally I am of the opinion that a mixture of linguistic models and statistical models is necessary. I think using one of the other is limiting. I've talked about the Lemur project before and this is statistical as well. N-grams, markov models, tf-idf, the query likelihood model, multi-variate Bernoulli, there are a lot of different techniques and very prominent and much respected people who have worked on them and are working on them right now. But lets not forget to involve the linguists now and again.

Also lets remember one of the well know computer science sayings:

"If enough data is collected, anything can be proved by statistical methods" ( I might not include this quote in my thesis)

September 16, 2008

GAUDI

Google have been working on GAUDI (Google audio indexing) for some time now, they incorporated their speech recognition technology into YouTube (transferring speech to text and then indexing it), and now there is a dedicated labs page for the project.

You can have a play, searching for words in the video clips. There's also a Google group for it.

From the Google labs page:

"Google Audio Indexing uses speech technology to transform spoken words into text and leverages the Google indexing technology to return the best results to the user.

The returned videos are ranked based -- among other things -- on the spoken content, the metadata, the freshness.

We periodically crawl the YouTube political channels for new content. As soon as a new video is uploaded to YouTube, it is processed by our system and made available in our index for people to search."

Audio indexing research has been around at least since the early 90's. The main problem is obviously accuracy, as for example the system has to recognise different accents. I think that getting it to work in music would be quite a breakthrough because of all the "noise" around the words.

The kind of methods of measure used in this type of technology are things like amplitude, zero-crossing, bandwidth, band energy in the sub-bands, spectrum and periodicity properties.

Once GAUDI works well and is fully deployed it'll be extremely useful for us all I'm sure. This also means that all those podcasts you have about your company and it's products and services will be very useful to help users find you in Google.

Collaborative search

Jeremy Pickens, Gene Golovchinsky, Chirag Shah, Pernilla Qvarfordt, and Maribeth Back, all from FX Palo Alto Laboratory won SIGIR paper of the year 2008 with "Algorithmic Mediation for Collaborative Exploratory Search".

It's a really interesting paper. It's about searching in a completely different way than we do at the moment. The idea is that users with the same information need team up and search together at the same time. They're provided with tools to help them search and collaborate, and also "algorithmically-mediated retrieval to focus, enhance and augment the team's search and communication activities."

Recomender systems can use the user profile to reference certain characteristics, ask the user to get involved and vote on things that are useful or not for example, rank items, and so on.

Collaborative filtering systems have agents that collaborate to find information and filters that filter patterns to make sense of it. They will look for users who share similarities with a profile, and use their data to make predictions. Amazon was one of the first to propose rating products way back in 2000 to add to its variables. Another types of collaborative filtering will look at the user activity history to make predictions.

Sites and services that use this kind of method are already very popular. I regularly use Amazon, Digg, Stumbleupon, iTunes, iLike, Last.fm and a whole host of others.

I'm not sure however if I would like to use this kind of thing for searching the Google index for example. In my experience the recommender systems and collaborative filtering systems are very useful and I discovered all sorts of new music that I like this way, and books I've enjoyed reading, but there are also a mountain of suggestions that are not relevant to me.

I use search engines all day for work, and I don't have time to be proposed something I might find more interesting, having said that, how accurate are my results anyway? Well I always manage to find what I'm looking for one way or another, so it can't be too bad. But how much better could it be?

Search engines have trouble with vague or ambiguous queries and currently the preferred solution seems to be to apply personalisation or query context. I don't think personalisation has been tested to its full potential just yet and I think it has the potential to improve things. Query context information is useful but the problem of "query drift" (user changing his/her intent) still remains.

September 12, 2008

What is AI?

AI is a field that overlaps with computer science rather than being a strict subfield. Different areas of AI are more closely related to psychology, philosophy, logic, linguistics, and even neurophysiology. People might want to automate human intelligence for a number of different reasons. One reason is simply to understand human intelligence better. For example, we may be able to test and refine psychological and linguistic theories by writing programs which attempt to simulate aspects of human behavior. Another reason is simply so that we have smarter programs. We may not care if the programs accurately simulate human reasoning, but by studying human reasoning we may develop useful techniques for solving difficult problems. -- Alison Cawsey (Databases and Artificial Intelligence, 1994)

September 11, 2008

Search, a 90-10 problem

Marissa Mayer posted an article on the future of search on the Gooogle blog today. She expands on the 90-10 problem that she believes search to be.

She explains:

"We’re all familiar with 80-20 problems, where the last 20% of the solution is 80% of the work. Search is a 90-10 problem. Today, we have a 90% solution: I could answer all of my unanswered Saturday questions, not ideally or easily, but I could get it done with today’s search tool. (If you’re curious, the answers are below.) However, that remaining 10% of the problem really represents 90% (in fact, more than 90%) of the work. Coming up with elegant, fitting and relevant solutions to meet the challenges of mobility, modes, media, personalization, location, socialization, and language will take decades."

The main advances necessary in search at the moment are listed in the article as being mobile, media, personalization and language. That does broadly cover it.

This made me smile though :

"So what's our straightforward definition of the ideal search engine? Your best friend with instant access to all the world’s facts and a photographic memory of everything you’ve seen and know."

I'm not sure about it being my best friend, I think if I did say my best friend was a search engine my non-geeky friends would have me down as a lost cause :)

Chrome security - calm down.

I answered a post on SEW and thought it was useful information for the blog as well, so I've added to it. It was a thread about how Chrome has security flaws and should people be using it basically.

Last week, the German federal office for information security advised against the use of Google Chrome. The official apparently said that the fact that it was released as a beta was problematic and it was risky for one vendor to have all of this data. It seems that the German official wasn't quoted completely accurately. It seems strange that Chrome should be targeted because of it's beta release when IE is often in beta as are many other browsers. As for Google having too much information, well no one ever complained about Microsoft before.

In fact Matt Cutts says, people don't seem to know that Chrome doesn't send information about your surfing habits to Google but Microsoft's IE8 beta 2 will send Microsoft information if the "suggested sites" feature is enabled.

Chrome is a beta, there are always going to be some issues, and other browsers have security flaws as well. In fact Chrome has some good security features that other browsers don't like site blacklists, there is also a privacy mode (incognito), a dialogue box where you can clear data, and the rendering engine runs in a sandbox (if something bad is running in one tab, only that tab is affected and not the whole browser).

The vulnerabilities to date include the flaw from the Safari webkit, some java bugs (ok, of which one severe one), a DoS (denial of service) vulnerability.

Google did hire Michal Zalewski in July though, so he's probably helping out with the Chrome product.

I don't blame people for waiting for it to come out of beta but that might take quite a long time. In the meantime, other browsers have been used happily for years along with their security flaws. Firefox 3.0 as recently as June was found to have some problems, for example.

IBM's x-Force report was released in July 2008. It says that 94 percent of all browser-related online exploits have happened within 24 hours of official vulnerability disclosure ("Zero-day" exploits). Browser plug-ins are the favorite hacker weapon (how many FireFox plug-ins do you have?) as 78% of all hacks during the early part of 2008 happened this way. The reports says that automated toolkits, obfuscationand unpatched browsers are the primary hacking route at the moment. It also says "Although the most exploited Web browser vulnerabilities are one to two years old, the availability of public proof-of-concept and exploit code is speeding the integration of more contemporary exploits into toolkits." Automated SQL injection attacks are also on the up.

This report was released before Chrome was even around, so it's safe to say that many of us have been using insecure browsers. I understand about being cautious but it's a not fair to call Chrome an unsafe browser when others are not necessarily any better.

September 09, 2008

Google and privacy

Recently there's been a lot of fuss over Google Chrome's EULA, and over Google protecting user privacy. Google rectified the EULA earlier last week.

At first it read: That users had "a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive license to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute any Content which you submit, post or display on or through, the Services." After the change it read that users could "retain copyright and any other rights" that they already hold in content they submit or display using the browser."

I do believe that it was a mistake, as they said they used their standard terms of service.

I don't think Google has any interest in what I might be writing in "Documents", and don't think there is any need for them to index such information. It would annoy all the users and we'd use something else instead. There is a fine line to tread when collecting user data to train machines, it has to be annonymised, and of course people have to agree to it, which we all do when we tick that box. Frankly I think it would be an enormous shame for Google to not collect user information (such as behavioural analysis for example) through Chrome. This is how research is done, you need to learn from somewhere.

Anyway, Google have reacted to the outcry and have released a nice video explaining the Google search privacy policy, and also a blog article. They state: "we'll anonymize IP addresses on our server logs after 9 months. We're significantly shortening our previous 18-month retention policy to address regulatory concerns and to take another step to improve privacy for our users."

Google News archive search

Today Google brought our attention to the fact that they have been working on digitalizing newspapers for online use. Their technology can tell apart headings and text through optical character recognition. It's built on the scanning technology used for books, but has some extra features.

You can browse these in the Google news archive, or by using the timeline feature in Google news. Google are going to keep working on this, and they're going to serve contextual ads through the service. They say that they're also going to try and drive print subscriptions.

I can see this service being really useful for people researching history, as essentially it replaces the microfiche method at the library. It adds to Google's archived news data, and as we know, more data is always better. I'm not convinced I'll be using this service though, I'm not entirely sure what I'd use it for.

September 08, 2008

Grand challenges for IR

There's an article in the LA Times where Marissa Mayer from Google is interviewed about the last 10 years of Google. She states in the interview:

"Search is an unsolved problem. We have a good 90 to 95% of the solution, but there is a lot to go in the remaining 10%."

She's right, but there's a lot to do still. SIGIR published a paper entitled "Some(what) Grand Challenges for Information Retrieval", written by Nicholas Belkin (Rutgers Uni). Some of the challenges he identifies are (and I quote):

1) "the ability to characterize and differentiate among information-related goals, tasks and intentions in some principled manners that go beyond straightforward listings, that will apply across a wide variety of contexts, and from which design principles for IR systems can be inferred".

2)" we need to develop methods for inferring information-related goals, tasks and intentions from implicit sources of evidence, such as previous or concurrent behaviors".

3) "going from characterization and identification of goals, tasks and intentions, to IR techniques which actually respond effectively to them, is a challenge that has been barely noted, much less addressed, to date. One reason is, of course, that we so far don’t have the necessary characterizations, but another is that putting together the research expertise in the study of information behavior and in the development of IR techniques is a challenge in and of itself."

4) We need to identify what "context" is

5) To understand how emotions affect information search.

6) Personalisation is far from perfect.

7) Integrating IR in the search environment.

8) Better evaluation is needed.

People like Liadh Kelly (Dublin City Uni) have been looking at improving retrieval in a more human way:

"Existing retrieval techniques are good at retrieving from non-personal spaces, such as the World Wide Web. However they are not sufficient for retrieval of items from these new unstructured spaces which contain items that are personal to the individual, and of which the user has personal memories and with which has had previous interaction. We believe that there are new and exciting possibilities for retrieval from personal archives."

That kind of work is interesting because it starts to push through into area of IR we just didn't look at previously.

It's very interesting to look at what Bruce Croft (UMASS) said we wanted from IR in 1995. His 10 issues are relevance and feedback, IE, multimedia retrieval, Effective retrieval, Rooting and filtering, Interfaces and browsing, "Magic", Indexing and retrieval, Distributed IR and integrated solutions. The "Magic" issue concerns the vocabulary mismatch, we have come a long way in this area. In fact we have come a long way since 1995, definitely, however the fundamental issues still remain, we don't yet have a system that works perfectly. Google has addressed all of Croft's issues though.

So yes, 10% of the search problem left...arguably a bit more. Multilingual information retrieval is still quite a challenge, a big challenge.

September 05, 2008

Sending an internet....

"I just the other day got an internet. It was sent by my staff at 10 o'clock in the morning on Friday and I just got it yesterday. Why ? Because it got tangled up with all these things going on the internet commercially... They want to deliver vast amounts of information over the internet. And again, the internet is not something you just dump something on. It's not a truck. It's a series of tubes. And if you don't understand those tubes can be filled and if they are filled, when you put your message in, it gets in line and its going to be delayed by anyone that puts into that tube enormous amounts of material, enormous amounts of material." — Senator Ted Stevens (R-Alaska) explaining how the Internet works (2006).

Relating Documents via User Activity

Elin Pedersen (Google) and David McDonald (Uni Washington) wrote an interesting paper entitled "Relating Documents via User Activity: The Missing Link". The research was "carried out as part of a project in the Office of the CTO, Microsoft".

The Abstract:

"In this paper we describe a system for creating and exposing relationships between documents: a user’s interaction with digital objects (like documents) is interpreted as links – to be discovered and maintained by the system. Such relationships are created automatically, requiring no priming by the user. Using a very simple set of heuristics, we demonstrate the uniquely useful relationships that can be established between documents that have been touched by the user. Furthermore, this mechanism for relationship building is media agnostic, thus discovering relationships that would not be found by conventional content based approaches. We describe a proof-of-concept implementation of this basic idea and discuss a couple of natural expansions of the scope of user activity monitoring."

They use a system called Ivan, which monitors a user's activity during a task taking note in particular of times when documents are on the screen together, when the users switches between them, or performs other manipulations like cutting and pasting from one document to the other. Ivan helps the user find clusters of documents that were used at the same time and repeatedly, and it also find relationships between documents. The inventors state that it's a mixture of a recommendation system like Amazon's, and Google's page ranking. Ivan captures user activity and then builds relationships.

Activity is captured through message spying, and they focus on symmetrical relationships for pairs of documents. A relationship is established when one document is opened and another is already open. When a user performs actions between those documents, the relationship is strengthened.

They had problems with matching file system events with window events and also found it a challenge to get a reliable and non-invasive stream of user interaction events. They've decided to look at web apps instead of the desktop, as they feel that the desktop will in time be redundant.

It's interesting work as it touches on another way of discovering relevant documents. User interaction hasn't been used enough I think, and this is a nice piece of work going in that direction. Not all documents are text only and so an algorithm based on activity rather than simply content may be very efficient.

September 02, 2008

Google Chrome

Watch out for the beta release of Google's new Chrome broweser today. It'll be windows only for now, I'm sure there'll be all sorts of interesting things to play with.

"On the surface, we designed a browser window that is streamlined and simple. To most people, it isn't the browser that matters. It's only a tool to run the important stuff -- the pages, sites and applications that make up the web. Like the classic Google homepage, Google Chrome is clean and fast. It gets out of your way and gets you where you want to go.

Under the hood, we were able to build the foundation of a browser that runs today's complex web applications much better. By keeping each tab in an isolated "sandbox", we were able to prevent one tab from crashing another and provide improved protection from rogue sites. We improved speed and responsiveness across the board. We also built a more powerful JavaScript engine, V8, to power the next generation of web applications that aren't even possible in today's browsers."

Read here for more...

September 01, 2008

Sorting data like a human

Some interesting research has been funded at MIT by the James S. McDonnell Foundation Causal Learning Research Collaborative, the Air Force Office of Scientific Research, and the NTT Communication Sciences Laboratory. Charles Kemp is in charge and cool things are being discovered.

Computers can't pick up on information like humans do because they have trouble finding where to begin, unless a specific pattern strutcure is given. Human have pattern matching skills which utilize information that the machine doesn't have. MIT have devised a way for the computer to figure out which type of organizational structure best fits the given data. The computer looks for all pattern structures and then weighs them up against each other. The computer tries different data structures on the data until it decides that it's found the one that best fits the data.

It might sound simple because us humans do this all the time with no trouble at all, but that's the beauty of it really. This is hopefully going to shed some light on how humans thinks, how humans recognise patterns in data, and this will in turn help us create a better machine.