Science for SEO: search engine

Showing posts with label search engine. Show all posts

January 14, 2009

Microsoft's Game-Powered Search Engine

Someone dropped me this patent and I instantly loved it because it describes a completely different solution to the problem of IR and does so in a very entertaining way...well obviously. The patent was filed in 2005 and published on the 13th of January 2009. The authors are all brilliant and renowned computer scientists from slightly varied fields.

Anyway, it's called "Game-powered search engine".

The idea is that:

The user types in a query

The game participants receive this query

Responses are collected from the game participants - these can be anything from images, text, audio etc...

The game rewards participants with the most suitable responses

The suitability is calculated by analyzing the degree of agreement between the responses. Agreement depends on the level of similarity.

You could think "who on earth is going to bother playing that game?". Probably the same kind of people who answer questions in forums, on Google answers or such places I imagine. The advantage of the game system is that a machine actually checks all responses and filters them first so you're more likely to get a correct answer.

You know, the more I think about it, the less I find it quirky and funny and the more I think it could work. It's a bit like super-users (people who are experts at using search engines) helping out less savvy users.

It would have to be really well marketed and introduced because there have been some human edited engines before like ChaCha that haven't won the majority over. It would also need to be really swanky looking with a top level of usability. Then it would need to actually give the people answering a motivation for doing so. What do you get, points? For what?

December 01, 2008

Why writing a search engine is hard

Anna Patterson, research Associate to the formal reasoning group at Stanford and ex-Googler, also head lady at the Cuil search engine explains why writing a search engine is hard at the ACM queue.

Some main points:

Building good search engines has never been done in a big group but in teams of 1 to 4.

You need a lot of disks. The indices are so big that you have to merge them and they will never fit on a single machine.

You need to design a ranking algorithm

CPU doesn't matter - you need as much bandwidth as you can afford

The bugs you write will slow you down more than the cheap CPUs

SCSI is faster, but IDE is bigger and cheaper

For indexing use a big huge file to minimize disk seeks, which will slow you down no end - You cannot afford the time to seek to a file to process a Web page

Use real distributed systems, not a Network file system (NFS)

Write a very simple crawler. "For instance, (dolist (y list of URLs) GET y) is essentially all you need." Use Sort | uniq on Linux to find duplicates. This of course a very simplistic way of designing the crawler and duplicate issue but it will mean that you can get up and running quickly. The other option is to use and opensource crawler.

One false step in the indexing and processing will take too long. To make it simple, just index on words. Indexing is a really complex area of information retrieval research.

Keep a disk-based index architecture - you're not getting lots of traffic right now

Don't use PageRank - "Use the source, Luke—the HTML source, that is."

"At serve time, you have to get the results out of the index, sort them as per their relevancy to the query, and stick them in a pretty Web page and return them. If it sounds easy, then you haven't written a search engine".

"The fastest thing to do at runtime is pre-rank and then sort according to the pre-rank part of your indexing structure."

Leave the little indices where they were deposited initially. This means makes the whole thing faster - then gather these little lists into a big list and sort this list for relevancy. Or get all results for a particular word together in a big list beforehand.

Loads and loads of things can go wrong, and you have no room for error or you will be sunk.

For more information check out "Building Nutch: OpenSource search: A case study in writing an OpenSource search engine" (also in ACM queue)

Have fun!

November 26, 2008

Google, my backend system

Right now, we're the web equivalent of the horse and cart. We have invented the wheel, and domesticated animals and this has revolutionised our existence, especially the way in which we do business, but...I don't see Ferrari's, E-type Jags or anything like that right now in web world.

Are we heading that way? Yes, for sure.

Search is not supposed to be something independent of the rest of your web experience, or actually you digital experience. You are going to be able to access search from any device or environment without actually having to go to a search engine.

Imagine you're typing away an article for your photography blog. The intelligent environment you're in is already aware that you are writing something for you blog, because it has seen patterns and features develop over time. You highlight something and summon Google. It does something pretty cool. It goes out with the highlighted words as a query, but already has the context of the query, because of it knowing all about your writing for your photography blog.

It rushes out, and visits all the top results. These results are dependant not just on the keyword phrase but also on the other variables gathered from your intelligent environment. Then it pulls out all the key concepts and information and writes you a summary to answer your question. You can "repair" the results by typing something like "No, I meant..." or "Perfect! Tell me more about the canon". And off it goes again. Or "Let me see the top 5 documents", or "Show me related information",...

From a mobile device, you could summon Google during a conversation with someone for example. Imagine you're trying to figure out where the closest restaurant is to you both. You summon Google and ask it, where's the best place for us to meet, she's vegetarian". You can request an answer to a question like "Did Angelina Jolie really bungee jump yesterday?" and get a response such as "Yes she did. She jumped off a bridge in New Zealand".

I look forward to summoning Google and saying "Remember when I was writing that paper for that conference? There was a quote by x about y in it, what was it?"..."That's right, who else said something about that?"...

Google becomes a backend system. Gasp! No but there is nothing more natural than for the engine to be in the background. I think that conversational systems removed from search are fun toys, but their real use is in information retrieval. Once you start getting used to having all your information at your fingertips as and when you ask, you are also going to get used to conversing with the system pretty quickly.

For this kind of thing of thing to work you will need strong summarization systems, natural language generation and understanding, and machine translation, personalization, machine learning also, not to mention all the other supporting technologies without which it could never happen. Luckily these are all under development right now.

One of the most interesting questions I believe is "Does your behaviour change now that the search engine is conversational in nature?" - Does it become your friend, do you get attached to it because it shows human qualities, or do you treat it like a tool? Does the way that you search change now that you no longer actually go to a search engine web page? Are you more focused, more specific, more vague?

Hummmm....so how are businesses going to take advantage of the search market then? Clearly ads are still going to be served up, but how do you make sure the clever agent like your content most of all?

Google discuss the more immediate future of search here.

November 04, 2008

Search engine index: a tutorial

I prepared a tutorial about the index of a search engine: what it is, how it works, why we need one, and what the issues are with it.

The search engine index

View SlideShare presentation or Upload your own. (tags: search engine)

October 23, 2008

Think you're good with Google?

Are you a Google ninja? Can you master any query thrown at you? Would you like to win a cool book all about language and computers (it'll help you understand search engines)?

If so, take part in the Google Ninja Challenge:

You're given 8 query contexts or questions, and then you're asked to find out the answer using Google.

You keep your queries from the first till the last one you used to find the right information - it's not about how few queries you found the information in, it's simply about finding the information.

About the experiment:

This is part of an experiment not on Google but rather on users, so it's an HCI experiment. The KIA project (knowledge interaction agent) is all about natural language generation and understanding. We can't do any of that if we don't know how users search for things or what language they use for example. The first part of the experiment happened in 2006/7 and was based on an irritating chatbot system that helped us understand how accepting users were of susch things. You can read my research on that here, it's a Springer paper from HCI International.

The winner of the book is chosen by a group of researchers at the university. The reason for that person winning will be revealed in the experiment analysis afterwards.

If you want to take part....

Start here by filling out the intro survey
Once you've done that - play with the 8 queries
Finally fill in the de-brief form with all your answers

Have fun people and thanks :)

October 22, 2008

U Rank

Microsoft has unleashed a personalisation centered search engine, it's called "U Rank".

They say that they want to use it to discover more about how people search, share and edit information, and how they organise their search results. You can move around your search results, delete stuff, make notes and make it all visible to your friend, and also recommending sites to them. I like the idea of sharing my search results with people, because I often do a search for someone and then send them the best results for their information need, so this would just make it much easier. They also offer the possibility of mixing up photos or images with video footage results, and I can see that happening quite easily.

Read Write Web have a good post about U Rank and notice that you can't move results from the second page to the first page - I think this is a pretty big problem. The dragging and dropping doesn't work so well either they noticed.

You have to have a LIVE account to use it.

October 13, 2008

Cognition - a short interview

I've been playing with the Cognition search engine for a while now and also sent the link on to some colleagues of which my friend Dan who is a proper algorithm geek, like I am. Dr Kathleen Dahlgren from Cognition answered some questions for us, here they are:

- How does cognition feel about personalised search?

Personalized search can be augmented when the search engine understands language and can automatically see relationships that are opaque to pattern-matchers. For example, if a person is interested in rhythm and blues, they are also interested in R&B, and probably blues as well. But not blues meaning a bad mood. These subtleties are all handled by Cognition.

- Are there plans for a multilingual solution?

There are plans. The semantic map is relevant in all languages; it is universal. But linguists need to tie concepts to the words of other languages.

- How are the ontologies constructed?

Originally they were constructed by hand. Currently Cognition adds digitized ontologies automatically.

- Cognition claims that no other NLP processing technology comes close in breadth and depth of understanding of English... how so?

The closest semantic map, WordNet, has 2.5 times fewer word stems and 20 times less

semantic information.

- What exactly is meant by the "context" of the text they are processing?

The context is the other words in a sentence. So in “strike a match”, “strike” means “ignite” and “match” means “phosphorus-tipped stick”. But in “striking workers”, “strike” means “walkout”.

- What metrics are used to measure the quality of the engine?

We have many different metrics and regression tests. Our main method is to index identical content with another search engine, produce 50 typical queries, and test them for relevance using the two search engines. Recall is measured as relative recall, lacking a gold standard in which all documents have been inspected. In relative recall, the total of relevant search results by the two search engines is counted as full recall. In such tests, Cognition always performs with over 90% precision and recall. Google, for example, in 3 such tests had 20% precision and 20% recall.

- What exactly is meant by a "phrase" in the stat database?

A phrase is a frequently-occurring set of terms that are always juxtaposed, such as The Bill of Rights, U.S. Congress, United Airlines, or Securities and Exchange Commission.

- Are there prebuilt macros for common phrases?

Yes – 200,000 of them.

It's really a very interesting system to use, and I reckon it'll improve leaps and bounds in the future as well. We will be playing with this a great deal, I'll blog about it again, so watch this space!

October 03, 2008

Goodsearch

Goodsearch is a search engine powered by Yahoo that donates around a penny to a charity of your choice every time you do a search. The donations are raised from 50% of the advertising revenue. Apparently the Dance marathon chapter raised $900 for a children's hospital.

Goodshop was recently launched and it donates 37% of the sale of any goods purchased.

The results are decent too. Give it a go, especially if you run a charity.

October 02, 2008

Nexplore search engine

The Nexplore search engine has been released in beta. It's fast, it's busy, it's fun and it's pretty accurate. You can search the web, news, video, images, blogs and podcasts. It's clearly aimed at social networking, it's all very web 2.0.

They say:

“Our Web 2.0 application model uses innovative cloud-computing techniques to create a highly effective distributed search engine that easily scales to meet volume demands without compromising performance. We’ve combined this backend with a social overlay to fine tune and share results and a user interface built with advanced RIA technologies to create a compelling, highly productive user experience. NeXplore Search is poised for growth as users seek more effective and enjoyable ways to find the information they need.”

So they've made it fast using cloud computing and use RIA's (Rich Internet Applications) to enhance interactivity and expressiveness. It's quite different to the Google model, which is very centered on providing information and not necessarily creating a networking and rich environment like Nexplore.

I tried a couple of searches (to have a clear picture I'd have to do loads more to be fair). I search for 2 of my favourite things:

The Breeders:

Google -> wikipedia, myspace, a random blog, random blog, random blog

Nexplore -> random blog, myspace, wikipedia, and the same random blogs as Google

Ashtanga yoga:

Google -> wikipedia, wikipedia, yoga school, BBC, Amazon

Nexplore -> ashtanga.com (A very authoritative site), a yoga school, wikipedia, a yoga school, and ashtanga.com again.

So clearly in my searches Google has preferred wikipedia results, which is cool, because I might well want to know the definition of "Ashtanga yoga" or know who "The Breeders" are. Nexplore also provides that but starts with a random blog. Really the results are pretty much the same for some searches, just in a different order. Nexplore does better on "Ashtanga yoga" I think because it gives me a well known authoritative site rather than wikipedia first. Here the results weren't the same as in Google.

Google gives me 4 video options in the results for "The Breeders". Nexplore doesn't do that. You have to search under "video", so no universal search in the results.

I searched for myself and the results contained less instances of the person of the same name who works for NASA, so that's fine.

Nexplore has a super busy interface, do not be duped by the very minimalistic landing search page. Under every result there's a social media sharing option, a massive preview window pops up when you hover over the results at any time, there's a wiki search box which is constantly on the right of the page...you see what I mean...it's very busy. You can view result in a line, which makes them a bit hard to navigate through because there's 25 results on a page. You can also just view the site preview, which looks a bit like a music library.

Nexplore does personalise results for you as you can bin them, preview them or save them. Google is looking into this too with the thumbs up or down thing. It also gives you a "popular" searches library which I like.

I'm not too sure about ad relevance as I get one for puppies for sale for "The Breeders" search when the engine has already established that it's a rock band. For "Ashtanga yoga" 2 ads are relevant but one if for a hotel, which has a yoga class. But that hotel site isn't really about ashtanga yoga. Maybe it was the best ad to display from the collection available.

So. I think Nexplore has good results, as good as Google in my short experiment, and way better than Cuil. I think it's too busy, there's too much going on, and I don't like the preview popup. The other two option, gallery view and list just aren't clear enough in my view because they don't give enough info, which you get from the description. I like the personalisation. I like that it gave me authoritative sites at the top.

The social media button to share the site are on every result. This further crowds the interface. It's a great idea though, just maybe it can be done a bit differently to relieve the interface of clutter. I hate all this popping up stuff! Also, I don't like how the pictures display on the image search.

The verdict: I will use it for w while and see how it grows on me, after all every change takes a while to adjust to. I don't want to write it off because I don't like the clutter so early on. I might really like it all once I use it often. The results are good.

I'm really not ready to dump my classic, simple, Google interface, and the results are fine for my purposes. I think that it's very hard for a new search engine to come along with new cool ideas, and this one has really tried hard. I'll definitely persevere and see how I go.

Social media in this kind of engine is all done for the sites because of the button under every result. It gives them more chance at visibility than in Google.

The CTO, Dion Hinchcliffe runs a blog called "Musings and ruminations on building great systems". It's really wordy but quite interesting so read it if you have the time. You can follow him on Twitter as well.

You can also read about Nexplore at "Beyond Search", where Stephen Arnold gives his take on it.

If we did all migrate to Nexplore, I think we'd still Google things in it anyway.