Science for SEO: Why writing a search engine is hard

My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

December 01, 2008

Why writing a search engine is hard

Anna Patterson, research Associate to the formal reasoning group at Stanford and ex-Googler, also head lady at the Cuil search engine explains why writing a search engine is hard at the ACM queue.

Some main points:

Building good search engines has never been done in a big group but in teams of 1 to 4.

You need a lot of disks. The indices are so big that you have to merge them and they will never fit on a single machine.

You need to design a ranking algorithm

CPU doesn't matter - you need as much bandwidth as you can afford

The bugs you write will slow you down more than the cheap CPUs

SCSI is faster, but IDE is bigger and cheaper

For indexing use a big huge file to minimize disk seeks, which will slow you down no end - You cannot afford the time to seek to a file to process a Web page

Use real distributed systems, not a Network file system (NFS)

Write a very simple crawler. "For instance, (dolist (y list of URLs) GET y) is essentially all you need." Use Sort | uniq on Linux to find duplicates. This of course a very simplistic way of designing the crawler and duplicate issue but it will mean that you can get up and running quickly. The other option is to use and opensource crawler.

One false step in the indexing and processing will take too long. To make it simple, just index on words. Indexing is a really complex area of information retrieval research.

Keep a disk-based index architecture - you're not getting lots of traffic right now

Don't use PageRank - "Use the source, Luke—the HTML source, that is."

"At serve time, you have to get the results out of the index, sort them as per their relevancy to the query, and stick them in a pretty Web page and return them. If it sounds easy, then you haven't written a search engine".

"The fastest thing to do at runtime is pre-rank and then sort according to the pre-rank part of your indexing structure."

Leave the little indices where they were deposited initially. This means makes the whole thing faster - then gather these little lists into a big list and sort this list for relevancy. Or get all results for a particular word together in a big list beforehand.

Loads and loads of things can go wrong, and you have no room for error or you will be sunk.

For more information check out "Building Nutch: OpenSource search: A case study in writing an OpenSource search engine" (also in ACM queue)

Have fun!

2 comments:

Anonymous said...: Been a fan of Anna's since her work on phrase based IR stuff pre-Google... Too bad about Cuil (so far) though...

Tnx for the summary... always like keeping up with the 'Cuil kids' ;0); 1 December 2008 at 16:09
CJ said...: Thanks Dave,

I'm a big fan too, I wish her book was a bit cheaper though - £135.37!

Ouch :(; 1 December 2008 at 21:45

Subscribe to: Post Comments (Atom)

Creative Commons License

Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.