Some main points:
- Building good search engines has never been done in a big group but in teams of 1 to 4.
- You need a lot of disks. The indices are so big that you have to merge them and they will never fit on a single machine.
- You need to design a ranking algorithm
- CPU doesn't matter - you need as much bandwidth as you can afford
- The bugs you write will slow you down more than the cheap CPUs
- SCSI is faster, but IDE is bigger and cheaper
- For indexing use a big huge file to minimize disk seeks, which will slow you down no end - You cannot afford the time to seek to a file to process a Web page
- Use real distributed systems, not a Network file system (NFS)
- Write a very simple crawler. "For instance, (dolist (y list of URLs) GET y) is essentially all you need." Use Sort | uniq on Linux to find duplicates. This of course a very simplistic way of designing the crawler and duplicate issue but it will mean that you can get up and running quickly. The other option is to use and opensource crawler.
- One false step in the indexing and processing will take too long. To make it simple, just index on words. Indexing is a really complex area of information retrieval research.
- Keep a disk-based index architecture - you're not getting lots of traffic right now
- Don't use PageRank - "Use the source, Luke—the HTML source, that is."
- "At serve time, you have to get the results out of the index, sort them as per their relevancy to the query, and stick them in a pretty Web page and return them. If it sounds easy, then you haven't written a search engine".
- "The fastest thing to do at runtime is pre-rank and then sort according to the pre-rank part of your indexing structure."
- Leave the little indices where they were deposited initially. This means makes the whole thing faster - then gather these little lists into a big list and sort this list for relevancy. Or get all results for a particular word together in a big list beforehand.
- Loads and loads of things can go wrong, and you have no room for error or you will be sunk.
For more information check out "Building Nutch: OpenSource search: A case study in writing an OpenSource search engine" (also in ACM queue)