This does really show that results are personal to each user, we're not looking for the same things every time and if we are, maybe not for the same reasons. This is why personalisation is a good solution, but that's a topic for another day.
Usually you can classify queries into navigational ones or information motivated ones. This also affects the evaluation of the search engine. Information ones are hardest because you're looking for a bunch of relevant documents but the query isn't usually rich enough to establish what exactly is needed. Navigational queries such as looking for the Sofitel in Bangkok are much easier because they're more exact.
You can use human evaluators or automated methods to check how good the results are. Human evaluators are very biased towards their own motivations of course which have in the past shown that results vary widely. Automated testing isn't biased of course, the machine doesn't care, but it isn't always very representative of human search if you like. Google use human evaluators and also live traffic experiments.
Here I'll introduce a few papers you might find interesting on the subject. I've chosen a bit of a mixture but of course there are many more ways to do this.
"Search Engine Ranking Efficiency Evaluation Tool" by Alhalabi, Kubat and Tapia from the University of Miami.
They also note that "precision" and "recall" doesn't take into consideration ranking quality. They propose using SEREET (Search Engine Ranking Efficiency Evaluation Tool).
They compare a known correctly ordered list to a search engine's one. The method is to start at 100 points and then deduct from those each time a relevant document isn't present in the search engine rankings and also if an irrelevant document is returned. It's basically (the number of misses/RankLength) x 100. RankLength is the number of links in the rank list.They found it was more sensitive to change and efficient in space and time.
"Automatic Search Engine Performance Evaluation with Click-through Data Analysis" by Liu, Fu, Zhang, Ru from Tsinghua University.
They note than human evaluation is too time consuming to be an efficient method of evaluation. Their click-through data analysis method allows them to evaluate automatically. Navigational type queries, query topics and answers are made by the system based on user query and click behaviour. They found that they got similar results from those of human evaluators.
"Evaluation of Web-Based Search Engines Using User-Effort Measures" - Tang and Sun from Reutgers University
They looked at "user-effort-sensitive evaluation measures", namely search length, rank correlation and first 20 full precision. They say this is better because it focuses on the quality of the ranking. They found overall that the 3 measures were consistent. "Search length" is the number of non-relevant documents the users has to sift through, "Rank correlation" is comparing the user ranking to the search engine ranking, and "First 20 Full Precision" is the ratio of relevant document within the total set of documents returned.
More reading if you fancy it:
Evaluating search engines by Croft
and there are many more...
Why should you care?
Well obviously if search engine results are not showing the best results to the user, your very content rich, useful and perfect website will always have difficulty in ranking well. If the results are very credible and accurate, spam in the results and rubbish sites ranking higher wouldn't ever happen. It's in your interest as a user, a webmaster, a site owner, an seo to evaluate these results for yourself too. Knowing about some of the methods gives you some insight into this.