June 13, 2005
Search engines are "surprisingly ineffective" for many queries: New York Times
Eureka! Journalist James Fallows, in today's New York Times, finally gets it -- that search engines are great at answering easy, quick-fact types of questions, but are pretty terrible for more indepth critical questions. In his article "Enough Keyword Searches. Just Answer My Question," (free registration required to view), he reminds readers that for complex, indepth questions, searchers try in vain to outguess the engines. Fallows describes his frustration trying to use keyword searches to find consistent state-by-state data covering the last 40 years -- and coming up completely empty after fruitless hours of searching. "We live with these imperfections by trying to outguess the engines - what if I put "per capita spending by states" in quotation marks? - and by realizing that they're right for some jobs and wrong for others."
May 28, 2004
Coming Soon - the Death of Search Engines?
Is search weariness finally settling in? Are mass market consumers ready to look beyond search engines to other ways of web searching? In "Coming Soon: The Death of Search Engines", I ponder the issues and look for some solutions.
December 02, 2003
Link Competition on the Web
Although it is now almost 18 months old, Winners don't take all: Characterizing the competition for links on the web by David Pennock, Gary Flake, Steve Lawrence, Eric Glover, and C. Lee Giles, remains an excellent study of how distribution of links to web sites approximates a "power law" where a small number of sites receive the majority of links, and always rise to the top of search engine results for a given keyword combination. The study, which was published in the Proceedings of the National Academy of Sciences 99(8): 5207-5211, is also available in synopsis form
The study notes that the competition for web links is particularly fierce in publications, entertainment, and consumer electronics topics. Although the paper doesn't directly mention Google or its PageRank methodology, which ranks partially by link frequency, one can easily make the connection and conclude that link competition will continue to devolve Google's PageRank methodology, making Google less and less suitable for serious information searches in popular topics.
Could Microsoft search your computer's files?
In Microsoft Aims for Search On Its Own Terms, Michael Kanellos describes Microsoft's experiment with "different search technologies that will, among other tasks, conduct Google-like searches on an individual's hard drive or categorize query results in different ways intended to make the data easier to digest."
Using this technology, the system "retrieves links, music files, e-mails and other materials that relate to applications running in the foreground." A Microsoft spokesperson describes the technology as "being able to retrieve a bunch of things without you explicitly asking for them."
If the technology could retrieve files based on the context of what you are working on now, it isn't a big stretch to think that the same technology might also conduct a web search and deliver web links based on the same contextual considerations.
Besides enabling Microsoft to fully undermine the utility of stand-alone search engines like Google by making its own software so easy to use, the prospect of such an invasive tool being built into an operating system has the sort of big-brother overtones that will likely raise privacy concerns among those who still care about such things.
If that's the idea (and Microsoft has persistently indicated that it wants to integrate web search into its next operating system), the idea is brilliant: Microsoft stands to enrich itself tremendously by persistently delivering external contextual content through a variety of revenue-producing streams. Harried computer users should find the convenience of integrated search irresistable, so this appears to be a strategy that can't miss.
November 12, 2003
Rich-Get-Richer with Link Analysis
Google's PageRank, known generically as link analysis, has become the subject of some interesting research which leads many search professionals to conclude that search engines which rely on link analysis will favor the most popular, well-established and best-known web sites in their results.
The rich-get-richer concept of web linking -- whereby a large percentage of web links point to a relatively small number of web pages -- is described in reasonably plain language in Merrick E. Lozano's article "Rich Get Richer - Why Yahoo, DMOZ, Google and PageRank are Important." Lorenzo also touches on ideas like power laws and preferential attachment as they apply to web linking. A good introduction to a complex topic.
April 18, 2003
Search Engine Robot Simulator
The Sim Spider Search Engine Robot Simulator is a spider that simulates what search engine robots read from your website. Readers can input a web page URL and visualize the links that will be spidered, the "word dump" that will go into the database, and keyword density analysis for each page. This is a highly illustrative example of the difference between the page you see on the screen and the content that actually lands up in the search engine's database.
April 17, 2003
All About Search Indexing Robots and Spiders
Good searchers seek to understand the nature and content of the database that they are searching. Understanding how content "happens" in databases can enable advanced searchers to tailor their searches to the content, and to know why some searches won't work well.
This principle also applies to search engines, but few of us really know how search engine database content "happens" and how search engines gather their web pages. All About Search Indexing Robots and Spiders by Avi Rappoport of SearchTools.com provides an excellent summary, with additional links, about how spiders actually find and download pages into their mega-databases. Of particular interest are the links to how robots.txt pages work and the Robots Exclusion Protocol, which enables webmasters to redirect web spiders away from selected directories or pages.
April 04, 2003
Google's PageRank Explained
Although it's more than you'll ever want to know about how Google rank orders its search results, Phil Craven's excellent article is required reading for anyone interested in just how Google rank orders its search results.
Although the algorithms of PageRank are complex, the results produced by PageRank are pretty easy to predict. Searchers should keep in mind that the PageRank algorithm is a popularity ranking tool, not a relevancy ranking tool. So if you think that Google brings the most relevant results to the top of the hit list, you're wrong: it brings the best known, most established resultsto the top of the hit list. Relevancy in any substantive sense would require human assessment and intervention, which doesn't happen in search engines.