December 07, 2004

Evaluating New Web Search Tools

When I'm not teaching or writing, I spend a large chunk of time evaluating new web search tools for inclusion (or not!) in the Search Portfolio. I'm often struck by how many reviewers of web search tools seem to completely miss essential elements that affect search tool quality. Although much has been written on how to evaluate content-rich web sites, there is almost nothing about how to effectively evaluate new search tools.

I recently spoke about my methodology on web search tool evaluation at Internet Librarian 2004 in Monterey, and I'm working on an article which will be published in Spring 2005 on the same topic. I see web search tool evaluation as a multi-step process, which involves evaluating both functionality and features of the tool as well as the source of content that is delivered through the tool. Most evaluators of web search tools tend to restrict their discussion to issues of functionality, but I would argue that content informs the quality of the tool at least as much as the functional bells and whistles.

It's also critically important to compare new web search tools to others of the same type that currently exist on the free web. Even if you like the way a new search tool behaves, it's only upon comparison to other existing "best of breed" tools of a similar type that you can really adequately determine whether or not this new tool is worth adding to your search tool roster.

I recently set a group of my students (mainly librarians) to the task of evaluating a newly announced web search tool, DonBusca.com. This group was just over halfway through taking the online course Beyond Google: Searching Faster and Smarter on the Web offered through the partnership of Canadian library associations in cooperation with Workingfaster.com. They had already been exposed to many excellent search tools in the first half of the course.

Those who focused their evaluation on functionality and ignored content sources had more difficulty in objectively judging quality. Those who tested the functionality of Donbusca (which uses a form of clusting in search results) against other clustering meta-search tools of similar type (like Clusty.com, for example) found notable differences in the quality of the clustering in these different search tools, which helped lead them to more objective conclusions. And those who dug deeply into the sources of content were able to make the best decisions, because they were able to assess the capacity of the search tool to deliver quality regardless of the functional capacity of the tool.

What did they think of Donbusca.com? Well, they liked the Thumbshots previews that accompanied some of the links (and so did I - this is a handy preview feature for broadband users). They didn't like Donbusca's clustering capability as much as they liked Clusty's -- and they conducted side-by-side comparisons to examine the clustering results. (For example, here's a result in DonBusca of 360 degree feedback and the same search in Clusty). One student looked pretty carefully at the sources of content for DonBusca. She carefully went through each source (DonBusca parses queries to 7search, About.com, AOL, AskJeeves, Dmoz, Epilot, FindWhat, MSN, Netscape, Overture, Wisenut, and Yahoo). She determined that several of the partners are pure pay-for-placement; that Netscape and AOL are basically Google searches, so they kind of cancelled each other out. One student was concerned about the prominent placement of the Wikipedia in search results, since as a school librarian she had met kids who deliberately put incorrect information into Wikipedia just to prove that they could.

Evaluating web search tools isn't easy, and this group of very capable students produced varying results. But for those who attempted to dig down "under the hood" and go beyond the "hmm, this is cool" conclusion, their evaluations proved more satisfying and ultimately more conclusive.

Posted by ritavine at 05:01 PM

November 15, 2004

Another tool to compare results from search engines

At this week's Internet Librarian conference in Monterey, Gary Price introduced me to another free web tool that provides information on the number of unique results in two different search engines. Jux2.com is similar to Thumbshots ranking tool (mentioned in the May 21 issue of Sitelines)

Jux2.com compares up to 2 engines -- Google, Yahoo and AskJeeves are the only ones available at the present time. Results report the unique items at the top of the results list, and for those which are dups in either two or all three engines are displayed after that.

I conducted several searches in jux2.com today and found some pretty obvious errors in its reporting mechanism, which reported substantially more unique sites than actually existed. Although my searches in jux2.com reported low overlap, a manual comparison of the same searches in both Google and Yahoo revealed very significant overlaps in the first 10 results. Something isn't working quite right, so searchers should supplement jux2.com searches with another tool, like Thumbshots, or conduct independent searches in your search engines of choice, and count the overlap manually.

One cautionary note: Many authors have used the results of these tools to prove how low the overlap is between search engines. That may not be as true as it looks at first glance. These comparison tools measure overlap of EXACT URLS, so if two URLs are very close but not an exact match, the tool will treat each one as unique. However, if we examined top-level domain overlap between engines, our overlapping results would be considerably higher.

Posted by ritavine at 09:44 PM

November 11, 2004

Librarians vs. Technology

Freelance writer Sue Bowness interviewed Gary Price, Gwen Harris, and me for her article "Librarians vs. Technology" in this month's issue of Information Highways. The three of us comment on how librarians can bring added value to the work of amateur researchers who live in a plug-in-the-keyword world.

Posted by ritavine at 02:00 PM

November 03, 2004

Yahoo Beats Google at Link Checking

Note to searchers who rely on occasional link checking in search engines (where you check to see which web sites are linking to a particular URL):

Many web searchers wish that Google would list everything in its index that linked to the desired URL -- but Google requires a minimum PageRank to display the result. Hence, many pages that link to the requested URL do not display in the Google link: search.

Such is not the case with Yahoo, which appears to have no such requirement, As a result the link:http://mywebsite.com results work much better. Another good reason to use Yahoo! for this type of search.

Posted by ritavine at 04:54 PM

October 27, 2004

Yahoo Does Boolean

I was reviewing Greg Notess's excellent article on key changes in the Yahoo! database and search syntax, published in the July/August issue of ONLINE Magazine, and was particularly struck by his observations on link checking and Boolean capabilities.

It seems that Yahoo can produce results for full Boolean nested searches, which Google can't do (at least not yet). Using a fairly complex search on benchmarking of computer expenditures for consumer goods industry (which a week earlier had baffled one of my smarter student groups) and it appeared to deliver results consistent with the Boolean search statement.

So for search situations where you have exhausted other approaches and need to try an engine with full Boolean capabilities, Yahoo! seems to be the answer, at least for now.

Posted by ritavine at 06:58 PM

July 30, 2004

Article on Search Engine Gigablast

There's a very interesting article on the independent search engine Gigablast by Canadian Internet consultant Gwen Harris, in this month's issue of Information Highways. Harris covers a bit of background on Gigablast's owner, Matt Wells (he's formerly of Infospace), and the business model (he wants to sell the technology rather than ads), plus an overview of one of Gigablast's most interesting features, Gigabits, which is used to show related concepts to a previously executed search.

Posted by ritavine at 03:40 PM

Search Engine Comparison/Relationship Charts

Librarian Diana Botluk has produced a Search Engine Comparison Chart in the latest issue of LLRX.com. Botluck covers AlltheWeb, AltaVista, Google, Lycos, MSN, Teoma, Wisenut, and Yahoo. Gigablast, an important independent search engine, is absent from the list.

Although Botluk's chart is similar in style to the one that Greg Notess has maintained for several years as part of Search Engine Showdown, Botluck has focused on the major functions of each search tool rather than the databases that actually feed into these search engines. She also excludes Gigablast, an important independent search engine, from her list, which Notess includes. Notess chose not to include Altavista, Lycos or Alltheweb as separate entities, probably because they are all now fed by Yahoo!'s Inktomi engine plus paid listings from other sources.

The databases that serve results to search engines are at least as important as the functionality and features of each engine. Readers are advised to consider findings from both Botluk's and Notess's charts, and also to keep aware of the ever-changing feeds that provide each engine's content. There's a good (and frequently updated) chart at Search Engine Watch: the latest verion is dated July 23, 2004.

Posted by ritavine at 03:29 PM

July 19, 2004

More on Yahoo! and Google's Inclusion of WorldCat records

There has been more news this month of Yahoo!'s inclusion of Worldcat records (Google already has them) in its database.

This is interesting because it illustrates some real variations in current ranking and sorting differences between Yahoo! and Google.

As a test to see if the Worldcat records for a book would come up during an average search, I selected the book Your Guide to Passing the AMP Real Estate Exam by Joyce Bea Sterling (Real Estate Education Co., 2000) which is one of the Worldcat records captured by both Google and Yahoo. I chose the title because it was recent, because users looking to pass the exam could conceivably use Google to help them, and because the word selections for searching would be fairly obvious (amp real estate exam).

I typed in the query amp real estate exam into Google (without any punctuation or double quotations). As I expected, Google's algorithmic ranking and sorting methods, which prefer popular web pages (as opposed to Worldcat's obscure and rarely-linked documents) delivered lots of links, including lots of links to booksellers selling this and similar books, but in the first 10 pages of results, there was no link for the Worldcat record for this book.

I did the same in Yahoo - typed in the query amp real estate exam. The results were dramatically different. There in the first page of Yahoo results, was the Worldcat record.

What does this mean? Well, it provides an illustration of how Yahoo's ranking and sorting algorithms are different from Google's. Neither better nor worse, just different. It may also mean that, at least for a while, Yahoo may have given preferential treatment to the worldcatlibraries.org domain. We don't know for sure, we just know that the domain seems to rank higher in Yahoo's results than for a similar search in Google.

Clearly Google hasn't given preference to the worldcatlibraries.org domain (not yet, anyhow), but that doesn't mean that its results aren't just as -- or more -- relevant. I've always had a problem with domain preference decisions by the search engines (who are they to judge quality anyhow?) so if anything, the moral of the story is continue to use multiple search engines.

I'm puzzled at the very positive response by most information professionals to these announcements of database dumps into search engines. In Searcher, Barbara Quint recently quoted several ecstatic responses to these announcements from people who are usually a lot more measured in their opinions. On the other hand, Gary Price of Resourceshelf.com provided considerably more balanced views on the topic.

Although click throughs are way up at OCLC through these search engine links to Worldcat records, users will often fail to find these records unless they know that they want them. Sure, if I had added the keyword library or worldcat to my search string in Google, I would have found the Worldcat record for the Sterling book on the first page of results. But who would ever think of doing that when they don't know exactly what is wanted?

(See my posting, Just Because It's Indexed Doesn't Mean You'll Find It for another example, this one with PubMed records in Google)

Posted by ritavine at 07:01 PM

July 14, 2004

Some Cautionary Notes on Vivisimo

In a recent issue of Resourceshelf.com, I spotted a link to a Pittsburgh Business Times article on Vivisimo, a popular meta-search engine, and about profitability of the Vivisimo meta-search engine "test bed" which demonstrates Vivisimo's clustering technology. Profitability, they say? Time to take another look at Vivisimo's public meta search engine.

At the heart of Vivisimo's popularity is its excellent clustering technology, which is also used to facilitate targeted search in many other online products. Raves about Vivisimo's clustering has brought many users to its public meta-search site.

But a closer look at the underlying databases used by Vivisimo show it as a substandard meta-search tool for serious searchers. It's default web search databases (MSN, Lycos, Looksmart, Wisenut, Open Directory, and Overture) are generally agreed to be less-than-stellar choices in their respective categories. Overture and Looksmart are almost exclusively pay-for-placement products. Lycos is now principally Yahoo's Inktomi database with added sponsored links; the Open Directory is generally agreed to be an occasionally useful directory, but crowded with commercial content because of its preferential treatment in Google's algorithms. Wisenut is owned by Looksmart, and according to Search Engine Showdown, it has one of the smallest databases of all the spidered search engines.

Conclusion? No wonder Vivisimo is boasting of profitability -- most of its source database partners are pay-for players. According to the article, Vivisimo earns 35% of its revenue from paid placement and advertising on its public web site.

Vivisimo's story isn't really new -- many search engines (including Google and the original Altavista) have in the past used their public web search utilities as test beds to promote their technology, only to soon discover that there was more money in search than in selling the technology outright.

I like Vivisimo's clustering technology a lot. But it's important for serious searchers to understand that even great technology will produce poor results if the underlying databases aren't good. In Vivisimo's case, paid content in, (clustered) paid content out.

Posted by ritavine at 06:14 PM

May 12, 2004

Just Because It's Indexed Doesn't Mean You'll Find It

Since sometime in 2002, Google has indexed a significant portion of the PubMed database. None of the other search engines I tested (Teoma, Yahoo, Gigablast) had any PubMed content.

Even though it's indexed in Google, PubMed's content may never be found.

Google's PageRank algorithm, which sorts search results based principally on how many pages link to the matching page, helps to ensure that PubMed database citations will remain at the
bottom of search results. In other words, it doesn't matter if those PubMed citations are
indexed, because they will never be found by a searcher looking for topical information
using a typical keyword approach.

In this example, I searched the keywords asthma children in Google. The result is a large results list. The sites in the first pages of results aren't particularly bad: Google weights certain domains, like cdc.gov, and medlineplus.gov more heavily and as a result the search results aren't completely overwhelmed by .com medical sites.

But where are the results from PubMed? A search of the first ten pages of the asthma children search above reveals no PubMed citations. Why? Because these individual PubMed citations are hardly ever linked by other web pages, and as a result they receive a low PageRank in Google. The net effect? The low-ranked PubMed results sink to the bottom of Google's search results list for practically any medical topic.

Beyond the negative effect of PageRank, Google's simple keyword-string-matching approach isn't nearly as sophisticated as PubMed's own search options. This isn't unusual. Many specialized, searchable databases on the web have unique search options which simply aren't available through one-box search engine interfaces.

For example, I conducted a search of the keywords asthma children in Google, limiting
results to the Pubmed domain ncbi.nlm.nih.gov (example). There were approximatly 13500 results.

But the same keyword search in PubMed delivers over 22000 results. Why? The difference between the Google results and the PubMed results can likely be attributed to the sophisticated search methodology inherent in the PubMed search, which matches keyword against the MeSH Translation Table in order to create a more inclusive and accurate search strategy. Google's search methodology is rather more ordinary, searching the keyword input directly and matching the occurance of the keywords in the PubMed record. In our asthma children example, Google would retrieve children but not child while PubMed's sophisticated preprogrammed interpretive search logic would retrieve both.

Sure, it's theoretically nice that search engines are indexing formerly "invisible" web content, but without corresponding tweaking of the PageRank algorithm, that content will never be found.

Posted by ritavine at 08:00 PM

March 18, 2004

Is it time to detach from our reliance on search engines?

Consider the reality of relying on your favorite search engine. You're applying a pretty dumb technology (search algorithms) against a huge, undifferentiated pile of randomly selected, unorganized content; then adding billions of dollars of keyword-matched ads to the sorted output. Moreover, the effect over time of persistent ad placement in search results is to push those web resources that lack the capacity or interest in placing ads further down the search results list and out of sight of most searchers.

This is no recipe for search success. And it's folly to assume that reliance on ad-supported search engines is just fine as long as searchers understand how results are derived and ranked. From long experience training business and information professionals to search the web more effectively, I can tell you that almost no one fully understands just how pervasively ads influence search results without receiving additional information and training.

Given the facts, I'm puzzled by the extent to which information professionals continue to believe that commercial ad-supported search tools will deliver relevant, high quality information results. And I'm troubled by how much buzz the new crop of new-to-the-marketplace ad-supported search tools is receiving among the library community, as if some "new Google" will magically produce better results applying the same ad-driven business model to giant indexes. It's time for the library community to fully understand how unsupportable the notion of finding persistently relevant information in free, ad-supported search tools really is.

The answer is better search tools which, though they may be supported by ads around the periphery of the page, have no ads placed inside of search results. It is not going to make anyone rich, but it is a model that has been used to a limited extent by individual libraries and consortia worldwide, in an effort to develop organized, selective directories of best-of-breed resources in a discipline or of a materials type.

We have a few fine models already, like Librarians Index to the Internet (http://www.lii.org) and the Resource Discovery Network (http://rdn.ack.uk). Both receive funding from libraries or consortia and have no ads inside of search results. Using a slightly different model, Genie Tyburski's excellent business and legal search starter site, the Virtual Chase (http://www.virtualchase.com) is supported by a law firm, but has no secondary delivery of ads either inside or around the periphery of the site.

There should be hundreds of these types of high quality, carefully maintained starter sites on the web, but there aren't. There used to be more, but I suspect that many were abandoned because their creators either ran out of steam, or time, or money, or felt that search engines were doing the job of delivering relevant results so well that their personal efforts were considered unnecessary. Perhaps it is time to resuscitate these efforts; for libraries to join together to form cooperative ventures, to develop and deploy high quality, discipline or format-specific collections of free web resources that continue to exist on the web, but that can no longer be found with commercial search engines.

Posted by ritavine at 01:15 PM | Comments (2)

October 22, 2003

Googlization of Library Searches

In "Trumping Google? Metasearching's Promise, in the October issue of Library Journal, marketing consultant Judy Luther argues persuasively in favor of meta-searching across multiple databases of full text journals and bibliographic citation indexes. Luther suggests that library users want simple, novice approaches to information retrieval that resemble Google-style one-box searches.

Many libraries are already implementing the one-box approach to cross-database searching (also called federated searching or meta-searching), often against the objections of librarians who fear the demise of quality search strategies in favor of a dumbed-down approach.

While I have never been a fan of cross-database searching for information professionals or other serious searchers, this article compellingly argues for a simplified approach that can reach the vast majority of beginner searchers, with appropriate advanced options for deeper resource discovery.

Posted by ritavine at 05:58 PM

April 24, 2003

Phrase Your Question as the Answer

In an interview with Greg Kline of the Champaign News-Gazette, Craig Silverstein, technology director of Google, suggests that web searchers looking for answers “always phrase [the] query in the form of an answer.” So that means if you're looking for the capital of Iowa, you might search using the phrase "the capital of iowa is" and expect to retrieve pages that have that phrase -- followed by the answer.

Most of us search uncritically in search engines using keywords that match the subject of our search. But search engines don't search for subjects, they just search for patterns of words on pages.

Read the full article

Posted by ritavine at 12:03 AM

March 05, 2003

Search Engines ≠ Search

In the January 2003 issue of EContent, David M. Scott's article "I Don't Google Madonna" keenly illustrates the problem with using search engines as one-stop web search tools for all types of questions.

Scott states that sites like Google exist only to answer questions, and users must already know what they want before proceeding. But people also need services that tell them something that they don't already know, or things that they did not think to ask.

Posted by ritavine at 10:03 AM
Description
SiteLines is written by Rita Vine, a professional librarian, web search trainer, and lead site evaluator of the Search Portfolio web search product.

Together with other members of the Search Portfolio selection team, Rita monitors over 50 key alerting services related to web search tools, site announcements, and the business of web search. SiteLines is intended to present a distillation of the most important trends, news, and new web search tools and directories.

Sitelines is sponsored by the Search Portfolio, a licensed web desktop of the 100 top peer-reviewed web sites for searching.

Subscribe
Subscribe Unsubscribe
Search


Archives
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
Recent Entries
Evaluating New Web Search Tools
Another tool to compare results from search engines
Librarians vs. Technology
Yahoo Beats Google at Link Checking
Yahoo Does Boolean
Article on Search Engine Gigablast
Search Engine Comparison/Relationship Charts
More on Yahoo! and Google's Inclusion of WorldCat records
Some Cautionary Notes on Vivisimo
Just Because It's Indexed Doesn't Mean You'll Find It
Categories
Boolean Searching (1)
E-Mail (4)
Google (43)
Handheld Computers (1)
Images (2)
Information Literacy (10)
Internet Filters (3)
Miscellaneous (13)
News Stories (14)
RSS (2)
Resources - Business (12)
Resources - Health (19)
Resources - Misc. (42)
Search Engines - Best Practices (14)
Search Engines - Business Issues (24)
Search Engines - Impact on Searching (7)
Searching - Best Practices (15)
Searching - User Behavior (6)
Software (7)
Spyware (2)
Staying Current (2)
Trends & Predictions (1)
Links
SiteLines Home
Workingfaster.com
Upcoming Courses
Search Portfolio
XML for Site Syndication(XML)