A History of Search Engines
The grandfather of all search engines was
Archie, created in 1990 by Alan Emtage, a student at McGill
University in Montreal. The author originally wanted to call the
program "archives," but had to shorten it to comply with
the Unix world standard of assigning programs and files short,
cryptic names such as grep, cat, troff, sed, awk, perl, and so on.
For more information on where Archie is today, see:
At the early date of 1990, there was no World Wide Web. Around
this time, Tim Burners-Lee probably had a bad dream in which a
scary monster with "HTTP" etched into its hide slowly
ate up all of the Earth's resources. Nonetheless, there was still
an Internet, and many files were scattered all over the vast
network.
The primary method of storing and retrieving files was via the
File Transfer Protocol (FTP). This was (and still is) a system
that specified a common way for computers to exchange files over
the Internet. It works like this: Some administrator decides that
he wants to make files available from his computer. He sets up a
program on his computer, called an FTP server. When someone on the
Internet wants to retrieve a file from this computer, he or she
connects to it via another program called an FTP client. Any FTP
client program can connect with any FTP server program as long as
the client and server programs both fully follow the
specifications set forth in the FTP protocol.
Initially, anyone who wanted to share a file had to set up an
FTP server in order to make the file available to others. Later,
"anonymous" FTP sites became repositories for files,
allowing all users to post and retrieve them.
Even with archive sites, many important files were still
scattered on small FTP servers. Unfortunately, these files could
be located only by the Internet equivalent of word of mouth:
Somebody would post an e-mail to a message list or a discussion
forum announcing the availability of a file.
Archie changed all that. It combined a script-based data
gatherer, which fetched site listings of anonymous FTP files, with
a regular expression matcher for retrieving file names matching a
user query. (4) In other words, Archie's
gatherer scoured FTP sites across the Internet and indexed all of
the files it found. Its regular expression matcher provided users
with access to its database.
Veronica and Jughead - but where is Betty?
Gopher is like FTP, but for documents instead of files. Gopher
servers contain plain-text documents (no images, no hypertext)
that can be retrieved. Archie's popularity had grown such that in
1993, the University of Nevada System Computing Services group
developed Veronica(5) (the grandmother of
search engines). It was created as a type of searching device
similar to Archie but for Gopher files. Another Gopher search
service, called Jughead, appeared a little later, probably for the
sole purpose of rounding out the comic-strip triumvirate. Jughead
is an acronym for Jonzy's Universal Gopher Hierarchy Excavation
and Display, although, like Veronica, it is probably safe to
assume that the creator backed into the acronym. Jughead's
functionality was pretty much identical to Veronica's, although it
appears to be a little rougher around the edges.
The lone Wanderer
If Archie was the grandfather of search tools and Veronica the
grandmother, their child, and thus the mother of all search
engines, was Matthew Gray's World Wide Web Wanderer. The Wanderer
was the first robot on the web and was designed to track the web's
growth. Initially, the Wanderer it counted only Web servers, but
shortly after its introduction, it started to capture URLs as it
went along. The database of captured URLs became the Wandex, the
first web database.
Matthew Gray's Wanderer created quite a controversy at the
time, partially because early versions of the software ran rampant
through the Net and caused a noticeable netwide performance
degradation. This degradation occurred because the Wanderer would
access the same page hundreds of time a day. The Wanderer soon
amended its ways, but the controversy over whether robots were
good or bad for the Internet remained.
| What's
a Robot got to do with the Internet?
The term robot
has special significance to programmers. Their version of
the term is mostly unrelated to the metallic lumbering
creatures of Asimov lore. A synonym for robot
"automaton" is actually more enlightening.
Computer robots are programs that automatically perform a
repetitive task at speeds that would be impossible for
humans to match, just like the tasks today's robots
perform in factories.
On the Internet, the term robot or bot
has become a bit broader. For the most part, it refers to
programs that explore the Internet for some sort of
information. Web robots search the Internet for web pages,
usually for the purpose of compiling a large, searchable
database. This category of robot is often called a spider.
The spider robot falls right into the standard definition
of performing a repetitive task.
Other types of robots on the Internet push the
interpretation of the automated task definition. The chatterbot
variety is a perfect example. These robots are designed to
communicate with humans about some topic in a human-like
manner. Some of them are fairly convincing; others are
obviously quickly written computer programs. Chatterbots
are sometimes used as an intuitive way to communicate
certain basic information to users. An example is the milk
robot, which can answer lots of questions about milk.
One could force this type of program into the definition
above by saying that it performs the repetitive task of
communicating with clueless people.
|
The ALIWEB Strikes Back!
In response to the Wanderer, Martijn Koster created Archie-Like
Indexing of the Web, or ALIWEB, in October 1993. As the name
implies, ALIWEB was the HTTP equivalent of Archie, and because of
this, it is still unique in many ways. ALIWEB does not have a
web-searching robot. Instead, webmasters of participating sites
post their own index information for each page they want listed.
The advantage to this method is that users get to describe their
own site, and a robot doesn't run about eating up Net bandwidth.
Unfortunately, the disadvantages of ALIWEB are more of a
problem today. The primary disadvantage is that a special indexing
file must be submitted. Most users do not understand how to create
such a file, and therefore they don't submit their pages. This
leads to a relatively small database, which meant that users are
less likely to search ALIWEB than one of the large bot-based
sites. This Catch-22 has been somewhat offset by incorporating
other databases into the ALIWEB search, but it still does not have
the mass appeal of search engines such as Yahoo! or Lycos.
Invasion of the Spiders!
As the web grew, it became more and more difficult to sort through
all of the new web pages added each day. Matthew Gray’s Wanderer
inspired a number of programmers to follow up on the idea of web
robots, or spiders, as they are now called. These programs
systematically scour the web for pages by exploring all of the
links on a starter site, which is a page that contains many links
to other pages. The concept was that by definition, every page on
the web must be linked to another page. By searching through a
large number of pages and following all of the links, a user will
discover new pages that have their own collection of links. The
hope is that most of the web can be explored through the
continuous repetition of this process.
This process caused a great deal of controversy because some
poorly written spiders were creating huge loads on the network by
repeatedly accessing the same series of pages. Most network
administrators thought they were a bad thing, so naturally
programmers created even more of them.
By December 1993, the web had a case of the creepy crawlies.
Three search engines powered by robots had made their debut:
JumpStation, the World Wide Web Worm, and the Repository-Based
Software Engineering (RBSE) spider.
JumpStation’s web bot gathered information about the title
and header from Web pages and used a very simple search and
retrieval system for its web interface. The system searched a
database linearly, matching keywords as it went. Needless to say,
as the web grew larger, JumpStation became slower and slower,
finally grinding to a halt.
The WWW Worm indexed only the titles and URLs of the pages it
visited. It used regular expressions to search the index. Results
from JumpStation and the Worm came out in the order that the
search found them, meaning that the order of the results was
completely irrelevant. The RSBE spider was the first to improve on
this process by implementing a ranking system based on relevance
to the keyword string.5
The Easily Excitable Spider
The popular public search engine, Excite, has roots that extend
rather far back in the history of the web. Initially, the project
was called Architext; it was started by six Stanford
undergraduates in February 1993. Their idea was to use statistical
analysis of word relationships in order to provide more efficient
searches through the large amount of information on the Internet.
Their project was fully funded by mid-1993. Once funding was
secured. they released a version of their search software for
webmasters to use on their own web sites. At the time, the
software was called Architext, but it now goes by the name of
Excite for Web Servers.
Billions and billions of catagorized links...
Unfortunately, these spiders all lacked the intelligence to
understand what it was that they were indexing. Therefore, if you
didn’t specifically know what it was that you were looking for,
it was unlikely that you’d find it. This deficiency prompted the
creation of EINet Galaxy, now know as the Tradewave Galaxy, which
is the oldest browsable/searchable web directory. Because it is a
directory, Galaxy links are organized into hierarchical
categories. For example, a top-level category might be called
"Computers." Within the Computers category there might
be subcategories for "IBM," "Sun
Microsystems," "Digital Equipment Corporation," and
so on. Within each of these subcategories would be further
subcategories, although these would be more or less consistent
across the various machine types. As an example, all of the
computer company categories might contain the subcategories of
"Hardware" and "Software." This method of
organization allows users to more effectively explore the contents
of the database by narrowing the field of interest.
The Galaxy went online in January 1994. It contained Gopher and
Telnet search features in addition to the web-searching features.
Interestingly enough, Gopher was vastly popular as a
document-sharing tool when the web was born. The Gopher search
capability was probably the primary reason for the creation of the
EINet Galaxy. (There weren’t really very many web pages to
search through in January 1994!) The web page search capability
was simply an additional feature.
Through the present, Tradewave (www.tradewave.com) still clings
to its directory-based roots; it uses no bots or spiders to seek
out new URLs. Therefore, the Galaxy is a true directory in the
sense that it lists only URLs that have been submitted to it, and
all categorization and review of the submitted URLs is done by
hand. This results in higher-quality pages and more relevant
searches, but far fewer pages to search through.
Yahoo! and a Yippity tai-yai-yay!
At this stage in the game, people were creating pages of links to
their favorite documents. In April 1994, two Stanford University
Ph.D. candidates, David Filo and Jerry Yang, created some pages
that became rather popular. They called the collection of pages
Yahoo! Their official explanation for the name choice was that
they considered themselves to be a pair of yahoos.
As the number of links grew and their pages began to receive
thousands of hits a day, the team created ways to better organize
the data. In order to aid in data retrieval, Yahoo! (www.yahoo.com)
became a searchable directory. The search feature was a simple
database search engine. Because Yahoo! entries were entered and
categorized manually, Yahoo! was not really classified as a search
engine. Instead, it was generally considered to be a searchable
directory. Yahoo! has since automated some aspects of the
gathering and classification process, blurring the distinction
between engine and directory.
The Wanderer captured only URLs, which made it difficult to
find things that weren’t explicitly described by their URL.
Because URLs are rather cryptic to begin with, this didn’t help
the average user. Searching Yahoo! or the Galaxy was much more
effective because they contained additional descriptive
information about the indexed sites.
Brian's WebCrawler: Some Spider!
As bots got better and better, one rose above the pack with it’s
unique ability to index the entire text of a web page. Other bots
were storing the title and the URL, and the first 100 or so words
of a document, but it was WebCrawler that first allowed the user
to search the full text of entire documents.
The history of WebCrawler is best told by those responsible:
"In early 1994, students and faculty in
the Department of Computer Science and Engineering [of the
University of Washington] gathered in an informal seminar
to discuss the early popularity of the Internet and the
World-Wide Web. Students typically try out their ideas in
small projects in these seminars, and several interesting
projects were started. The WebCrawler was Brian
Pinkerton's project, and began as a small single-user
application to find information on the Web.
Fellow students persuaded Pinkerton to build the Web
interface to the WebCrawler that became widely usable. In
that first release on April 20, 1994, the WebCrawler's
database contained documents from just over 6000 different
servers on the Web. The WebC rawler quickly became an
Internet favorite, receiving an average of 15,000 queries
per day in October, 1994 when Pinkerton delivered a paper
describing the WebCrawler."
|
Eventually, the demand for WebCrawler devastated the network
resources at the University of Washington. Although a number of
companies invested in server equipment to ease the load on the
WebCrawler servers, there was no solution to the bandwidth issue.
At one point, the service became entirely unusable during the
daytime hours. Finally, America Online (AOL) saved the day by
purchasing the WebCrawler system and running it on its own
network. In 1997, Excite bought out WebCrawler, and now AOL is
using an Excite derivative as the engine behind its own NetFind.
The most important point about WebCrawler is that it was the
first full-text search engine on the Internet. Until its debut, a
user could search through only URLs or descriptions. The
descriptions were sometimes created by the engines themselves or
reviewers trying to rate the sites.
A final word about WebCrawler from the company itself:
"Several competitors emerged within a year of WebCrawler’s
debut: Lycos, Infoseek, and OpenText. They all improved on
WebCrawler’s basic functionality, though they did nothing
revolutionary. WebCrawler’s early success made their entry into
the market easier, and legitimized businesses that today
constitute a small industry in Web resource
discovery."(www.webcrawler.com)
Mellon-Mania: The Birth of Lycos
Lycos was indeed the next big kid on the block, bursting out of
the labs at Carnegie Mellon University during the July of 1994.
The person responsible for unleashing this force onto the world is
Michael Mauldin. He is currently on leave from CMU, acting as
Chief Scientist at Lycos, Inc. In a paper describing design
decisions made while programming Lycos, he gives a very nice
history of the service.
"Work on the Lycos spider began in May
1994, using John Leavitt's LongLegs program as a starting
point. (Lycos was named for the wolf spider, Lycosidae
lycosa, which catches its prey by pursuit, rather than in
a web.) In July 1994, I added the Pursuit retrieval engine
to allow user searching of the Lycos catalog (although
Pursuit was written from scratch for the Lycos project, it
was based on experience gained from the ARPA Tipster Text
Program in dealing with retrieval and text processing in
very large text databases (9) ). On July 20, 1994, Lycos
went public with a catalog of 54,000 documents. In
addition to providing ranked relevance retrieval, Lycos
provided prefix matching and word proximity bonuses. But
Lycos' main difference was the sheer size of its catalog:
by August 1994, Lycos had identified 394,000 documents; by
January 1995, the catalog had reached 1.5 million
documents; and by November 1996, Lycos had indexed over 60
million documents -- more than any other Web search
engine. In October 1994, Lycos ranked first on Netscape's
list of search engines by finding the most hits on the
word ‘surf.’"(6) |
Hide and Seek
Representatives of Infoseek, another major search engine, say that
they founded their corporation in January 1994. Although this may
be true, the search engine itself was not accessible until much
later that year.
Initially, Infoseek was just another search engine. It borrowed
conceptually from Yahoo! and Lycos, not really innovating in any
particular way. Yet the history of Infoseek and its current
critical acclaim show that being the first or most original
isn’t always that important. Infoseek’s user-friendly
interface and the numerous additional services (such as UPS
tracking, News, a directory, and the like) have garnered kudos,
but it was Infoseek’s strategic deal with Netscape in December
1995 that brought it to the forefront of the search engine line.
Infoseek convinced Netscape (with the help of quite a bit of cash)
to have its engine pop up as the default when people hit the Net
Search button on the Netscape browser. Prior to this, Yahoo! was
Netscape’s default search service.
Return of the DEC
Digital Equipment Corporation’s (DEC) AltaVista was a latecomer
to the scene; it had its online debut in December 1995.
Nonetheless, it had a number of innovative features that quickly
catapulted it to the top. The least of the features was its speed.
Run on a bunch of DEC Alphas, it had the horsepower to handle
millions of hits per day without slowing down in the slightest.
The rest of its features, all available from introduction,
changed the face of search engines forever. AltaVista was the
first to use natural language queries, meaning a user could type
in a sentence like "What is the weather like in Tokyo?"
and not get a million pages containing the word "What."
Additionally, it was the first to implement advanced searching
techniques, such as the use of Boolean operators (AND, OR, NOT,
etc.). Furthermore, a user could search newsgroup articles and
retrieve them via the web as well as specifically search for text
in image names, titles, Java applets, and ActiveX objects.
Additionally, AltaVista claims to be the first search engine to
allow users to add to and delete their own URLs from the index,
placing them online within 24 hours.
One of the most interesting new features AltaVista provided was
the ability to search for all of the sites that link to a
particular URL. This was very useful for web designers who were
trying to get some popularity for their pages; they could
frequently check to see how many other pages were referencing
them.
On the user interface end, AltaVista made a number of
innovations. It put "tips" below the search field to
help the user better formulate a search. These tips constantly
change, so that after using the search for a few times, users see
a number of interesting features that they possibly did not know
about. This system became widely adopted by the other search
engines.
In 1997, AltaVista created LiveTopics, a graphical
representation system to help users sort through the thousands of
results that a typical AltaVista search generates. LiveTopics is
interesting as a search tool, but conceptually it is more
confusing than the standard search format. Although its innovative
qualities are uncontested, its effectiveness remains to be seen (altavista.software.digital.com/search/showcase/two/index.htm).
A Spider Named "Slurp!": The Powerful HotBot
On the May 20, 1996, Inktomi Corporation was formed, and HotBot
was unleashed upon the world. This is the youngest of all of the
major search services, but even at its young age, it has already
caused quite a stir in the online community. According to the
company: "Pronounced ‘ink-to-me’, the company name is
derived from a mythological spider of the Plains Indians known for
bringing culture to the people. Inktomi was founded in January
1996 by Eric Brewer, an assistant professor of computer science at
the University of California at Berkeley, and Paul Gauthier, a
graduate student in the computer science Ph.D. program, with a
desire to commercialize the highly-effective technologies
developed during their research. (www.inktomi.com/press/icf-pr.html)"
The Inktomi search engine was quickly licensed to Wired
magazine’s web site, HotWired. This site’s popularity
accounted for much of the initial fervor over HotBot. Wired’s
reputation as the oracle of the Net made promoting the site fairly
straightforward.
So what’s the big deal? Just another search engine? Well, yes
and no. HotBot is probably the most powerful of the search
engines, with a spider that can supposedly index 10 million pages
per day. According to the Wired web site, HotBot should soon be
able to reindex its entire database on a daily basis. This will
ensure that the pages returned from a search are not out of date,
which is now common with other search engines.
Additionally, HotBot makes extensive use of cookie technology
to store personal search preference information. A cookie is a
small file that a site can store on your computer. This file can
be read only by the site that generates it. It can hold a small
amount of text or binary information. This information is often
used by sites to store customization information or to store user
demographic data.
HotBot recently won the PC Computing Search Engine Challenge, a
contest between the major search engines. Representatives from
each company were asked questions that could be answered only by a
web search. The engine that most effectively led the
representative to the right answer won the question. Although this
challenge proved very little more than the searching abilities of
the various representatives, it still garnered quite a bit of
critical acclaim for HotBot, further increasing its popularity.
Information Overload: METAbolic Shutdown
What the PC Computing Challenge did show was that different
engines pull up completely different sets of materials for similar
searches. This makes it extremely frustrating to find what you
want on the web, because a query that has little effect using one
engine may turn up a gold mine of information on another.
Additionally, the little differences between the engines,
especially regarding the support of Boolean operators, has a large
impact on the type of query format that works most effectively.
The current solution to this problem is the META engine. META
engines forward search queries to all of the major web engines at
once. The first of these engines was MetaCrawler. MetaCrawler
searches Lycos, AltaVista, Yahoo!, Excite, WebCrawler, and
Infoseek simultaneously.
MetaCrawler was developed in 1995 by Eric Selburg, a Masters
student at the University of Washington (the same place where
WebCrawler was developed a few years earlier). Like WebCrawler,
MetaCrawler soon grew too large for its university britches and
had to be moved to another site. Here, Eric tells the story of how
MetaCrawler became the go2net search engine:
MetaCrawler was conceived in spring of 1995 by
myself and my advisor, Oren Etzioni, as my master's degree
project. It grew rapidly in popularity once we released it
publicly, gaining many new users after Forbes mentioned us
in a cover-page article. Use jumped after C|Net reviewed
all the major search services, ranking us No. 1, with
AltaVista No. 2 and Yahoo No. 3...
In May of 1996, I (along with most of the rest of the
AI department at UW) created NETbot. ...When I left NETbot
to return to research at UW… MetaCrawler was now under 7
´ 24 monitoring service, the code was as reliable as
ever, and we had made several performance improvements.
...
There was a realization that Netbot was ill-equipped to
handle negotiations with the search services for continued
MetaCrawler use. Thus, the decision was made to license
MetaCrawler to go2net, who could provide the resources
necessary to make MetaCrawler viable as well as negotiate
with the search services toward mutually beneficial
arrangements. (www.metacrawler.com/selberg-history.html)
|
MetaCrawler functions by reformatting the search engine output
from the various engines that it indexes it onto one concise page.
Throughout MetaCrawler’s history, the search engine companies
that it worked with did not entirely approve of this procedure.
The most common complaint was that the advertising banners that
the search engines had on their sites were not appearing when a
user employed MetaCrawler. This meant that their ads were not
reaching the intended audience, reducing their ad revenues.
The move to go2net heralded MetaCrawler’s concession to these
concerns. Now MetaCrawler displays the ads from each search site
right above the results. MetaCrawler users were not thrilled by
this change because it increased the time it took for the result
page to download. However, skillful design of the result pages now
causes the text to load first, calming the restless native users.
Are You Savvy Enough to Search with Me?
Colorado State University also has a tool called Savvy Search that
searches up to 20 engines at once, including a number of
topic-specific directories such as Four11 (e-mail addresses),
FTPSearch95 (files on the Net), and DejaNews (UseNet database).
It’s faster but less reliable than MetaCrawler. SavvySearch’s
solution to the problem of differing types of search engine query
formats is to ignore them all. Users should not try and enter
complex search strings into SavvySearch. MetaCrawler at least
tries to tackle this problem by creating its own search syntax
(using + to indicate AND, - to indicate AND NOT) and by converting
this syntax into the equivalent command for each engine. However,
neither MetaCrawler nor SavvySearch let you tap the full power of
the advanced search syntaxes offered by most engines.
One Click, DoubleClick, Red Click, Blue Click
We’ve already briefly touched upon the relationship between
advertisers and search engines, but the area is now of such
importance that it deserves its own section. It wasn’t long
after the advent of search engines—especially when Yahoo! made
its much publicized move from the servers at Stanford to those at
Netscape—before advertisers noticed that search engine sites
were receiving numbers of hits in orders of magnitude greater than
any other type of site on the web. Receiving daily hits in the
millions, search engines seemed like advertising gold mines. This
realization prompted the creation of many of the other current
search engines.
"Intra"-ducing...
Netscape, severely shaken and battered by Microsoft’s free
release of a competing web browser (Internet Explorer), decided to
concentrate on the new phenomenon of the intranet. Corporations
wanted to use web technology to facilitate document sharing within
their own corporate networks. These corporations also wanted to be
able hide these documents from the rest of the web, yet provide
their employees with the same search capabilities offered on the
web. Search engine companies now had a market for their product,
which initially capitalized on the advertising industry for
revenue. Although there were a number of freely available search
engines, corporations such as Digital Equipment and Infoseek
capitalized on the lack of programmers who understood web
administration and priced technical support and service into their
commercial search engine packages.
Soon, another reason for having a "private" search
engine became apparent. Unlike most other media, a web page is
constantly updated, and new pages are added to and removed from
sites every day. None of the major web-based search engines could
search the entire web on a daily basis. Therefore, the search
databases would often contain out-of-date references or would miss
entire sections of web sites. The larger sites began indexing
their own sites and providing search engines that would primarily
search through their own materials. Some allowed the user to
search the rest of the web as well by linking the engine into one
of the larger web databases such as AltaVista.
Many relatively small sites are now providing search engines
for their own sites. This is because search engines are becoming
easier and easier to use and incorporate within a web site, and
because the rapid growth of the web has led to an incredible
amount of "junk" in the form of out-of-date pages, pages
with misleading descriptions, pages deliberately designed to
confuse search engines, and so on. Additionally, it is often
difficult to know what to search for, and many users have a hard
time expressing what it is they wish to find in a language that
search engines can effectively understand. Using a site-specific
search engine narrows the possibilities enough that a poorly
formulated search may still return the intended result.
Summary
Now that we’ve finished our search engine history lesson, you
should be somewhat familiar with a number of the key players in
the search engine area. Additionally, you should be starting to
get a feeling for some of the issues that search engines face.
The next chapter takes a closer look at some of the engines
mentioned here as well as a few others. You’ll learn how users
interact with each engine. Ultimately, you’ll understand the
strengths and limitations of today’s search techniques and what
users have come to expect from a search engine. This knowledge is
extremely important when choosing a search engine for your own web
site. It will help you determine if a particular engine can handle
the task you need it to accomplish. You’ll also be able to
better understand how your users will interact with the engine you
choose.
References
1. Hotomi sounds better than HOinkTomi, don’t you think?
2. This fact did not thrill MIT network administrators when the
web became popular a year later. Although they made an attempt to
wrestle the URL away from SIPB, the students prevailed, and to
this day MIT’s own homepage is located at http://web.mit.edu.
There is an interesting allegory relating to this at the bottom of
SIPB’s main page at http://www.mit.edu for those that are
curious.
3. such as the document, "Inessential Refrigerator
Restocking," which is still available at:
http://www.mit.edu:8001/sipb/documents/
4. Michael Maudlin, "Lycos: Design choices in an Internet
search service" 1997
5. The name Veronica officially expands to Very Easy
Rodent-Oriented Netwide Index to Computerized Archives -- somehow
I think they worked the expansion out afterwards, but you decide.
6. Michael Maudlin, "Lycos: Design choices in an Internet
search service" 1997
|