There are three tools that are central to the functionality of every modern search engine. What are those three tools?
A way to discover new content automatically and continuously;
A way to index content as it is discovered;
A way to search through indexed content to find bits and pieces that a search engine user is looking for.
By that definition, a search engine is a pretty simple concept. However, in practice, putting those three bits of technology together has proven easier said than done, and early search engines only met one or two of these requirements.
Today, leading search engines are some of the most visible and valuable technology companies around. And technologies pioneered by search engines are implemented on nearly all modern websites.
However, it wasn’t always this way. Today’s search engines come from humble beginning and search has come a long way in the last few decades.
Search Engines Before the Web
The story of the search engine actually starts at Cornell University before the internet had even been created. In the 1960s, Gerard Salton and his colleagues at Cornell developed the SMART Information Retrieval System.
SMART either stands for System for the Mechanical Analysis and Retrieval of Text or Salton’s Magical Automatic Retriever of Text depending on who you ask.
It was an early information retrieval system that established many of the conceptual underpinnings on which search engines are based, including term weighting, relevance feedback, term dependency, and a lot more.
From SMART, we move on to the first generation of internet-based search engines. The Internet is really just a system of computer networks connected over TCP/IP communication protocols. It was developed more than a decade before Tim Berners-Lee created the World Wide Web, or just the web.
Several different communication protocols were used to transmit data over internet connections before the web was born. And the earliest search engines were designed to be used over some of these older protocols.
The WHOIS protocol, which is still used to this day, debuted in 1982 and was one of the first tools used to query databases over the internet.
Initially, WHOIS searches were quite powerful and could be used to locate a great deal of information about a block of internet resources or to track down all of the resources associated with a single person or organization.
Today, WHOIS search parameters are much more limited and WHOIS is used to locate the registered owner of a single resource, or quite commonly, to locate the privacy service used to obscure the ownership of a single resource.
Public FTP servers, which are document storage and retrieval servers that anyone can access over an internet connection, were common during the late 1980s and early 1990s.
However, there was no easy way to locate information on a public FTP server unless you knew the location of the server and the name and location of the document you wanted to access. All of that changed when Archie was released in 1990.
Archie is often thought of as the first real search engine. While there were search technologies such as WHOIS that were developed earlier, Archie was noteworthy because it was the first tool that could be used to search for content rather than users.
Archie consisted of two components:
An Archie server that indexed the contents of public FTP servers.
A search tool used to query the names of the files that were indexed on the Archie server.
By modern standards, Archie was a pretty crude tool. However, at the time, Archie was a huge step forward in the use of the internet for information retrieval. Here’s how the system worked:
When a new public FTP server came online, the owner of the server would get in touch with the administrator of an Archie server and ask to have their FTP server included in the Archie index.
Once a month, more or less, each of these servers would take a snapshot of the names of the files stored on each mapped FTP server.
Archie servers were networked together and the contents of each was periodically mirrored to all the other Archie servers.
In this way, each Archie server contained a relatively complete and up-to-date index of the contents of every FTP server that was mapped by the system.
The contents of an Archie server could be searched in a few different ways. If a user had direct access to a server, they could use a search application installed directly on the server.
Command line connections could be made to search an Archie server over a Telnet internet connection. Later, queries could be made by sending a properly formatted email to the server or by using a web-based search interface.
What Archie was to FTP servers, Archie’s friend, Veronica, was to Gopher servers.
Gopher was an internet communication protocol developed in the early 1990s by Mark McCahill at the University of Minnesota. It was much more like the web than FTP. But there were also many differences.
Gopher was a fairly strict protocol compared to the web’s HTTP protocol. Enthusiasts would say it was faster and more organized than the web while critics might call it restrictive and confining.
Gopher looked more like a File Manager (think: Windows Explorer) than a web page. Each Gopher server consisted of a series of menus and submenus which were used to organize the documents stored on the server.
Initially, finding information on a Gopher server required manually navigating through a series of menus and submenus based on the titles and descriptions associated with each menu until the resource you were looking for was found.
Veronica soon offered an alternative to this manual navigation process.
Veronica was basically the application of the Archie model to the Gopher protocol. Information about Gopher servers was stored on Veronica servers, and the Veronica servers were queried to locate information about documents stored on the indexed Gopher servers.
Not long after the development of Veronica, Jughead appeared. Although it was also a Gopher tool, Jughead was a different animal entirely. Jughead could only be used to search through the menus and submenus of a very limited part of Gopher–usually just a single server.
Some advanced search operators could be used with Jughead, making it a powerful tool for sifting and locating the contents on a single Gopher server.
What’s in a Name?
I’m sure at this point you’re wondering about the names of these three search engines: Archie, Veronica, and Jughead.
Archie came first and had nothing to do with the popular comic series. The name was created by taking the word archive and removing the letter v. The names Veronica and Jughead were a simulataneous reference to their relationship to Archie and a nod to the comics series.
In the interest of pretending like the names Veronica and Jughead had any sort of meaning beyond a playful reference to Archie, acronyms were later created (backronyms).
Veronica was said to be short for Very Easy Rodent-Oriented Net-wide Index to Computer Archives. And Jughead was Jonzy’s Universal Gopher Hierarchy Excavation and Display.
The Problem with Archie and His Friends
While Archie, Veronica, and Jughead were all useful and cutting-edge tools at the time, they all suffered from certain limitations.
First, all three failed to meet the first requirement of a modern search engine: to possess a way to discover new content automatically and continuously. While Archie and Veronica did index contents on a broad range of servers, new servers had to be added to the index manually.
There was no mechanism for automatic discovery of new servers. Jughead, on the other hand, was limited to just a single server.
Second, all three search engines were only capable of searching titles and descriptions. None of the three indexed the contents of any of the documents included in their indices.
While all three of these search engines were important steps along the way to building a modern search engine, all three of these tools were effectively manual indices with limited search functionality.
What Happened to Gopher?
Gopher expanded rapidly through the mid-1990s. However, in 1993, the University of Minnesota, who owned the intellectual property rights to Gopher, decided to start charging licensing fees for every Gopher installation.
The World Wide Web, which had been launched after Gopher and was lagging behind, had been released as a completely free platform. As a result, after 1993 users began to flock to the web to avoid the licensing fees associate with Gopher.
While Gopher was eventually released as GPL software in the year 2000, and there are a few active Gopher servers today, Gopher is largely a hobby project kept alive by Gopher enthusiasts.
The Web’s First Search Engines
When the web was first created there were no search engines designed to operate over the web’s communication protocol, HTTP. Initially, Tim Berners-Lee maintained and manually updated a directory of all web servers.
However, by 1993, the web had grown to the point that keeping a comprehensive manual directory was no longer feasible and the need for good search engines was plain to see.
As was mentioned in the introduction, a web search engine needs to do three things to be genuinely useful:
Content discovery: computer programs called web crawlers must be used to automatically and systematically crawl the web seeking out new or updated content.
Content indexing: an index of the discovered content must be created and maintained.
Search: the index must be accessible with a search tool which compares search terms to the contents of the index and returns useful results.
Early information retrieval tools such as WHOIS, Archie, Veronica, and Jughead failed to meet all three requirements.
Where they all fell short was that they were manually created directories with limited search functionality that did not have a mechanism for automatically finding and indexing new content.
Searchable Manual Direcotries
The earliest web search engines were searchable directories similar to Archie and Veronica.
W3Catalog, the very first web search engine, was extremely similar to Archie or Veronica in concept. When it was created in 1993, there were several high-quality, curated, website indexes that each covered a limited portion of the web. What W3Catalog did was:
Use a computer program to pull the information from the various indexes;
Reformat the contents so that the listings were presented consistently regardless of the index from which they originated;
Provide a query tool which could be used to search for relevant listings.
Aliweb followed quickly on the heels of W3Catalog and was another index-searching tool in the same vein as Archie, Veronica, and W3Catalog.
However, while W3Catalog only pulled information in from a few curated website indices, any webmaster could submit their website for listing on Aliweb.
Indexes like W3Catalog and Aliweb, also called web directories, continued to be popular throughout the 1990s. The most successful of these web directories was Yahoo!
Yahoo! was founded in 1994. One of its biggest contributions to search was its directory service: a large collection of authoritative sites used for their search results.
Yahoo! itself started as a directory of webpages without using a web crawler. The Yahoo! Directory wasn’t the first, but it probably the largest.
Yahoo! was — and still is — one of the most recognizable search engine names. In the early days, its search function was just a front end for results that came from other web crawlers.
It wasn’t until 2003 that Yahoo! became its own self-crawling search engine. Prior to this, Inktomi, followed by Google, powered Yahoo! Ironically, Google would later become their biggest competitor.
In addition, Yahoo! purchased several search engine companies: Inktomi, AlltheWeb, and Overture.
Yahoo! introduced, or made popular, a number of elements that many search engines still use. It allowed for vertical search results, which is a search within a specific category.
A person could run a search just for images, just for news, and so on. Yahoo! is still in operation, but just like in the past, another search company powers the search results. Today, it is Bing.
Web Crawlers Automate and Speed Up the Indexing Process
The first web crawler was created in June of 1993 and named World Wide Web Wanderer, or just Wanderer for short.
It was created by Matthew Gray to generate an index called Wandex, which was essentially a measure of the size of the web. Wanderer kept Wandex updated until late 1995 but the index was never used for information retrieval purposes.
The first application of a web crawler to create a search engine index was JumpStation.
Created in December of 1993 at the University of Stirling in Scotland by Jonathan Fletcher, the “father of modern search,” JumpStation used web crawlers to create a searchable index of webpage titles and headings.
Within less than a year, while running on a single shared server in Scotland, JumpStation’s web crawlers had indexed 275,000 entries.
However, Fletcher was unable to convince the University to invest additional resources or provide funding for the project, and when Fletcher left the University in late 1994 JumpStation was shut down.
WebCrawler, released shortly after JumpStation, was the first crawler-based search engine to crawl the entire text of every indexed web page.
Over the ensuing two to three years many crawler-based all-text search engines such as Magellan, Northern Light, Infoseek, HotBot, MSN Search, and Inktomi were launched, bought, sold, shuttered, and merged.
Lycos started as a research project. It launched in 1994 and became the most popular Web destination by 1999.
Unlike other search engines, Lycos was a full corporate business out of the gate. It made money, and it did so quickly. The main reason for its popularity as a search engine was its huge catalog of indexed documents.
It indexed about 400,000 documents per month at launch and ramped up to index a total of 60,000,0000 documents in less than two years — more indexed pages than any other search engine. Lycos went through several acquisitions and sales.
As a company, it owned many other companies and sites. As a search engine, it still exists today.
Excite started in 1995. It was the first search engine to use word relationships and statistical analysis to make search results more relevant.
Today, it is known more for what it did not do. In 1999, had the opportunity to buy Google — twice! First, it was offered for a million dollars. Later, the price was reduced to just $750,000. Excite turned down both deals.
At the end of 1995, the Digital Equipment Corporation launched AltaVista. While it wasn’t the first search engine, it improved on its predecessors, eventually becoming one of the most popular search engines of its time.
AltaVista was the first to allow for natural-language search queries, meaning people could simply type what they were looking for instead of using query strings. It also indexed much more of the web than people even knew existed at the time.
Finally, it was one of the first search engines to use Boolean operators. It eventually became part of Yahoo!
Ask.com started as Ask Jeeves in 1996. The search engine operated on a question-and-answer platform, where users could ask a question using natural language and the search engine would find an answer.
One of Ask’s main contributions to search is their own page-ranking algorithm, ExpertRank. ExpertRank works with subject-specific popularity. If a website on a specific subject has backlinks from other sites on the same subject, then it’s more relevant.
Ask eventually stopped focusing on search. It still exists as a search engine, but its core product is its searchable database of questions answered by users.
Microsoft’s Bing launched in 2009, but it is not actually that new. Bing existed as MSN Search and Windows Live Search — dating back to 1998. Third parties powered its early searches.
Around 2004, Microsoft started using its own search results. This powered the eventual change from MSN Search to Windows Live Search and finally Bing. Although not nearly as popular as Google, Bing has managed to carve out a decent part of the search engine market.
That same year that Microsoft got into the search engine business (1998), Google was launched. It would soon revolutionize the world of search.
PageRank: A Revolutionary Idea
While it’s impossible to attribute Google’s success to any single factor, it’s also difficult to overstate how important PageRank was to Google’s early success. So, what is PageRank?
Google uses multiple algorithms to decide the order in which search results should be presented. PageRank was the first of these algorithms used by Google. It remains an important part of Google’s overall results ranking methodology. There are two basic ideas behind PageRank:
When lots of websites link to a webpage it suggests that the webpage is useful and trustworthy.
Links from a useful and trustworthy webpage are more valuable and trustworthy than links from an untrusted webpage.
These two ideas are combined to create a hierarchy of website trustworthiness and usefulness known as PageRank.
As you can see, these ideas feed into each other. The presence of more incoming links means a site is more trustworthy, and links from trustworthy sites are more valuable than links from sites that don’t have many incoming links.
What happens is that each link from one website to another is assigned a certain weight, which is typically called link juice in SEO circles. That weight is based on the PageRank of the website from which the link originates and the number of outbound links from the originating website.
Google adds up all of the link juice flowing from originating websites to the webpage in question and uses that information to decide the PageRank to assign to the webpage.
PageRank proved to be a great way to identify useful websites, and users quickly realized that Google search results were more useful than those generated by any other search engine. As a result, users quickly flocked to Google and other search engines were left scrambling to catch up.
By 2002, Google had risen to prominence in the search engine market thanks in part to their innovative PageRank technology and to the streamlined design of the Google homepage which stood in stark contrast to the advertising and content heavy web portals implemented by virtually all other search engines.
Search Grows Up and Gets a Job
In the 1990s, investing in search was a speculative endeavor. Everyone knew that search was valuable, but no one was really making any money with search.
However, that wasn’t stopping investors from pumping huge sums into innovative search engines, making search investment a significant contributing factor to the dot-com bubble.
In the late 1990s, efforts began in earnest to monetize search.
Search engines realized that they had access to web users who were telling them exactly what they wanted. All that remained was for merchants to place ads that would be displayed to the users who were looking for their products and services.
Overture Monetizes Search
In 1996, Open Text was the first to attempt to commercialize search by offering paid search listings. However, the reaction to seeing paid ad placements was swift condemnation and the idea failed to take off.
Two years later, GoTo, which was later renamed Overture, took a second shot at paid search placements and the concept was accepted. This was due in large part to the fact that the web had matured significantly between 1996 and 1998 and transitioned from being primarily an academic platform to a commercially supported platform.
Shortly after launching in early 1998, Google borrowed the idea of paid search placements from Overture and rapidly transformed from a struggling startup into one of the most profitable internet businesses.
As could have been predicted, Overture didn’t take too kindly to Google co-opting their idea, and Overture sued Google for infringing on their patented intellectual property in 2002.
Yahoo! got involved in the lawsuit when they purchased Overture in 2003 and then proceeded to settle the case. Google earned a perpetual license to use Overture’s patents in exchange for 2.7 million shares of Google common stock.
Today, advertising in search results is the primary funding mechanism used by search engines and generates billions of dollars in annual revenue.
The Modern Search Engine Landscape
Today’s search engine market is dominated by just four competitors whose combined search volume makes up approximately 98% of the total global search engine market.
Google commands about 70% of the global search engine market.
Bing comes in second with a little more than 10% of the market.
Baidu come in third with a little less than 10% of the market.
Yahoo! comes in tied for third with Baidu.
While other search engines, such as AOL and Ask, are still used millions of times every day, their combined market share is significantly less than 1% of the global search engine market.
One noteworthy omission from most lists of top search engines is YouTube. While YouTube is not a search engine in the traditional sense, more and more users search YouTube for How To videos, product information, music, news, and other topics previously found primarily through search engines.
If YouTube’s search volume is compared to the list of search engines, YouTube, owned by Google, may actually be the second largest search engine on the web.
For Your Eyes Only
It is attractive to privacy-minded individuals who do not what their search habits tracked and sold to advertisers. While these search engines do still use an advertising-based search model, they do not collect, store, or sell identifiable user data.
While DuckDuckGo’s current average of around 10 million queries per day pales in comparison with the 3.5 billion queries processed every day by Google, it represents a 100-fold increase in total search volume between 2011 and 2016.
Search Engine Sophistication
The trend over the past couple of years in the development of search technology has been toward greater sophistication. Examples of innovation in search since 2010 include:
Faster search performance thanks to autocomplete and instantly generated search results, an innovation called Instant Search.
The use of Schema.org markup to produce rich search results, such as product ratings based on a 5-star rating system displayed right on the search results page.
Increasingly targeted crackdowns on spam, content duplication, low-quality content, and websites that make excessive use of advertisements.
The ability of search engines to process unit conversions, currency conversions, simple mathematical calculations, term definitions, language translation, and similar tasks, and display the results in the search engine results page.
The display of public domain encyclopedic information directly in search results, a feature called knowledge graph.
Clearly, leading search engines are no longer satisfied to simply tell you where you can find the information you are searching for.
They are increasingly serving up that information themselves and delivering it directly to users while they simultaneously deliver additional impressions to paying advertisers.
The Future of Search on the Web
Where search is headed is anyone’s guess. Private search, a clear pushback against the advertising and tracking practices of industry leaders such as Google, is exploding in growth, but still represents just a tiny fraction of the overall market.
Google, on the other hand, has grown into a company worth hundreds of billions of dollars and generated almost $75 billion in revenue in 2015 alone.
At the same time, the number of internet-connected devices, households, and users continues to grow and search represents the fundamental mechanism used to find information on the web.
While the future of search may be anyone’s guess, of one thing we can be sure: search isn’t going away anytime soon.