Laymen’s Guide to Understanding Search Engines
There are billions of web pages on the Internet, and it is thought that any information a person would ever need to know can be found somewhere online. However, searching through these web pages to find the information we are seeking would be a daunting and almost impossible task. Search engines, which help us to weed through these sites to find what were are looking for, provide an important role in making this information accessible.
Search Engine History
During the early 1990s, the sharing of files across a network was done through a File Transfer Protocol (FTP) in which anyone who had a file to share would run an FTP server. The user that needed the file would connect to FTP client, and files for sharing would be posted in mailing lists and discussion forums. As time progressed, anonymous FTP sites were developed, and they allowed users to retrieve or post files.
The first search engine can be traced back to the early 1990s. It was known as Archie, and it was developed by Canadian student Alan Emtage. Archie worked to change the face of searching for and sharing content by developing a script-based gatherer that worked to search through the FTP sites to create file indexes. It also included a regular expression matcher that allowed users to access the database. In 1993, the search engine Veronica was developed at the University of Nevada. This system was similar to Archie but it instead utilized Gopher files. Gopher files consisted of plain text documents.
Later, the World Wide Web Wanderer, produced by Matthew Gray, became the first search engine to employ a robotic concept. This robot was basically a software program that was designed to access web pages using the links found in the pages that were already accessed. It began capturing URLs to create the Wandex web database. The Wanderer also influenced future search engines based off of robots, and many of those systems power the search engines of the present.
Search Engine Types
The major search engine types that are found today can be classified as crawler-based or directory-based. Crawler-based search engines use software called “spiders”, and these spiders work to crawl through the web to find links in web pages. All the pages found by the links are stored within a database that includes idexes related to the content of the text. The index is then used to display the pages that match the search terms that are entered by the user.
Directory-based engines are driven entirely by humans. The web site creator maintains a director with page links. These links can be provided by web master or by other people that are trying to view that particular site.
Crawlers are more popular than directory-based engines due to their autonomous nature and speed. These search engines are composed of the following components: The spider, indexer, and ranking software.
The spider component of search engines works by crawling throughout the Internet, and it constantly accesses and downloads pages. They feed off of the principle that all pages have been linked via hyperlinks. Typically they begin with a URL list and then move along by accessing the hyperlinks that are found within the listed web pages.
It is not possible for a spider to access every web page on the Internet, so they are programmed to be able to filter and decide the routes to take and where to stop when searching through the web. This helps to make the search go faster. A spider will visit a page based on how often the page is updated, and the behavior of the spider is established by crawling policies that are programmed into it. These polices are discussed below:
- Selection policy: This refers to which page the spider chooses to download. The growth of the Internet has made it necessary to identify a good policy for selecting the URLs that are navigated. Studies have indicated that only about 16% of the web pages available online are actually indexed and downloaded. With such a small percentage, search engines need to display only the pages that are most relevant and prioritizing them for users. The relevance of a page may be determined by the amount of pertinent information, number of visits, intrinsic quality, and the total inbound links.
- Re-visit policy: This policy refers to when the pages are checked for changes. The Internet is constantly being revised with additions, modifications, and deletions. This puts added pressure on a search engine. An efficient search engine will attempt to achieve search results with low age and high freshness values. Additionally, there are two re-visiting policies that can be used:
- Uniform Policy: This policy requires that all pages are visited uniformly regardless of the page’s rate of change.
- Proportional Policy: This policy requires that the page’s update frequency determines how often that pages are visited.
- Politeness Policy: This policy refers to how search engines avoid overloading web pages and sites. Spiders affect the performance of the Internet when they scour it looking for sites. Spiders use up large amounts of bandwidth due to their functioning, and this uses up a lot of network resources. They sometimes work in parallel and they can access sites too frequently, which can also increase the server load.
- Parallelization policy: A parallel crawler will create multiple processes, and it can then cover more area while searching the Internet. The goal is to increase the rate of download, but they also need to ensure that they are minimizing the overhead that is caused by this process. They will also need to prevent more than one process from downloading the exact same page. There are several sub-policies for assigning URLS in order to control redundant downloading:
- Dynamic assignment: In this policy, a central server will take responsibility for assigning URLS to an individual crawling process. This results in a load that is balanced uniformly on the crawlers that are available.
- Static assignment: This policy results in a fixed rule that starts from the beginning of a crawl, and it defines how new URLS are assigned to crawlers. This policy satisfies the following purposes:
- Each process should receive about an equal number of hosts
- If the amount of crawling processes increases, the amount of load per process should decrease.
- Crawling processes should be added or removed with minimal impact on how the system functions overall.
The indexer works to provide indexing functions to the searching process. Each page that has been found by the spider and has been determined worthy of downloading is indexed in a way that allows users to access results that are relevant quickly. The indexer then reads to repository that contains all the web pages that were downloaded by the spider, and it then decompresses those documents.
Each document is then converted into “hits”, or sets of word occurrences. These hits record the word, its position in a document, font size, capitalization, and other characteristics. Important information is also stored about them within an anchor file, which contains information that determines where each link directs users.
A search engine’s success can be measured by how relevant its search results are. The page-ranking component found within a search engine does the last job of displaying the results in a web browser. People that work in SEO, or search engine optimization, work to determine how web pages can be programmed so that they can be displayed within the top 10 results of a given search query. Understanding the algorithms that search engines use can ultimately help developers to create web pages with higher page rankings.
Typically, the details of search engine algorithms are not made public. However, the way that the results are listed can provide some information on the criteria that is used in the page rank process. Some of the most important criteria include the following:
- Web page titles
- Links within pages
- Word usage frequency
- Information freshness
Some web users can use inappropriate or unethical means to create web pages that appear higher within the search result page than they should. They may not even contain relevant information, and this process is referred to as “spamdexing.” People that engage in this practice are sometimes known as spammers.
Spamdexing and search engine optimization differ subtly. SEO focuses on adding results that are rich in quality to search engine results, and it works to get a page listed based on its own merit. Spamdexing works to deceive and mislead search engines into inappropriately listing the site in search queries. However, if a search engine determines that a page is participating in spamdexing, that page can be blacklisted from the search engine, and it will not be considered within results in the future. This is to ensure that pages are listed in search results only if they contain original and high-quality content.
Search Engines in the Future
One of the problems with current search engines as they are not able to interpret the user’s intent, although they are highly efficient at producing and finding results. Words with dual meanings are particular concern. Searching for “keyboards” can produce results for computer keyboards or the musical instrument, and search engines cannot decipher the user’s intent without more information.
With the current state of search engines, users either have to enter more keywords or search through results to find relevant information. If the search engine were able to determine the user’s actual intent, this process would be much easier. This is why many search engines are beginning to move toward the process of personalization.
Search engines are in need of a way to track user Internet surfing habits in order to interpret the meaning of a query. This would require the user to give up certain privacies in order to receive accurate search results. If a search engine was able to tell that a user searched through a variety of results and selected one about a Casio digital keyboard, it would save the information to link to that particular user. The next time the user entered a similar query, the search engine would be able to display more accurate results.
Additionally, a more effective search engine of the future may be able to profile a user. It could ask the user questions regarding his or her living habits, demographics, and hobbies in order to produce more accurate and relevant search results. The possibilities of enhancing search engines are limitless, and future search engines could make those of the present look ancient and inefficient.