 |
Where did the spider and the web come from?
How Do Search Engines Really Work?
by Suzon Walton
Publisher of Connected Now
Connected
Now currently does not list every business in Sacramento,
because we are comprised of a database of paid subscribers
(those who wish to pay to be listed). We are much like
a Google's sponsored advertiser. We do offer tips and
tricks to find what you are looking for, while we wait
for all the remaining businesses in the area who have
websites jump on board. Search Engines for the general
web do not really search the World Wide Web directly.
Each one searches a database of the full text of web
pages selected from the billions of web pages out there
residing on servers. When you search the web using a
search engine, you are always searching a stale copy
of the real web page. When you click on links provided
in a search engine's search results, you retrieve from
the server the current version of the page.
Search engine databases are selected and built by computer
robot programs called spiders. Although it is said they
"crawl" the web in their hunt for pages to include,
in truth they stay in one place. They don't have eight
legs but millions. They find the pages for potential
inclusion by following the links in the pages they already
know about. They cannot think or type a URL or use judgment
to "decide" to go look up something too see what's up.
They are still brainless and only do what the humans
type (for now).
If a web page is never linked to in any other page,
search engine spiders can't find it. The only way a
brand new page - one that no other page has ever linked
to - can get into a search engine is for its URL to
be sent by some human to the search engine companies
as a request that the new page be included. All search
engine companies offer ways to do this. The process
of sending this info is commonly called "search engine
submission". Experts in this field of submitting your
web pages can command a high price for the job. Computer
programs are now making this easy. We will cover search
engine submission next month's issue of Connected Now.
After spiders find pages, they pass them on to another
computer program for "indexing." This program identifies
the text, links, and other content in the page and stores
it in the search engine database's files so that the
database can be searched by keyword and whatever more
advanced approaches are offered, and the page will be
found if your search matches its content. Some types
of pages and links are excluded from most search engines
by policy. Others are excluded because search engine
spiders cannot access them. Pages that are excluded
are referred to as the "Invisible Web" -- what you don't
see in search engine results. The Invisible Web is estimated
to be two to three or more times bigger than the visible
web.
Google has one of the largest databases of Web pages,
including many other types of Web documents (e.g., PDFs,
Word, Power Point or Excel documents). Despite the presence
of many advertisements and considerable clutter, Google's
popularity ranking often makes pages worth looking at
rise near the top of search results. Google alone is
not sufficient, however. Less than half the searchable
Web is fully searchable in Google. Overlap studies show
that about half of the pages in any search engine database
exist only in that database. Getting a second opinion
is therefore often worthwhile. For a second opinion,
we recommend Teoma (www.teoma.com/),
Vivisimo (www.vivisimo.com/)-a
meta-search engine that indirectly searches three huge
search engine databases) or Yahoo! Search (www.yahoo.com/).
YAHOO
Yahoo! is one of the best known and most popular Internet portals. Originally just a subject directory, it now is a search engine, directory, and portal. To go to the Yahoo! portal and main starting point, use www.yahoo.com. For direct access to the search engine, use search.yahoo.com and for the directory use dir.yahoo.com.
Strengths:
- A very large, new (as of Feb. 2004) search engine database
- Includes cached copies of pages
- Also includes links to the Yahoo! directory
- Supports full Boolean searching
Weaknesses:
- Lack of some advanced search features such as truncation
- Only indexes first 500 KB of a Web page (still more than Google's 101KB)
- Link searches require the inclusion of the http://
- Includes some pay for inclusion sites
TEOMA
Debuting in Spring 2001 and relaunching in April 2002, this new search engine has built its own database and offers some unique search features. It was bought by Ask Jeeves in Sept. 2001. It lacks full Boolean and other advanced search features, but in has more recently expanded and improved its search features and added an advanced search.
Strengths:
Identifying metasites
- Refine feature to focus on Web communities
Weaknesses:
- Smaller database
- No free URL submission
- No ability to uncluster results to easily see more than two hits per site
- No cached copies of pages
GOOGLE
Google has become for many the favorite Web search engine for the masses. Since Feb. 1999, GoogleTM it has made its mark with its relevance ranking based on link analysis, cached pages, and aggressive growth. Today, it's boasts over 4+ billion indexed pages, unindexed URLs, and other file formats.
Strengths:
- Size and scope: It is now the largest, and includes PDF, DOC, PS, and many other file types
- Relevance based on sites' linkages and authority
- Cached archive of Web pages as the looked were indexed
- Additional databases: Google Groups, News, Directory, etc.
Weaknesses:
- Limited search features: no nesting, no truncation, does not support full Boolean
- Link searches must be exact and are incomplete
- Only indexes first 101 KB of a Web page and about 120 KB of PDFs
May search for plural/singular, synonyms, and grammatical variants without telling you
MSN Search is one of the search engine for the MSN portal site. It uses an Inktomi database. The basic search screen only shows a few options, but by choosing the Advanced Search link, the full range of search features is displayed. This review discusses the full set of options, some of which are only available in the Advanced Search. Use the table of contents on the left to navigate this review.
Databases: MSN Search uses LookSmart for its directory and Inktomi for its search engine database. Its sponsored sites (ads) are from Overture. MSN Featured Sites and Directory results come first from the basic search screen. The Advanced Search only displays Inktomi results. Before Sept. 2002, is used to include a link to Direct Hit results. In the Featured Sites section, results may come from MSN destination sites, MSN Encarta, and/or MSN ad partners. Note that
MSN will not retrieve adult content, and that searches on terms such as 'sex' will give no results but will link to an adult search engine.
HotBot, owned by Terra/Lycos, is one of older Web search engines. Originally it just used the Inktomi database and then added Direct Hit and the Open Directory.
Strengths:
- Advanced searching capabilities
- Quick check of three major databases
- Advanced search help
Weaknesses:
- Does not include all advanced features of each of the four databases
- No cached copies of pages
- Only displays a few hits from each domain with no access to the rest in Inktomi
- Same ads at the top push regular results below the fold
- Should have a file type limit for PDF, MS Word, PowerPoint, and Excel files
Contact the Suzon Walton at
suzon@connectednow.com
|
 |