All Categories :
Intranets
Chapter 32
How Intranet Search Tools and Spiders
Work
CONTENTS
Corporate intranets can contain an almost unimaginable amount
of information. Departments, divisions, and individuals create
a wide variety of Web pages, both for internal and external consumption.
Human resource information, personnel handbooks, procedures manuals,
and newsletters are all posted internally. Databases-both those
hosted directly on the intranet and on "legacy" databases
on non TCP/IP systems-are available. Add that to all the information
that can be gotten via the Internet using the World Wide Web,
and you have a serious case of information overload.
There are several ways to help intranet users find the information
they need. One way is to create subject directories of intranet
data that present a highly structured way to find information.
They let you browse through information by categories and subcategories,
such as marketing, personnel, sales, research and development,
budget, competitors, and so on. In a Web browser, you click on
a category, and you are then presented with a series of subcategories,
such as East Coast Sales, South Sales, Midwest Sales, and West
Sales. Depending on the size of the subject directory, there may
be several such layers of subcategories. At some point, when you
get to the subcategory you're interested in, you'll be presented
with a list of relevant documents. To get those documents, you
click on links to them. On the Internet, Yahoo is the most well-known,
largest, and most popular subject directory.
Another popular way of finding information-and in the long run
for intranets, probably more useful-is to use search engines,
also called search tools. Search engines operate differently from
subject directories. They are essentially massive databases that
index all the information found on the intranet-and can include
information found on the Internet as well. Search engines don't
present information in a hierarchical fashion. Instead, you search
through them as you would a database, by typing in keywords that
describe the information you want.
Intranet search engines are usually built out of three components:
An agent, spider, or crawler that
crawls across the intranet gathering information; a database,
which contains all the information the spiders gather; and a search
tool, which people use as an interface to search through the
database. The technology is similar to Internet search engines
such as Alta Vista.
Intranet search tools differ somewhat from their Internet equivalents.
The database of information they search can be built not just
by agents and spiders searching Web-based pages. Agents can be
written that can go into existing corporate databases, extract
data from them, and put them into the database of searchable information.
And people on an intranet can fill out forms and submit their
information into the database as well. Additionally, since they
are built for a specific corporation and its data, the information
they gather and the way they are searched can be customized.
Searching and cataloging tools, sometimes called search engines,
can be used to help people find the information they need. Intranet
search tools, such as agents, spiders, crawlers, and robots, are
used to gather information about the documents available on an
intranet. These search tools are programs that search Web pages,
extract the hypertext links on those pages, and automatically
index the information they find to build a database. Each search
engine has its own set of rules guiding how documents are gathered.
Some follow every link on every page that they find, and then
in turn examine every link on each of those new home pages, and
so on. Some ignore links that lead to graphics files, sound files,
and animation files; some ignore links to certain resources such
as WAIS databases; and some are instructed to look primarily for
the most popular home pages.
- Agents are the "smartest" of the tools. They can
do more than just search out records: They can per-form transactions
on your behalf, eventually such as finding and ordering the lowest-fare
airline ticket for your vacation. Right now they can search sites
for particular recordings and return a list of five sites, sorted
by the lowest price first. Agents can cope with the context of
the content. Agents can find and index other kinds of intranet
resources, not just Web pages. They can also be programmed to
extract records from legacy data-bases. Whatever information the
agents index, they send back to the search engine's database.
- General searchers are commonly known as spiders. Spiders report
the content found. They index the information they find and extract
summary information. They look at headers and at some of the links
and send an index of the information to the search engine's database.
There is some overlap between the tools-spiders can be robots,
for example.
- Crawlers look at headers and report first layer links only.
Crawlers can be spiders.
- Robots can be programmed to go to various link depths, compile
the index, and even test the links. Because of their nature, they
can get stuck in loops, and they take consider-able Web resources
going through the system. There are methods available to prevent
robots from searching your site.
- Agents extract and index different kinds of information. Some,
for example, index every single word in each document, while others
index only the most important 100 words in each; some index the
size of the document and number of words in it; some index the
title, headings and subheadings, and so on. The kind of index
built will determine what kind of searching can be done with the
search engine, and how the information will be displayed.
- Agents can also go out to the Internet and find information
there to put in the search engine's database. Intranet administrators
can decide which sites or kinds of sites the agents should visit
and index-for example, competitors to the corporation or news
sources. The information is indexed and sent to the search engine's
database in the same way as is information found on the intranet.
- Individuals can put information into the index by filling
out a form about the data they want put in. That data is then
put into the database.
- When someone wants to find information available on the intranet,
they visit a Web page and fill out a form detailing the information
they're looking for. Keywords, dates, and other criteria can be
used. The criteria in the search form must match the criteria
used by the agents for indexing the information they found while
crawling the intranet.
- The database is searched, based on the information specified
in the fill-out form, and a list of matching documents is prepared
by the database. The data-base then applies a ranking algorithm
to determine the order in which the list of documents will be
displayed. Ideally, the documents most relevant to a user's query
will be placed highest on the list. Different search engines use
different ranking algorithms. The database then tags the ranked
list of documents with HTML and returns it to the individual requesting
it. Different search engines also choose different ways of displaying
the ranked list of documents-some just provide URLs; some show
the URL as well as the first several sentences of the document;
and some show the title of the document as well as the URL.
- When you click on a link to one of the documents you're interested
in, that document is retrieved from where it resides. The document
itself is not in the database or on the search engine site.

Contact
reference@developer.com with questions or comments.
Copyright 1998
EarthWeb Inc., All rights reserved.
PLEASE READ THE ACCEPTABLE USAGE STATEMENT.
Copyright 1998 Macmillan Computer Publishing. All rights reserved.