About the SpiderMonkey Web Crawling Robot and Search Engine
[ Robot Tech. Specs. | Robots.txt Generator | Robots.txt Fetcher ]

About Our Robot - SpiderMonkey Canadian Web Search: This page contains a wide range of general internet search robot information applicable to SpiderMonkey and to most other search engines.
Here both SpiderMonkey Search users and web site promoters and builders can learn,

how web crawlers work;

how to build simple META tags to enhance page reachability and readability;

how to write robots.txt files for the robots exclusion standard;

see and understand the pre-index database;

what is an index;

how to search with SpiderMonkey;

how to do boolean searches with SpiderMonkey or any search engine;

search engine terminology;

and much more.

The SpiderMonkey search robot is registered among the top search engines of the world as a sophisticated software "engine" running from our servers to reach out across the web and fetch content for our database (SQL) indices.

SpiderMonkey began its life in the late 1980s in association with a number of academic student and faculty groups as a research and development project exploring technologies such as clustering systems, information retrieval algorithms, network programming, machine learning, heurisitic determinations, unix kernel, PERL, C, shell, PHP, Python, Ruby, SQL and more.

SpiderMonkey search technology is not for sale, nor are any of its services.

SpiderMonkey does not hold out to compete at any level with any service.

SpiderMonkey shares technology with many small to medium sized developers for the purpose of indexing local sites for consideration of research data exchange. SpiderMonkey is also used by many web site owners/webmasters to index local sites or specific database content.

The SpiderMonkey search engine is a software programme resident on a server cluster (a group of load sharing computers) that searches through a (usually massive) database. In the context of the World Wide Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot.

The SpiderMonkey robot crawl process is automated by our programmers to systematically traverse the World Wide Web's hypertext structure and retrieve documents; thereafter recursively retrieving all documents that are linked from within the initial target document with some certain criteria and discretion.

"Recursive" here doesn't limit the definition to any specific traversal algorithm.

Even though SpiderMonkey's crawl robot might be programmed to apply some heuristic rules to the selection and order of documents it will visit; and also spaces out requests over a long span of time; it is still a "robot" despite any so-called "smart" behaviour. Concomitantly, normal Web browsers are not robots because they are operated by a person and don't automatically retrieve referenced documents.

Robots are sometimes referred to as Bots, Wanderers, Crawlers, or Spiders. Although arguably apropos, for the lay person, these names are a little misleading if they give the impression the software itself moves between sites like a virus; this not the case. The robot is software, permanently resident in its own computer, communicating from that computer its requests for website documents from other computers (the document server(s)) upon which the target site is resident.

robots.txt Is the Robot Exclusion Implementation (REI).

SpiderMonkey abides by the Robot Exclusion Standard. Specifically, Spider Monkeyadheres to the 1994 Robots Exclusion Standard (RES). Where the 1996 proposed standard supersedes the 1994 standard, the proposed standard is followed.

SpiderMonkey will obey the first entry in the robots.txt file with a User-Agent containing" SpiderMonkey or Spider_Monkey ". If there is no such entry, it will obey the first entry with a User-Agent of "*".

Before you submit yoursite for inclusion in iny search engine's database (index), are there pages you don't want indexed? If so, put the following in the head of any web page you want excluded. Our crawler (Spider Monkey) will obey this instruction and skip the document.

<META NAME="robots" CONTENT="noindex,nofollow">

The other way to warn a robot of sensitive material you don't want crawled is with the simple robots.txt file in the top-level domain (i.e.: www.domainname.com/robots.txt). It is important that every web site have a robots.txt file in the root directory to avoid the numerous 404 errors and to make the site more "robot-friendly".
We offer a resource for generating your robots.txt file but suggest you read and understand the following first.

# EXAMPLE robots.txt
User-agent: * # You can enter specific user-agent (spider's name) or "*" which is best
Disallow: /cgi-bin/
Disallow: /cgi-win/
Disallow: /tmp/
Disallow: /images/
Disallow: /includes/
Disallow: /public/~specific-user/
- - - - -
# EXAMPLE robots.txt to exclude a single robot
User-agent: user_agent-of-Bad_Bot_From_Hell
Disallow: /
- - - - -
# EXAMPLE robots.txt to allow a single robot
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /mybirthdaysuitpictures/
Disallow: /scripts/
- - - - -

Do you properly use meta content tags?

Meta tags are used to help the client browser and any other technology accessing your web pages. These tags specify information about your document and how the client should read the document content. Meta tags therefore are also to help search engines like SpiderMonkey read documents on your site.

There are some extremely important page header tags for the spiders. You must have a document declaration and should at least set out the title and content meta tags of the page as succinctly as possible.

If present, the content.description tag will become the default introduction to your page in the SpiderMonkey search results search users see.

You can specify content.keywords that occur in your document although because of extensive abuse this tag is now widely deprecated as is the case with SpiderMonkey which ignores keywords.

As mentioned above, you can use a meta tag to tell SpiderMonkey to exclude a page. Concomittantly you can tell the robot to include the page.

<meta name="robots" CONTENT="index,follow">
An example of an HTML page header with meta tags follows:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"">
<title>World Against Land Mines Alliance Introduction</title>
<meta name="author" content="World Against Land Mines Alliance">
<meta name="Description" content="The official web site of the World Against Land Mines Alliance">
<meta name="keywords" content="against land mines alliance">
<meta name="robots" content="follow, index">

An XHTML example follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="en" >
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<title>World Against Land Mines Alliance Introduction</title>
<meta name="author" content="World Against Land Mines Alliance" />
<meta name="Description" content="The official web site of the World Against Land Mines Alliance" />
<meta name="keywords" content="against land mines alliance" />
<meta name="robots" content="follow, index" />
</head>

How to do a SpiderMonkey Search

Here is a sample of what you see when you visit SpiderMonkey

[ Home ]

Bookmark

SpiderMonkey searched for george bush

Help | Add URL ] © SpiderMonkey Search

In this example the words "George Bush" were searched. Note that the phrase is wrapped in quote marks to make sure SpiderMonkey looks for the words as a phrase and not as two separate words. You can try either way. Try a search from here. Just click "SpiderMonkey Go". In the alternative, use your mouse to mark off "George Bush" in the "terms" space and enter as many words as you like separated by spaces.

Search Engine Help for doing Boolean + Simple Searches

Our search engine finds documents throughout the World Wide Web. Here's how it works: you tell our search engine what you're looking for by typing in keywords, phrases, or questions in the search box. SpiderMonkey responds by giving you a list of all the Web pages in our crawler's (SpiderMonkeytechnical details) index relating to those topics. The most relevant content will appear at the top of your results.

Most foul language is ignored by SpiderMonkey. Conclude it is not a tool for seeking porn sites.

Type the word or phrase you seek into the text-entry box. When searching, think of a word as a combination of letters and numbers. You can tell SpiderMonkey how to distinguish words and numbers you want treated differently.
You can link words and numbers together into phrases if you want specific words or numbers to appear together in your search results pages. If you want to find an exact phrase or full name use " quotation marks" around the phrase when you enter words in the search box.This tells SpiderMonkey to match your word pattern exactly and search for your phrase set out in quotation marks.
You can ban words using (not) or you can indicate must-have words with (and).
Excluding words. If you want to search for a word or phrase but want SpiderMonkey to exclude pages having certain words, simply type: "keyword and keyword not word". Below are some more examples.

Doing Boolean Searches With SpiderMonkey
Firstly, clink on the link that says either "Boolean Web Search" or "Site Search"

using and, or
EXAMPLE: Micheal or Michael and mike
usingand, or, not
EXAMPLE: Micheal or Michael and mike not mikey
usingand, or, notwith "phrases"
EXAMPLE: Micheal and Michael or mike not "mike smith"
EXAMPLE: mp3 or midi and music and free not "wav files"
EXAMPLE: mp3s or mp3 or midi and "free music" not "wav files"
using and, or, not with "phrases"and *
EXAMPLE: Micheal and Michael or mike not "mike* plumb*"
(Translation: search for Micheal spelled either way, or spelled as Mike but don't include {Mike, Mike's, or Mike y's Plumb ing, or Plumb er})

What is a Web Crawler's "Index"?

Spider Monkey's index is a large, growing, organized collection of data comprised of Web pages of various types, their content and location, as well as discussion group pages from around the world. The "index" is stored on a chain of clustered computers comprising the database engine. The 'index' becomes larger every day as people submit addresses for new Web pages and as our administrators search for new material. We own sophisticated technology that crawls the World Wide Web daily during lower server load periods looking for links to new pages. When you use SpiderMonkey search engine, you search the entire collection using keywords or phrases, just like other search engines such as Google, Excite, Yahoo or Alta Vista.

Some Terminology Related To Search Engines

Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.

Concept search: A search for documents related conceptually to a word, rather than specifically containing the word itself.

Full-text index: An index containing every word of every document catalogued, including stop words (defined below).

Fuzzy search: A search that will find matches even when words are only partially spelled or misspelled.

Index: The searchable catalogue of documents created by search engine software. Also called "catalogue." Index is often used as a synonym for search engine.

Keyword search: A search for documents containing one or more words that are specified by a user.

Phrase search: A search for documents containing a exact sentence or phrase specified by a user.

Precision: The degree in which a search engine lists documents matching a query. The more matching documents that are listed, the higher the precision. For example, if a search engine lists 80 documents found to match a query but only 20 of them contain the search words, then the precision would be 25%.

Proximity search: A search where users to specify that documents returned should have the words near each other.

Query-By-Example: A search where a user instructs an engine to find more documents that are similar to particular document. Also called "find similar."

Recall: Related to precision, this is the degree in which a search engine returns all the matching documents in a collection. There may be 100 matching documents, but a search engine may only find 80 of them. It would then list these 80 and have a recall of 80%.

Relevancy: How well a document provides the information a user is looking for, as measured by the user.

Spider: The software that scans documents and adds them to an index by following links. Spider is often used as a synonym for search engine.

Stemming: The ability for a search to include the "stem" of words. For example, stemming allows a user to enter "swimming" and get back results also for the stem word "swim."

Stop words: Conjunctions, prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning.

Thesaurus: A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents.

Creative Technologies
Level 4 Support Provider for SpiderMonkey.ca