![]()
About the SpiderMonkey Web Crawling Robot and Search Engine
[ Robot Tech. Specs. |
Robots.txt Generator |
Robots.txt Fetcher |
Add Url ]
About Our Robot - SpiderMonkey Canadian Web Search: This page contains a wide range of general internet search robot information applicable to SpiderMonkey and to most other search engines.
Here both SpiderMonkey Search users and web site promoters and builders can learn,
- how web crawlers work;
- how to build simple META tags to enhance page reachability and readability;
- how to write robots.txt files for the robots exclusion standard;
- see and understand the pre-index database;
- what is an index;
- how to search with SpiderMonkey;
- how to do boolean searches with SpiderMonkey or any search engine;
- search engine terminology;
- and much more.
The SpiderMonkey search robot is registered among the top search engines of the world as a sophisticated software "engine" running from our servers to reach out across the web and fetch content for our database (SQL) indices.
SpiderMonkey began its life in the late 1980s in association with a number of academic student and faculty groups as a research and development project exploring technologies such as clustering systems, information retrieval algorithms, network programming, machine learning, heurisitic determinations, unix kernel, PERL, C, shell, PHP, Python, Ruby, SQL and more.
SpiderMonkey search technology is not for sale, nor are any of its services.
SpiderMonkey does not hold out to compete at any level with any service.
SpiderMonkey shares technology with many small to medium sized developers for the purpose of indexing local sites for consideration of research data exchange. SpiderMonkey is also used by many web site owners/webmasters to index local sites or specific database content.
The SpiderMonkey search engine is a software programme resident on a server cluster (a group of load sharing computers) that searches through a (usually massive) database. In the context of the World Wide Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot.
The SpiderMonkey robot crawl process is automated by our programmers to systematically traverse the World Wide Web's hypertext structure and retrieve documents; thereafter recursively retrieving all documents that are linked from within the initial target document with some certain criteria and discretion.
"Recursive" here doesn't limit the definition to any specific traversal algorithm.
Even though SpiderMonkey's crawl robot might be programmed to apply some heuristic rules to the selection and order of documents it will visit; and also spaces out requests over a long span of time; it is still a "robot" despite any so-called "smart" behaviour. Concomitantly, normal Web browsers are not robots because they are operated by a person and don't automatically retrieve referenced documents.
Robots are sometimes referred to as Bots, Wanderers, Crawlers, or Spiders. Although arguably apropos, for the lay person, these names are a little misleading if they give the impression the software itself moves between sites like a virus; this not the case. The robot is software, permanently resident in its own computer, communicating from that computer its requests for website documents from other computers (the document server(s)) upon which the target site is resident.
You can view the pre-index database
Like most search engine service providers, for both quality and security reasons, submitted URI's and URL's are stored in a temporary database before they are crawled and entered into the search engine's index. We allow interested visitors viewing access to the pre-index database. Leave the text box blank and press the search button to see all of the latest. Other search engines are allowed access to the pre-index database thereby simplifying the submission process. Use the "Pre-Index" engine here by either entering key words; entering your site name; or leave the search field blank, press "Pre-Index" and the engine will show you the entire list of recent submissions. You can see how others describe their sites and get some ideas for your own. If you have submitted your site using our Add Url form, you can check here and see how it looks. If you don't like it, remember that the final index entry will be derived from your web page, so spend your time working on your web page and it's meta-tags instead of resubmitting.
Our Crawler (SpiderMonkey) visits and checks URLs during server off-peak load times and feeds the result to the index. All realms of the main database are refreshed no less than every 30 days. This temp. database is crawled regularly and while a URI and its conent is fetched from the actual site, each entry in the pre-index database remains for a period of roughly 30 days to verify when and how it was submitted.
robots.txt Is the Robot Exclusion Implementation (REI).
SpiderMonkey abides by the Robot Exclusion Standard. Specifically, Spider Monkeyadheres to the 1994 Robots Exclusion Standard (RES). Where the 1996 proposed standard supersedes the 1994 standard, the proposed standard is followed.
SpiderMonkey will obey the first entry in the robots.txt file with a User-Agent containing" SpiderMonkey or Spider_Monkey ". If there is no such entry, it will obey the first entry with a User-Agent of "*".Before you submit your site for inclusion in our database (index), are there pages you don't want indexed? If so, put the following in the head of any web page you want excluded. Our crawler (Spider Monkey) will obey this instruction and skip the document.
<META NAME="robots" CONTENT="noindex,nofollow">The other way to warn a robot of sensitive material you don't want crawled is with the simple robots.txt file in the top-level domain (i.e.: www.domainname.com/robots.txt). It is important that every web site have a robots.txt file in the root directory to avoid the numerous 404 errors and to make the site more "robot-friendly".
We offer a resource for generating your robots.txt file but suggest you read and understand the following first.# EXAMPLE robots.txt
User-agent: * # You can enter specific user-agent (spider's name) or "*" which is best
Disallow: /cgi-bin/
Disallow: /cgi-win/
Disallow: /tmp/
Disallow: /images/
Disallow: /includes/
Disallow: /public/~specific-user/
- - - - -
# EXAMPLE robots.txt to exclude a single robot
User-agent: user_agent-of-Bad_Bot_From_Hell
Disallow: /
- - - - -
# EXAMPLE robots.txt to allow a single robot
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /mybirthdaysuitpictures/
Disallow: /scripts/
- - - - -
Do you properly use meta content tags?
Meta tags are used to help the client browser and any other technology accessing your web pages. These tags specify information about your document and how the client should read the document content. Meta tags therefore are also to help search engines like SpiderMonkey read documents on your site.
There are some extremely important page header tags for the spiders. You must have a document declaration and should at least set out the title and content meta tags of the page as succinctly as possible.
If present, the content.description tag will become the default introduction to your page in the SpiderMonkey search results search users see.
You can specify content.keywords that occur in your document although because of extensive abuse this tag is now widely deprecated as is the case with SpiderMonkey which ignores keywords.
As mentioned above, you can use a meta tag to tell SpiderMonkey to exclude a page. Concomittantly you can tell the robot to include the page.
An example of an HTML page header with meta tags follows:<meta name="robots" CONTENT="index,follow">
An XHTML example follows:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"">
<title>World Against Land Mines Alliance Introduction</title>
<meta name="author" content="World Against Land Mines Alliance">
<meta name="Description" content="The official web site of the World Against Land Mines Alliance">
<meta name="keywords" content="against land mines alliance">
<meta name="robots" content="follow, index">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="en" >
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<title>World Against Land Mines Alliance Introduction</title>
<meta name="author" content="World Against Land Mines Alliance" />
<meta name="Description" content="The official web site of the World Against Land Mines Alliance" />
<meta name="keywords" content="against land mines alliance" />
<meta name="robots" content="follow, index" />
</head>
![]() |
[ MP3 Search | Site Search | Home | Contact ] | Bookmark |
| Enter your search word or phrase. |
| SpiderMonkey searched for george bush |
Contact | Help | Add URL ] ©2006 SpiderMonkey Search
Search Engine Help for doing Boolean + Simple SearchesOur search engine finds documents throughout the World Wide Web. Here's how it works: you tell our search engine what you're looking for by typing in keywords, phrases, or questions in the search box. SpiderMonkey responds by giving you a list of all the Web pages in our crawler's (SpiderMonkey technical details) index relating to those topics. The most relevant content will appear at the top of your results. Most foul language is ignored by SpiderMonkey. Conclude it is not a tool for seeking porn sites. |
|
Type the word or phrase you seek into the text-entry box. When searching, think of a word as a combination of letters and numbers. You can tell SpiderMonkey how
to distinguish words and numbers you want treated differently. |
|
Doing Boolean Searches With SpiderMonkey
|
What is a Web Crawler's "Index"?Spider Monkey's index is a large, growing, organized collection of data comprised of Web pages of various types, their content and location, as well as discussion group pages from around the world. The "index" is stored on a chain of clustered computers comprising the database engine. The 'index' becomes larger every day as people submit addresses for new Web pages and as our administrators search for new material. We own sophisticated technology that crawls the World Wide Web daily during lower server load periods looking for links to new pages. When you use SpiderMonkey search engine, you search the entire collection using keywords or phrases, just like other search engines such as Google, Excite, Yahoo or Alta Vista. |