Home Categories Submit Republish Tools Links Credits Contact
Popular Articles
 
     
 
 Categories
 
 
Submit your articles online!

Search Engine Spiders Lost Without Guidance - Post This Sign!

By: Mike Valentine

Published: August 17, 2007
Link To Article Link To Article  E-mail Article E-mail Article  Republish Article Republish Article
The robots.txt file is an exclusion standard required by all web crawlers/robots to tell them what files and directories that you want them to stay OUT of on your site. Not all crawlers/bots follow the exclusion standard and will continue crawling your site anyway. I like to call them "Bad Bots" or trespassers. We block them by IP exclusion which is another story entirely.

This is a very simple overview of robots.txt basics for webmasters. For a complete and thorough lesson, visit http://www.robotstxt.org/

To see the proper format for a somewhat standard robots.txt file look directly below. That file should be at the root of the domain because that is where the crawlers expect it to be, not in some secondary directory.

Below is the proper format for a robots.txt file ----->

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /group/

User-agent: msnbot
Crawl-delay: 10

User-agent: Teoma
Crawl-delay: 10

User-agent: Slurp
Crawl-delay: 10

User-agent: aipbot
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: psbot
Disallow: /

--------> End of robots.txt file

This tiny text file is saved as a plain text document and ALWAYS with the name "robots.txt" in the root of your domain.

A quick review of the listed information from the robots.txt file above follows. The "User Agent: MSNbot" is from MSN, Slurp is from Yahoo and Teoma is from AskJeeves. The others listed are "Bad" bots that crawl very fast and to nobody's benefit but their own, so we ask them to stay out entirely. The * asterisk is a wild card that means "All" crawlers/spiders/bots should stay out of that group of files or directories listed.

The bots given the instruction "Disallow: /" means they should stay out entirely and those with "Crawl-delay: 10" are those that crawled our site too quickly and caused it to bog down and overuse the server resources. Google crawls more slowly than the others and doesn't require that instruction, so is not specifically listed in the above robots.txt file. Crawl-delay instruction is only needed on very large sites with hundreds or thousands of pages. The wildcard asterisk * applies to all crawlers, bots and spiders, including Googlebot.

Those we provided that "Crawl-delay: 10" instruction to were requesting as many as 7 pages every second and so we asked them to slow down. The number you see is seconds and you can change it to suit your server capacity, based on their crawling rate. Ten seconds between page requests is far more leisurely and stops them from asking for more pages than your server can dish up.

(You can discover how fast robots and spiders are crawling by looking at your raw server logs - which show pages requested by precise times to within a hundredth of a second - available from your web host or ask your web or IT person. Your server logs can be found in the root directory if you have server access, you can usually download compressed server log files by calendar day right off your server. You'll need a utility that can expand compressed files to open and read those plain text raw server log files.)

To see the contents of any robots.txt file just type robots.txt after any domain name. If they have that file up, you will see it displayed as a text file in your web browser. Click on the link below to see that file for Amazon.com

http://www.Amazon.com/robots.txt

You can see the contents of any website robots.txt file that way.

The robots.txt shown above is what we currently use at Publish101 Web Content Distributor, just launched in May of 2005. We did an extensive case study and published a series of articles on crawler behavior and indexing delays known as the Google Sandbox. That Google Sandbox Case Study is highly instructive on many levels for webmasters everywhere about the importance of this often ignored little text file.

One thing we didn't expect to glean from the research involved in indexing delays (known as the Google Sandbox) was the importance of robots.txt files to quick and efficient crawling by the spiders from the major search engines and the number of heavy crawls from bots that will do no earthly good to the site owner, yet crawl most sites extensively and heavily, straining servers to the breaking point with requests for pages coming as fast as 7 pages per second.

We discovered in our launch of the new site that Google and Yahoo will crawl the site whether or not you use a robots.txt file, but MSN seems to REQUIRE it before they will begin crawling at all. All of the search engine robots seem to request the file on a regular basis to verify that it hasn't changed.

Then when you DO change it, they will stop crawling for brief periods and repeatedly ask for that robots.txt file during that time without crawling any additional pages. (Perhaps they had a list of pages to visit that included the directory or files you have instructed them to stay out of and must now adjust their crawling schedule to eliminate those files from their list.)

Most webmasters instruct the bots to stay out of "image" directories and the "cgi-bin" directory as well as any directories containing private or proprietary files intended only for users of an intranet or password protected sections of your site. Clearly, you should direct the bots to stay out of any private areas that you don't want indexed by the search engines.

The importance of robots.txt is rarely discussed by average webmasters and I've even had some of my client business' webmasters ask me what it is and how to implement it when I tell them how important it is to both site security and efficient crawling by the search engines. This should be standard knowledge by webmasters at substantial companies, but this illustrates how little attention is paid to use of robots.txt.

The search engine spiders really do want your guidance and this tiny text file is the best way to provide crawlers and bots a clear signpost to warn off trespassers and protect private property - and to warmly welcome invited guests, such as the big three search engines while asking them nicely to stay out of private areas.

Copyright © August 17, 2005 by Mike Banks Valentine

Google Sandbox Case Study http://publish101.com/Sandbox2 Mike Banks Valentine operates http://Publish101.com Free Web Content Distribution for Article Marketers and Provides content aggregation, press release optimization and custom web content for Search Engine Positioning http://www.seoptimism.com/SEO_Contact.htm



Visitor Comments

Post Comment Post A Comment
What do you think about this article? Do you agree or disagree with it? Be the first to comment on this article, and share your thoughts with the world. No registration is required to post comments.

Article Icon Choosing The Best SEO Agency
Most companies always encounter a difficult task of selecting the right SEO agency Manchester. This is partly because the directories of these firms are filled up with high numbers of entries, especially at...
Article Icon How Latent Semantic Indexing Can Help You
What is latent semantic indexing? How can latent semantic indexing help your business? Why bother with latent semantic indexing? The answers to these questions can turn your online business from a...
Article Icon The Way To Get Into Search Engine Optimization And What It May Do To Suit Your Needs
In an effort to make funds, you should entice people towards your site the greater people that arrived at your SEO site, the far better chance you have at receiving funds. Internet search engine...
Article Icon SEO Services, Manchester
When it comes to choosing SEO services Manchester businesses are able to access high quality services providing effective, efficient, and cost efficient search engine optimisation.
Article Icon SEO Companies
The Internet is the fastest and most widely used tool these days to do business over. Chances are your business or company has already invested a great deal of time and money into a website and Internet...
Article Icon Small Business Owners Take Advantage Of Local SEO Seminars
If your website is not bringing you the business you had hoped for, you may be wondering if a re-design is in order.
Article Icon Social Media: A Regional SEO Must-have
Social media campaigns are often throughst upon site owners who really don't need them. Small local businesses, whose main concern is appearing in a local search, often just don't have the time to invest in...
Article Icon Regional SEO Is The Only Way To Survive Locally
Once upon a time, when people wanted to find a local business, they would look in their Yellow Pages. Since the arrival of the Internet, however, the main form of search has become Google.
Article Icon Social Media - What Can SEO Do For Social Media?
Internet has become a major part of our lives. Twenty or thirty years ago, no one knew what Internet was and even a decade ago, there was still a need to explore Internet further. However, the Internet...
Article Icon Search Engine Marketing
Marketing is a complicated, ever changing mystery to most of us, and the inner workings of the Internet are about the same. Unfortunately, search engine marketing is the most commonly seen and effective...

Article Icon Article Marketing Keyword Parrots And Linking Fanatics In Web Content
It's stunning sometimes how far article marketing has come over the last few years as an effective means of promoting business through educational articles. But precisely because it is so useful to online...
Article Icon Bugaboos Of Article Marketing In Web Content Management Systems (CMS)
One of the most common blunders by those using article marketing to distribute free reprint articles is to include periods after domain names at the end of sentences. Surely those authors who have made this...
Article Icon Car Insurance Monitoring For Discounted Insurance Rates - Privacy Devouring Monster Eating Us One Bi
For a price, would you let car insurer along for the ride? - asks a USA Today technology story by Kevin Maney. It seems that Progressive Insurance and IBM have worked out a plan to pay drivers to be safer - by
Article Icon Gourmet Coffee Stops Decrease Gas Mileage;Home Brewed Premium Coffee Reduce Traffic Congestion
A researcher has stirred up the commuter coffee mug with the suggestion that morning rush hour traffic is worsened by stops for daily morning gourmet coffee at Starbucks and other premium coffee houses....
Article Icon Gourmet Coffee Habit Costing Consumers As Much As $1,500 Yearly
Gourmet coffee consumers rarely consider the cost of their daily coffee in terms of the expense to brew premium whole bean coffee at home (50 cents to 75 cents) with prices of a pound of gourmet coffee...
Article Icon 40 Million People Hacked - YOU As Identity Theft Victim
Saturday, MasterCard blamed a vendor of ALL credit card providers called CardSystems Solutions, Inc., a third-party processor of payment card data, as the source of loss of 40 million consumers credit card...


Print This Article Print This Article
Add To Favorites Add To Favorites
Cite This Article Cite This Article
 
 
Home | Categories | Submit | Republish | Tools | Links | Credits | Contact | Privacy Statement | Terms Of Use
Copyright © 2012 InfoServe Media, LLC (DBA PopularArticles.com). All rights reserved.