Robots.txt Ultimate Guide
The robots.txt file is a special tool used by websites to tell any internet robots that might visit what they can and cannot do.
For example, before the Googlebot visits your website, it reads the robots.txt file to see where it can and can't go, what information it can collect, and stuff like that. Of course, it does this because it represents an established company that cares about its reputation.
If some internet scammers created ScamBot5000, it probably wouldn't even read the robots.txt file — except maybe to see where you did not want it searching.
So Why Use Robots.txt?
Given that robots don't have to abide by what is in the robots.txt file, it can seem like a waste of time. But it actually can be very important. Sure, spambots will come onto your website and post useless comments, but that's a different issue with different solutions. The robots.txt file is used to help search engines and archivers know how to navigate your site.
Under most circumstances, websites want robots to check out their entire sites. But not always. Imagine you have a site that is divided into two parts. One part contains a blog where you tell the world what you think about each new smartphone that comes on the market. And the other part has pictures of your new baby. You don't mind people looking at the pictures of your baby, because she is, after all cute as a button.
But you don't want those pictures included in search engine databases where people who don't even know who you are might come upon them. Or maybe you just don't want your server taxed because you just happen to have over 10,000 high resolution pictures of your new baby.
Regardless of the reason, you could use a robots.txt file to tell the search engines: index my smartphone articles but leave my baby pictures alone.
How Robots.txt Works
The commands inside it are referred to as the Robots Exclusion Protocol. It has been around since 1994, and has never been officially standardized. But it manages to work pretty well anyway.
There is a lot to it (which we will get to). But mostly, there are just two commands: (1) those that tell which robots the commands apply to; and (2) those that tell the robots what they can and cannot do.
All sections of a robots.txt file start with a User-agent command. It is of the form:
In this case, [robot-name] can be either the name of a particular robot (eg, Googlebot) or all robots, which is indicated with an asterisks symbol. This latter case is the most common. Following the
User-agent, all commands refer to it until the next
User-agent line (if there is one).
The most common commands in a robots.txt file are those that disallow the robot to go to different places on the website. All the lines have a similar format to the
Disallow: [file or directory name]
In this case, [file or directory name] is given relative to the website root. For example, a common location for a website on a shared server is /home/websiteName/public_html. As far as robots.txt is concerned, this is just the root directory, or /.
Perhaps the simplest robots.txt file is one that tells all robots to go wherever they want:
But if you want a website that is "off the grid" and can't be found by normal search engines, your robots.txt file might look like this:
User-agent: * Disallow: /
A more realistic case would be one where you don't want the Google search robot going to private areas:
User-agent: Googlebot Disallow: /cgi-bin/ Disallow: /wp-admin/
What Else Does Robots.txt Do?
Since the robots exclusion standard is not backed up by any authoritative body like the ISO and the W3C, exactly what any given robot will pay attention to is variable. Thus, the user-agent and disallow commands we've just discussed are all you can really depend upon. But there are other nonstandard commands that you can add to your robots.txt file.
The allow command is almost standard. Most robots do understand it. But it really isn't of a great deal of use. It is generally used as a way to carve out a small part of an otherwise disallowed site to be crawled. Most robots give precedence to whichever command is longer. It can be confusing and should be avoided.
User-agent: * Disallow: / Allow: /wp
Crawl-delay tells the robot how often it can visit the site. The original idea was to keep a robot from dominating the web server. In other words, it was a way to avoid an inadvertent DoS attack. But most robots don't use it and those that do, use it in different ways.
User-agent: * Crawl-delay: 10
The host command tells the robot which host it should craw. This may seem strange, but it is intended for mirror sites. If you had a base website called freeware.com and mirrors freeware1.com and freeware2.com, it would make sense for robots to crawl only freeware.com, given that the other two would be exactly the same.
User-agent: * Host: freeware.com
This command tells robots where the site's XML sitemap can be found. In general, sitemaps are submitted directly to to search engines.
User-agent: * Sitemap: http://www.mysite.com/sitemap.xml
In addition to the robots.txt file, there are also robots meta tags. By using them, you can indicate what robots should do on a per-page level. As with most meta tags, it uses two attributes: name and content.
The name attribute usually contains the word "robots." However, it can include the name of a specific robot — or even multiple ones separated by commas.
The content attribute contains one or more commands, separated by commas. The most common ones are "noindex" (don't index the page) and "nofollow" (don't follow the links on the page). There are many other parameters, including: index, follow, none, noarchive, nocache, and nosnippet. See the advanced resources for more information.
<meta name="robots" content="noindex,nofollow">
Below you find an up-to-date collection of guides, tutorials and tools for robots.txt.
- How to Create and Configure Your Robots.txt File: a great and thorough introduction to the subject.
- The Web Robots Pages: a basic introduction to the robots.txt file.
- What Is Robots.txt: the MOZ page that is focused more on the SEO side of things.
- What Is a Robots.txt File: Patrick Sexton's article that provides a good introduction to all the basics.
- About the Robots <META> Tag: basic information about controlling robots with the meta tag.
- Learn About Robots.txt with Interactive Examples: a thorough introduction to robots.txt files.
- A Deeper Look at Robots.txt: a good discussion of the subject including pattern matching.
- Robots.txt Specifications: Google's specification, which explains exactly how they use the file.
- Robots Exclusion Protocol: information from Bing about how robots.txt files are used.
- Robots.txt Is a Suicide Note: an explanation from Archive.org as to why it no longer even reads robots.txt files, which it considers "a stupid, silly idea in the modern era."
- How to Stop Search Engines From Indexing Specific Posts and Pages in WordPress: although the focus is on WordPress, this article provides a thorough introduction into robots meta tags.
- How to Block and Destroy SEO with 5K+ Directives: a case study on how one website destroyed its visibility due to an over-complicated robots.txt file.
- Robots.txt Disallow: 20 Years of Mistakes To Avoid: good advice about what not to do with your robots.txt file.
- McAnerin's Robot Control Code Generation Tool: a full-featured robots.txt generator with a number of specific robots to create rules for.
- SEO Book Tools: simple tools for creating and checking robots.txt files.
- Robots Database: a list of over 300 robots and details about each.
- Robots.txt Tester: Google's tool for checking your robots.txt file. It's critical that you know what Google thinks it can and can't do on your site.
The robots.txt file and robots meta tags can be useful tools for website owners and administrators. But you must take great care with them. If used incorrectly, they can greatly harm your website visibility.