Robots.txt Ultimate Guide

The robots.txt file is a special tool used by websites to tell any internet robots that might visit what they can and cannot do. 

For example, before the Googlebot visits your website, it reads the robots.txt file to see where it can and can't go, what information it can collect, and stuff like that. Of course, it does this because it represents an established company that cares about its reputation.

If some internet scammers created ScamBot5000, it probably wouldn't even read the robots.txt file — except maybe to see where you did not want it searching.

So Why Use Robots.txt?

Given that robots don't have to abide by what is in the robots.txt file, it can seem like a waste of time. But it actually can be very important. Sure, spambots will come onto your website and post useless comments, but that's a different issue with different solutions. The robots.txt file is used to help search engines and archivers know how to navigate your site.

Under most circumstances, websites want robots to check out their entire sites. But not always. Imagine you have a site that is divided into two parts. One part contains a blog where you tell the world what you think about each new smartphone that comes on the market. And the other part has pictures of your new baby. You don't mind people looking at the pictures of your baby, because she is, after all cute as a button. 

But you don't want those pictures included in search engine databases where people who don't even know who you are might come upon them. Or maybe you just don't want your server taxed because you just happen to have over 10,000 high resolution pictures of your new baby. 

Regardless of the reason, you could use a robots.txt file to tell the search engines: index my smartphone articles but leave my baby pictures alone.

How Robots.txt Works

The commands inside it are referred to as the Robots Exclusion Protocol. It has been around since 1994, and has never been officially standardized. But it manages to work pretty well anyway. 

There is a lot to it (which we will get to). But mostly, there are just two commands: (1) those that tell which robots the commands apply to; and (2) those that tell the robots what they can and cannot do.

User-Agent Command

All sections of a robots.txt file start with a User-agent command. It is of the form:

User-agent: [robot-name]

In this case, [robot-name] can be either the name of a particular robot (eg, Googlebot) or all robots, which is indicated with an asterisks symbol. This latter case is the most common. Following the User-agent, all commands refer to it until the next User-agent line (if there is one).


The most common commands in a robots.txt file are those that disallow the robot to go to different places on the website. All the lines have a similar format to the User-agent format:

Disallow: [file or directory name]

In this case, [file or directory name] is given relative to the website root. For example, a common location for a website on a shared server is /home/websiteName/public_html. As far as robots.txt is concerned, this is just the root directory, or /.

Simple Examples

Perhaps the simplest robots.txt file is one that tells all robots to go wherever they want:

User-agent: *

But if you want a website that is "off the grid" and can't be found by normal search engines, your robots.txt file might look like this:

User-agent: * Disallow: /

A more realistic case would be one where you don't want the Google search robot going to private areas:

User-agent: Googlebot Disallow: /cgi-bin/ Disallow: /wp-admin/

What Else Does Robots.txt Do?

Since the robots exclusion standard is not backed up by any authoritative body like the ISO and the W3C, exactly what any given robot will pay attention to is variable. Thus, the user-agent and disallow commands we've just discussed are all you can really depend upon. But there are other nonstandard commands that you can add to your robots.txt file.


The allow command is almost standard. Most robots do understand it. But it really isn't of a great deal of use. It is generally used as a way to carve out a small part of an otherwise disallowed site to be crawled. Most robots give precedence to whichever command is longer. It can be confusing and should be avoided.


User-agent: * Disallow: / Allow: /wp


Crawl-delay tells the robot how often it can visit the site. The original idea was to keep a robot from dominating the web server. In other words, it was a way to avoid an inadvertent DoS attack. But most robots don't use it and those that do, use it in different ways.


User-agent: * Crawl-delay: 10


The host command tells the robot which host it should craw. This may seem strange, but it is intended for mirror sites. If you had a base website called and mirrors and, it would make sense for robots to crawl only, given that the other two would be exactly the same.


User-agent: * Host:


This command tells robots where the site's XML sitemap can be found. In general, sitemaps are submitted directly to to search engines.


User-agent: * Sitemap:

Meta Tags

In addition to the robots.txt file, there are also robots meta tags. By using them, you can indicate what robots should do on a per-page level. As with most meta tags, it uses two attributes: name and content.

The name attribute usually contains the word "robots." However, it can include the name of a specific robot — or even multiple ones separated by commas.

The content attribute contains one or more commands, separated by commas. The most common ones are "noindex" (don't index the page) and "nofollow" (don't follow the links on the page). There are many other parameters, including: index, follow, none, noarchive, nocache, and nosnippet. See the advanced resources for more information.


<meta name="robots" content="noindex,nofollow">

Further Resources

Below you find an up-to-date collection of guides, tutorials and tools for robots.txt.

Basic Introductions

Advanced Information

Robots.txt Tools


The robots.txt file and robots meta tags can be useful tools for website owners and administrators. But you must take great care with them. If used incorrectly, they can greatly harm your website visibility.