What is a robots.txt file and how to use it

Robots.txt - General information


Robots.txt and SEO


Hotfixes and workarounds


Robots.txt for WordPress




Robots.txt - General information

Robots.txt is a text file located in a website’s root directory that specifies what website pages and files you want (or don’t want) search engine crawlers and spiders to visit. Usually, website owners want to be noticed by search engines; however, there are cases when it’s not needed. For instance, if you store sensitive data or you want to save bandwidth by not indexing (excluding heavy pages with images).

The search engines index the websites using the keywords and metadata in order to provide the most relevant results to the Internet users looking for something online. Reaching the top of the search results’ list is especially important for e-commerce shop owners. Customers rarely browse further than the first few pages of the suggested matches in the search engine.
For indexing purposes, so-called spiders or crawlers are used. These are bots that the search engine companies use to fetch and index the content of all the websites that are open to them.

When a crawler accesses a website, it first requests a file named /robots.txt. If such a file is found, the crawler then checks it for the website indexation instructions. The bot that does not find any directives has its own algorithm of actions, which basically indexes everything. Not only does this overload the website with needless requests but also indexing itself becomes a lot less effective.

NOTE: There can be only one robots.txt file for the website. A robots.txt file for an addon domain name needs to be placed in the corresponding document root. For example, if your domain name is www.domain.com, it should be found at https://www.domain.com/robots.txt.
It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so make sure to get that right or it won’t work.

Google's official stance on the robots.txt file

A robots.txt file consists of lines which contain two fields:
  1. User-agent name (search engine crawlers). Find the list with all user-agents’ names here
  2. .Line(s) starting with the Disallow: directive to block indexing.

Robots.txt has to be created in the UNIX text format. It’s possible to create such a .txt file directly in the File Manager in cPanel. More detailed instructions can be found here.


Basics of robots.txt syntax

Usually, a robots.txt file contains a code like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~different/

In this example three directories: /cgi-bin/, /tmp/ and /~different/ are excluded from indexation.

PLEASE NOTE:
  • Every directory is written on a separate line. You should not write all the directories in one line, nor should you break up one directive into several lines. Instead, use a new line to separate directives from each other.
  • Star (*) in the User-agent field means “any web crawler.” Consequently, directives such as Disallow: *.gif or User-agent: Mozilla* are not supported. Pay attention to these logical mistakes as they are the most common ones.
  • Another common mistake is an accidental typo - misspelled directories, user-agents, missing colons after User-agent and Disallow, etc. When your robots.txt files get more and more complicated, it’s easier for an error to slip in so there are some validation tools that come in handy.


Examples of usage

Here are some useful examples of robots.txt usage:

Example 1

Prevent the whole site from indexation by all web crawlers:

User-agent: *
Disallow: / 

Such a measure as fully blocking the crawling might be needed when the website is under a heavy load of requests or if the content is being updated and should not come up in the search results. Sometimes the settings for the SEO-campaign are too aggressive so the bots basically overload the website with the requests to its pages.

Example 2

Allow all web crawlers to index the whole site:

User-agent: *
Disallow:

There is actually no need to crawl the whole website. It’s unlikely that the visitors will be looking up terms of use or login pages via Google Search, for example. Excluding some pages or types of content from indexing would be beneficial for security, speed, and relevance in the rankings of the given website.

Below are examples on how to control what content is indexed on your website.

Example 1

Prevent only several directories from indexation:

User-agent: *
Disallow: /cgi-bin/
Example 2

Prevent the site’s indexation by a specific web crawler:

User-agent: *
Disallow: /page_url

The page usually goes without a full URL, only by its name that follows http://www.yourdomain.com/. When such a rule is used, any page with matching name is blocked from indexing. For example, both /page_url and /page_url_new will be excluded. In order to avoid this, the following code can be used:

User-agent: *
Disallow: /page_url$

Example 3

Prevent the website’s indexation by a specific web crawler.:

User-agent: Bot1
Disallow: /

Despite the list, some identities might change over time. When the load is extremely high on the website, and it’s not possible to find out the exact bot overusing the resources, it’s better to block all of them temporarily.

Example 4

Allow indexation to a specific web crawler and prevent indexation from others:

User-agent: Opera 9
Disallow: User-agent: * Disallow: /

Example 5

Prevent all the files from indexation except a single one.

There is also the Allow: directive. This is not recognized, however, by all the crawlers and might get ignored by a number of them. Currently, it’s supported by Bing and Google. The following rule example of how to allow only one file from a specific folder should be used at your own risk:

User-agent: *
Allow: /docs/file.jpeg
Disallow: /docs/

Instead, you can move all the files to a certain subdirectory and prevent its indexation except for one file that you allow to be indexed:

User-agent: *
Disallow: /docs/

This setup requires a specific website structure. It’s also possible to create a separate landing page that would redirect to actual user`s home page. This way you can block the actual directory with the website and allow the landing index page only. It’s better when such changes are performed by a website developer to avoid any issues.

You can also use an online robots.txt file generator here. Keep in mind that it performs the default setup that does not take into account the sophisticated structures of the custom-coded websites.

The default robots.txt file in some CMS versions is set up to exclude your images folder. This issue doesn’t occur in the latest CMS versions, but the older versions need to be checked.
This exclusion means your images will not be indexed and included in Google’s Image Search. Images appearing in search results is something you would want, as it increases your SEO rankings. However, you need to look out for an issue called “hotlinking.” When someone reposts an image uploaded to your website elsewhere, your server is what gets loaded with the requests. To prevent hotlinking, read more in our corresponding Knowledgebase article.

If you would like to change this, open your robots.txt file and remove the line that says:

Disallow: /images/

If your website has a lot of private content or the media files are not stored permanently, but uploaded and deleted daily, it’s better to exclude the images from search results. In the first case, it’s a matter of personal privacy. The latter regards the possible overload of crawlers activity when they are checking each new image again and again.


Adding reference to your sitemap.xml file

If you have a sitemap.xml file (and you should have it as it increases your SEO rankings), it will be good to include the following line in your robots.txt file:

sitemap:http://www.domain.com/sitemap.xml

Do not forget to replace the http://www.domain.com/sitemap.xml path with your actual information.
For guidelines on how to create the sitemap.xml for your website, you may find them here.


Miscellaneous remarks

  • Don't block CSS, Javascript and other resource files by default. This prevents Googlebot from properly rendering the page and understanding that your site is mobile-optimized.
  • You can also use the file to prevent specific pages from being indexed, like login- or 404-pages, but this is better done using the robots meta tag.
  • Adding disallow statements to a robots.txt file does not remove content. It simply blocks access to spiders. If there is content that you want to remove, it’s better to use a meta noindex.
  • As a rule, the robots.txt file should never be used to handle duplicate content. There are better ways like a Rel=canonical tag which is a part of the HTML head of a webpage.
  • Always keep in mind that robots.txt should be accurate in order your website could be indexed correctly by the search engines.



Hotfixes and workarounds


Including URL indexing to 'noindex'

The noindex meta tag prevents the whole page from being indexed by a search engine. This might not be a desirable situation since you would want the URLs on that page being indexed and followed by bots for better results. To ensure this happening, you can edit your page header with the following line:

<meta name="robots" content="noindex, follow">

This line will prevent the page itself from being indexed by a search engine but due to the follow part in the code, the links posted on this page will still be retrieved. This will allow the spider to move around the website and its linked content. The benefit from this type of integration is called Link Juice - it’s the connection between different pages and the relevance of their content to each other.
If nofollow is added, the crawler will stop when it reaches this page and will not move further to the interlinked content:

<meta name="robots" content="noindex, nofollow">

From an SEO perspective, this is not recommended but it’s up to you to decide.



Some pages might be removed from the website permanently, therefore, no longer having any real value. Any outdated content should be removed from the robots.txt and .htaccess files. The latter might contain the redirects for the pages that are no longer relevant.
Simply blocking expired content is not effective. Instead, the 301 redirects should be applied either in the .htaccess file or via plugins. If there is no adequate replacement for the removed page it may be redirected to the homepage.



It’s better to prohibit indexed pages with sensitive data on them. The most common examples are:
  • Login pages
  • Administration area
  • Personal accounts information
To improve website security, please keep in mind the following:
  • The fact that this URL appears in the search results does not mean that anyone without the credentials can access it. Still, you may want to have a custom administrative dashboard and login URLs that are only known to you.
  • It’s recommended to not only exclude certain folders but also protect them using passwords.
  • If certain content on your website should be available to registered users only, make sure to apply these settings to the pages. The password-only access can be set up as described here. The examples are the websites with premium membership where certain pages and articles are available upon being logged in only.
  • The robots.txt file and its content can be checked online. This is why it’s advised to avoid inputting any names or data that might give unwanted information about your business.
For example, if you have pages for your colleagues that each reside in separate folders and you want to exclude them from search results, they should not be named "johndoe" or "janedoe", etc. Disallowing these aforementioned folder names will basically openly publicize your colleagues’ names. Instead, you can create folder “profiles” and place all the personal accounts there. The URL in the browser would be https://yourdomain.com/profiles/johndoe and the robots.txt rule will look like this:

User-agent: *
Disallow: /profiles/





Some search engines are too eager to check for content with the slightest update. They do it too often and create a heavy load on the website. Nobody wants to see their pages loading slowly because of hungry crawlers, but blocking them completely every time might be too extreme. Instead, it’s possible to slow them down by using the following directive:

crawl-delay: 10

In this case, there’s a 10-second delay for search bots.



Robots.txt for WordPress

WordPress creates a virtual robots.txt file once you publish your first post with WordPress. Though if you already have a real robots.txt file created on your server, WordPress won’t add a virtual one.

A virtual robots.txt doesn’t exist on the server, and you can only access it via the following link: http://www.yoursite.com/robots.txt

By default, it will have Google’s Mediabot allowed, a bunch of spambots disallowed and some standard WordPress folders and files disallowed.

So in case you didn’t create a real robots.txt yet, create one with any text editor and upload it to the root directory of your server via FTP. As best practice, you can also use one of the many offered SEO plugins. For the most updated and trustworthy plugins, check out WordPress’ official SEO guide.


Blocking main WordPress directories

There are 3 standard directories in every WordPress installation – wp-content, wp-admin, wp-includes that don’t need to be indexed.

Don’t choose to disallow the whole wp-content folder though, as it contains an 'uploads' subfolder with your site’s media files that you don’t want to be blocked. That’s why you need to proceed as follows:

Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/


Blocking on the basis of your site structure

Every blog can be structured in various ways:

 a) On the basis of categories
 b) On the basis of tags
 c) On the basis of both - none of those
 d) On the basis of date-based archives

a) If your site is category-structured, you don’t need to have the Tag archives indexed. Find your tag base in the Permalinks options page under the Settings menu. If the field is left blank, the tag base is simply 'tag':

Disallow: /tag/

b) If your site is tag-structured, you need to block the category archives. Find your category base and use the following directive:

Disallow: /category/

c) If you use both categories and tags, you don’t need to use any directives. In case you use none of them, you need to block both of them:

Disallow: /tags/
Disallow: /category/

d) If your site is structured on the basis of date-based archives, you can block those in the following ways:

Disallow: /2010/
Disallow: /2011/
Disallow: /2012/
Disallow: /2013/

PLEASE NOTE: You can’t use Disallow: /20*/ here as such a directive will block every single blog post or page that starts with the number '20'.


Duplicate content issues in WordPress

By default, WordPress has duplicate pages which do no good to your SEO rankings. To repair it, we would advise you not to use robots.txt, but instead go with a subtler way: the rel = canonical tag that you use to place the only correct canonical URL in the section of your site. This way, web crawlers will only crawl the canonical version of a page. A more detailed description from Google about what a canonical tag is and why you should be using it can be found here.


That's it!


                    Need any help? Contact our Helpdesk

Comments

We welcome your comments, questions, corrections and additional information relating to this article. Your comments may take some time to appear. Please be aware that off-topic comments will be deleted.

If you need specific help with your account, feel free to contact our Support Team. Thank you.

Need help? We're always here for you.

× Close