A robots.txt is gives instructions to the search engine crawlers, telling them what they can or cannot index on your page. Often, robots.txt is referred to as The Robots Exclusion Protocol.
The first thing search engine crawlers do, before scanning a website, is look for a robots.txt file. The file can point the crawler to your sitemap or tell it to not crawl certain subdomains. If you want the search engine crawlers to scan everything (which is most common), creating a robots.txt file is unnecessary. However, if you have things that you do not want indexed, you can do this through a robots.txt. It is important that your robots.txt file is formatted correctly, so that the crawlers will not index your page.
Search Engine Crawlers & robots.txt Files
If a search engine crawler encounters a robots.txt and it sees some disallowed URL, it will not crawl them; however, it still might index them. This is because even if robots are not allowed to see the content, they still are able to remember the anchor text and/or the backlinks that point to the disallowed URL on the site. Thus, due to the blocked access to the link, the URL will appear in search engines, however, without snippets.
See an example of how a robots.txt has been indexed, but not crawled:
Note: While search engine crawlers comply with your robots.txt file. However, other crawlers (like malware, spambots, etc.) may not comply with the instructions on your robots.txt. Do not put confidential information online.
robots.txt & Domain Errors
In case your domain has an error 404 (Not Found) or 410 (Gone), the crawler will crawl your website despite the presence of the robots.txt, because the search engine will assume that the robots.txt file doesn’t exist.
Other errors, like 500 (Internal Server Error), 403 (Forbidden), timeout or ‘unreachable’ take the instructions of robots.txt into consideration, however the crawl might be postponed until the file is accessible again.
robots.txt and a Marketing SEO strategy
If a robots.txt is necessary for your inbound marketing strategy, it could enable your site to be crawled as you desire by the crawlers. On the other hand, if the file is incorrectly formatted, it can lead to your website not being shown in the SERPs .
Locating a robots.txt file
Your robots.txt file is public information. Although the search engines are unable to crawl it, you can see any website's robots.txt by going to their domain and following it with a /robots.txt
Using a tool like Unamo SEO's Optimization section, you can also type in any domain and it will tell you if a robots.txt file is already in place.
Reasons for using a robots.txt file
You should create a robots.txt file if:
- you have out-of-date or sensitive content that you do not want to be crawled
- you do not want for the images on your site to be included in the image search results
- you want to point the crawler easily to your sitemap
- your site is not ready yet and you do not want the robot to index it before it’s fully prepared to be launched
Please bear in mind that the information you want the crawler to avoid is accessible to everyone that enters your URL. Do not use this text file to hide any confidential data.
Facebook has a lot of information that they do not want crawled by different search engines. Their robots.txt file is rather extensive, take a look:
Creating a robots.txt for your website
Most CMS programs, like Wordpress, already have a robots.txt file in place. Check out their FAQ's to figure out how to access it. If you are creating a robots.txt yourself, follow the tips listed in this article.
The robots.txt file should be:
- written with lowercase
- used with UTF-8 encoding
- saved in a text editor; therefore, it is saved as a text file (.txt)
If you’re doing the file yourself, and you’re not sure where to place it exactly, you can either:
- Contact your web server software provider to ask how to access your domain’s root
- Go to Google Search Console and upload it there
With Google Search Console, you can also test if your robots.txt was properly done and check which sites were blocked with the use of the file. If you submit the document in Google Search Console, the updated document should be crawled almost immediately.
You can access the robots.txt Testing Tool here.
An example of a robots.txt
The basic format of the robots.txt is the following:
# You can add comments, which are only used as notes to keep you organized, by preceding them with an octothorpe (#) tag. These comments will be ignored by the crawlers along with any typos that you happen to make.
User-agent - Tells which crawler the instructions on the robots.txt file are meant for.
- Adding an asterisk (*) - you are telling all crawlers that the instructions are meant for all of them
- Specifying a bot (e.g. Googlebot, Baiduspider, Applebot, etc.) - you are telling that specific bot that the instructions are meant for them.
Disallow - Tells the crawlers which parts of a website you don’t want crawled.
Some disallow examples:
- Disallow: /
You disallow crawling of everything
You allow the crawler to crawl everything
- Disallow: /xyz/
You disallow crawling of a folder /xyz/
- Disallow: /xyz
You disallow crawling of a folder that starts with the letters ‘xyz’, so it can be /xyz/, /xyzabc/, /xyzabc_photo/ etc
- Disallow: /.xyz
You disallow crawling folders that start with .xyz
- Disallow: /*.xyz
You disallow crawling folders that contain .xyz
- Disallow: /.xyz$
You disallow crawling folders that end with .xyz
Allow - Tells the crawlers which parts of the just disallowed content is allowed to be crawled.
- Allow: /xyz/abc.html
Crawler is allowed to crawl one of the files in the folder, here: file /abc/ in folder /xyz/)
Sitemap - Tells all the crawlers where your sitemap’s URL can be found. This increases the speed at which you site map is crawled. Adding this is optional.
Please bear in mind that:
Names of folders are case sensitive, i.e. /xyz/ /XYZ/
/xyz/ is more specific than /xyz, therefore use the first one whenever possible to be as precise as possible.
A correctly created robots.txt file is important
A robots.txt should be used together with a robots meta tag. Remember to use both of them carefully. Otherwise, you might end up with a website that will never appear in the SERPs.