The file Robots.txt is used to give information to Googlebot and the other robots that track the internet about the pages and files that must be indexed on our website. Although it is not essential, the file Robots.txt is of great help for Google and other crawlers when it comes to indexing our page, so it is very important that it is configured correctly.
1. Location of the file Robots.txt
The file Robots.txt must be created in the root directory of our website and, as its name indicates, it is a simple text file with extension .txt. We must ensure that you have public reading permissions so that it can be accessed from outside, such as 664 permissions .
In case the file does not exist on our website, we must access via FTP to our server and create it. There are Plugins for the most used CMS such as Drupal or WordPress that create and configure this file for us in case it does not exist.
2. Types of robots that can visit our website
Although the Googlebot Googlebot is the most popular tracking robot, it is also worth considering the Bingbot of the Bing search engine, the Russian Yandexbot, the Yahoo Slurp, the Alexa bot (ia_archiver) or the Chinese search engine BaiduSpider.
There are also other bots with more specific functionalities such as Googlebot-image , in charge of tracking and indexing exclusively the images of the websites.
There are a lot of crawler robots and many of them do not track our website with good intentions, since they can be from bots looking for security holes to content extraction programs to duplicate our website.
3. Editing the file Robots.txt
It is very important to bear in mind that, by default, all pages of a website will be indexable. Through the file Robots.txt we can give some guidelines to the different bots that visit us to tell them the content they can access and which they should not track. We can do all this with a few simple basic commands :
- User-agent : Used to indicate the robot to which the rules that will be defined below will be applied.
Syntax: User-agent: NombreBelBot
Example: User-agent: Googlebot
- Disallow : Used to indicate to the robots that they should not crawl the URL or URLs that match the pattern defined below.
Syntax: Disallow: Pattern
Example: Disallow: / comments
- Allow : Used to indicate to the robots that they must crawl the URL or URLs that match the pattern defined below. The instructions of Allow have preference over those of Disallow , so if we define that a page or pages are indexable with Allow , they will always be indexable although some of them are included in some other Disallow instruction .
Syntax: Allow: Pattern
Example: Allow: /readme.html
- Sitemap : Used to specify where the sitemap of our website is located.
Syntax: Sitemap: UrlDelSitemap
Example: Sitemap: https://www.harishamilineni.com/sitemap.xml
When specifying patterns, there are a series of special characters . We will see first what these characters are and then we will explain how they are used through some examples.
- * : The asterisk is a wildcard equivalent to any character or set of characters.
- $ : The dollar symbol indicates the end of a text string, because by default, these expressions understand that if we do not indicate it, more characters may follow the last one we wrote in the pattern.
Finally, it is important to keep in mind that the Robots.txt file is case-sensitive , so it will not be the same as ” Disallow: /file.html” rather than ” Disallow: /Archive.html”.
As you probably have not understood too much, the time has come for you to understand everything through simple examples.
3.1. Block a page and lower level pages
User-agent: * Disallow: /articulos/ |
What we are doing with the asterisk of User-agent is to indicate that the following instruction or instructions will be applied for all the bots. This will remain until the end of the document or until the User-agent command appears again, referring to another robot or robots.
By means of the instruction of Disallow , we will be indicating to the bots that they do not index the page ” / articles / “, always starting from our root directory. It is a frequent error to think that only this URL will be blocked , since as we explained before, it is assumed that there may be more characters behind the last character, which in this case is the ” / ” of ” / articles / ” . For example, the URL ” / articles / example ” and other URLs that begin with ” / articles / ” will also be blocked . We will now see how to block only the page ” / articles /“, Making it possible to index the pages that hang from it at a lower level such as” / articles / July “or” / articles / August “.
3.2. Block a page while maintaining access to lower-level pages
User-agent: * Disallow: /articulos$ |
This case is exactly the same as the previous one, with the difference that by means of the dollar symbol we delimit the URL so that only ” / articles ” are excluded , and the lower level pages can be indexed as ” / articles / january ” or ” / articles / February . ”
As we can see, we have excluded the backslash from the end of the URL, since it is common that it is sometimes included and not included, thus covering all cases.
3.3. Block a page and all lower level pages except for those that we define
User-agent: * Disallow: /articulos/ Allow: /articulos/enero |
By default, it is allowed that bots can access all pages. What we do first is to prevent access to the page ” / articles / ” and all those at the lower level, but by allowing allow the URL ” / articles / january ” to be indexed . In this way, only the page ” / articles / January ” will be indexed , but not the pages ” / articles / February “, ” / articles / March” and other subpages.
3.4. Block all lower level pages but allow access to the higher level
User-agent: * Allow: /articulos/$ Disallow: /articulos/ |
In this case, we allow access to the ” / articles / ” page and only to it, not specifying anything about the pages that could be in the lower level, which, by default, would also be accessible to the bots at the moment.
Through the following Disallow instruction , we are excluding the ” / articles / ” page and all the sub-level pages, but as we have explicitly defined that it is possible to index ” / articles / ” using the immediately preceding instruction, it will be indexable.
3.5. Blocking URLs using wildcards
User-agent: * Disallow: /pagina/*/articulos/ |
What we are indicating by means of the instruction of Disallow of the example, is that the pages with the first element of the URL ” / pagina / ” and as the third element ” / articles / “, should not be indexed , independently of which is the second element. As we can see, the asterisk serves to replace any string of characters.
3.6. Assign different instructions for different robots
User-agent: * Disallow: /ocultar User-agent: WebZIP Disallow: / |
In the example, we first tell all the bots not to index the ” / hide ” page. Then we select the “WebZIP” button and we tell it not to index any URL of our website, indicating it with a backslash ” / “, which represents the root directory. It is possible to reference many robots in the Robots.txt file. The common commands will affect all the robots and the specific ones for each robot, only the selected robot, taking precedence the specific commands for the robot itself over the general ones.
3.7. Tell the crawler robots where the web sitemap is located
Sitemap: https://www.harishamilineni.com/sitemap.xml |
Using the Sitemap command , we can tell the bots where the web sitemap is located, useful to help them find all the URLs. It is not a must, but all help will always be welcome.
4. Recommendations for the file Robots.txt
It is recommended that, when it is possible to index a page, all images , CSS files and JavaScript files are also indexable. This must be so because Google needs to have a real view of the web, this being the closest thing to what a human visitor will see. In other words, so that Google does not penalize us in the rankings, CSS files, JavaScript files and images should not be blocked in the Robots.txt file.
5. Alternative: Use of meta-tag meta robots
In addition to the Robots.txt file, we can also indicate to the robots that they index or not index certain pages using the meta-meta robots tag, which can have the Index or NoIndex values to indicate to the robots if they should or should not index the page. In addition, they can also have a second value that can be Follow or NoFollow to indicate the robots if, by default, they must follow the links on the page.
These meta-tags can be used in combination with the file Robots.txt, but the use of the file gives previous information to the robots so that they do not even have to see the code of the pages to know whether or not they can be indexed.
Contents
- 1 1. Location of the file Robots.txt
- 2 2. Types of robots that can visit our website
- 3 3. Editing the file Robots.txt
- 3.1 3.1. Block a page and lower level pages
- 3.2 3.2. Block a page while maintaining access to lower-level pages
- 3.3 3.3. Block a page and all lower level pages except for those that we define
- 3.4 3.4. Block all lower level pages but allow access to the higher level
- 3.5 3.5. Blocking URLs using wildcards
- 3.6 3.6. Assign different instructions for different robots
- 3.7 3.7. Tell the crawler robots where the web sitemap is located
- 4 4. Recommendations for the file Robots.txt
- 5 5. Alternative: Use of meta-tag meta robots