Harish Amilineni Optimization of the file Robots.txt for Google

0
549
Optimization of the file Robots.txt for Google

The file Robots.txt is used to give information to Googlebot and the other robots that track the internet about the pages and files that must be indexed on our website. Although it is not essential, the file Robots.txt is of great help for Google and other crawlers when it comes to indexing our page, so it is very important that it is configured correctly.

The file Robots.txt must be created in the root directory of our website and, as its name indicates, it is a simple text file with extension .txt. We must ensure that you have public reading permissions so that it can be accessed from outside, such as 664 permissions  .

In case the file does not exist on our website, we must access via FTP to our server and create it. There are Plugins for the most used CMS such as Drupal or WordPress that create and configure this file for us in case it does not exist.

Although the Googlebot Googlebot is the most popular tracking robot, it is also worth considering the Bingbot of the Bing search engine, the Russian Yandexbot, the Yahoo Slurp, the Alexa bot (ia_archiver) or the Chinese search engine BaiduSpider.

There are also other bots with more specific functionalities such as Googlebot-image , in charge of tracking and indexing exclusively the images of the websites.

There are a lot of crawler robots and many of them do not track our website with good intentions, since they can be from bots looking for security holes to content extraction programs to duplicate our website.

It is very important to bear in mind that, by default, all pages of a website will be indexable. Through the file Robots.txt we can give some guidelines to the different bots that visit us to tell them the content they can access and which they should not track. We can do all this with a few simple basic commands :

  • User-agent : Used to indicate the robot to which the rules that will be defined below will be applied.
    Syntax: User-agent: NombreBelBot
    Example: User-agent:  Googlebot
  • Disallow : Used to indicate to the robots that they should not crawl the URL or URLs that match the pattern defined below.
    Syntax: Disallow:  Pattern
    Example:  Disallow: / comments
  • Allow : Used to indicate to the robots that they must crawl the URL or URLs that match the pattern defined below. The instructions of Allow have preference over those of Disallow , so if we define that a page or pages are indexable with Allow , they will always be indexable although some of them are included in some other Disallow instruction .
    Syntax: Allow:  Pattern
    Example:  Allow: /readme.html
  • Sitemap : Used to specify where the sitemap of our website is located.
    Syntax: Sitemap:  UrlDelSitemap
    Example:  Sitemap: http://www.harishamilineni.com/sitemap.xml

When specifying patterns, there are a series of special characters . We will see first what these characters are and then we will explain how they are used through some examples.

  • * : The asterisk is a wildcard equivalent to any character or set of characters.
  • $ : The dollar symbol indicates the end of a text string, because by default, these expressions understand that if we do not indicate it, more characters may follow the last one we wrote in the pattern.

Finally, it is important to keep in mind that the Robots.txt file is case-sensitive , so it will not be the same as ” Disallow: /file.html” rather than ” Disallow: /Archive.html”.

As you probably have not understood too much, the time has come for you to understand everything through simple examples.

User-agent: *
Disallow: /articulos/

What we are doing with the asterisk of User-agent is to indicate that the following instruction or instructions will be applied for all the bots. This will remain until the end of the document or until the User-agent command appears again, referring to another robot or robots.

By means of the instruction of Disallow , we will be indicating to the bots that they do not index the page ” / articles / “, always starting from our root directory. It is a frequent error to think that only this URL will be blocked , since as we explained before, it is assumed that there may be more characters behind the last character, which in this case is the ” / ” of ” / articles / ” . For example, the URL ” / articles / example ” and other URLs that begin with ” / articles / ” will also be blocked . We will now see how to block only the page ” / articles /“, Making it possible to index the pages that hang from it at a lower level such as” / articles / July “or” / articles / August “.

User-agent: *
Disallow: /articulos$

This case is exactly the same as the previous one, with the difference that by means of the dollar symbol we delimit the URL so that only ” / articles ” are excluded , and the lower level pages can be indexed as ” / articles / january ” or ” / articles / February . ”

As we can see, we have excluded the backslash from the end of the URL, since it is common that it is sometimes included and not included, thus covering all cases.

User-agent: *
Disallow: /articulos/
Allow: /articulos/enero

By default, it is allowed that bots can access all pages. What we do first is to prevent access to the page ” / articles / ” and all those at the lower level, but by allowing allow the URL ” / articles / january ” to be indexed . In this way, only the page ” / articles / January ” will be indexed , but not the pages ” / articles / February “, ” / articles / March”  and other subpages.

User-agent: *
Allow: /articulos/$
Disallow: /articulos/

In this case, we allow access to the ” / articles / ” page and only to it, not specifying anything about the pages that could be in the lower level, which, by default, would also be accessible to the bots at the moment.

Through the following Disallow instruction , we are excluding the ” / articles / ” page and all the sub-level pages, but as we have explicitly defined that it is possible to index ” / articles / ” using the immediately preceding instruction, it will be indexable.

User-agent: *
Disallow: /pagina/*/articulos/

What we are indicating by means of the instruction of Disallow of the example, is that the pages with the first element of the URL ” / pagina / ” and as the third element ” / articles / “, should not be indexed , independently of which is the second element. As we can see, the asterisk serves to replace any string of characters.

User-agent: *
Disallow: /ocultar
User-agent: WebZIP
Disallow: /

In the example, we first tell all the bots not to index the ” / hide ” page. Then we select the “WebZIP” button and we tell it not to index any URL of our website, indicating it with a backslash ” / “, which represents the root directory. It is possible to reference many robots in the Robots.txt file. The common commands will affect all the robots and the specific ones for each robot, only the selected robot, taking precedence the specific commands for the robot itself over the general ones.

Sitemap: http://www.harishamilineni.com/sitemap.xml

Using the Sitemap command , we can tell the bots where the web sitemap is located, useful to help them find all the URLs. It is not a must, but all help will always be welcome.

It is recommended that, when it is possible to index a page, all images , CSS files and JavaScript files are also indexable. This must be so because Google needs to have a real view of the web, this being the closest thing to what a human visitor will see. In other words, so that Google does not penalize us in the rankings, CSS files, JavaScript files and images should not be blocked in the Robots.txt file.

In addition to the Robots.txt file, we can also indicate to the robots that they index or not index certain pages using the meta-meta robots tag, which can have the Index  or  NoIndex values to indicate to the robots if they should or should not index the page. In addition, they can also have a second value that can be Follow or NoFollow to indicate the robots if, by default, they must follow the links on the page.

These meta-tags can be used in combination with the file Robots.txt, but the use of the file gives previous information to the robots so that they do not even have to see the code of the pages to know whether or not they can be indexed.