The Robots.txt file and everything we need to know about it
As you know, the Robots.txt file tells crawlers (or crawlers) of search engines and other crawlers which pages of the site are open for scanning and which pages they cannot check.
This file is used by most sites today and respected by most web crawlers. This protocol is often used on sites that are under development or on pages that do not want to be publicly available.
In the process of site optimization for search engines, or SEO, the Robots.txt file plays an important role in optimizing the crawling and indexing of search engines.
History of the Robots.txt file
The Robots.txt protocol was originally proposed by one of the pioneers of the Internet space and the creator of Allweb, Martijn Koster.
He made the offer in early 1994 while working for Nexor. English writer Charlie Stross claims he made the suggestion to Koster when he created the malicious crawler that caused problems on the servers.
Thanks to the simplicity and usefulness of this protocol, most of the sites and early search engines also adapted to this file. To date, search engines such as Google, Bing, Yahoo and other search engines respect this protocol and stay away from user-restricted pages.
For the site SEO process, the robots.txt file has become an integrated part of the optimization process. Because the community has gained more awareness about concepts such as link equalization flow and creep budget. Today, expert and experienced SEOs rely on this protocol to prevent bots from crawling dynamic pages, admin pages, payment pages, and other similar documents. However, not all reptiles follow this standard. Spam bots, content copycats, hacking and malicious software all ignore the instructions in this file.
In some cases, malicious crawlers even make crawling these pages their priority. Archival sites such as Archive Team and Internet Archive ignore such standards and consider it an outdated protocol created mostly for search engines.
Archival groups often claim to store information that monitors the evolution of the Internet and other founders.
Application of Robots.txt file
The robots.txt file is usually uploaded in the root directory of the site. Most robots are programmed to look for an address like www.example.com/robots.txt.
For most robots, not finding a valid robots.txt file in this position means that all pages on the site are free to crawl. This applies even when the file is uploaded in another location and address. Creating a robots.txt file is as simple as writing instructions on a notepad and saving it in txt format with the name robots.
After you have created the robots.txt file, you need to upload it to the root directory of the domain via FTP or cPanel (or any hosting and server management program). Most modern content management platforms and SEO plugins create this file automatically.
So you can enter it and apply the required edits. The following are the most common uses of robots.txt files.
• Prevent and reject indexing
Among all the reasons for using the robots.txt file, this is one of the most common. Webmasters usually like to prevent indexing and crawling of pages that are not relevant to the searcher's experience. For example, pages such as sections under construction, internal search results, user-generated content, PDFs, pages generated by filters, etc.
• Maintain creep budget
Large websites with thousands of pages usually don't want all of their pages to be crawled by Google bots. They do this to increase the chances of crawling and indexing important pages.
Regular and frequent crawling on organic traffic landing pages means that your SEO efforts will soon show up on the search engine results page. This also means that linked pages can benefit more from link transfer.
• Optimizing the flow of equalizing links
txt file can be useful in optimizing the flow of links for site pages. By keeping crawlers away from pages that don't matter much, internal link equity is maintained on natural traffic landing pages. This means that the ranking power of your site will focus on the pages that are of great importance, and this will make these pages rank higher in the search results and attract more natural traffic.
• Sitemap index
The robots.txt file can also be used for this purpose. In this situation, the robots.txt file tells search engines where to find the sitemap. This issue is optional because the site map can be registered through the Google search console and get the same result, but using this file to register the site map will not do any harm.
Some pages should not be made public. Login pages and admin pages are examples of these pages. The more secure these pages are, the lower the risk of site attacks. (Of course, by registering these pages in the robot.txt file, people can see them by viewing this file!)
• Determine the creep delay
Large websites such as e-commerce sites and wikis often publish their content in batches. In such a situation, bots quickly come to work and try to scan the entire published content at once. This issue creates pressure on the server and eventually the loading speed of the site decreases or downtime is created. Such sites can avoid such situations by writing the instructions in the txt file. In this situation, new pages are crawled gradually, and enough time is given to the server.
Writing and formatting the robots.txt file
This file has a simple and basic language that even people who do not know programming can learn to write it in a very short time. This often involves specifying pages that crawlers should not have access to. These are the general words that you should consider when writing the robots.txt file:
This code specifies the name of the crawler you want to address. This part can be Googlebot for natural Google crawlers, Bingbot for Bing crawlers, Rogerbot for MOZ crawlers, etc. The * character can be used to target all reptiles.
This directive is followed by a directory path such as /category to tell bots not to crawl every address in this category.
Single addresses such as category/sample-page.html can be kept away from robots with the help of this code.
This code tells the bots how many milliseconds the crawl delay should be. The value of this part often changes depending on the size of the site and the capacity of its servers.
This section shows the location of the site map.
Let's say you are the admin of a WordPress site, and you want to make sure that some pages and dynamic pages will never show up in search engine results. Your robots.txt file might look like this:
The first line addresses all reptiles because of the * character in it. While the second line specifies that all pages with addresses that include /wp-admin should not be crawled. The third line tells the bots that all pages with question marks should not be indexed. The question mark and the equivalent signs are characters that appear in dynamic addresses.
Note that you do not need to include the root domain in this file when specifying the pages and directories you want to block. Slug address or file path is sufficient.
Important points in making the Robots file
There are several exercises that make sure that the configuration of the robots.txt file is done correctly and can have a positive impact on the SEO process in addition to a good user experience. In the following, we review some important exercises together:
- Never upload a txt file anywhere other than the root directory. You should not rename it. If the search bots can't find it in the path www.example.com/robots.txt, they won't be able to find it, which will make them assume that all pages on the site are free to crawl.
- File name is case sensitive. Most crawlers recognize a file created with the name robots.txt as different from a file named robos.txt. Make sure you write the file name in lowercase letters.
- As we mentioned earlier, malicious crawlers sometimes prioritize crawling the addresses blocked in this file so that they can find the entry point to the site. For security purposes, you can use the noindex meta tag instead of this file to prevent such pages from being indexed.
- Pages that are kept away from search robots do not transfer link equity to the internal and external pages that have linked to them. If you don't want to index a page, but you want to transfer link equity, use the noindex, follow meta tag.
- Subdomains are considered different sites in most search engines. This means that the txt file in the root domain will not be tracked in these subdomains. In fact, g-ads.org should have a different robots.txt file than its subdomain, for example, seotools.g-ads.org.
You can measure the health of your file through the robots.txt tester tool of Google Search Console. Just go to this site and check your Robot.txt file with the help of this tool.