The digital bridge between human and machine? How robots.txt works

What is robots.txt?

It's not just people roaming the web anymore, and we’ve known about the new kids on the block for a while: robots. A company or organization’s online robot—also called a bot, crawler, scraper, or spider—is tasked with exploring the internet for specific reasons, like indexing to rank websites or collect data. For most websites, being open to these bots and being indexed by a search engine, for example, is an excellent way to optimize and rank higher in search results. But not every website wants each of its webpages accessed or indexed by a bot—what then? The robots.txt file is this modern problem’s modern solution.

Essentially, robots.txt is a text file that a website administrator adds to their site to tell bots and crawlers where they can or cannot go, and what they should or shouldn’t index. It's the digital bridge between human and machine, one that helps maintain a symbiotic relationship between the two online.

What does robots.txt do?

The bots that traverse the web and collect information are looking to index your site and its webpages, ready to be shown to searchers (in the case of Googlebot) or broken down into technical details, as the Dataprovider.com spider does. SEO is improved when search engine bots can quickly and accurately identify the pages of a website.

This is where a robots.txt file comes in: it indicates how crawlers can index the web by setting rules and directing to sitemaps. The website owner explicitly allows or disallows (particular) robots to access or index the site, and outlines which bots can access what parts of the site. The owner can direct bots to the unique sitemap, which usually outlines the relevant and appropriate pages to be indexed, and excludes private, irrelevant, or administrative parts of the site. In contrast to LLMs.txt, which gives additional information and specific details, robots.txt is intended to be a more ‘objective’ and straightforward resource for our robotic co-surfers.

The robots.txt file from Dataprovider.com.

However, all of this hinges on an important agreement: adherence to the robots exclusion protocol. First developed by Dutch engineer Martijn Koster in 1994 and used across the web since then, Google sought to upgrade the protocol to an official industry standard in 2019. In 2022, Koster and Google wrote the Proposed Standard together, and it was accepted by the Internet Engineering Task Force (IEFT). As with other kinds of .txt files, this one can also be added to a website root as a markdown file. Website administrators can include incredibly specific instructions to bots, as long or short as needed. Find the original guidelines here and more modern recommendations here.

Robots.txt is the practical outcome of the robots exclusion protocol, but, of course, it is still up to crawlers, bots, and the organization behind them to adhere to the rules. Safe to say that most do—there is a benefit for sites to be indexed, but there is also a benefit for crawlers to have access to these sites and their relevant webpages. Crawlers would rather not be denied access, and so, they adhere to the rules. For example, Wikipedia’s robots.txt file is very explicit about its various instructions: it includes small written explanations (for a human reader’s benefit, no doubt) about which bots are in the ‘naughty corner’ for not following the rules. There are too many bots out there to have specific rules for each, but having some indication on a website about the expectations for robots seems integral to the internet today. Ultimately, however, the protocol remains advisory—a polite suggestion—despite using terms like allow and disallow.

Curious to see some examples of robots.txt on major websites? Take a look at the notable files at Nike, YouTube, Reddit, LinkedIn, and Wikipedia, or add /robots.txt after any domain name to see that website's bot rules.

The data behind robots.txt

At Dataprovider.com, we closely monitor robots.txt files as part of our web crawling process. Since our spider scans the web on a monthly basis, the robots.txt file is our first checkpoint to determine whether we are permitted to access a website: if the file explicitly disallows crawling, we respect those directives and do not access the site. This is part of our privacy by design protocol. While we have always recorded whether a robots.txt file is present, we did not systematically track which bots were specifically allowed or disallowed until recently. This has revealed some impressive numbers: in March 2025, our crawler detected mentions of 112,327 unique user agents specified across all robots.txt files. While some of those mentions may be typos, or for bots that are no longer ‘in service’, the total number gives an indication of just how important it is to provide structure and instructions.

As the web becomes increasingly crowded with more LLM crawlers and other automated agents, understanding who’s out there and what they’re allowed to do is increasingly relevant. Robots.txt provides an interesting glimpse into how website owners manage bot traffic—who they welcome and who they prefer to keep out.

In Figure 1, we break down the share of websites based on whether they have a robots.txt file and whether they allow or disallow crawling. The data reveals that 79% of websites include a robots.txt file, indicating that a significant majority of website owners take bot management seriously.

Figure 1: the share of websites that allow and disallow crawlers via robots.txt, or do not include robots.txt files.

Next, we examined the bots most frequently mentioned in robots.txt files. One key observation is that among the 70 million robots.txt file analyzed, 24% explicitly allow all bots by using a wildcard directive (*). When looking at specific mentions, various Google-owned bots dominate the list, with one major exception: AhrefsBot. This SEO-focused crawler appears in 4% of robots.txt files, making it the most frequently mentioned bot overall (see Figure 2).

Figure 2: the most frequently mentioned bots in robots.txt files.

To take the analysis a step further, we focused on bots that are explicitly disallowed from crawling—meaning they are entirely blocked from accessing any part of the website. In Figure 3, we highlight the most frequently disallowed user agents. Among the top bots are PetalBot, Nutch, and MJ12bot.

PetalBot is operated by Petal Search, a search engine developed by Huawei, and is used to index web pages for Huawei’s search ecosystem. Nutch, on the other hand, is an open-source web crawler widely used for research purposes, custom search engines, and large-scale web scraping projects. MJ12bot is the crawler for Majestic, a search engine and SEO tool specializing in backlink analysis and website rankings.

None of these crawlers are inherently malicious; their impact largely depends on how they are configured and used. While concerns about PetalBot stem from Huawei’s ties to the Chinese government (given past allegations of surveillance and data privacy issues), there is no direct evidence that the bot engages in malicious activity. Wikipedia’s robots.txt specifically calls out MJ12bot, saying “Observed spamming large amounts of [...] and ignoring 429 ratelimit responses”, despite MJ12bot’s website saying it obeys robots.txt.

Figure 3: the most frequently disallowed bots in robots.txt files.

Should every website have robots.txt?

Creating a robots.txt file on your website isn’t complicated and has plenty of benefits, so why not? Depending on the nature and content of a site, it may be useful to indicate clearly which URLs the crawler can access on your site, to indicate the (ir)relevance of certain pages, or to manage and avoid unwanted visitors. Without it, crawlers can and will access all parts of a site. Website owners put different degrees of detail into their robots.txt instructions: Reddit, for example, disallows all crawlers with two lines of text, while LinkedIn’s robots.txt is over 4,000 lines of specification and rules.

As discussed in our recent article, the rise of generative artificial intelligence (genAI) has introduced another visitor to the digital world. Large Language Models (LLMs) also gather all kinds of information from the web; including LLMs.txt on a website specifies which relevant details the bot should share if asked via ChatGPT, Claude, or another service. However, since this development is still rather recent, many website owners opt to include LLMs instructions in their robots.txt files.

The internet is an ever-evolving place, as we know, and keeping up with the changes is complicated. Access, security, privacy, and responsibility must be considered when new developments arise, and protocols like robots.txt are one way to bridge the gap between humans and our machines.

Subscribe to our newsletter to stay in the loop about the latest insights and developments around web data.

Subscribe

Follow us

Questions?

The digital bridge between human and machine? How robots.txt works

Subscribe to our newsletter to stay in the loop about the latest insights and developments around web data.

Related Recipes

Related Articles

Optimizing websites for the future of search? LLMs.txt explained

Why you should add a humans.txt file to your website – and how to make one

Security.txt: the web's unsung guardian