As #AI scrapers and #SEO bots become increasingly common across the web, website owners are more interested than ever in controlling who accesses their content. While some bots are useful for indexing content (like #Googlebot), others may harvest data without permission, copy original material, or overload your servers. One of the simplest ways to manage bot traffic is by configuring your robots.txt
file.
What Is robots.txt?
The robots.txt
file is a standard used by websites to communicate with web #crawlers and other bots. It tells them which parts of the site they are allowed or disallowed to access. Although compliance is voluntary, most well-behaved bots (like those from #searchengines) will honor the rules set in robots.txt
.
Blocking Bots: General Syntax
Here’s a quick breakdown of the robots.txt
syntax:
User-agent: name of bot
Disallow: URL path you want to block
For example:
User-agent: Googlebot
Disallow: /private/
This blocks Googlebot from accessing anything under the /private/
directory.
Blocking All Bots
If you want to block all bots from crawling your site:
User-agent: *
Disallow: /
This tells all bots to avoid your entire site.
Blocking Specific SEO and AI Bots
To block specific AI and SEO-related bots, you need to know their user-agent names. Some common ones include:
SEO Bots
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SEOkicks-Robot
Disallow: /
User-agent: DotBot
Disallow: /
AI and Data Scrapers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: ChatGPT-User
Disallow: /
Note: These bots may change their #useragent names or ignore
robots.txt
, especially if they’re malicious or not affiliated with major companies. In such cases, server-side blocking (via firewalls or .htaccess rules) is more effective.
Allowing Good Bots While Blocking Others
Let’s say you want Google to index your site but want to block all AI and SEO bots. You can structure your robots.txt
like this:
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: GPTBot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Disallow: /
This allows Google and Bing but blocks everyone else.
How to Test Your robots.txt
- Google Search Console – Google offers a tool to test your robots.txt rules.
- Online Tools – Online robots.txt checker sites help validate your syntax.
- Manual Testing – You can mimic a bot by using tools like
curl
or browser extensions that spoof user-agents.
Limitations of robots.txt
- Voluntary Compliance – Bad bots can ignore it.
- No Obfuscation –
robots.txt
is public; disallowed paths may expose sensitive directories. - No Granular Controls – It can’t limit request rates or block based on behavior.
Going Beyond robots.txt
For stronger control:
- Use a #firewall or WAF (Web Application Firewall).
- Set up rate-limiting or CAPTCHA challenges.
- Monitor server logs for unwanted crawlers.
- Block IP ranges known to host #scrapers.
Final Thoughts
robots.txt
is a simple but powerful tool to help manage how bots interact with your site. While it won’t stop determined #scrapers, it’s an important first line of defense – especially against well-known SEO and AI #crawlers. Pair it with more advanced server-side protections for a comprehensive #botblocking strategy.