How to Block AI, SEO Bots, and Web Crawlers Using robots.txt

1 Reply 4 months

Estimated reading time: 3 minutes

As #AI scrapers and #SEO bots become increasingly common across the web, website owners are more interested than ever in controlling who accesses their content. While some bots are useful for indexing content (like #Googlebot), others may harvest data without permission, copy original material, or overload your servers. One of the simplest ways to manage bot traffic is by configuring your robots.txt file.

What Is robots.txt?

The robots.txt file is a standard used by websites to communicate with web #crawlers and other bots. It tells them which parts of the site they are allowed or disallowed to access. Although compliance is voluntary, most well-behaved bots (like those from #searchengines) will honor the rules set in robots.txt.

Blocking Bots: General Syntax

Here’s a quick breakdown of the robots.txt syntax:

User-agent: name of bot
Disallow: URL path you want to block

For example:

User-agent: Googlebot
Disallow: /private/

This blocks Googlebot from accessing anything under the /private/ directory.

Blocking All Bots

If you want to block all bots from crawling your site:

User-agent: *
Disallow: /

This tells all bots to avoid your entire site.

Blocking Specific SEO and AI Bots

To block specific AI and SEO-related bots, you need to know their user-agent names. Some common ones include:

SEO Bots

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: SEOkicks-Robot
Disallow: /

User-agent: DotBot
Disallow: /

AI and Data Scrapers

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ChatGPT-User
Disallow: /

Note: These bots may change their #useragent names or ignore robots.txt, especially if they’re malicious or not affiliated with major companies. In such cases, server-side blocking (via firewalls or .htaccess rules) is more effective.

Allowing Good Bots While Blocking Others

Let’s say you want Google to index your site but want to block all AI and SEO bots. You can structure your robots.txt like this:

User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

User-agent: GPTBot
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow: /

This allows Google and Bing but blocks everyone else.

How to Test Your robots.txt

Google Search Console – Google offers a tool to test your robots.txt rules.
Online Tools – Online robots.txt checker sites help validate your syntax.
Manual Testing – You can mimic a bot by using tools like curl or browser extensions that spoof user-agents.

Limitations of robots.txt

Voluntary Compliance – Bad bots can ignore it.
No Obfuscation – robots.txt is public; disallowed paths may expose sensitive directories.
No Granular Controls – It can’t limit request rates or block based on behavior.

Going Beyond robots.txt

For stronger control:

Use a #firewall or WAF (Web Application Firewall).
Set up rate-limiting or CAPTCHA challenges.
Monitor server logs for unwanted crawlers.
Block IP ranges known to host #scrapers.

Final Thoughts

robots.txt is a simple but powerful tool to help manage how bots interact with your site. While it won’t stop determined #scrapers, it’s an important first line of defense – especially against well-known SEO and AI #crawlers. Pair it with more advanced server-side protections for a comprehensive #botblocking strategy.

AI Botblocking Crawlers Firewall Googlebot Scrapers Searchengines SEO Useragent

Login to comment

Written by

HowTo

@HowToArticles: 18

eBay Update: New Restrictions on High-Value Listings to the U.S. (Effective June 16, 2025)

1 1 Reply 2 months

eBay has announced a significant update to its seller listing policy, effective June 16, 2025, that will impact international sellers in over 170 countries. This update restricts certain high-value …

Bookmark