Are AI Crawlers a Necessary ‘Evil’?

Are AI Crawlers a Necessary ‘Evil’?


Current-gen AI tools crawl the internet to inform their outputs. For end-users, this connectivity is convenient, since most models can scour for and source more relevant information than what’s available in their training data. But what about the website owners providing the content they crawl?

It turns out that AI crawlers have kinked web traffic and introduced new challenges that website owners must face now and into the future. Let’s talk about it.

The Problem(s) With AI Crawlers

Challenge No. 1 is that AI-driven web crawlers create a lot of traffic. Nearing the end of 2024, these bots were booming — especially OpenAI’s GPTBot. According to Vercel, who dove into their traffic data to round up some details, the crawler made 569 million requests across their website network in just one month. Anthropic’s Claude had 370 million. In Vercel’s article, they put these figures into perspective:

Combined, GPTBot and Claude’s requests represent about 20% of Googlebot’s 4.5 billion requests during the same period.

That’s a lot of traffic in a short time, and no doubt gums up the works of websites. AI crawlers consume substantial bandwidth and server resources, which can slow things down and lead to increased operational costs.​

The other main challenge here is that bot traffic can distort website analytics, making it difficult for site owners to accurately assess user engagement and performance metrics amid a swarm of AI crawlers. General invalid traffic (GIVT) surged 86% at the end of last year because of this, according to research from DoubleVerify.

With AI’s staying power, these challenges will undoubtedly persist, meaning website owners need to take matters into their own hands to remediate any consequences of increasing GIVT. Thankfully, you have options.

How To Handle Increased AI Crawler Traffic to Your Website

Website owners can employ several strategies — both technical and policy-based — to protect against increased bot traffic, particularly from AI crawlers that may strain resources, consume bandwidth or extract proprietary content.

That said, site owners likely want some of these crawlers exploring their domains so that they have a better chance of appearing in things like ChatGPT Search or Google’s AI Overviews. To this end, it’s a bit of a balancing act — but more on that later. First, tactics for mitigating the negative aspects of heightened bot traffic, including AI crawlers:

Strategies for Protecting Your Website from Excessive Bots

It’s difficult to have 100% control over bot traffic and AI crawlers, especially without impacting user experience. However, several strategies can help mitigate the negative impacts of bots on site performance and privacy with varying degrees of effectiveness — it just depends on what your goals are:

Control Access via robots.txt

robots.txt is a plain text file that instructs compliant bots on which parts of your site to crawl or avoid. It works by blocking or restricting crawling from known user agents, like GPTBot. It’s easy to implement and respected by many of today’s most popular and reputable AI crawlers, however, it isn’t enforceable. Malicious or non-compliant bots can and will ignore it to achieve their ends.

Server-Side Firewalls and Rate Limiting

Server-side firewalls and rate-limiting mechanisms are essential infrastructure components that inspect, filter and manage incoming HTTP requests before they reach your website or web application. For firewalls, you can define access rules based on request metadata like IP address or range, region, user-agent string (e.g., GPTBot) and more.

This strategy is customizable in terms of your ability to tailor rules per bot, endpoint or geography; however, legitimate users (e.g., corporate proxies or tools) have a chance of getting blocked.

Manually Monitor, Audit and Respond

A little human oversight is always a good thing. And while 24/7 human monitoring isn’t necessarily a realistic strategy, setting up a firewall or configuring your robots.txt file isn’t a set-it-and-forget-it strategy. Best practice is to monitor your logs and analytics for spikes in traffic from new or suspicious bots. When you see anomalies that clearly link to unwanted bots or crawlers, ban or rate-limit them.

Solving Analytics Issues

The most important thing to understand when dealing with inflated and confusing analytics caused by bot traffic is your benchmarks. If you know what’s normal, you can more easily identify when something doesn’t look right and investigate further.

For example, if you normally have X amount of website traffic and one day you have 3X, it’s reasonable to assume it could be spam traffic. This is when you should head into GA4 or a similar tool and start exploring where it came from to be sure. Here are a few common abnormalities to look for that could indicate spam:

  • Traffic all came in at one time.
  • Traffic all came from the same city.
  • Traffic is all direct.

Once you have an idea about what happened or how much spam traffic you’re dealing with, you can try sussing out junk data from the real to understand how actual people interacted with your site during that time.

This becomes especially important when you’re running a campaign, like email, that’s designed to send traffic to the site. Differentiating between traffic increases from humans opening or clicking your emails and spam bots is necessary to determine success. Here’s what you can do ahead of your next campaign for more clarity on traffic:

  • Have strong tracking mechanisms and reliable reporting systems in place — such as GA4 and Looker Studio — so you can see traffic sources.
  • Use consistent UTMs and other tracking mechanisms so you can easily identify real traffic.

Creating Balance & Final Thoughts

All of this is a nuanced challenge of the AI era. How do you make your content available for beneficial indexing and discovery, while protecting your resources and user experience from abuse or overreach? 

AI crawlers are likely here for good. To catch up with and stay ahead of the curve, website owners and marketers should shift from reactive defense to proactive strategy. Of course, not all AI bots are threats — many website owners would be pleased to have a generative AI tool recommend their website, product or service — so we shouldn’t treat them all as such. Perhaps we start categorizing crawlers and other bots as a new type of audience, with unique behaviors and risks.

Maybe it’s about moving away from a “blocking everything” mindset to one of managed exposure — carefully deciding what content serves strategic goals when indexed by AI, and building the technical infrastructure to support that vision without compromising performance, privacy or user trust. What do you think?

Note: This article was originally published on contentmarketing.ai.



Post Comment