Marketing Skills for Cursor, Claude Code, OpenClaw — Install 160+ skills

Web Crawler: Search Engine & AI Crawlers

Full-spectrum crawler guide: crawl vs scrape, robots.txt vs noindex, Google-Extended, search bots (Googlebot, Bingbot, Yandex, Baidu), AI bots (GPTBot, OAI-SearchBot, ClaudeBot), browser agents, third-party bots, and bot traffic control.

Updated on April 21, 2026
20 min read
Share
TL;DR

Key Takeaways

This guide explains web crawlers for site owners: terminology, protocols, major search/AI user agents, agents, third-party bots, and bot-traffic defenses. If you are choosing scraping stacks (proxies, hosted APIs, Playwright), follow the Web Scraping Tools link in the Crawling, Scraping, and Site Controls transition below—complementary scope.

  • Crawling discovers and fetches URLs from the web, while scraping extracts data from pages—two distinct concepts often confused.
  • Learn how robots.txt directives, Google-Extended tokens, and AI crawler user-agents differ, plus which directives each major crawler respects.
  • Consider crawl budget allocation, JavaScript rendering requirements, and whether your CDN or server infrastructure can handle crawler traffic patterns.
  • Search crawlers (Googlebot, Bingbot, Yandex, Baidu) render JS broadly; many AI training crawlers read HTML without executing JS.
  • OpenAI and Anthropic split training bots, search index bots, and user-triggered fetchers—configure each crawler category separately for granular control.

Use Cursor / OpenClaw to optimize crawlability

npx skills add kostja94/marketing-skills --skill site-crawlability

Star or fork on GitHub for 160+ skills

Introduction: Web Crawlers

Additional detail: Browser-style AI agents may use signed HTTP requests (RFC 9421); User-Agent strings alone are spoofable. SEO tools, social preview fetchers, and monitors also hit your origin—separate them from search bots in logs. Good bots vs bad bots (scrapers, credential stuffing) need different defenses than editing robots.txt alone.

Web crawlers (also called spiders or bots) are automated programs that discover and crawl web content. Crawlers are foundational components of search engines and AI systems, tracking hyperlinks to continuously discover new pages and crawl content, providing data sources for search indexes and AI model training.

In practice you need more than two buckets: search indexing, AI training, AI search/retrieval, user-triggered fetches, browser agents, third-party SEO and monitoring bots, and malicious automation. Each has different goals, user-agent tokens, and risk trade-offs.

This article is not legal advice. Vendor behaviors, user agents, and percentages change—verify against each operator's latest documentation.

To learn more about how search engines work, check our complete guide on how search engines work. For generative visibility strategy, see GEO / AEO.

This page is written for site owners and operators: who is hitting your origin, how robots/WAF fit together, and how to reason about bot traffic. If you are on the data-collection side—picking proxies, hosted scraping APIs, browser automation, or RAG fetch pipelines—see Web Scraping Tools as the complementary guide (we avoid duplicating vendor comparison blocks here).

Crawling, Scraping, and Site Controls

Crawling usually means automatically discovering and fetching URLs at scale. Scraping (web scraping) emphasizes extracting structured data from pages or APIs (text, tables, prices). People use the words interchangeably; many programs both crawl links and scrape fields.

Crawling is not indexing. A search engine may fetch a URL without indexing it, or show a URL without a recent crawl. robots.txt (see Google's robots introduction and RFC 9309) asks cooperative crawlers which paths not to fetch. noindex and X-Robots-Tag steer indexing. Blocking everything with Disallow can prevent a crawler from ever seeing a noindex—a common misconfiguration.

Google-Extended is a robots user-agent token for whether content Google may use for certain Gemini-related training and grounding. Google's documentation states it does not affect ordinary Google Search rankings as a separate switch; HTTP requests may still use familiar Google user agents—read Google's common crawlers. Configure training, search indexing, and AI retrieval deliberately.

Search Engine Crawlers

Search engine crawlers are the automated clients that fetch pages so engines can refresh their indexes. The Crawling → Indexing → Serving pipeline, inverted indexes, and query-time assembly are explained in the companion article linked in the Introduction—this page stays on who is requesting your URLs, how they identify themselves, and how they fetch resources.

Major Search Engine Crawlers

Major search engine crawlers include: Googlebot (Google's crawler for Google Search and Gemini AI), Bingbot (Microsoft Bing's crawler), YandexBot (Yandex's crawler), Baiduspider (Baidu's crawler). These crawlers efficiently discover and crawl content, support JavaScript rendering, handle modern web apps.

Googlebot is the most widely used, generating about 4.5 billion requests monthly on Vercel network. Googlebot uses Chrome rendering engine to process JavaScript, fully rendering modern web apps. Bingbot has similar JavaScript rendering. These crawlers typically crawl from multiple locations (e.g., multiple US data centers) ensuring global coverage and load balancing.

JavaScript Rendering Capability

Modern search engine crawlers (Googlebot, Bingbot) have full JavaScript rendering, executing JavaScript and rendering complete content. This means SPAs built with React, Vue, Angular can be correctly crawled and indexed. Googlebot uses Chrome rendering engine, handling CSS, Ajax, WebSocket, and other modern web technologies.

However, JavaScript rendering requires additional resources. If JavaScript content is excessive or slow-loading, it may affect crawling efficiency. Website owners should use SSR or SSG to ensure key content is quickly crawled, while using JavaScript to enhance UX. For key content (article body, product info, metadata), ensure it's included in initial HTML response, not relying entirely on JavaScript rendering.

AI Crawlers

AI crawlers are used by AI companies (OpenAI, Anthropic) to collect web data for training LLMs, helping AI tools generate accurate, human-like answers. With the rise of ChatGPT, Claude, Perplexity, AI crawlers have become an important part of web traffic.

According to Vercel and MERJ research, AI crawler traffic is substantial: OpenAI's GPTBot generated 569 million requests on Vercel network in the past month, Anthropic's Claude generated 370 million. This combined traffic is about 20% of Googlebot's 4.5 billion requests in the same period, showing AI crawlers are important forces in the crawler ecosystem.

Types of AI Crawlers

AI crawlers have two main purposes: model training (collecting data to train LLMs) and real-time retrieval (crawling pages during user queries to provide latest info and citation links). Training crawlers (GPTBot, ClaudeBot) continuously crawl content for training and optimization; retrieval crawlers (ChatGPT-User, OAI-SearchBot) crawl pages during queries for latest info and citations.

Training crawlers focus on collecting large amounts of quality content, crawling various types including HTML, images, JavaScript files. Retrieval crawlers focus on quickly getting latest info during queries, typically referencing search engine indexes (e.g., Bing index) to find relevant pages, then crawling content to generate answers.

Major AI Crawlers

Major AI crawlers include: GPTBot (OpenAI for training), ChatGPT-User (OpenAI for retrieval), OAI-SearchBot (OpenAI for search index), ClaudeBot (Anthropic's crawler), PerplexityBot (Perplexity's crawler), Bytespider (ByteDance's crawler), Amazonbot (Amazon's crawler).

OpenAI's crawler system includes multiple user agents: GPTBot for training, ChatGPT-User for retrieval, OAI-SearchBot for building ChatGPT Search index. Anthropic's ClaudeBot is a general web crawler for training Claude AI. PerplexityBot builds Perplexity AI-driven search index, reducing reliance on third-party search engines. These crawlers typically crawl from US data centers (Iowa, Arizona, Ohio).

AI Crawler Behavior Characteristics

AI crawler behavior differs significantly from search engine crawlers: JavaScript rendering: Most AI crawlers (GPTBot, ClaudeBot) don't execute JavaScript, only reading initial HTML. Only Google's Gemini (using Googlebot infrastructure) and AppleBot have full JavaScript rendering. This means CSR web apps may not be correctly crawled by AI crawlers.

Content type priority: AI crawlers prioritize different content types. ChatGPT prioritizes HTML (57.70% requests), Claude focuses on images (35.17%). Both spend significant time crawling JavaScript files (ChatGPT: 11.50%, Claude: 23.84%) despite not executing them. This may be because AI models need to learn various web content forms, including JavaScript code as text data.

Crawling efficiency: AI crawler efficiency is relatively low. ChatGPT has 34.82% requests returning 404, Claude has 34.16%. ChatGPT has 14.36% requests following redirects. In contrast, Googlebot has only 8.22% 404s, 1.49% redirects. This indicates AI crawlers need improvement in URL selection and validation.

Geographic distribution: All AI crawlers run from US data centers. ChatGPT runs from Iowa Des Moines and Arizona Phoenix, Claude from Ohio Columbus. In contrast, traditional search engines (Googlebot) crawl from multiple locations (including 7 US locations) ensuring global coverage.

Anthropic user agents (training, search, user fetch)

Anthropic documents separate bots such as ClaudeBot (training), Claude-SearchBot (search quality/indexing-style work), and Claude-User (user-directed retrieval). Each can be targeted independently in robots.txt—verify the latest names in Anthropic's crawler help article.

Verifying real search crawlers

User-Agent strings are trivially spoofed. Google and Bing publish verification steps (often reverse DNS or allowlists). Yandex and Baidu emphasize reverse DNS patterns for their networks—use official webmaster documentation before you block an IP based on UA alone.

Browser Agents and Provable Identity

Some products run a browser automation flow that does not look like a classic “bot UA string.” OpenAI documents ChatGPT agent traffic that can be validated with HTTP Message Signatures (RFC 9421) and a Signature-Agent header—see ChatGPT agent allowlisting. This is a different control surface than listing GPTBot in robots.txt.

Do not equate ChatGPT-User (user-triggered fetches) with “search index bots.” OpenAI's docs explain that robots rules for fully automatic crawling may not apply to some user-initiated actions—read the current OpenAI crawler documentation when you audit compliance.

Third-Party and Non-Search Bots

Beyond search and AI vendors, your origin often sees: SEO platform crawlers (link indexes, site audits), social link preview fetchers (messaging and social apps pulling Open Graph tags), RSS/feed readers, uptime monitors, and archival projects. They may be polite and rate-limited—or noisy if misconfigured. They are not “Googlebot,” but they still consume CPU and bandwidth.

Common Crawl operates CCBot for open web datasets—another independent knob in robots.txt if you want to opt out of that ecosystem.

How to Manage Crawler Access

Website owners need to decide whether to allow crawler access based on brand goals and risk-benefit. For e-commerce, allowing all major AI crawlers may be beneficial, contributing brand narrative and showing latest content through real-time retrieval. For content publishers, may need nuanced strategies to protect content from AI search summarization, which may affect organic traffic and engagement.

Strategies to Allow Crawler Access

If you want to be crawled, recommended strategies: Prioritize server-side rendering: ChatGPT and Claude don't execute JavaScript, so important content should be server-side rendered. This includes main content (articles, product info, docs), metadata (titles, descriptions, categories), and navigation structure. Use SSR, ISR, and SSG to ensure all crawlers can access your content.

Client-side rendering still works for enhancements: For non-essential dynamic elements (view counters, interactive UI enhancements, live chat widgets, social feeds), continue using CSR. These don't affect crawler access to core content.

Efficient URL management: AI crawlers' high 404 rates highlight the importance of maintaining correct redirects, keeping sitemaps updated, using consistent URL patterns. Ensure all important pages have correct redirects, avoid 404s, helping crawlers crawl more efficiently.

Push latest and most important content to top index: Use sitemaps, IndexNow protocol, and direct content push to Bing are ways to actively notify search engines of new content needing crawling. This encourages AI crawlers to focus on your priority content, making full use of their crawl budget.

Note: While these strategies are often called "GEO" (Generative Engine Optimization) recommendations, they're essentially SEO recommendations. Most search engine crawlers have JavaScript rendering, but excessive or slow JavaScript still affects efficiency. More importantly, AI crawlers have issues (high 404 rates, no JavaScript execution), and most websites have weak SEO foundations, trying to do GEO optimization directly is "trying to run before learning to walk." There aren't many truly GEO-native optimization methods; most GEO optimization adds to SEO. Good GEO is simple: on user-friendly, search engine-friendly content foundation, make it LLM-friendly, and ensure AI crawlers and LLM web search APIs can see it.

Strategies to Block Crawler Access

If you don't want to be crawled, use these strategies: Use robots.txt to control access: robots.txt is effective for all measured crawlers. Specify user agents or product tokens to set rules, limiting AI crawler access to sensitive or unnecessary content. To find user agents to block, check each company's documentation (e.g., Applebot and OpenAI crawlers).

Use firewalls to block AI crawlers: Vercel's WAF provides AI bot firewall rules, one-click blocking of AI crawlers. This rule automatically configures firewall to reject their access. Other CDNs and hosting platforms offer similar bot management.

Note: If blocking all AI crawlers indiscriminately, you may miss consumers searching your products or services on non-Google platforms. AI model knowledge is limited by training data; if brands block all AI crawlers, they'll learn about brands from other sources (third-party sites, reviews, competitors). The only way to maintain brand narrative control in AI search is to contribute to model knowledge of your brand.

If you don't want website content crawled for training, a simple method: add more JavaScript. Since most AI crawlers don't execute JavaScript, placing important content in JavaScript-rendered sections can effectively block AI crawlers (though this also affects SEO, needs balancing).

Bot Traffic Management

Bot traffic refers to website visits generated by automated programs rather than real users. According to Imperva's (Thales subsidiary) 2025 "Bad Bot Report," automated bot traffic exceeded human-generated traffic for the first time in 2024, accounting for 51% of global internet traffic, the first time bots exceeded humans in a decade. This historic shift is mainly attributed to AI and LLM rise, simplifying bot creation and deployment at scale.

Malicious bot growth is particularly significant: 2024 malicious bots accounted for 37% of total internet traffic, up from 32% in 2023, the sixth consecutive year of growth. In contrast, 2023 bot traffic was almost equal to human traffic at 49.6%, with malicious bots at 32%, good bots at 17.6%. AI tool proliferation lowered barriers for attackers, enabling large-scale creation and deployment of malicious bots, which are increasingly sophisticated, mimicking human behavior and evading traditional security measures.

This change has profound impacts on website operations. On the internet, over half of "users" are actually automated programs, not real human users. This means website analytics may be severely distorted, ad performance may be inflated, content recommendation algorithms may be disrupted. For some industries (e.g., travel), impact is more severe: 2024 travel industry malicious bots accounted for 41% of traffic, becoming the most attacked industry, accounting for 27% of all bot attacks, up from 21% in 2023.

Bot traffic is mainly divided into two types: good bots (search engine crawlers, AI crawlers, monitoring tools) and bad bots (content scrapers, spam bots, DDoS bots). Good bots have positive effects; bad bots may cause data leaks, server overload, content theft.

Methods to identify bot traffic include: analyzing user behavior patterns (bots have repetitive, predictable behavior), checking user agent strings (bots may use outdated or suspicious agents), monitoring IP addresses and geography (malicious bots may come from unusual locations), analyzing session duration and page views (bot sessions differ from real users).

Bot traffic management strategies include: using WAF to filter bad bots, setting rate limits to prevent excessive access, using CAPTCHA to verify users, configuring robots.txt to control crawler access, regularly monitoring and analyzing logs to identify abnormal traffic. For good bots (search engine crawlers), allow normal access to ensure content is correctly indexed and discovered.

However, not all bots claiming to follow rules actually comply with robots.txt. Multiple investigations show some AI companies violate robots.txt standards. WIRED's June 2024 investigation found AI search company Perplexity used an undisclosed IP (44.221.181.252) to bypass robots.txt restrictions, secretly crawling content. In Condé Nast's (WIRED's parent) server logs, this IP accessed at least 822 times in the past three months, actual number likely higher. More seriously, Perplexity's AI chatbot not only has violation issues but also "hallucination" problems: WIRED tests found Perplexity sometimes fabricates content rather than truly accessing and summarizing articles.

ByteDance's (TikTok's parent) Bytespider crawler has similar issues. According to Kasada (bot management company) and Fortune's October 2024 research, Bytespider, released April 2024, quickly became one of the most aggressive scraping bots. Bytespider crawls about 25x faster than OpenAI's GPTBot, about 3000x faster than Anthropic's ClaudeBot. More concerning, Bytespider also doesn't comply with robots.txt, despite website owners clearly indicating they don't want content crawled.

To learn more about bot traffic management, check our complete guide on website traffic management.

Conclusion

Web crawlers are foundational components of search engines and AI systems, playing key roles in discovering and crawling web content. Search engine crawlers (Googlebot, Bingbot) are highly optimized with full JavaScript rendering, efficiently building search indexes. AI crawlers (GPTBot, ClaudeBot) are smaller in scale but important parts of web traffic, playing important roles in model training and real-time retrieval.

Website owners should map crawl vs index vs training vs retrieval, then choose robots tokens, rendering strategy, and security controls deliberately. For sites that want visibility, SSR/SSG for key HTML still matters—especially where AI training crawlers may not execute JavaScript. For sites that want restrictions, combine robots.txt, edge/WAF policies, and (where appropriate) authentication—do not rely on UA string blocking alone.

For indexing, sitemaps, robots rules, and internal linking, read website indexing, XML sitemaps, robots.txt, and internal links. The end-to-end search mechanics guide is linked in the Introduction; this article stays focused on who requests your URLs.

Frequently Asked Questions

What's the difference between search engine crawlers and AI crawlers?
Search engine crawlers (Googlebot, Bingbot) build search indexes to help users find pages. They're optimized with full JavaScript rendering, efficiently crawling and indexing content. AI crawlers (GPTBot, ClaudeBot) train LLMs or provide real-time retrieval, helping AI tools generate accurate answers. Most AI crawlers don't execute JavaScript, only reading initial HTML responses.
Do AI crawlers execute JavaScript?
Most AI crawlers (GPTBot, ClaudeBot, PerplexityBot) don't execute JavaScript, only reading initial HTML. Only Google's Gemini (using Googlebot infrastructure) and AppleBot have full JavaScript rendering. CSR web apps may not be correctly crawled by AI crawlers. Use SSR or SSG to ensure key content is accessible to all crawlers.
How to block AI crawlers from accessing my website?
Use robots.txt to control AI crawler access. Specify user agents or product tokens to set rules, limiting access to sensitive or unnecessary content. For example, add User-agent: GPTBot and Disallow: / to block GPTBot. Firewalls like Vercel's WAF can also block AI crawlers. Note: blocking all AI crawlers may miss consumers searching on AI platforms.
How efficient are AI crawlers?
AI crawler efficiency is relatively low. According to Vercel and MERJ research, ChatGPT has 34.82% requests returning 404, Claude has 34.16%. ChatGPT has 14.36% requests following redirects. In contrast, Googlebot has only 8.22% 404s, 1.49% redirects. This indicates AI crawlers need improvement in URL selection and validation. Website owners can help by maintaining correct redirects, keeping sitemaps updated, using consistent URL patterns.
Should I allow AI crawlers to access my website?
Depends on brand goals and risk-benefit. For e-commerce, allowing major AI crawlers may be beneficial, contributing brand narrative and showing latest content through real-time retrieval. For content publishers, may need nuanced strategies to protect content from AI search summarization, which may affect organic traffic. If blocking all AI crawlers, they'll learn about your brand from other sources, losing brand narrative control.
How to optimize websites for AI crawlers?
Key strategies: 1) Prioritize SSR or SSG to ensure key content accessible to all crawlers; 2) Ensure important content (articles, product info, metadata) in initial HTML, not relying entirely on JavaScript; 3) Maintain correct redirects, avoid 404s; 4) Keep sitemaps updated, use IndexNow to notify search engines; 5) Use consistent URL patterns, avoid complex structures.
What content types do AI crawlers crawl?
AI crawlers crawl various web content including HTML, images, JavaScript files. According to Vercel research, ChatGPT prioritizes HTML (57.70% requests), Claude focuses on images (35.17%). Both spend significant time crawling JavaScript files (ChatGPT: 11.50%, Claude: 23.84%) despite not executing them. This may be because AI models need to learn various web content forms, including JavaScript code as text data.
How to see which crawlers access my website?
Use log file analysis to understand crawler access. Log data is highly reliable, providing important info including how crawlers find content, how much they find, where they encounter problems. Botify's LogAnalyzer provides automated log analysis. Use Google Search Console and Bing Webmaster Tools to monitor search engine crawler access. For AI crawlers, identify through user agent strings.
What is the difference between crawling and scraping?
Crawling usually means discovering and fetching URLs (often many pages). Scraping emphasizes extracting structured data from a response. Many automated systems do both. Neither term implies malicious intent by itself.
What is Google-Extended and does blocking it remove my site from Google Search?
Google-Extended is a robots.txt user-agent token described by Google for controlling certain Gemini-related training and grounding uses of fetched content. Google's documentation states it is not a separate Google Search ranking switch. Read Google's common crawlers page for the exact behavior statement.
How do I verify a request is really Googlebot or Bingbot?
Do not trust the User-Agent header alone. Follow the search engine's official verification method (commonly reverse DNS and forward DNS checks, or published IP ranges where provided). Spoofed crawlers are common in attack traffic.

References

  1. The rise of the AI crawler (Vercel Blog · 2026)Vercel analysis of AI crawler traffic.
  2. What Are AI Crawler Bots? (Botify · 2026)Botify overview of AI crawler bots.
  3. OpenAI Crawler Documentation (OpenAI · 2026)OpenAI crawler and user-agent documentation.
  4. Perplexity Is a Bullshit Machine (WIRED · 2026)WIRED investigation involving Perplexity crawling claims.
  5. TikTok's parent launched a web scraper that's gobbling up the world's online data 25 times faster than OpenAI (Fortune · 2024)Fortune coverage of Bytespider crawl volume claims.
  6. RFC 9309: Robots Exclusion Protocol (IETF · 2022)Formal specification for robots.txt.
  7. Google's common crawlers (incl. Google-Extended) (Google for Developers · 2026)Official notes on Google-Extended token behavior.
  8. Does Anthropic crawl the web (ClaudeBot, Claude-User, Claude-SearchBot)? (Anthropic · 2026)Anthropic crawler names and robots guidance.
  9. OAT-011 Scraping (OWASP · 2026)Automated threat taxonomy for scraping abuse.

    This site uses cookies and similar technologies for analytics, personalized ads (via Google AdSense), and essential functions. By clicking “Accept All”, you consent to our use of cookies. You can reject non-essential cookies by clicking “Reject All”.

    Privacy Policy

    Web Crawler: Search & AI Crawlers Explained | Alignify