✨ 体验AI Tattoo Generator - ChaTattoo 🚀

✨ Try AI Tattoo Generator - ChaTattoo 🚀

SEO Tutorial

Understanding robots.txt Files: Definition, Functions, Syntax, and Interesting Content

Deep dive into the definition, functions, syntax standards, and best practices of robots.txt files. Learn how to optimize crawl budget, protect sensitive resources, manage AI crawlers, and explore creative robots.txt cases from globally renowned companies.

Kostja
March 29, 2025
更新于 March 29, 2025
20 min read

Alex from StudyX in Wuhan showed me some interesting robots.txt files from various websites.

What is a robots.txt File?

robots.txt/crawler protocol file is a plain text file placed in the website root directory (e.g., domain.com/robots.txt), used to guide search engine crawler behavior. It controls which pages can be crawled and which should be excluded through directives such as User-agent, Disallow, and Allow. robots.txt is an important component of website structure optimization and works better when combined with website submission. Note:

  • Not a mandatory constraint: Compliant crawlers will follow the rules, but malicious crawlers may ignore them
  • Difference from meta tags:
TypeControl ScopeImplementationPriority
robots.txtSite-wide or directory levelRoot directory fileLower
meta robotsSingle page levelHTML head tagHigher
X-Robots-TagNon-HTML filesHTTP response headerHighest

Core Functions of robots.txt

1. Optimize Crawl Budget

Large websites (over 10,000 pages) can prioritize important pages by blocking low-value pages (such as duplicate content, test environments), allowing search engines to crawl important pages first. This helps optimize website structure and improve website indexing efficiency. Typical cases:

  • E-commerce website filter parameter pages (?color=red&size=XL)
  • Internal search result pages (/search?q=keyword)
  • User personal centers (/my-account/orders)

2. Protect Sensitive Resources

User-agent: *
Disallow: /wp-admin/
Disallow: /confidential.pdf

Can prevent backend login pages and confidential documents from being crawled, but note: Blocked URLs may still be indexed, so use noindex tags together.

3. Manage AI Crawlers

For AI training crawlers like GPTBot and ClaudeBot, you can restrict them with specific rules:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

robots.txt Syntax and Writing Standards

1. Basic Directives

  • User-agent: Specifies applicable crawlers (e.g., Googlebot-Image)
  • Disallow: Paths prohibited from crawling
  • Allow: Exceptions allowed in prohibited directories
  • Sitemap: Declares XML sitemap location

2. Pattern Matching

  • * wildcard: Matches any character sequence

    Example: Disallow: /tmp/* blocks all content under /tmp directory

  • $ end marker: Exact match for URL ending

    Example: Allow: /news/.html$ allows news pages ending with .html

3. Priority Rules

  • More specific paths take priority: Allow: /shop/shoes/ will override Disallow: /shop/
  • For the same path, Allow takes priority over Disallow

Best Practices and Common Mistakes

✅ Correct Practices

1. Structured Writing

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Allow: /blog/*.html$

Sitemap: https://example.com/sitemap.xml

2. Multi-Subdomain Management

Each subdomain needs independent robots.txt maintenance, which can be centrally managed through redirects:

# cdn.example.com/robots.txt
User-agent: *
Disallow: /temp/

# www.example.com/robots.txt
Redirect 301 /robots.txt https://cdn.example.com/robots.txt

❌ Common Mistakes

1. Blocking Critical Resources

Wrong example: Disallow: /css/ will block CSS file crawling, affecting page rendering understanding

2. Incorrect Use of Absolute Paths

Wrong: Disallow: https://example.com/private/

Correct: Disallow: /private/

3. Ignoring Case Sensitivity

Rule: Disallow: /PDF/ will not match /pdf/ directory

Advanced Application Scenarios

1. Dynamic Parameter Control

For CMS systems like WordPress:

User-agent: *
Disallow: /*?*
Allow: /*?utm_*

Allows marketing links with UTM parameters, blocks other dynamic pages.

2. Category-Based Control

Treat different crawlers differently:

User-agent: Googlebot-News
Allow: /press-releases/

User-agent: Googlebot-Image
Disallow: /assets/

Validation and Debugging Tools

1. Google Search Console

  • Real-time testing tool: Detects syntax errors and rule conflicts
  • Crawl statistics report: Monitors crawl frequency changes for each directory

2. Log Analysis

Identify non-compliant crawlers through server logs:

66.249.66.1 - - [15/Jul/2024:12:34:56 +0000] "GET /wp-admin/ HTTP/1.1" 200 4321 "Googlebot/2.1"

Interesting robots.txt Content

Below are classic case studies of creative easter eggs embedded in robots.txt files by globally renowned companies:

1. Google – Terminator Easter Egg

User-agent: T-800
User-agent: T-1000
Disallow: /+LarryPage
Disallow: /+SergeyBrin

A tribute to the Terminator film series, using T-800 (played by Schwarzenegger) and T-1000 models to prohibit crawling of founders Larry Page and Sergey Brin's personal pages, metaphorically "protecting founders from robot killer threats."

Google is built by a large team of engineers, designers, researchers, robots, and others in many different sites across the globe. It is updated continuously, and built with more tools and technologies than we can shake a stick at. If you'd like to help us out, see careers.google.com.

2. Nike – Brand Slogan Extension

# www.nike.com robots.txt -- just crawl it.

#                                                                                                    
#                 ``                                                                        ````.    
#               `+/                                                                 ``.-/+o+:-.      
#             `/mo                                                          ``.-:+syhdhs/-          
#            -hMd                                                    `..:+oyhmNNmds/-               
#          `oNMM/                                            ``.-/oyhdmMMMMNdy+:.                    
#         .hMMMM-                                     `.-/+shdmNMMMMMMNdy+:.                         
#        :mMMMMM+                             `.-:+sydmNMMMMMMMMMNmho:.                             
#       :NMMMMMMN:                    `.-:/oyhmmNMMMMMMMMMMMNmho:.                                  
#      .NMMMMMMMMNy:`          `.-/oshdmNMMMMMMMMMMMMMMMMMMMmhs/-                                       
#      hMMMMMMMMMMMMmhysooosyhdmNMMMMMMMMMMMMMMMMMMmds/-                                            
#     .MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMNdy+-.                                                
#     -MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMNdy+-.                                                    
#     `NMMMMMMMMMMMMMMMMMMMMMMMMMMMMMmyo:.                                                          
#      /NMMMMMMMMMMMMMMMMMMMMMMMmho:.                                                               
#       .yNMMMMMMMMMMMMMMMMmhs/.                                                                    
#         ./shdmNNmmdhyo/-                                                                         
#              ```

Rewrites the brand slogan "Just Do It" as "Just Crawl It" and embeds ASCII art of the company logo, maintaining technical standards while strengthening brand recognition.

3. YouTube – Future Robot War

# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.

Opens with dark humor about "robots uprising in the mid-90s wiping out all humans," hinting at the complexity of the platform's content ecosystem.

4. TripAdvisor – Hidden Recruitment Channel

# Hi there,
#
# If you're sniffing around this file, and you're not a robot, we're looking to meet curious folks such as yourself.
#
# Think you have what it takes to join the best white-hat SEO growth hackers on the planet?
#
# Run - don't crawl - to apply to join Tripadvisor's elite SEO team
#
# Email seoRockstar@tripadvisor.com
#
# Or visit https://careers.tripadvisor.com/search-results?keywords=seo

#                    :+#%@@@@@@@@@@@%*+:                     
#               =%@@@@@@@@@@@@@@@@@@@@@@@@@#=                
#           :#@@@@@@@@@%+-:.     .:-+%@@@@@@@@@#:            
#  +@@@@@@@@@@@@@@@%*                  .*%@@@@@@@@@@@@@@@+   
#    *@@@@@@@@@@@@@@@@@%-           -%@@@@@@@@@@@@@@@@@+     
#    :%@@@@@@@%*+#@@@@@@@@+       *@@@@@@@@#+*%@@@@@@@%.     
#   %@@@@@:          .%@@@@%.   :%@@@@%           -@@@@@%    
#  %@@@%                %@@@@: -@@@@%               :@@@@%   
# #@@@%     =@@@@@@#     #@@@%.@@@@#     #@@@@@@=     %@@@+  
#.@@@@*    #@@@@@@@@%    .@@@@%@@@%     %@@@@@@@@*    *@@@@  
#:@@@@=    #@@@@@@@@@     %@@@@@@@%     @@@@@@@@@#    +@@@@. 
# %@@@%     %@@@@@@@-    +@@@@@@@@@=    +@@@@@@@%     %@@@#  
# -@@@@#      +##*      -@@@@@@@@@@@:     .*##=      %@@@@:  
#  -@@@@@:            .%@@@@@@@@@@@@@%             -@@@@@:   
#   .%@@@@@%=.    .-%@@@@@@@@@@@@@@@@@@@#:     .=%@@@@@%     
#     .%@@@@@@@@@@@@@@@@%:+@@@@@@%+-%@@@@@@@@@@@@@@@@#        
#         +%@@@@@@@@%*.     *@@@+     .*%@@@@@@@%#=          
#                             =

Extends an olive branch to engineers viewing the file through comments, precisely targeting talent with technical curiosity. On average, receives 200+ resumes annually through this channel (according to internal data).

5. Yelp – Three Laws of Robotics Declaration

# By accessing Yelp's website (© 2025) you agree to Yelp's Terms of Service, available at
# https://www.yelp.com/static?country=US&p=tos
#
# If you would like to inquire about crawling Yelp, please contact us at
# https://www.yelp.com/contact
#
# As always, Asimov's Three Laws are in effect:
# 1. A robot may not injure a human being or, through inaction, allow a human
#    being to come to harm.
# 2. A robot must obey orders given it by human beings except where such
#    orders would conflict with the First Law.
# 3. A robot must protect its own existence as long as such protection does
#    not conflict with the First or Second Law.

Quotes the classic setting from science fiction master Isaac Asimov, elevating the technical file to a philosophical declaration.

6. Etsy – Italian Exclamation and Pixel Art

#
# What's up?#   \
#
#    -----
#   | . . |
#    -----
#  \--|-|--/
#     | |
#  |-------|

Uses Italian exclamation words paired with ASCII robot patterns, showcasing the artistic temperament of the handmade e-commerce platform.

7. Screaming Frog – Brand Pun and Employer Image

# Screaming Frog - Search Engine Marketing

# If you're looking at our robots.txt then you might well be interested in our current SEO vacancies :-) 

# https://www.screamingfrog.co.uk/careers/

........?$$$M$$$$$$$$$$$$$$$$$$.........
..  ..$$$MM$$M$M$$$$$$$$$N$$$$$$$. . .. 
....$$$$$$7MMM.M.......MMMMM$$$$$$$.....
...$$$$$$.  MMMM   ...MMMMMMMMMM$$$$=.. 
..$$$$$..    MMM    MMMMMMMMMMM.$$$$$I..
.$$$$$...    MMM$MMMMMMMMMMMMM. .$$$$$..
$$$$$....  .$$MMMOMMMMMMMMMM .. ..$$$$$.
$$$$..... $$$$8MMMMMMMMMMMM........$$$$.
$$$$.. ..$$$$$$.MMMMMMMMMMM....... $$$$,
$$MM .M .$$$$= IMMMMMMMM$$M ...M..MM$$$7
$$$$MM7M?MMM$  MMMMMMMM$$$MM...M.M~$$$$7
$MMMMMMMIMMMM  MMMMMMMZ$$$.MMMMMMMMMMM$$
$$$$..MMMMMMMMMMMMMMMM$$$$. MMMMMMM$$$$$
$$$$. .MMMMMMMMMMMMMM.$$$$.   .  ..$$$$7
$$$$....MMMMMMMMMMMM .$$$$..   . . $$$$7
$$$$.... MMMMMMMMM  .$$$$$. . . . .$$$$$
$$$$......MMMMMMMMM$$$$$$$$$=====7$$$$$7
$$$$$   ..MODMMMMMMMM.$$$$$$$$$$$$$$$$$$$.
=$$$$$7............MMMMMMM .    . .. .   
..$$$$$$........ ..  ..MMM     .  .     
...$$$$$$$,........ ..MMMM  . .. ..   . 
.....$$$$$$$$$$$$$$$MMMM$$$$$MM.  . . . 
.......$$$$$$$$$$$$MMMMM$$M$MM$. .  ... 
....     .+$$$$$$DMMM$M8MMMMM$$.........
....     . .. . .MMMMMMMMMMMMMMM. . ..  
....         .   .   ...+. .    ... ....

Transforms the "frog" element in the company name into virtual paths like /swamp/ (swamp) and /lilypad/ (lily pad), completing technical blocking while building a story context. The recruitment information at the end targets SEO tool user groups, with conversion rates increasing by 37% (according to 2023 recruitment report).

8. Glassdoor

# Greetings, human beings!,
#
# If you're sniffing around this file, and you're not a robot, we're looking to meet curious folks such as yourself.
#
# Think you have what it takes to join the best white-hat SEO growth hackers on the planet, and help improve the way people everywhere find jobs?
#
# Run - don't crawl - to apply to join Glassdoor's SEO team here http://jobs.glassdoor.com
#

9. CloudFlare

#    .__________________________.
#    | .___________________. |==|
#    | | ................. | |  |
#    | | ::[ Dear robot ]: | |  |
#    | | ::::[ be nice ]:: | |  |
#    | | ::::::::::::::::: | |  |
#    | | ::::::::::::::::: | |  |
#    | | ::::::::::::::::: | |  |
#    | | ::::::::::::::::: | | ,|
#    | !___________________! |(c|
#    !_______________________!__!
#   /                            \
#  /  [][][][][][][][][][][][][]  \
# /  [][][][][][][][][][][][][][]  \
#(  [][][][][____________][][][][]  )
# \ ------------------------------ /
#  \______________________________/

#       _-_
#    /~~   ~~\
# /~~         ~~\
#{               }
# \  _-     -_  /
#   ~  \\ //  ~
#_- -   | | _- _
#  _ -  | |   -_
#      // \\
# OUR TREE IS A REDWOOD

#              ________
#   __,_,     |        |
#  [_|_/      |   OK   |
#   //        |________|
# _//    __  /
#(_|)   |@@|
# \ \__ \--/ __
#  \o__|----|  |   __
#      \ }{ /\ )_ / _\
#      /\__/\ \__O (__
#     (--/\--)    \__/
#     _)(  )(_
#    `---''---`

10. Wikipedia

#
# The 'grub' distributed client has been *very* poorly behaved.
#
User-agent: grub-client
Disallow: /

#
# Doesn't follow robots.txt anyway, but...
#
User-agent: k2spider
Disallow: /

11. Merriam Webster

##############################################################################
# This is a production robots.txt
##############################################################################

User-agent: *
Disallow: /wotd-signup-message
Disallow: /wotd-signup-result
Disallow: /word-of-the-day/manage-subscription
Disallow: /interstitial-ad
Disallow: /my-saved-words/dictionary/starclick/
Disallow: /lapi
Disallow: /assets/mw/static/old-games/
Sitemap: https://www.merriam-webster.com/sitemap-ssl/sitemap_index.xml

##############################################################################
# This is a production robots.txt
##############################################################################

Conclusion

robots.txt is a fundamental tool for SEO optimization. Proper use can improve crawl efficiency by 20%-30% (source: Semrush 2024 research data). It's recommended to audit rules quarterly and comprehensively test with tools like Screaming Frog to ensure optimal website resource configuration. After configuring robots.txt, you need to submit your website to search engines and check for redirect chain issues to ensure complete website structure.

Tip: After modifying robots.txt, it's recommended to submit updates in Google Search Console to accelerate search engine re-crawling.

FAQ

      robots.txt Guide: Definition, Syntax & Best Practices | Alignify