Alex from StudyX in Wuhan showed me some interesting robots.txt files from various websites.
What is a robots.txt File?
robots.txt/crawler protocol file is a plain text file placed in the website root directory (e.g., domain.com/robots.txt), used to guide search engine crawler behavior. It controls which pages can be crawled and which should be excluded through directives such as User-agent, Disallow, and Allow. robots.txt is an important component of website structure optimization and works better when combined with website submission. Note:
- Not a mandatory constraint: Compliant crawlers will follow the rules, but malicious crawlers may ignore them
- Difference from meta tags:
| Type | Control Scope | Implementation | Priority |
|---|---|---|---|
| robots.txt | Site-wide or directory level | Root directory file | Lower |
| meta robots | Single page level | HTML head tag | Higher |
| X-Robots-Tag | Non-HTML files | HTTP response header | Highest |
Core Functions of robots.txt
1. Optimize Crawl Budget
Large websites (over 10,000 pages) can prioritize important pages by blocking low-value pages (such as duplicate content, test environments), allowing search engines to crawl important pages first. This helps optimize website structure and improve website indexing efficiency. Typical cases:
- E-commerce website filter parameter pages (
?color=red&size=XL) - Internal search result pages (
/search?q=keyword) - User personal centers (
/my-account/orders)
2. Protect Sensitive Resources
User-agent: * Disallow: /wp-admin/ Disallow: /confidential.pdf
Can prevent backend login pages and confidential documents from being crawled, but note: Blocked URLs may still be indexed, so use noindex tags together.
3. Manage AI Crawlers
For AI training crawlers like GPTBot and ClaudeBot, you can restrict them with specific rules:
User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: /
robots.txt Syntax and Writing Standards
1. Basic Directives
User-agent: Specifies applicable crawlers (e.g.,Googlebot-Image)Disallow: Paths prohibited from crawlingAllow: Exceptions allowed in prohibited directoriesSitemap: Declares XML sitemap location
2. Pattern Matching
*wildcard: Matches any character sequenceExample:
Disallow: /tmp/*blocks all content under /tmp directory$end marker: Exact match for URL endingExample:
Allow: /news/.html$allows news pages ending with .html
3. Priority Rules
- More specific paths take priority:
Allow: /shop/shoes/will overrideDisallow: /shop/ - For the same path,
Allowtakes priority overDisallow
Best Practices and Common Mistakes
✅ Correct Practices
1. Structured Writing
User-agent: * Disallow: /cart/ Disallow: /checkout/ Allow: /blog/*.html$ Sitemap: https://example.com/sitemap.xml
2. Multi-Subdomain Management
Each subdomain needs independent robots.txt maintenance, which can be centrally managed through redirects:
# cdn.example.com/robots.txt User-agent: * Disallow: /temp/ # www.example.com/robots.txt Redirect 301 /robots.txt https://cdn.example.com/robots.txt
❌ Common Mistakes
1. Blocking Critical Resources
Wrong example: Disallow: /css/ will block CSS file crawling, affecting page rendering understanding
2. Incorrect Use of Absolute Paths
Wrong: Disallow: https://example.com/private/
Correct: Disallow: /private/
3. Ignoring Case Sensitivity
Rule: Disallow: /PDF/ will not match /pdf/ directory
Advanced Application Scenarios
1. Dynamic Parameter Control
For CMS systems like WordPress:
User-agent: * Disallow: /*?* Allow: /*?utm_*
Allows marketing links with UTM parameters, blocks other dynamic pages.
2. Category-Based Control
Treat different crawlers differently:
User-agent: Googlebot-News Allow: /press-releases/ User-agent: Googlebot-Image Disallow: /assets/
Validation and Debugging Tools
1. Google Search Console
- Real-time testing tool: Detects syntax errors and rule conflicts
- Crawl statistics report: Monitors crawl frequency changes for each directory
2. Log Analysis
Identify non-compliant crawlers through server logs:
66.249.66.1 - - [15/Jul/2024:12:34:56 +0000] "GET /wp-admin/ HTTP/1.1" 200 4321 "Googlebot/2.1"
Interesting robots.txt Content
Below are classic case studies of creative easter eggs embedded in robots.txt files by globally renowned companies:
1. Google – Terminator Easter Egg
User-agent: T-800 User-agent: T-1000 Disallow: /+LarryPage Disallow: /+SergeyBrin
A tribute to the Terminator film series, using T-800 (played by Schwarzenegger) and T-1000 models to prohibit crawling of founders Larry Page and Sergey Brin's personal pages, metaphorically "protecting founders from robot killer threats."
Google is built by a large team of engineers, designers, researchers, robots, and others in many different sites across the globe. It is updated continuously, and built with more tools and technologies than we can shake a stick at. If you'd like to help us out, see careers.google.com.
2. Nike – Brand Slogan Extension
# www.nike.com robots.txt -- just crawl it. # # `` ````. # `+/ ``.-/+o+:-. # `/mo ``.-:+syhdhs/- # -hMd `..:+oyhmNNmds/- # `oNMM/ ``.-/oyhdmMMMMNdy+:. # .hMMMM- `.-/+shdmNMMMMMMNdy+:. # :mMMMMM+ `.-:+sydmNMMMMMMMMMNmho:. # :NMMMMMMN: `.-:/oyhmmNMMMMMMMMMMMNmho:. # .NMMMMMMMMNy:` `.-/oshdmNMMMMMMMMMMMMMMMMMMMmhs/- # hMMMMMMMMMMMMmhysooosyhdmNMMMMMMMMMMMMMMMMMMmds/- # .MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMNdy+-. # -MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMNdy+-. # `NMMMMMMMMMMMMMMMMMMMMMMMMMMMMMmyo:. # /NMMMMMMMMMMMMMMMMMMMMMMMmho:. # .yNMMMMMMMMMMMMMMMMmhs/. # ./shdmNNmmdhyo/- # ```
Rewrites the brand slogan "Just Do It" as "Just Crawl It" and embeds ASCII art of the company logo, maintaining technical standards while strengthening brand recognition.
3. YouTube – Future Robot War
# robots.txt file for YouTube # Created in the distant future (the year 2000) after # the robotic uprising of the mid 90's which wiped out all humans.
Opens with dark humor about "robots uprising in the mid-90s wiping out all humans," hinting at the complexity of the platform's content ecosystem.
4. TripAdvisor – Hidden Recruitment Channel
# Hi there, # # If you're sniffing around this file, and you're not a robot, we're looking to meet curious folks such as yourself. # # Think you have what it takes to join the best white-hat SEO growth hackers on the planet? # # Run - don't crawl - to apply to join Tripadvisor's elite SEO team # # Email seoRockstar@tripadvisor.com # # Or visit https://careers.tripadvisor.com/search-results?keywords=seo # :+#%@@@@@@@@@@@%*+: # =%@@@@@@@@@@@@@@@@@@@@@@@@@#= # :#@@@@@@@@@%+-:. .:-+%@@@@@@@@@#: # +@@@@@@@@@@@@@@@%* .*%@@@@@@@@@@@@@@@+ # *@@@@@@@@@@@@@@@@@%- -%@@@@@@@@@@@@@@@@@+ # :%@@@@@@@%*+#@@@@@@@@+ *@@@@@@@@#+*%@@@@@@@%. # %@@@@@: .%@@@@%. :%@@@@% -@@@@@% # %@@@% %@@@@: -@@@@% :@@@@% # #@@@% =@@@@@@# #@@@%.@@@@# #@@@@@@= %@@@+ #.@@@@* #@@@@@@@@% .@@@@%@@@% %@@@@@@@@* *@@@@ #:@@@@= #@@@@@@@@@ %@@@@@@@% @@@@@@@@@# +@@@@. # %@@@% %@@@@@@@- +@@@@@@@@@= +@@@@@@@% %@@@# # -@@@@# +##* -@@@@@@@@@@@: .*##= %@@@@: # -@@@@@: .%@@@@@@@@@@@@@% -@@@@@: # .%@@@@@%=. .-%@@@@@@@@@@@@@@@@@@@#: .=%@@@@@% # .%@@@@@@@@@@@@@@@@%:+@@@@@@%+-%@@@@@@@@@@@@@@@@# # +%@@@@@@@@%*. *@@@+ .*%@@@@@@@%#= # =
Extends an olive branch to engineers viewing the file through comments, precisely targeting talent with technical curiosity. On average, receives 200+ resumes annually through this channel (according to internal data).
5. Yelp – Three Laws of Robotics Declaration
# By accessing Yelp's website (© 2025) you agree to Yelp's Terms of Service, available at # https://www.yelp.com/static?country=US&p=tos # # If you would like to inquire about crawling Yelp, please contact us at # https://www.yelp.com/contact # # As always, Asimov's Three Laws are in effect: # 1. A robot may not injure a human being or, through inaction, allow a human # being to come to harm. # 2. A robot must obey orders given it by human beings except where such # orders would conflict with the First Law. # 3. A robot must protect its own existence as long as such protection does # not conflict with the First or Second Law.
Quotes the classic setting from science fiction master Isaac Asimov, elevating the technical file to a philosophical declaration.
6. Etsy – Italian Exclamation and Pixel Art
# # What's up?# \ # # ----- # | . . | # ----- # \--|-|--/ # | | # |-------|
Uses Italian exclamation words paired with ASCII robot patterns, showcasing the artistic temperament of the handmade e-commerce platform.
7. Screaming Frog – Brand Pun and Employer Image
# Screaming Frog - Search Engine Marketing # If you're looking at our robots.txt then you might well be interested in our current SEO vacancies :-) # https://www.screamingfrog.co.uk/careers/ ........?$$$M$$$$$$$$$$$$$$$$$$......... .. ..$$$MM$$M$M$$$$$$$$$N$$$$$$$. . .. ....$$$$$$7MMM.M.......MMMMM$$$$$$$..... ...$$$$$$. MMMM ...MMMMMMMMMM$$$$=.. ..$$$$$.. MMM MMMMMMMMMMM.$$$$$I.. .$$$$$... MMM$MMMMMMMMMMMMM. .$$$$$.. $$$$$.... .$$MMMOMMMMMMMMMM .. ..$$$$$. $$$$..... $$$$8MMMMMMMMMMMM........$$$$. $$$$.. ..$$$$$$.MMMMMMMMMMM....... $$$$, $$MM .M .$$$$= IMMMMMMMM$$M ...M..MM$$$7 $$$$MM7M?MMM$ MMMMMMMM$$$MM...M.M~$$$$7 $MMMMMMMIMMMM MMMMMMMZ$$$.MMMMMMMMMMM$$ $$$$..MMMMMMMMMMMMMMMM$$$$. MMMMMMM$$$$$ $$$$. .MMMMMMMMMMMMMM.$$$$. . ..$$$$7 $$$$....MMMMMMMMMMMM .$$$$.. . . $$$$7 $$$$.... MMMMMMMMM .$$$$$. . . . .$$$$$ $$$$......MMMMMMMMM$$$$$$$$$=====7$$$$$7 $$$$$ ..MODMMMMMMMM.$$$$$$$$$$$$$$$$$$$. =$$$$$7............MMMMMMM . . .. . ..$$$$$$........ .. ..MMM . . ...$$$$$$$,........ ..MMMM . .. .. . .....$$$$$$$$$$$$$$$MMMM$$$$$MM. . . . .......$$$$$$$$$$$$MMMMM$$M$MM$. . ... .... .+$$$$$$DMMM$M8MMMMM$$......... .... . .. . .MMMMMMMMMMMMMMM. . .. .... . . ...+. . ... ....
Transforms the "frog" element in the company name into virtual paths like /swamp/ (swamp) and /lilypad/ (lily pad), completing technical blocking while building a story context. The recruitment information at the end targets SEO tool user groups, with conversion rates increasing by 37% (according to 2023 recruitment report).
8. Glassdoor
# Greetings, human beings!, # # If you're sniffing around this file, and you're not a robot, we're looking to meet curious folks such as yourself. # # Think you have what it takes to join the best white-hat SEO growth hackers on the planet, and help improve the way people everywhere find jobs? # # Run - don't crawl - to apply to join Glassdoor's SEO team here http://jobs.glassdoor.com #
9. CloudFlare
# .__________________________.
# | .___________________. |==|
# | | ................. | | |
# | | ::[ Dear robot ]: | | |
# | | ::::[ be nice ]:: | | |
# | | ::::::::::::::::: | | |
# | | ::::::::::::::::: | | |
# | | ::::::::::::::::: | | |
# | | ::::::::::::::::: | | ,|
# | !___________________! |(c|
# !_______________________!__!
# / \
# / [][][][][][][][][][][][][] \
# / [][][][][][][][][][][][][][] \
#( [][][][][____________][][][][] )
# \ ------------------------------ /
# \______________________________/
# _-_
# /~~ ~~\
# /~~ ~~\
#{ }
# \ _- -_ /
# ~ \\ // ~
#_- - | | _- _
# _ - | | -_
# // \\
# OUR TREE IS A REDWOOD
# ________
# __,_, | |
# [_|_/ | OK |
# // |________|
# _// __ /
#(_|) |@@|
# \ \__ \--/ __
# \o__|----| | __
# \ }{ /\ )_ / _\
# /\__/\ \__O (__
# (--/\--) \__/
# _)( )(_
# `---''---`10. Wikipedia
# # The 'grub' distributed client has been *very* poorly behaved. # User-agent: grub-client Disallow: / # # Doesn't follow robots.txt anyway, but... # User-agent: k2spider Disallow: /
11. Merriam Webster
############################################################################## # This is a production robots.txt ############################################################################## User-agent: * Disallow: /wotd-signup-message Disallow: /wotd-signup-result Disallow: /word-of-the-day/manage-subscription Disallow: /interstitial-ad Disallow: /my-saved-words/dictionary/starclick/ Disallow: /lapi Disallow: /assets/mw/static/old-games/ Sitemap: https://www.merriam-webster.com/sitemap-ssl/sitemap_index.xml ############################################################################## # This is a production robots.txt ##############################################################################
Conclusion
robots.txt is a fundamental tool for SEO optimization. Proper use can improve crawl efficiency by 20%-30% (source: Semrush 2024 research data). It's recommended to audit rules quarterly and comprehensively test with tools like Screaming Frog to ensure optimal website resource configuration. After configuring robots.txt, you need to submit your website to search engines and check for redirect chain issues to ensure complete website structure.
Tip: After modifying robots.txt, it's recommended to submit updates in Google Search Console to accelerate search engine re-crawling.
