Blocking AI Scrapers Using .htaccess

You can place the following into your .htaccess to have it return an HTTP 403 if the User Agent is matched.

RewriteCond %{HTTP_USER_AGENT} "Operator|ChatGPT-User|DuckAssistBot|Meta-ExternalFetcher|AI2Bot|Applebot-Extended|Bytespider|CCBot|ClaudeBot|cohere-training-data-crawler|Diffbot|FacebookBot|Google-Extended|GPTBot|Kangaroo Bot|Meta-ExternalAgent|omgili|PanguBot|Timpibot|Webzio-Extended|Amazonbot|Applebot|OAI-SearchBot|PerplexityBot|YouBot" [NC]
RewriteCond %{HTTP_USER_AGENT} "archive.org_bot|Arquivo-web-crawler|heritrix|ia_archiver|ia_archiver-web.archive.org|Nicecrawler|2ip bot|AhrefsSiteAudit|BingPreview|Chrome-Lighthouse|Dark Visitor Server|deadlinkchecker|Google-InspectionTool|rogerbot|SiteAuditBot|t3versionsBot|W3C_CSS_Validator|W3C_Validator|WellKnownBot|BazQux|bitlybot|BublupBot|Discordbot|Embedly|facebookexternalhit" [NC]
RewriteCond %{HTTP_USER_AGENT} "Feedly|FlipboardProxy|FreshRSS|Friendica|Google Web Preview|Google-Read-Aloud|Hatena|Iframely|inoreader|LinkedInBot|Mail.RU_Bot|Mastodon|Miniflux|NewsBlur|Nextcloud|Pinterestbot|PocketParser|redditbot|SerendeputyBot|SimplePie|SkypeUriPreview|Slackbot-LinkExpanding|Snap URL Preview Service|snapchat|startmebot" [NC]
RewriteCond %{HTTP_USER_AGENT} "Superfeedr|SurdotlyBot|Synapse|TelegramBot|Tiny Tiny RSS|Twitterbot|Viber|vkShare|WhatsApp|Yahoo Link Preview|Dark Visitor Client|HeadlessChrome|adbeat_bot|AdsBot-Google|AdsBot-Google-Mobile|aiHitBot|AndersPinkBot|ArchiveBot|AwarioBot|AwarioSmartBot|BitSightBot|Blackboard|BrandVerity|Cincraw|ev-crawler" [NC]
RewriteCond %{HTTP_USER_AGENT} "Google-Safety|HubSpot|ImagesiftBot|IonCrawl|Jugendschutzprogramm-Crawler|KStandBot|LightspeedSystemsCrawler|linkfluence|LinkWalker|magpie-crawler|Mediapartners-Google|Mediatoolkitbot|MuckRack|NetcraftSurveyAgent|Netvibes|Pandalytics|panscient.com|proximic|scoop.it|SeekportBot|SMTBot|trendictionbot|TrendsmapResolver|Turnitin|TurnitinBot" [NC]
RewriteCond %{HTTP_USER_AGENT} "TweetmemeBot|Twingly|um-LN|VelenPublicWebCrawler|virustotal|Webzio|ZoominfoBot|008|Dataprovider.com|dcrawl|HTTrack|HTTrack 3.0|MetaInspector|newspaper|Nutch|Offline Explorer|OpenindexSpider|Scrapy|360Spider|AlexandriaOrgBot|Atom Feed Robot|Baiduspider|bingbot|coccocbot-web|Daum" [NC]
RewriteCond %{HTTP_USER_AGENT} "DuckDuckBot|DuckDuckGo-Favicons-Bot|Feedfetcher-Google|Google Favicon|Googlebot|Googlebot-Image|Googlebot-Mobile|Googlebot-News|Googlebot-Video|GoogleOther|HaoSouSpider|MojeekBot|msnbot|msnbot-media|PetalBot|Qwantbot|Qwantify|SemanticScholarBot|SeznamBot|Sogou web spider|teoma|TinEye|TinEye-bot|yacybot|Yahoo! Slurp" [NC]
RewriteCond %{HTTP_USER_AGENT} "Yandex|YandexBot|YandexImages|YandexRenderResourcesBot|Yeti|YisouSpider|ZumBot|AhrefsBot|Barkrowler|BLEXBot|BrightEdge Crawler|Cocolyzebot|DataForSeoBot|DomainStatsBot|dotbot|hypestat|linkdexbot|MJ12bot|online-webceo-bot|Screaming Frog SEO Spider|SemrushBot|SemrushBot-BA|SemrushBot-CT|SemrushBot-SI|SemrushBot-SWA" [NC]
RewriteCond %{HTTP_USER_AGENT} "SenutoBot|SeobilityBot|SEOkicks|SEOlizer|serpstatbot|SiteCheckerBotCrawler|ZoomBot|007ac9 Crawler|2ip.ru|360Spider-Image|360Spider-Video|5emeRue|5erue|A Patent Crawler|A6-Indexer|Aboundex|AcademicBotRTU|acapbot|acoonbot|Acunetix Security Scanner|Acunetix Web Vulnerability Scanner|AddSearchBot|AddThis|adequat|adequat-systems" [NC]
RewriteCond %{HTTP_USER_AGENT} "AdIdxBot|ADmantX|adscanner|AdsTxtCrawler|AdvBot|AISearchBot|Alexabot|Alexibot|AlphaBot|AmiSoftware|antibot|AnyEvent|Apercite|AppInsights|Aqua_Products|arabot|Ask n read|asknread.com|AspiegelBot|asterias|Augure|auramundi|AwarioRssBot|awesomecrawler|B2B Bot" [NC]
RewriteCond %{HTTP_USER_AGENT} "b2w|BackDoorBot|BacklinkCrawler|Baidu-YunGuanCe|Baiduspider-image|Baiduspider-news|Baiduspider-video|BDCbot|BehloolBot|betaBot|Better Uptime Bot|bidswitchbot|BIGLOTRON|binlar|Birdcrawlerbot|BitBot|Black Hole|Blekkobot|blogmuraBot|BlowFish|BLP_bbot|bnf.fr_bot|BomboraBot|Bookmark search tool|bot-pge.chlooe.com" [NC]
RewriteCond %{HTTP_USER_AGENT} "Bot.AraTurka.com|BotALot|botify|BotRightHere|BoxcarBot|brainobot|BrandONbot|BTWebClient|BUbiNG|Buck|BuiltBotTough|Bullseye|BunnySlippers|buzzbot|Caliperbot|CapsuleChecker|careerbot|CC Metadata Scaper|Cegbfeieh|centurybot|changedetection|CheckMarkNetwork|CheeseBot|CherryPicker|CherryPickerElite" [NC]
RewriteCond %{HTTP_USER_AGENT} "CherryPickerSE|Cision|CISPA Webcrawler|citeseerxbot|Citoid|Claritybot|Clickagy|Cliqzbot|CloudFlare-AlwaysOnline|coccoc|coccocbot|coexel|Companybook-Crawler|content crawler spider|ContextAd Bot|contxbot|convera|ConveraCrawler|Cookiebot|Copernic|CopyRightCheck|Corporama|cosmos|crawler4j|CrawlyProjectCrawler" [NC]
RewriteCond %{HTTP_USER_AGENT} "Crescent|Crescent Internet ToolPak HTTP OLE Control v.1.0|CriteoBot|CrunchBot|CrystalSemanticsBot|Curebot|Cutbot|cXensebot|CyberPatrol|DareBoost|Datafeedwatch|datagnionbot|Datanyze|daumoa|deepcrawl|deepnoc|DeuSu|Digg Deeper|Digimind|Digincore bot|discobot|Disqus|DittoSpyder|DnyzBot|Domain Re-Animator Bot" [NC]
RewriteCond %{HTTP_USER_AGENT} "DomainCrawler|Dow Jones Searchbot|Download Ninja|Dragonbot|drupact|Dubbotbot|e.ventures Investment Crawler|EasyBib AutoCite|ec2linkfinder|edisterbot|electricmonk|elisabot|ellisphere|EmailCollector|EmailSiphon|EmailWolf|epicbot|eright|EroCrawler|EtaoSpider|europarchive.org|evc-batch|EveryoneSocialBot|Exabot|Experibot" [NC]
RewriteCond %{HTTP_USER_AGENT} "ExtLinksBot|ExtractorPro|Eyeotabot|EZID|Ezooms|Facebot|FairAd Client|FAST Enterprise Crawler|FAST-WebCrawler|FediDB|fedoraplanet|Feedbin|feedbot|FeedBurner|Feedspot|FeedValidator|FemtosearchBot|Fever|FindITAnswersbot|findlink|findthatfile|findxbot|Flaming AttackBot|Flamingo_SearchEngine|fluffy" [NC]
RewriteCond %{HTTP_USER_AGENT} "Foobot|fr-crawler|FreeWebMonitoring SiteChecker|FreshpingBot|fuelbot|Fyrebot|g00g1e.net|G2 Web Services|g2reader-bot|Gaisbot|GarlikCrawler|Genieo|GenomeCrawlerd|GetRight|Gigablast|Gigabot|GingerCrawler|Gluten Free Crawler|gnam gnam spider|GnowitNewsbot|Google-Adwords-Instant|Google-Certificates-Bridge|Google-PhysicalWeb|Google-Site-Verification|Google-Structured-Data-Testing-Tool" [NC]
RewriteCond %{HTTP_USER_AGENT} "google-xrawler|Gowikibot|grapeshot|GrapeshotCrawler|Grobbot|GroupHigh|grub-client|grub.org|gsa-crawler|gslfbot|Gwene|Harvest|HawaiiBot|humanlinks|hyscore.io|IAS crawler|ICBot|ICC-Crawler|ichiro|imrbot|IndeedBot|INETDEX-BOT|InfoNaviRobot|infoobot|infoseek" [NC]
RewriteCond %{HTTP_USER_AGENT} "integromedb|intelium_bot|InterfaxScanBot|ip-web-crawler.com|IRLbot|Iron33|iskanie|IsraBot|istellabot|it2media-domain-crawler|James BOT|JamesBOT|Jamie's Spider|JenkersBot|JennyBot|Jetbot|Jetty|JikeSpider|JobboerseBot|Jooblebot|jpg-newsbot|jyxobot|k2spider|K7MLWCBot|kbcrawl" [NC]
RewriteCond %{HTTP_USER_AGENT} "Kemvibot|Kenjin Spider|keys-so-bot|Keyword Density|Knowings|KomodiaBot|KosmioBot|Landau-Media-Spider|larbin|Laserlikebot|lb-spider|leadbox|Leikibot|LexiBot|libWeb|Linespider|Linguee Bot|linkapediabot|LinkArchiver|LinkCheck by Siteimprove.com|linkdex|LinkextractorPro|LinkisBot|linko|LinkpadBot" [NC]
RewriteCond %{HTTP_USER_AGENT} "LinkScan|lipperhey|LivelapBot|lkxscan|LNSpiderguy|lssbot|lssrocketcrawler|ltx71|Luminator-robots|lwp-trivial|MaCoCu|mappydata|Mata Hari|MauiBot|MBCrawler|MegaIndex|MegaIndex.ru|Meltawer|Meltwater|MeltwaterNews|memorybot|mention|MetaJobBot|MetaURI|MIIxpc" [NC]
RewriteCond %{HTTP_USER_AGENT} "mindUpBot|minicrawler|Mister PiX|MixnodeCache|mlbot|moatbot|moget|Mojeek|MoodleBot|Moreover|MS Search 4.0 Robot|MS Search 6.0 Robot|MSIECrawler|msrbot|MTRobot|Multiviewbot|mytwip|NAVER Blog Rssbot|NaverBot|Neevabot|NerdByNature.Bot|nerdybot|NetAnts|netEstate NE Crawler|Neticle Crawler" [NC]
RewriteCond %{HTTP_USER_AGENT} "NetMechanic|netresearchserver|NetSystemsResearch|newsharecounts|NewsNow|Newzbin|NextGenSearchBot|NICErsPRO|niki-bot|NimbleCrawler|Nimbostratus-Bot|NINJA bot|NIXStatsbot|NLUX_IAHarvester|Nmap Scripting Engine|NPBot|NTENTbot|Nuzzel|OdklBot|officestorebot|omgilibot|Openbot|Openfind|Openfind data gatherer|OpenGraphCheck" [NC]
RewriteCond %{HTTP_USER_AGENT} "OpenHoseBot|opinion-tracker|Oracle Ultra Search|OrangeBot|Orthogaffe|outbrain|OutclicksBot|page2rss|PagePeeker|PageThing|peer39_crawler|PerMan|Pingdom|Pinterest|PiplBot|postrank|PR-CY.RU|Primalbot|PrivacyAwareBot|ProPowerBot|ProWebWalker|proxem|psbot|Pulsepoint|purebot" [NC]
RewriteCond %{HTTP_USER_AGENT} "QueryN Metasearch|Qwam content intelligence|Radiation Retriever 1.1|RankActiveLinkBot|RankFlex|Refindbot|RegionStuttgartBot|RepoMonkey|RepoMonkey Bait & Tackle|RetrevoPageAnalyzer|ReverseEngineeringBot|RidderBot|Riddler|Rivva|Robozilla|rssbot|RSSingBot|RukiCrawler|RuxitSynthetic|RyteBot|SafeDNSBot|SafeSearch microdata crawler|SBL-BOT|score3|ScoutJet" [NC]
RewriteCond %{HTTP_USER_AGENT} "scribdbot|Scrubby|search.marginalia.nu|SearchAtlas|SearchmetricsBot|searchpreview|seekbot|Seekport Crawler|Seekr|seewithkids|semanticbot|sempi.tech|SemrushBot-BM|SemrushBot-SA|sentibot|SEOkicks-Robot|seoscanners|seostar.co|SEOstats|SimpleCrawler|SimpleScraper|Sindup|sistrix crawler|SiteBot|sitecheck.internetseer.com" [NC]
RewriteCond %{HTTP_USER_AGENT} "siteexplorer.info|Siteimprove|Siteimprove.com|SiteSnagger|SiteSucker|Slack-ImgProxy|Slackbot|Slurp|SocialRankIOBot|Sogou|Sogou inst spider|Sogou spider2|Sonic|Sosospider|SpankBot|spanner|spbot|Spinn3r|spotter|SputnikBot|Storebot-Google|StorygizeBot|StractBot|Streamline3Bot|SummalyBot" [NC]
RewriteCond %{HTTP_USER_AGENT} "summify|SuperBot|SurveyBot|suzuran|Swiftbot|SWIMGBot|Synthesio|Sysomos|Szukacz|Taboolabot|tagoobot|Talkwater|TangibleeBot|Teleport|TeleportPro|Telesoft|The Intraformant|TheNomad|theoldreader.com|Thinklab|tigerbot|Titan|toCrawl|TombaPublicWebCrawler|toplistbot" [NC]
RewriteCond %{HTTP_USER_AGENT} "ToutiaoSpider|Traackr.com|tracemyfile|trafilatura|trendeo|trendkite-akashic-crawler|trendybuzz|trovitBot|True_Robot|TruliaBot|turingos|tweetedtimes|twengabot|Twurly|UbiCrawler|um-IC|Updownerbot|Upflow|Uptime-Kuma|Uptimebot.org|UptimeRobot|URL Control|URL_Spider_Pro|urlappendbot|URLy Warning" [NC]
RewriteCond %{HTTP_USER_AGENT} "usasearch|UsineNouvelleCrawler|UT-Dorkbot|Validator.nu|VCI|VCI WebViewer VCI WebViewer Win32|vebidoobot|vecteurplus|Veoozbot|verticalsearch|Vigil|VKRobot|voilabot|voltron|VoluumDSP-content-bot|vsw|vuhuvBot|W3C_I18n-Checker|W3C_Unicorn|W3C-checklink|W3C-mobileOK|WASALive-Bot|wbsearchbot|Web Image Collector|web-archive-net.com.bot" [NC]
RewriteCond %{HTTP_USER_AGENT} "WebAuto|WebBandit|WebCapture 2.0|webcompanycrawler|WebCopier|WebCopier v.2.2|WebCopier v3.2a|WebDataStats|WebEnhancer|WebmasterWorldForumBot|webmon |WebReaper|WebSauger|Website Quester|WebStripper|WebZIP|winello|WinHTTrack|WiseGuys Robot|wocbot|woobot|woorankreview|WordupInfoSearch|woriobot|wotbox" [NC]
RewriteCond %{HTTP_USER_AGENT} "WWW-Collector-E|WWW-Mechanize|www.uptime.com|Xenu|Xenu Link Sleuth|Xenu's|Xenu's Link Sleuth 1.1c|xovibot|Yahoo Pipes 1.0|YaK|YandexMobileBot|YandexVideo|yanga|Yellowbrandprotectionbot|yoozBot|YoudaoBot|Youmag|Zabbix|Zao|Zealbot|zenback bot|Zeus|Zeus Link Scout|zgrab|Zite" [NC]
RewriteCond %{HTTP_USER_AGENT} "ZuperlistBot|ZyBORG|anthropic-ai|Claude-Web|cohere-ai" [NC]

Here’s a command to automate generating the above output so you can script it to update your .htaccess. It splits it into RewriteCond of 25 user agents to avoid hitting the /var/www/html/.htaccess at line 2: Line too long error for Apache when processing the .htaccess.

echo -e "<IfModule mod_rewrite.c>\nRewriteEngine On\n$(curl -s https://darkvisitors.com/agents | grep -oP '(?<=<div class="name agent-name">).*?(?=</div>)' | awk 'BEGIN {count=0; values=""} {if (count > 0) values = values "|" $0; else values = $0; count++; if (count == 25) {print "RewriteCond %{HTTP_USER_AGENT} \"" values "\" [NC]"; count=0; values=""}} END {if (count > 0) print "RewriteCond %{HTTP_USER_AGENT} \"" values "\" [NC]"}')\nRewriteRule .* - [F,L]\n</IfModule>" > HTACCESS

Published by

Rich

Just another IT guy.

Leave a Reply

Your email address will not be published. Required fields are marked *