91
Addons ideas and questions / Re: Updated Spider List
Last post by Spuds -There is a good overall blocker for either Apache or Nginx available here. https://github.com/mitchellkrogza Thats a bit more effort than just a robots.txt list but really blocks them early in the process.
If you just want a list of bad spiders/bots/AI agents ... here is the robot.txt file from the above. https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt I think there may be a couple of blocked ones that are arguably SEO useful, but YMMV.
I use that robot.txt file to politely ask them to not scan/scrape (which is all that really does, ask) and then the things I have in the ElkArte spiders list are what I consider "safe" or necessary/good for SEO, just so I can see who is doing what/when.
If you have a really problem with bots, then installing the full code from above is the way to go as it uses iptables and fail2ban among others.
I've attempted to keep an updated list of interesting ones (a.k.a probably SEO useful) for 2.0, here is the current install section
Code: [Select]
return $this->db->insert('ignore',
'{db_prefix}spiders',
array('spider_name' => 'string', 'user_agent' => 'string', 'ip_info' => 'string'),
array(
array('Amazon', 'Amazonbot', ''),
array('Anthropic-AI', 'anthropic-ai', ''),
array('Anthropic-AI (Bot)', 'ClaudeBot', ''),
array('Anthropic-AI (Claude)', 'claude-Web', ''),
array('Apple', 'Applebot', ''),
array('Baidu', 'Baiduspider', ''),
array('Bing', 'bingbot', ''),
array('Bing (Preview)', 'BingPreview', ''),
array('CCBot', 'CCBot', ''),
array('Diffbot', 'Diffbot', ''),
array('DoCoMo', 'DoCoMo', ''),
array('DuckDuckGo', 'duckduckgo', ''),
array('DuckDuckGo (Assist)', 'DuckAssistBot', ''),
array('Ecosia', 'Ecosia', ''),
array('Exabot', 'Exabot', ''),
array('Google', 'Googlebot', ''),
array('Google (AdSense)', 'Mediapartners-Google', ''),
array('Google (Adwords)', 'AdsBot-Google', ''),
array('Google (Bard)', 'Google-Extended', ''),
array('Google (Image)', 'Googlebot-Image', ''),
array('Google (ImageProxy)', 'GoogleImageProxy', ''),
array('Google (Mobile)', 'Googlebot-Mobile', ''),
array('Google (News)', 'Googlebot-News', ''),
array('Google (Video)', 'Googlebot-Video', ''),
array('Gravityscan', 'Gravityscan', ''),
array('InternetArchive', 'ia_archiver-web.archive.org', ''),
array('Jakarta', 'Jakarta Commons', ''),
array('Kraken', 'Kraken', ''),
array('LinkedIn', 'LinkedInBot', ''),
array('MegaIndex', 'MegaIndex.ru', ''),
array('Meta/Facebook', 'FacebookBot', ''),
array('Meta/Facebook', 'meta-externalagent', ''),
array('Meta/Facebook (Hit)', 'facebookexternalhit', ''),
array('MSN', 'msnbot', ''),
array('MSN (Mobile)', 'MSNBOT_Mobile', ''),
array('Omgili', 'Omgili', ''),
array('Open-AI (Bot)', 'GPTBot', ''),
array('Open-AI (SearchBot)', 'OAI-SearchBot', ''),
array('Open-AI (User)', 'ChatGPT-User', ''),
array('Perplexity (User)', 'Perplexity-User', ''),
array('PerplexityBot (Bot)', 'PerplexityBot', ''),
array('Slack', 'Slackbot', ''),
array('Sogou', 'Sogou', ''),
array('Teoma', 'teoma', ''),
array('Tik-Tok', 'Bytespider', ''),
array('Timpi', 'TimpiBot', ''),
array('Twitter', 'TwitterBot', ''),
array('Yahoo!', 'slurp', ''),
array('Yahoo! (Blogs)', 'Yahoo-Blogs', ''),
array('Yahoo! (Feeds)', 'YahooFeedSeeker', ''),
array('Yahoo! (Image)', 'Yahoo-MMCrawler', ''),
array('Yahoo! (Mobile)', 'YahooSeeker/M1A1-R2D2', ''),
array('Yandex', 'YandexBot', ''),
array('Yandex (Blogs)', 'YandexBlogs', ''),
array('Yandex (Images)', 'YandexImages', ''),
array('Yandex (Media)', 'YandexMedia', ''),
array('Yandex (Video)', 'YandexVideo', '')
);
}