CrawlerDetect
About
CrawlerDetect is a Ruby version of PHP class @CrawlerDetect.
It helps to detect bots/crawlers/spiders via the user agent and other HTTP-headers. Currently able to detect 1,000's of bots/spiders/crawlers.
Why CrawlerDetect?
Comparing with other popular bot-detection gems:
CrawlerDetect | Voight-Kampff | Browser | |
---|---|---|---|
Number of bot-patterns | >1000 | ~280 | ~280 |
Number of checked HTTP-headers | 10 | 1 | 1 |
Number of updates of bot-list (1st half of 2018) | 14 | 1 | 7 |
In order to remain up-to-date, this gem does not accept any crawler data updates โ any PRs to edit the crawler data should be offered to the original JayBizzle/CrawlerDetect project.
Installation
Add this line to your application's Gemfile:
gem 'crawler_detect'
Basic Usage
CrawlerDetect.is_crawler?("Bot user agent")
=> true
Or if you need crawler name:
detector = CrawlerDetect.new("Googlebot/2.1 (http://www.google.com/bot.html)")
detector.is_crawler?
# => true
detector.crawler_name
# => "Googlebot"
Rack::Request extension
Optionally you can add additional methods for request
:
request.is_crawler?
# => false
request.crawler_name
# => nil
It's more flexible to use request.is_crawler?
rather than CrawlerDetect.is_crawler?
because it automatically checks 10 HTTP-headers, not only HTTP_USER_AGENT
.
Only one thing you have to do is to configure Rack::CrawlerDetect
midleware:
Rails
class Application < Rails::Application
# ...
config.middleware.use Rack::CrawlerDetect
end
Rack
use Rack::CrawlerDetect
Configuration
In some cases you may want to use your own white-list, or black-list or list of http-headers to detect User-agent.
It is possible to do via CrawlerDetect::Config
. For example, you may have initializer like this:
CrawlerDetect.setup! do |config|
config.raw_headers_path = File.expand_path("crawlers/MyHeaders.json", __dir__)
config.raw_crawlers_path = File.expand_path("crawlers/MyCrawlers.json", __dir__)
config.raw_exclusions_path = File.expand_path("crawlers/MyExclusions.json", __dir__)
end
Make sure that your files are correct JSON files. Look at the raw files which are used by default for more information.
License
MIT License