Awesome Java Web Crawling Libraries

  • updated about 2 months ago Apache License 2.0

    A set of reusable Java components that implement functionality common to any web crawler

  • updated over 1 year ago Apache License 2.0

    HtmlUnit is a "GUI-Less browser for Java programs".

  • nutch nutch 2,653
    star
    updated over 1 year ago Apache License 2.0

    Apache Nutch is an extensible and scalable web crawler

  • selenium selenium 30,062
    star
    updated 4 months ago Apache License 2.0

    A browser automation framework and ecosystem.

  • updated 4 months ago

    Collection of reusable loosely Selenium-related code - see https://github.com/sergueik/selenium_tests for strict Selenium test code examples

  • tika tika 1,860
    star
    updated over 1 year ago Apache License 2.0

    The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

  • yauaa yauaa 677
    star
    updated over 1 year ago Apache License 2.0

    Yet Another UserAgent Analyzer