• This repository has been archived on 22/Dec/2022
  • Stars
    star
    214
  • Rank 184,640 (Top 4 %)
  • Language
    Java
  • License
    GNU General Publi...
  • Created over 12 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)

This is the primary repository for the services & map-reduce jobs used to produce the CommonCrawl web corpus from 2008 to 2012.

Tree Structure

  • org.commoncrawl.async - Utility code used to build Async server.
  • org.commoncrawl.hadoop.io - ARCInputFormat and related classes.
  • org.commoncrawl.hadoop.mergeutils - Support for merge-sorts outside the context of a Hadoop job.
  • org.commoncrawl.hadoop.template - Sample Hadoop Job.
  • org.commoncrawl.io - CommonCrawl IO library used by crawlers.
  • org.commoncrawl.mapred - Root for all MapReduce jobs. Also contains data structure definitions shared across jobs (database.jr).
  • org.commoncrawl.mapred.ec2.parser - Code used to generate ARCFiles and intermediate data on EC2 using EMR.
  • org.commoncrawl.mapred.ec2.postprocess.deduper - Code to support a parallel dedupe using a 64bit Simhash.
  • org.commoncrawl.mapred.ec2.postprocess.linkCollector - Code to merge metadata generated by the parser job.
  • org.commoncrawl.mapred.pipelineV3 - The start of the new Nutch Free map-reduce pipeline used to process crawl metadata and generate new crawl lists.
  • org.commoncrawl.mapred.segmenter - Support code used to generate Crawl Segment (URL lists consumed by the crawlers).
  • org.commoncrawl.protocol - Shared data structure and enum definitions (generated).
  • org.commoncrawl.rpc - CommonCrawl RPC library used to build distributed systems.
  • org.commoncrawl.server - CommonCrawl Server base class used by various services.
  • org.commoncrawl.service - All long lived processes in the CommonCrawl system are house under this directory.
  • org.commoncrawl.service.crawler - The crawler long running process (Consumes Crawl Lists, writes content to HDFS).
  • org.commoncrawl.service.crawlhistory - A service that manages a crawler's crawl state in a BloomFilter.
  • org.commoncrawl.service.directory - A barebones service used to store and subscribe to lists via a path.
  • org.commoncrawl.service.dns - CommonCrawl DNS Service (used by crawlers to queue up DNS requests).
  • org.commoncrawl.service.listcrawler - A different type of list crawler that supports dynamic uploading a crawling of very large lists of URLS.
  • org.commoncrawl.service.pagerank - PageRank Master / Slave implementations (and related code) used to compute PageRank across the graph.
  • org.commoncrawl.service.parser - The beginnings of a distributed parser service that Crawlers can use to do on demand link extraction.
  • org.commoncrawl.service.queryserver - The (deprecated) crawl metadata service.
  • org.commoncrawl.service.statscollector - Service that receives crawl stats.
  • org.commoncrawl.util - The catch-all repository of Utility classes used by the CommonCrawl system.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Contributors

Ahad Rana (ahad at commoncrawl.org)