This is the primary repository for the services & map-reduce jobs used to produce the CommonCrawl web corpus from 2008 to 2012.
Tree Structure
- org.commoncrawl.async - Utility code used to build Async server.
- org.commoncrawl.hadoop.io - ARCInputFormat and related classes.
- org.commoncrawl.hadoop.mergeutils - Support for merge-sorts outside the context of a Hadoop job.
- org.commoncrawl.hadoop.template - Sample Hadoop Job.
- org.commoncrawl.io - CommonCrawl IO library used by crawlers.
- org.commoncrawl.mapred - Root for all MapReduce jobs. Also contains data structure definitions shared across jobs (database.jr).
- org.commoncrawl.mapred.ec2.parser - Code used to generate ARCFiles and intermediate data on EC2 using EMR.
- org.commoncrawl.mapred.ec2.postprocess.deduper - Code to support a parallel dedupe using a 64bit Simhash.
- org.commoncrawl.mapred.ec2.postprocess.linkCollector - Code to merge metadata generated by the parser job.
- org.commoncrawl.mapred.pipelineV3 - The start of the new Nutch Free map-reduce pipeline used to process crawl metadata and generate new crawl lists.
- org.commoncrawl.mapred.segmenter - Support code used to generate Crawl Segment (URL lists consumed by the crawlers).
- org.commoncrawl.protocol - Shared data structure and enum definitions (generated).
- org.commoncrawl.rpc - CommonCrawl RPC library used to build distributed systems.
- org.commoncrawl.server - CommonCrawl Server base class used by various services.
- org.commoncrawl.service - All long lived processes in the CommonCrawl system are house under this directory.
- org.commoncrawl.service.crawler - The crawler long running process (Consumes Crawl Lists, writes content to HDFS).
- org.commoncrawl.service.crawlhistory - A service that manages a crawler's crawl state in a BloomFilter.
- org.commoncrawl.service.directory - A barebones service used to store and subscribe to lists via a path.
- org.commoncrawl.service.dns - CommonCrawl DNS Service (used by crawlers to queue up DNS requests).
- org.commoncrawl.service.listcrawler - A different type of list crawler that supports dynamic uploading a crawling of very large lists of URLS.
- org.commoncrawl.service.pagerank - PageRank Master / Slave implementations (and related code) used to compute PageRank across the graph.
- org.commoncrawl.service.parser - The beginnings of a distributed parser service that Crawlers can use to do on demand link extraction.
- org.commoncrawl.service.queryserver - The (deprecated) crawl metadata service.
- org.commoncrawl.service.statscollector - Service that receives crawl stats.
- org.commoncrawl.util - The catch-all repository of Utility classes used by the CommonCrawl system.
License
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
Contributors
Ahad Rana (ahad at commoncrawl.org)