• Stars
    star
    104
  • Rank 330,604 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created about 2 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This repository contains all the code for collecting large scale amounts of code from GitHub.

Code-Pile

pytest

This repository contains the processing scripts to scrape/process the code-pile dataset.

Table of Contents

  • Project Description
  • How to use the Code-Pile (todo)
  • How to Contribute
  • Additional Resources

Project Description

Check out The code pile proposal

The Code-Pile will be released similar to "the pile" as a folder of .jsonl.zst files, see lm-dataformat

How to use the Code-Pile

It's not finished, ask on discord

How to Contribute

Think about the most usefull Code-data for the next generation of textual Code Models.

The most valuable dataset properties (use your own judgment) are:

  1. Open License
  2. Data quality
  3. Dataset size
  4. Data variance/variety/nicheness
  5. Ease of obtaining/processing

To add a new dataset, open a Issue under given dataset-request template. Gather all the related informations appropriate to it. Use the issue to track.

Check if there is existing Code or someone already working on it: See Additional Resources

  1. Eleuthers Pile V1 Repos
  2. Ask on Carper #code-pile
  3. Ask on Eleuther
  4. Consult the linked Spreadsheets below

Then implement it through the following steps:

  1. Fork this repo
  2. Use the working branch
  3. Read the shared classes in datasets.py and codepile.py
  4. Create mvp/example for your dataset
  5. Create a pull request
  6. Keep building the data-domain specific classes and repeat

Citation Placeholder:

@misc{Code-Pile,
  author = {},
  doi = {},
  month = {},
  title = {},
  url = {https://github.com/CarperAI/Code-Pile},
  version = {},
  year = {2022}
}

Additional Resources

Closely related projects:

Previous work: