This repository contains the processing scripts to scrape/process the code-pile dataset.
- Project Description
- How to use the Code-Pile (todo)
- How to Contribute
- Additional Resources
Check out The code pile proposal
The Code-Pile will be released similar to "the pile" as a folder of .jsonl.zst files, see lm-dataformat
It's not finished, ask on discord
Think about the most usefull Code-data for the next generation of textual Code Models.
The most valuable dataset properties (use your own judgment) are:
- Open License
- Data quality
- Dataset size
- Data variance/variety/nicheness
- Ease of obtaining/processing
To add a new dataset, open a Issue under given dataset-request
template. Gather all the related informations appropriate to it. Use the issue to track.
Check if there is existing Code or someone already working on it: See Additional Resources
- Eleuthers Pile V1 Repos
- Ask on Carper #code-pile
- Ask on Eleuther
- Consult the linked Spreadsheets below
Then implement it through the following steps:
- Fork this repo
- Use the
working
branch - Read the shared classes in
datasets.py
andcodepile.py
- Create mvp/example for your dataset
- Create a pull request
- Keep building the data-domain specific classes and repeat
Citation Placeholder:
@misc{Code-Pile,
author = {},
doi = {},
month = {},
title = {},
url = {https://github.com/CarperAI/Code-Pile},
version = {},
year = {2022}
}
Closely related projects:
Previous work: