• Stars
    star
    1,292
  • Rank 36,420 (Top 0.8 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!

English | 中文

Data-Juicer: A One-Stop Data Processing System for Large Language Models

Data-Juicer

Paper Contributing

pypi version Docker version Document_List 文档列表 API Reference

ModelScope-10+ Demos ModelScope-20+_Refined_Datasets ModelScope-Reference_Models

HuggingFace-10+ Demos HuggingFace-20+_Refined_Datasets HuggingFace-Reference_Models

QualityClassifier AutoEvaluation

Data-Juicer is a one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs. This project is being actively updated and maintained, and we will periodically enhance and add more features and data recipes. We welcome you to join us in promoting LLM data development and research!

If you find Data-Juicer useful for your research or development, please kindly cite our work.


News

  • new [2023-10-13] Our first data-centric LLM competition begins! Please visit the competition's official websites, FT-Data Ranker (1B Track, 7B Track), for more information.

  • [2023-10-8] We update our paper to the 2nd version and release the corresponding version 0.1.2 of Data-Juicer!

Table of Contents

Features

Overview

  • Systematic & Reusable: Empowering users with a systematic library of 20+ reusable config recipes, 50+ core OPs, and feature-rich dedicated toolkits, designed to function independently of specific LLM datasets and processing pipelines.

  • Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process. Data-in-the-loop

  • Comprehensive Data Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, fine-tuning, en, zh, and more scenarios. Validated on reference LLaMA models. exp_llama

  • Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory and CPU usage, optimized for maximum productivity. sys-perf

  • Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.

  • User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.

Prerequisites

  • Recommend Python==3.8
  • gcc >= 5 (at least C++14 support)

Installation

From Source

  • Run the following commands to install the latest basic data_juicer version in editable mode:
cd <path_to_data_juicer>
pip install -v -e .
  • Some OPs rely on some other too large or low-platform-compatibility third-party libraries. You can install optional dependencies as needed:
cd <path_to_data_juicer>
pip install -v -e .  # install a minimal dependencies, which support the basic functions
pip install -v -e .[tools] # install a subset of tools dependencies

The dependency options are listed below:

Tag Description
. or .[mini] Install minimal dependencies for basic Data-Juicer.
.[all] Install all optional dependencies (including minimal dependencies and all of the following).
.[sci] Install all dependencies for all OPs.
.[dist] Install dependencies for distributed data processing. (Experimental)
.[dev] Install dependencies for developing the package as contributors.
.[tools] Install dependencies for dedicated tools, such as quality classifiers.

Using pip

  • Run the following command to install the latest released data_juicer using pip:
pip install py-data-juicer
  • Note:
    • only the basic APIs in data_juicer and two basic tools (data processing and analysis) are available in this way. If you want customizable and complete functions, we recommend you install data_juicer from source.
    • The release versions from pypi have a certain lag compared to the latest version from source. So if you want to follow the latest functions of data_juicer, we recommend you install from source.

Using Docker

  • You can
    • either pull our pre-built image from DockerHub:

      docker pull datajuicer/data-juicer:<version_tag>
    • or run the following command to build the docker image including the latest data-juicer with provided Dockerfile:

      docker build -t data-juicer:<version_tag> .

Installation check

import data_juicer as dj
print(dj.__version__)

Quick Start

Data Processing

  • Run process_data.py tool or dj-process command line tool with your config as the argument to process your dataset.
# only for installation from source
python tools/process_data.py --config configs/demo/process.yaml

# use command line tool
dj-process --config configs/demo/process.yaml
  • Note: For some operators that involve third-party models or resources which are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first. The default download cache directory is ~/.cache/data_juicer. Change the cache location by setting the shell environment variable, DATA_JUICER_CACHE_HOME to another directory, and you can also change DATA_JUICER_MODELS_CACHE or DATA_JUICER_ASSETS_CACHE in the same way:
# cache home
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
# cache models
export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
# cache assets
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"

Data Analysis

  • Run analyze_data.py tool or dj-analyze command line tool with your config as the argument to analyse your dataset.
# only for installation from source
python tools/analyze_data.py --config configs/demo/analyser.yaml

# use command line tool
dj-analyze --config configs/demo/analyser.yaml
  • Note: Analyser only compute stats of Filter ops. So extra Mapper or Deduplicator ops will be ignored in the analysis process.

Data Visualization

  • Run app.py tool to visualize your dataset in your browser.
  • Note: only available for installation from source.
streamlit run app.py

Build Up Config Files

  • Config files specify some global arguments, and an operator list for the data process. You need to set:
    • Global arguments: input/output dataset path, number of workers, etc.
    • Operator list: list operators with their arguments used to process the dataset.
  • You can build up your own config files by:
    • ➖:Modify from our example config file config_all.yaml which includes all ops and default arguments. You just need to remove ops that you won't use and refine some arguments of ops.
    • ➕:Build up your own config files from scratch. You can refer our example config file config_all.yaml, op documents, and advanced Build-Up Guide for developers.
    • Besides the yaml files, you also have the flexibility to specify just one (of several) parameters on the command line, which will override the values in yaml files.
python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang=en
  • The basic config format and definition is shown below.

    Basic config example of format and definition

Preprocess Raw Data (Optional)

  • Our formatters support some common input dataset formats for now:
    • Multi-sample in one file: jsonl/json, parquet, csv/tsv, etc.
    • Single-sample in one file: txt, code, docx, pdf, etc.
  • However, data from different sources are complicated and diverse. Such as:
    • Raw arXiv data downloaded from S3 include thousands of tar files and even more gzip files in them, and expected tex files are embedded in the gzip files so they are hard to obtain directly.
    • Some crawled data include different kinds of files (pdf, html, docx, etc.). And extra information like tables, charts, and so on is hard to extract.
  • It's impossible to handle all kinds of data in Data-Juicer, issues/PRs are welcome to contribute to process new data types!
  • Thus, we provide some common preprocessing tools in tools/preprocess for you to preprocess these data.
    • You are welcome to make your contributions to new preprocessing tools for the community.
    • We highly recommend that complicated data can be preprocessed to jsonl or parquet files.

For Docker Users

  • If you build or pull the docker image of data-juicer, you can run the commands or tools mentioned above using this docker image.
  • Run directly:
# run the data processing directly
docker run --rm \  # remove container after the processing
  --name dj \  # name of the container
  -v <host_data_path>:<image_data_path> \  # mount data or config directory into the container
  -v ~/.cache/:/root/.cache/ \  # mount the cache directory into the container to reuse caches and models (recommended)
  data-juicer:<version_tag> \  # image to run
  dj-process --config /path/to/config.yaml  # similar data processing commands
  • Or enter into the running container and run commands in editable mode:
# start the container
docker run -dit \  # run the container in the background
  --rm \
  --name dj \
  -v <host_data_path>:<image_data_path> \
  -v ~/.cache/:/root/.cache/ \
  data-juicer:latest /bin/bash

# enter into this container and then you can use data-juicer in editable mode
docker exec -it <container_id> bash

Documentation | 文档

Data Recipes

Demos

License

Data-Juicer is released under Apache License 2.0.

Contributing

We are in a rapidly developing field and greatly welcome contributions of new features, bug fixes and better documentations. Please refer to How-to Guide for Developers.

Welcome to join our Slack channel, or DingDing group for discussion.

Acknowledgement

Data-Juicer is used across various LLM products and research initiatives, including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for financial analysis, and Zhiwen for reading assistant, as well as the Alibaba Cloud's platform for AI (PAI). We look forward to more of your experience, suggestions and discussions for collaboration!

Data-Juicer thanks and refers to several community projects, such as Huggingface-Datasets, Bloom, RedPajama, Pile, Alpaca-Cot, Megatron-LM, DeepSpeed, Arrow, Ray, Beam, LM-Harness, HELM, ....

References

If you find our work useful for your research or development, please kindly cite the following paper.

@misc{chen2023datajuicer,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
year={2023},
eprint={2309.02033},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

More Repositories

1

arthas

Alibaba Java Diagnostic Tool Arthas/Alibaba Java诊断利器Arthas
Java
35,294
star
2

easyexcel

快速、简洁、解决大文件内存溢出的java处理Excel工具
Java
32,157
star
3

p3c

Alibaba Java Coding Guidelines pmd implements and IDE plugin
Kotlin
30,344
star
4

nacos

an easy-to-use dynamic service discovery, configuration and service management platform for building cloud native applications.
Java
30,212
star
5

canal

阿里巴巴 MySQL binlog 增量订阅&消费组件
Java
28,441
star
6

druid

阿里云计算平台DataWorks(https://help.aliyun.com/document_detail/137663.html) 团队出品,为监控而生的数据库连接池
Java
27,950
star
7

spring-cloud-alibaba

Spring Cloud Alibaba provides a one-stop solution for application development for the distributed solutions of Alibaba middleware.
Java
27,866
star
8

fastjson

FASTJSON 2.0.x has been released, faster and more secure, recommend you upgrade.
Java
25,716
star
9

flutter-go

flutter 开发者帮助 APP,包含 flutter 常用 140+ 组件的demo 演示与中文文档
Dart
23,629
star
10

Sentinel

A powerful flow control component enabling reliability, resilience and monitoring for microservices. (面向云原生微服务的高可用流控防护组件)
Java
22,352
star
11

weex

A framework for building Mobile cross-platform UI
C++
18,271
star
12

ice

🚀 ice.js: The Progressive App Framework Based On React(基于 React 的渐进式应用框架)
TypeScript
17,841
star
13

DataX

DataX是阿里云DataWorks数据集成的开源版本。
Java
15,692
star
14

lowcode-engine

An enterprise-class low-code technology stack with scale-out design / 一套面向扩展设计的企业级低代码技术体系
TypeScript
14,512
star
15

ARouter

💪 A framework for assisting in the renovation of Android componentization (帮助 Android App 进行组件化改造的路由框架)
Java
14,228
star
16

hooks

A high-quality & reliable React Hooks library. https://ahooks.pages.dev/
TypeScript
14,005
star
17

tengine

A distribution of Nginx with some advanced features
C
12,807
star
18

formily

📱🚀 🧩 Cross Device & High Performance Normal Form/Dynamic(JSON Schema) Form/Form Builder -- Support React/React Native/Vue 2/Vue 3
TypeScript
11,318
star
19

vlayout

Project vlayout is a powerfull LayoutManager extension for RecyclerView, it provides a group of layouts for RecyclerView. Make it able to handle a complicate situation when grid, list and other layouts in the same recyclerview.
Java
10,800
star
20

COLA

🥤 COLA: Clean Object-oriented & Layered Architecture
Java
9,964
star
21

MNN

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
C++
8,656
star
22

ali-dbhub

已迁移新仓库,此版本将不再维护
8,318
star
23

atlas

A powerful Android Dynamic Component Framework.
Java
8,127
star
24

otter

阿里巴巴分布式数据库同步系统(解决中美异地机房)
Java
8,069
star
25

rax

🐰 Rax is a progressive framework for building universal application. https://rax.js.org
JavaScript
7,994
star
26

anyproxy

A fully configurable http/https proxy in NodeJS
JavaScript
7,851
star
27

fish-redux

An assembled flutter application framework.
Dart
7,333
star
28

x-render

🚴‍♀️ 阿里 - 很易用的中后台「表单 / 表格 / 图表」解决方案
TypeScript
7,035
star
29

flutter_boost

FlutterBoost is a Flutter plugin which enables hybrid integration of Flutter for your existing native apps with minimum efforts
Dart
6,966
star
30

AndFix

AndFix is a library that offer hot-fix for Android App.
C++
6,954
star
31

transmittable-thread-local

📌 TransmittableThreadLocal (TTL), the missing Java™ std lib(simple & 0-dependency) for framework/middleware, provide an enhanced InheritableThreadLocal that transmits values between threads even using thread pooling components.
Java
6,750
star
32

jvm-sandbox

Real - time non-invasive AOP framework container based on JVM
Java
6,739
star
33

BizCharts

Powerful data visualization library based on G2 and React.
TypeScript
6,066
star
34

freeline

A super fast build tool for Android, an alternative to Instant Run
Java
5,497
star
35

UltraViewPager

UltraViewPager is an extension for ViewPager to provide multiple features in a single ViewPager.
Java
5,003
star
36

jetcache

JetCache is a Java cache framework.
Java
4,774
star
37

AliSQL

AliSQL is a MySQL branch originated from Alibaba Group. Fetch document from Release Notes at bottom.
C++
4,705
star
38

AliOS-Things

面向IoT领域的、高可伸缩的物联网操作系统,可去官网了解更多信息https://www.aliyun.com/product/aliosthings
C
4,583
star
39

dexposed

dexposed enable 'god' mode for single android application.
Java
4,483
star
40

butterfly

🦋Butterfly,A JavaScript/React/Vue2 Diagramming library which concentrate on flow layout field. (基于JavaScript/React/Vue2的流程图组件)
JavaScript
4,445
star
41

QLExpress

QLExpress is a powerful, lightweight, dynamic language for the Java platform aimed at improving developers’ productivity in different business scenes.
Java
4,361
star
42

BeeHive

🐝 BeeHive is a solution for iOS Application module programs, it absorbed the Spring Framework API service concept to avoid coupling between modules.
Objective-C
4,288
star
43

HandyJSON

A handy swift json-object serialization/deserialization library
Swift
4,233
star
44

x-deeplearning

An industrial deep learning framework for high-dimension sparse data
PureBasic
4,185
star
45

Tangram-Android

Tangram is a modular UI solution for building native page dynamically including Tangram for Android, Tangram for iOS and even backend CMS. This project provides the sdk on Android.
Java
4,110
star
46

coobjc

coobjc provides coroutine support for Objective-C and Swift. We added await method、generator and actor model like C#、Javascript and Kotlin. For convenience, we added coroutine categories for some Foundation and UIKit API in cokit framework like NSFileManager, JSON, NSData, UIImage etc. We also add tuple support in coobjc.
Objective-C
4,025
star
47

jstorm

Enterprise Stream Process Engine
Java
3,914
star
48

dragonwell8

Alibaba Dragonwell8 JDK
Java
3,826
star
49

LuaViewSDK

A cross-platform framework to build native, dynamic and swift user interface - 强大轻巧灵活的客户端动态化解决方案
Objective-C
3,707
star
50

fastjson2

🚄 FASTJSON2 is a Java JSON library with excellent performance.
Java
3,673
star
51

Alink

Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.
Java
3,572
star
52

f2etest

F2etest是一个面向前端、测试、产品等岗位的多浏览器兼容性测试整体解决方案。
JavaScript
3,564
star
53

GGEditor

A visual graph editor based on G6 and React
TypeScript
3,414
star
54

GraphScope

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统
C++
3,277
star
55

designable

🧩 Make everything designable 🧩
TypeScript
3,266
star
56

cobar

a proxy for sharding databases and tables
Java
3,210
star
57

macaca

Automation solution for multi-platform. 多端自动化解决方案
3,171
star
58

lightproxy

💎 Cross platform Web debugging proxy
TypeScript
3,111
star
59

pont

🌉数据服务层解决方案
TypeScript
3,035
star
60

higress

🤖 AI Gateway | AI Native API Gateway
Go
2,918
star
61

euler

A distributed graph deep learning framework.
C++
2,849
star
62

sentinel-golang

Sentinel Go enables reliability and resiliency for Go microservices
Go
2,763
star
63

beidou

🌌 Isomorphic framework for server-rendered React apps
JavaScript
2,735
star
64

ChatUI

The UI design language and React library for Conversational UI
TypeScript
2,602
star
65

pipcook

Machine learning platform for Web developers
TypeScript
2,539
star
66

kiwi

🐤 Kiwi-国际化翻译全流程解决方案
TypeScript
2,533
star
67

yugong

阿里巴巴去Oracle数据迁移同步工具(全量+增量,目标支持MySQL/DRDS)
Java
2,504
star
68

jvm-sandbox-repeater

A Java server-side recording and playback solution based on JVM-Sandbox
Java
2,503
star
69

tsar

Taobao System Activity Reporter
C
2,446
star
70

tidevice

tidevice can be used to communicate with iPhone device
Python
2,411
star
71

TProfiler

TProfiler是一个可以在生产环境长期使用的性能分析工具
Java
2,377
star
72

tair

A distributed key-value storage system developed by Alibaba Group
C++
2,179
star
73

dubbo-spring-boot-starter

Dubbo Spring Boot Starter
Java
2,097
star
74

RedisShake

redis-shake is a tool for synchronizing data between two redis databases. Redis-shake 是一个用于在两个 redis之 间同步数据的工具,满足用户非常灵活的同步、迁移需求。
Go
2,077
star
75

uirecorder

UI Recorder is a multi-platform UI test recorder.
JavaScript
2,061
star
76

EasyNLP

EasyNLP: A Comprehensive and Easy-to-use NLP Toolkit
Python
2,052
star
77

AliceMind

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab
Python
1,967
star
78

LVS

A distribution of Linux Virtual Server with some advanced features. It introduces a new packet forwarding method - FULLNAT other than NAT/Tunneling/DirectRouting, and defense mechanism against synflooding attack - SYNPROXY.
C
1,947
star
79

GCanvas

A lightweight cross-platform graphics rendering engine. (超轻量的跨平台图形引擎) https://alibaba.github.io/GCanvas
C
1,873
star
80

alpha

Alpha是一个基于PERT图构建的Android异步启动框架,它简单,高效,功能完善。 在应用启动的时候,我们通常会有很多工作需要做,为了提高启动速度,我们会尽可能让这些工作并发进行。但这些工作之间可能存在前后依赖的关系,所以我们又需要想办法保证他们执行顺序的正确性。Alpha就是为此而设计的,使用者只需定义好自己的task,并描述它依赖的task,将它添加到Project中。框架会自动并发有序地执行这些task,并将执行的结果抛出来。
HTML
1,873
star
81

Tangram-iOS

Tangram is a modular UI solution for building native page dynamically, including Tangram for Android, Tangram for iOS and even backend CMS. This project provides the sdk on iOS platform.
Objective-C
1,863
star
82

testable-mock

换种思路写Mock,让单元测试更简单
Java
1,827
star
83

compileflow

🎨 core business process engine of Alibaba Halo platform, best process engine for trade scenes. | 一个高性能流程编排引擎
Java
1,793
star
84

SREWorks

Cloud Native DataOps & AIOps Platform | 云原生数智运维平台
Java
1,792
star
85

EasyCV

An all-in-one toolkit for computer vision
Python
1,780
star
86

LazyScrollView

An iOS ScrollView to resolve the problem of reusability in views.
Objective-C
1,774
star
87

EasyRec

A framework for large scale recommendation algorithms.
Python
1,764
star
88

ilogtail

Fast and Lightweight Observability Data Collector
C++
1,740
star
89

MongoShake

MongoShake is a universal data replication platform based on MongoDB's oplog. Redundant replication and active-active replication are two most important functions. 基于mongodb oplog的集群复制工具,可以满足迁移和同步的需求,进一步实现灾备和多活功能。
Go
1,714
star
90

xquic

XQUIC Library released by Alibaba is a cross-platform implementation of QUIC and HTTP/3 protocol.
C
1,687
star
91

lowcode-demo

An enterprise-class low-code technology stack with scale-out design / 一套面向扩展设计的企业级低代码技术体系
TypeScript
1,683
star
92

async_simple

Simple, light-weight and easy-to-use asynchronous components
C++
1,662
star
93

havenask

C++
1,586
star
94

clusterdata

cluster data collected from production clusters in Alibaba for cluster management research
Jupyter Notebook
1,554
star
95

mdrill

for千亿数据即席分析
Java
1,538
star
96

kt-connect

A toolkit for Integrating with your kubernetes dev environment more efficiently
Go
1,519
star
97

Virtualview-Android

A light way to build UI in custom XML.
Java
1,455
star
98

yalantinglibs

A collection of modern C++ libraries, include coro_rpc, struct_pack, struct_json, struct_xml, struct_pb, easylog, async_simple
C++
1,431
star
99

tb_tddl

1,410
star
100

react-intl-universal

Internationalize React apps. Not only for Component but also for Vanilla JS.
JavaScript
1,337
star