• Stars
    star
    985
  • Rank 44,835 (Top 1.0 %)
  • Language
    Python
  • License
    Other
  • Created over 1 year ago
  • Updated 11 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tencent Pre-training framework in PyTorch & Pre-trained Model Zoo

English | 中文

TencentPretrain: Tencent Pre-training Framework

Pre-training has become an essential part of AI technology. TencentPretrain is a toolkit for pre-training and fine-tuning on data of different modalities (e.g. text and vision). TencentPretrain is characterized by modular design. It facilitates the use of existing pre-training models, and provides interfaces for users to further extend upon. With TencentPretrain, we build a model zoo which contains pre-trained models of different properties. TencentPretrain inherits the open source toolkit UER (https://github.com/dbiir/UER-py/) and extends it to a multimodal pre-training framework.

Full Documentation:https://github.com/Tencent/TencentPretrain/wiki

  • News 2023.04.07: Add Low-Rank Adaptation (LoRA) and Deepspeed Zero 3 support tutorial
  • News 2023.03.10: Add LLaMA model training / inference tutorial , 中文博客

Table of Contents


Features

TencentPretrain has the following features:

  • Reproducibility TencentPretrain has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT-2, ELMo, T5, CLIP.
  • Model modularity TencentPretrain is divided into the following parts: embedding, encoder, target embedding (optional), decoder (optional), and target. Ample modules are implemented in each part. Clear and robust interface allows users to combine modules to construct pre-training models with as few restrictions as possible.
  • Multimodal TencentPretrain supports different modalities such as text, vision, and audio.
  • Model training TencentPretrain supports CPU mode, single GPU mode, distributed training mode, and gigantic model training with DeepSpeed.
  • Model zoo With the help of TencentPretrain, we pre-train and release models of different properties. Proper selection of pre-trained models is important to the performances of downstream tasks.
  • SOTA results TencentPretrain supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and provides winning solutions of many competitions.
  • Abundant functions TencentPretrain provides abundant functions related with pre-training, such as feature extractor and text generation.

Requirements

  • Python >= 3.6
  • torch >= 1.1
  • six >= 1.12.0
  • argparse
  • packaging
  • regex
  • For the mixed precision training you will need apex from NVIDIA
  • For the pre-trained model conversion (related with TensorFlow) you will need TensorFlow
  • For the tokenization with sentencepiece model you will need SentencePiece
  • For developing a stacking model you will need LightGBM and BayesianOptimization
  • For the pre-training with whole word masking you will need word segmentation tool such as jieba
  • For the use of CRF in sequence labeling downstream task you will need pytorch-crf
  • For the gigantic model training you will need DeepSpeed
  • For the vision model training you will need torchvision
  • For the audio model training you will need torchaudio, and opencv-python is needed for some special settings of specaugment

Quickstart

This section uses several commonly-used examples to demonstrate how to use TencentPretrain. More details are discussed in Instructions section. We firstly use BERT (a text pre-training model) on book review sentiment classification dataset. The dataset is collected by this paper and is available here. We pre-train model on book review corpus and then fine-tune it on book review sentiment classification dataset. There are three input files: book review corpus, book review sentiment classification dataset, and vocabulary. All files are encoded in UTF-8 and included in this project.

The format of the corpus for BERT is as follows (one sentence per line and documents are delimited by empty lines):

doc1-sent1
doc1-sent2
doc1-sent3

doc2-sent1

doc3-sent1
doc3-sent2

The book review corpus is obtained from book review sentiment classification dataset. We remove labels and split a review into two parts from the middle to construct a document with two sentences (see book_review_bert.txt in corpora folder).

The format of the classification dataset is as follows:

label    text_a
1        instance1
0        instance2
1        instance3

Label and instance are separated by \t . The first row is a list of column names. The label ID should be an integer between (and including) 0 and n-1 for n-way classification.

We use Google's Chinese vocabulary file models/google_zh_vocab.txt, which contains 21128 Chinese characters.

We firstly pre-process the book review corpus. In the pre-processing stage, the corpus needs to be processed into the format required by the specified pre-training model (--data_processor):

python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor bert

Notice that six>=1.12.0 is required.

Pre-processing is time-consuming. Using multiple processes can largely accelerate the pre-processing speed (--processes_num). BERT tokenizer is used in default (--tokenizer bert). After pre-processing, the raw text is converted to dataset.pt, which is the input of pretrain.py. Then we download Google's pre-trained Chinese BERT model google_zh_model.bin (in TencentPretrain format and the original model is from here), and put it in models folder. We load the pre-trained Chinese BERT model and further pre-train it on book review corpus. Pre-training model is usually composed of embedding, encoder, and target layers. To build a pre-training model, we should provide related information. Configuration file (--config_path) specifies the modules and hyper-parameters used by pre-training models. More details can be found in models/bert/base_config.json. Suppose we have a machine with 8 GPUs:

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/google_zh_model.bin \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/book_review_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32

mv models/book_review_model.bin-5000 models/book_review_model.bin

Notice that the model trained by pretrain.py is attacted with the suffix which records the training step (--total_steps). We could remove the suffix for ease of use.

Then we fine-tune the pre-trained model on downstream classification dataset. We use embedding and encoder layers of book_review_model.bin, which is the output of pretrain.py:

python3 finetune/run_classifier.py --pretrained_model_path models/book_review_model.bin \
                                   --vocab_path models/google_zh_vocab.txt \
                                   --config_path models/bert/base_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --epochs_num 3 --batch_size 32

The default path of the fine-tuned classifier model is models/finetuned_model.bin . It is noticeable that the actual batch size of pre-training is --batch_size times --world_size ; The actual batch size of downstream task (e.g. classification) is --batch_size . Then we do inference with the fine-tuned model.

python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
                                          --vocab_path models/google_zh_vocab.txt \
                                          --config_path models/bert/base_config.json \
                                          --test_path datasets/book_review/test_nolabel.tsv \
                                          --prediction_path datasets/book_review/prediction.tsv \
                                          --labels_num 2

--test_path specifies the path of the file to be predicted. The file should contain text_a column. --prediction_path specifies the path of the file with prediction results. We need to explicitly specify the number of labels by --labels_num. The above dataset is a two-way classification dataset.


The above content provides basic ways of using TencentPretrain to pre-process, pre-train, fine-tune, and do inference. More use cases can be found in complete ➡️ quickstart ⬅️ . The complete quickstart contains abundant use cases, covering most of the pre-training related application scenarios. It is recommended that users read the complete quickstart in order to use the project reasonably.


Pre-training data

This section provides links to a range of ➡️ pre-training data ⬅️ (in other open source projects).


Downstream datasets

This section provides links to a range of ➡️ downstream datasets ⬅️ (in other open source projects). TencentPretrain can load these datasets directly.


Modelzoo

With the help of TencentPretrain, we pre-trained models of different properties (e.g. models based on different modalities, encoders, and targets). Detailed introduction of pre-trained models and their download links can be found in ➡️ modelzoo ⬅️ . All pre-trained models can be loaded by TencentPretrain directly. More pre-trained models will be released in the future.


Instructions

TencentPretrain is organized as follows:

TencentPretrain/
    |--tencentpretrain/
    |    |--embeddings/ # contains embedding modules
    |    |--encoders/ # contains encoder modules such as RNN, CNN, Transformer
    |    |--decoders/ # contains decoder modules
    |    |--targets/ # contains target modules such as language modeling, masked language modeling
    |    |--layers/ # contains frequently-used NN layers, such as normalization layer
    |    |--models/ # contains model.py, which combines modules of different parts
    |    |--utils/ # contains frequently-used utilities
    |    |--model_builder.py
    |    |--model_loader.py
    |    |--model_saver.py
    |    |--opts.py
    |    |--trainer.py
    |
    |--corpora/ # contains pre-training data
    |--datasets/ # contains downstream tasks
    |--models/ # contains pre-trained models, vocabularies, and configuration files
    |--scripts/ # contains useful scripts for pre-training models
    |--finetune/ # contains fine-tuning scripts for downstream tasks
    |--inference/ # contains inference scripts for downstream tasks
    |
    |--preprocess.py
    |--pretrain.py
    |--README.md
    |--README_ZH.md
    |--requirements.txt
    |--LICENSE

The code is well-organized. Users can use and extend upon it with little efforts.

Comprehensive examples of using TencentPretrain can be found in ➡️ instructions ⬅️ , which help users quickly implement pre-training models such as BERT, GPT-2, ELMo, T5, CLIP and fine-tune pre-trained models on a range of downstream tasks.


Competition solutions

TencentPretrain has been used in winning solutions of many competitions. In this section, we provide some examples of using TencentPretrain to achieve SOTA results on competitions, such as CLUE. See ➡️ competition solutions ⬅️ for more detailed information.

More Repositories

1

weui

A UI library by WeChat official design team, includes the most useful widgets/modules in mobile web applications.
Less
27,053
star
2

wepy

小程序组件化开发框架
JavaScript
22,396
star
3

ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
C++
19,310
star
4

mars

Mars is a cross-platform network component developed by WeChat.
C++
17,137
star
5

tinker

Tinker is a hot-fix solution library for Android, it supports dex, library and resources update without reinstall apk.
Java
17,056
star
6

MMKV

An efficient, small mobile key-value storage framework developed by WeChat. Works on Android, iOS, macOS, Windows, and POSIX.
C++
16,913
star
7

APIJSON

🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.
Java
16,681
star
8

vConsole

A lightweight, extendable front-end developer tool for mobile web page.
TypeScript
16,379
star
9

weui-wxss

A UI library by WeChat official design team, includes the most useful widgets/modules.
Less
14,966
star
10

QMUI_Android

提高 Android UI 开发效率的 UI 库
Java
14,336
star
11

rapidjson

A fast JSON parser/generator for C++ with both SAX/DOM style API
C++
13,803
star
12

secguide

面向开发人员梳理的代码安全指南
13,093
star
13

omi

Web Components Framework - Web组件框架
TypeScript
12,926
star
14

VasSonic

VasSonic is a lightweight and high-performance Hybrid framework developed by tencent VAS team, which is intended to speed up the first screen of websites working on Android and iOS platform.
Java
11,779
star
15

matrix

Matrix is a plugin style, non-invasive APM system developed by WeChat.
Java
11,417
star
16

wcdb

WCDB is a cross-platform database framework developed by WeChat.
C
10,509
star
17

xLua

xLua is a lua programming solution for C# ( Unity, .Net, Mono) , it supports android, ios, windows, linux, osx, etc.
C
9,133
star
18

libco

libco is a coroutine library which is widely used in wechat back-end service. It has been running on tens of thousands of machines since 2013.
C++
7,998
star
19

Hippy

Hippy is designed to easily build cross-platform dynamic apps. 👏
C++
7,840
star
20

Shadow

零反射全动态Android插件框架
Java
7,316
star
21

QMUI_iOS

QMUI iOS——致力于提高项目 UI 开发效率的解决方案
Objective-C
7,030
star
22

MLeaksFinder

Find memory leaks in your iOS app at develop time.
Objective-C
5,399
star
23

lemon-cleaner

腾讯柠檬清理是针对macOS系统专属制定的清理工具。主要功能包括重复文件和相似照片的识别、软件的定制化垃圾扫描、可视化的全盘空间分析、内存释放、浏览器隐私清理以及设备实时状态的监控等。重点聚焦清理功能,对上百款软件提供定制化的清理方案,提供专业的清理建议,帮助用户轻松完成一键式清理。
Objective-C
5,188
star
24

kbone

一个致力于微信小程序和 Web 端同构的解决方案
JavaScript
4,742
star
25

libpag

The official rendering library for PAG (Portable Animated Graphics) files that renders After Effects animations natively across multiple platforms.
C++
4,729
star
26

puerts

PUER(普洱) Typescript. Let's write your game in UE or Unity with TypeScript.
C++
4,661
star
27

GT

GT (Great Tit) is a portable debugging tool for bug hunting and performance tuning on smartphones anytime and anywhere just as listening music with Walkman. GT can act as the Integrated Debug Environment by directly running on smartphones.
Java
4,385
star
28

TNN

TNN: developed by Tencent Youtu Lab and Guangying Lab, a uniform deep learning inference framework for mobile、desktop and server. TNN is distinguished by several outstanding features, including its cross-platform capability, high performance, model compression and code pruning. Based on ncnn and Rapidnet, TNN further strengthens the support and performance optimization for mobile devices, and also draws on the advantages of good extensibility and high performance from existed open source efforts. TNN has been deployed in multiple Apps from Tencent, such as Mobile QQ, Weishi, Pitu, etc. Contributions are welcome to work in collaborative with us and make TNN a better framework.
C++
4,297
star
29

westore

小程序项目分层架构
JavaScript
4,216
star
30

tmagic-editor

TypeScript
4,037
star
31

wujie

极致的微前端框架
TypeScript
3,801
star
32

vap

VAP是企鹅电竞开发,用于播放特效动画的实现方案。具有高压缩率、硬件解码等优点。同时支持 iOS,Android,Web 平台。
Objective-C
3,794
star
33

phxpaxos

The Paxos library implemented in C++ that has been used in the WeChat production environment.
C++
3,301
star
34

WeFlow

A web developer workflow tool by WeChat team based on tmt-workflow, with cross-platform supported and environment ready.
JavaScript
3,224
star
35

cherry-markdown

✨ A Markdown Editor
JavaScript
3,195
star
36

weui.js

A lightweight javascript library for WeUI.
JavaScript
3,157
star
37

spring-cloud-tencent

Spring Cloud Tencent is a Spring Cloud based Service Governance Framework provided by Tencent.
Java
3,116
star
38

tencent-ml-images

Largest multi-label image database; ResNet-101 model; 80.73% top-1 acc on ImageNet
Python
3,046
star
39

tdesign

Enterprise Design System
Vue
3,010
star
40

VasDolly

Android V1 and V2 Signature Channel Package Plugin
Java
2,999
star
41

FaceDetection-DSFD

腾讯优图高精度双分支人脸检测器
Python
2,863
star
42

PhoenixGo

Go AI program which implements the AlphaGo Zero paper
C++
2,863
star
43

Tendis

Tendis is a high-performance distributed storage system fully compatible with the Redis protocol.
C++
2,837
star
44

behaviac

behaviac is a framework of the game AI development, and it also can be used as a rapid game prototype design tool. behaviac supports the behavior tree, finite state machine and hierarchical task network(BT, FSM, HTN)
C#
2,784
star
45

PocketFlow

An Automatic Model Compression (AutoMC) framework for developing smaller and faster AI applications.
Python
2,782
star
46

MSEC

Mass Service Engine in Cluster(MSEC) is opened source by QQ team from Tencent. It is a backend DEV &OPS engine, including RPC,name finding,load balance,monitoring,release and capacity management.
Java
2,745
star
47

phxsql

A high availability MySQL cluster that guarantees data consistency between a master and slaves.
C++
2,463
star
48

OOMDetector

OOMDetector is a memory monitoring component for iOS which provides you with OOM monitoring, memory allocation monitoring, memory leak detection and other functions.
Objective-C++
2,298
star
49

tsf

coroutine and Swoole based php server framework in tencent
PHP
2,179
star
50

tmt-workflow

A web developer workflow used by WeChat team based on Gulp, with cross-platform supported and solutions prepared.
CSS
2,175
star
51

Hardcoder

Hardcoder is a solution which allows Android APP and Android System to communicate with each other directly, solving the problem that Android APP could only use system standard API rather than the hardware resource of system.
C++
2,145
star
52

LKImageKit

A high-performance image framework, including a series of capabilities such as image views, image downloader, memory caches, disk caches, image decoders and image processors.
Objective-C
2,079
star
53

UnLua

A feature-rich, easy-learning and highly optimized Lua scripting plugin for UE.
C++
2,053
star
54

TubeMQ

TubeMQ has been donated to the Apache Software Foundation and renamed to InLong, please visit the new Apache repository: https://github.com/apache/incubator-inlong
2,027
star
55

ObjectDetection-OneStageDet

单阶段通用目标检测器
Python
1,962
star
56

cloudbase-framework

腾讯云开发云原生一体化部署工具 🚀 CloudBase Framework:一键部署,不限框架语言,云端一体化开发,基于Serverless 架构。A front-end and back-end integrated deployment tool. One-click deploy to serverless architecture. https://docs.cloudbase.net/framework/index
JavaScript
1,936
star
57

InjectFix

InjectFix is a hot-fix solution library for Unity
C#
1,933
star
58

TscanCode

A static code analyzer for C++, C#, Lua
C++
1,916
star
59

phxrpc

A simple C++ based RPC framework.
C++
1,905
star
60

soter

A secure and quick biometric authentication standard and platform in Android held by Tencent.
Java
1,902
star
61

phxqueue

A high-availability, high-throughput and highly reliable distributed queue based on the Paxos algorithm.
C++
1,891
star
62

plato

腾讯高性能分布式图计算框架Plato
C++
1,889
star
63

GameAISDK

基于图像的游戏AI自动化框架
C++
1,861
star
64

MedicalNet

Many studies have shown that the performance on deep learning is significantly affected by volume of training data. The MedicalNet project provides a series of 3D-ResNet pre-trained models and relative code.
Python
1,837
star
65

TSW

Tencent Server Web
TypeScript
1,802
star
66

NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Python
1,781
star
67

QMUI_Web

An efficient front-end framework for developers building UI on the web.
JavaScript
1,719
star
68

Biny

Biny is a tiny, high-performance PHP framework for web applications
PHP
1,690
star
69

sluaunreal

lua dev plugin for unreal engine 4 or 5
C++
1,687
star
70

paxosstore

PaxosStore has been deployed in WeChat production for more than two years, providing storage services for the core businesses of WeChat backend. Now PaxosStore is running on thousands of machines, and is able to afford billions of peak TPS.
C++
1,658
star
71

Metis

Metis is a learnware platform in the field of AIOps.
Python
1,644
star
72

CodeAnalysis

Static Code Analysis - 静态代码分析
Python
1,585
star
73

TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
C++
1,442
star
74

TencentOS-kernel

腾讯针对云的场景研发的服务器操作系统
1,401
star
75

nohost

基于 Whistle 实现的多账号多环境远程配置及抓包调试平台
JavaScript
1,392
star
76

TBase

TBase is an enterprise-level distributed HTAP database. Through a single database cluster to provide users with highly consistent distributed database services and high-performance data warehouse services, a set of integrated enterprise-level solutions is formed.
C
1,372
star
77

WeDemo

WeDemo为微信团队开源项目,用于帮助微信开发者完成微信登录、微信分享等功能的接入和开发。开发者可参考源代码完成开发,也可以直接将代码应用到自己的App开发中,安全、便捷地在App中实现微信分享、微信登录功能。
Objective-C
1,365
star
78

feflow

🚀 A command line tool aims to improve front-end engineer workflow and standard, powered by TypeScript.
TypeScript
1,354
star
79

GAutomator

Automation for mobile games
Objective-C
1,318
star
80

tdesign-vue-next

A Vue3.x UI components lib for TDesign.
TypeScript
1,316
star
81

flare

Flare是广泛投产于腾讯广告后台的现代化C++开发框架,包含了基础库、RPC、各种客户端等。主要特点为易用性强、长尾延迟低。
C++
1,264
star
82

TFace

A trusty face analysis research platform developed by Tencent Youtu Lab
Python
1,236
star
83

LuaPanda

lua debug and code tools for VS Code
Lua
1,219
star
84

FeatherCNN

FeatherCNN is a high performance inference engine for convolutional neural networks.
C++
1,209
star
85

tdesign-miniprogram

A Wechat MiniProgram UI components lib for TDesign.
HTML
1,084
star
86

tgfx

A lightweight 2D graphics library for rendering texts, geometries, and images with high-performance APIs that work across various platforms.
C++
1,011
star
87

RapidView

RapidView is an android ui and lightapp development framework
Java
977
star
88

FAutoTest

A UI automated testing framework for H5 and applets
Python
930
star
89

TencentKona-8

Tencent Kona is a no-cost, production-ready distribution of the Open Java Development Kit (OpenJDK), Long-term support(LTS) with quarterly updates. Tencent Kona serves as the default JDK internally at Tencent Cloud for cloud computing and other Java applications.
Java
909
star
90

tquic

A high-performance, lightweight, and cross-platform QUIC library
Rust
900
star
91

hel

A module federation SDK which is unrelated to tool chain for module consumer. 工具链无关的运行时模块联邦sdk.
JavaScript
888
star
92

tdesign-vue

A Vue.js UI components lib for TDesign.
TypeScript
872
star
93

Pebble

Pebble分布式开发框架
C++
861
star
94

mxflutter

使用 TypeScript/JavaScript 来开发 Flutter 应用的框架。
Dart
834
star
95

Face2FaceTranslator

面对面翻译小程序是微信团队针对面对面沟通的场景开发的流式语音翻译小程序,通过微信同声传译插件提供了语音识别,文本翻译等功能。
JavaScript
822
star
96

tdesign-react

A React UI components lib for TDesign.
TypeScript
787
star
97

LightDiffusionFlow

This extension is developed for AUTOMATIC1111's Stable Diffusion web UI that provides import/export options for parameters.
JavaScript
764
star
98

Real-SR

Real-World Super-Resolution via Kernel Estimation and Noise Injection
Python
753
star
99

DCache

A distributed in-memory NOSQL system based on TARS framework, support LRU algorithm and data persists on back-end database. Users can easily deploy, publish, and scale services on the web interface.
C++
746
star
100

PatrickStar

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP and democratizes AI for everyone.
Python
741
star