• Stars
    star
    1,005
  • Rank 45,533 (Top 0.9 %)
  • Language
    Python
  • License
    Other
  • Created about 2 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tencent Pre-training framework in PyTorch & Pre-trained Model Zoo

English | 中文

TencentPretrain: Tencent Pre-training Framework

Pre-training has become an essential part of AI technology. TencentPretrain is a toolkit for pre-training and fine-tuning on data of different modalities (e.g. text and vision). TencentPretrain is characterized by modular design. It facilitates the use of existing pre-training models, and provides interfaces for users to further extend upon. With TencentPretrain, we build a model zoo which contains pre-trained models of different properties. TencentPretrain inherits the open source toolkit UER (https://github.com/dbiir/UER-py/) and extends it to a multimodal pre-training framework.

Full Documentation:https://github.com/Tencent/TencentPretrain/wiki

  • News 2023.04.07: Add Low-Rank Adaptation (LoRA) and Deepspeed Zero 3 support tutorial
  • News 2023.03.10: Add LLaMA model training / inference tutorial , 中文博客

Table of Contents


Features

TencentPretrain has the following features:

  • Reproducibility TencentPretrain has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT-2, ELMo, T5, CLIP.
  • Model modularity TencentPretrain is divided into the following parts: embedding, encoder, target embedding (optional), decoder (optional), and target. Ample modules are implemented in each part. Clear and robust interface allows users to combine modules to construct pre-training models with as few restrictions as possible.
  • Multimodal TencentPretrain supports different modalities such as text, vision, and audio.
  • Model training TencentPretrain supports CPU mode, single GPU mode, distributed training mode, and gigantic model training with DeepSpeed.
  • Model zoo With the help of TencentPretrain, we pre-train and release models of different properties. Proper selection of pre-trained models is important to the performances of downstream tasks.
  • SOTA results TencentPretrain supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and provides winning solutions of many competitions.
  • Abundant functions TencentPretrain provides abundant functions related with pre-training, such as feature extractor and text generation.

Requirements

  • Python >= 3.6
  • torch >= 1.1
  • six >= 1.12.0
  • argparse
  • packaging
  • regex
  • For the mixed precision training you will need apex from NVIDIA
  • For the pre-trained model conversion (related with TensorFlow) you will need TensorFlow
  • For the tokenization with sentencepiece model you will need SentencePiece
  • For developing a stacking model you will need LightGBM and BayesianOptimization
  • For the pre-training with whole word masking you will need word segmentation tool such as jieba
  • For the use of CRF in sequence labeling downstream task you will need pytorch-crf
  • For the gigantic model training you will need DeepSpeed
  • For the vision model training you will need torchvision
  • For the audio model training you will need torchaudio, and opencv-python is needed for some special settings of specaugment

Quickstart

This section uses several commonly-used examples to demonstrate how to use TencentPretrain. More details are discussed in Instructions section. We firstly use BERT (a text pre-training model) on book review sentiment classification dataset. The dataset is collected by this paper and is available here. We pre-train model on book review corpus and then fine-tune it on book review sentiment classification dataset. There are three input files: book review corpus, book review sentiment classification dataset, and vocabulary. All files are encoded in UTF-8 and included in this project.

The format of the corpus for BERT is as follows (one sentence per line and documents are delimited by empty lines):

doc1-sent1
doc1-sent2
doc1-sent3

doc2-sent1

doc3-sent1
doc3-sent2

The book review corpus is obtained from book review sentiment classification dataset. We remove labels and split a review into two parts from the middle to construct a document with two sentences (see book_review_bert.txt in corpora folder).

The format of the classification dataset is as follows:

label    text_a
1        instance1
0        instance2
1        instance3

Label and instance are separated by \t . The first row is a list of column names. The label ID should be an integer between (and including) 0 and n-1 for n-way classification.

We use Google's Chinese vocabulary file models/google_zh_vocab.txt, which contains 21128 Chinese characters.

We firstly pre-process the book review corpus. In the pre-processing stage, the corpus needs to be processed into the format required by the specified pre-training model (--data_processor):

python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor bert

Notice that six>=1.12.0 is required.

Pre-processing is time-consuming. Using multiple processes can largely accelerate the pre-processing speed (--processes_num). BERT tokenizer is used in default (--tokenizer bert). After pre-processing, the raw text is converted to dataset.pt, which is the input of pretrain.py. Then we download Google's pre-trained Chinese BERT model google_zh_model.bin (in TencentPretrain format and the original model is from here), and put it in models folder. We load the pre-trained Chinese BERT model and further pre-train it on book review corpus. Pre-training model is usually composed of embedding, encoder, and target layers. To build a pre-training model, we should provide related information. Configuration file (--config_path) specifies the modules and hyper-parameters used by pre-training models. More details can be found in models/bert/base_config.json. Suppose we have a machine with 8 GPUs:

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/google_zh_model.bin \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/book_review_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32

mv models/book_review_model.bin-5000 models/book_review_model.bin

Notice that the model trained by pretrain.py is attacted with the suffix which records the training step (--total_steps). We could remove the suffix for ease of use.

Then we fine-tune the pre-trained model on downstream classification dataset. We use embedding and encoder layers of book_review_model.bin, which is the output of pretrain.py:

python3 finetune/run_classifier.py --pretrained_model_path models/book_review_model.bin \
                                   --vocab_path models/google_zh_vocab.txt \
                                   --config_path models/bert/base_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --epochs_num 3 --batch_size 32

The default path of the fine-tuned classifier model is models/finetuned_model.bin . It is noticeable that the actual batch size of pre-training is --batch_size times --world_size ; The actual batch size of downstream task (e.g. classification) is --batch_size . Then we do inference with the fine-tuned model.

python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
                                          --vocab_path models/google_zh_vocab.txt \
                                          --config_path models/bert/base_config.json \
                                          --test_path datasets/book_review/test_nolabel.tsv \
                                          --prediction_path datasets/book_review/prediction.tsv \
                                          --labels_num 2

--test_path specifies the path of the file to be predicted. The file should contain text_a column. --prediction_path specifies the path of the file with prediction results. We need to explicitly specify the number of labels by --labels_num. The above dataset is a two-way classification dataset.


The above content provides basic ways of using TencentPretrain to pre-process, pre-train, fine-tune, and do inference. More use cases can be found in complete ➡️ quickstart ⬅️ . The complete quickstart contains abundant use cases, covering most of the pre-training related application scenarios. It is recommended that users read the complete quickstart in order to use the project reasonably.


Pre-training data

This section provides links to a range of ➡️ pre-training data ⬅️ (in other open source projects).


Downstream datasets

This section provides links to a range of ➡️ downstream datasets ⬅️ (in other open source projects). TencentPretrain can load these datasets directly.


Modelzoo

With the help of TencentPretrain, we pre-trained models of different properties (e.g. models based on different modalities, encoders, and targets). Detailed introduction of pre-trained models and their download links can be found in ➡️ modelzoo ⬅️ . All pre-trained models can be loaded by TencentPretrain directly. More pre-trained models will be released in the future.


Instructions

TencentPretrain is organized as follows:

TencentPretrain/
    |--tencentpretrain/
    |    |--embeddings/ # contains embedding modules
    |    |--encoders/ # contains encoder modules such as RNN, CNN, Transformer
    |    |--decoders/ # contains decoder modules
    |    |--targets/ # contains target modules such as language modeling, masked language modeling
    |    |--layers/ # contains frequently-used NN layers, such as normalization layer
    |    |--models/ # contains model.py, which combines modules of different parts
    |    |--utils/ # contains frequently-used utilities
    |    |--model_builder.py
    |    |--model_loader.py
    |    |--model_saver.py
    |    |--opts.py
    |    |--trainer.py
    |
    |--corpora/ # contains pre-training data
    |--datasets/ # contains downstream tasks
    |--models/ # contains pre-trained models, vocabularies, and configuration files
    |--scripts/ # contains useful scripts for pre-training models
    |--finetune/ # contains fine-tuning scripts for downstream tasks
    |--inference/ # contains inference scripts for downstream tasks
    |
    |--preprocess.py
    |--pretrain.py
    |--README.md
    |--README_ZH.md
    |--requirements.txt
    |--LICENSE

The code is well-organized. Users can use and extend upon it with little efforts.

Comprehensive examples of using TencentPretrain can be found in ➡️ instructions ⬅️ , which help users quickly implement pre-training models such as BERT, GPT-2, ELMo, T5, CLIP and fine-tune pre-trained models on a range of downstream tasks.


Competition solutions

TencentPretrain has been used in winning solutions of many competitions. In this section, we provide some examples of using TencentPretrain to achieve SOTA results on competitions, such as CLUE. See ➡️ competition solutions ⬅️ for more detailed information.

More Repositories

1

weui

A UI library by WeChat official design team, includes the most useful widgets/modules in mobile web applications.
Less
27,140
star
2

wepy

小程序组件化开发框架
JavaScript
22,491
star
3

ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
C++
19,861
star
4

mars

Mars is a cross-platform network component developed by WeChat.
C++
17,249
star
5

MMKV

An efficient, small mobile key-value storage framework developed by WeChat. Works on Android, iOS, macOS, Windows, and POSIX.
C++
17,138
star
6

tinker

Tinker is a hot-fix solution library for Android, it supports dex, library and resources update without reinstall apk.
Java
17,137
star
7

APIJSON

🏆 实时 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构 🏆 Real-Time coding-free, powerful and secure ORM 🚀 providing APIs and Docs without coding by Backend, and the returned JSON of API can be customized by Frontend(Client) users
Java
17,052
star
8

vConsole

A lightweight, extendable front-end developer tool for mobile web page.
TypeScript
16,716
star
9

weui-wxss

A UI library by WeChat official design team, includes the most useful widgets/modules.
Less
15,070
star
10

QMUI_Android

提高 Android UI 开发效率的 UI 库
Java
14,423
star
11

rapidjson

A fast JSON parser/generator for C++ with both SAX/DOM style API
C++
14,163
star
12

secguide

面向开发人员梳理的代码安全指南
13,203
star
13

omi

Web Components Framework - Web组件框架
TypeScript
13,001
star
14

VasSonic

VasSonic is a lightweight and high-performance Hybrid framework developed by tencent VAS team, which is intended to speed up the first screen of websites working on Android and iOS platform.
Java
11,801
star
15

matrix

Matrix is a plugin style, non-invasive APM system developed by WeChat.
Java
11,544
star
16

wcdb

WCDB is a cross-platform database framework developed by WeChat.
C
10,777
star
17

xLua

xLua is a lua programming solution for C# ( Unity, .Net, Mono) , it supports android, ios, windows, linux, osx, etc.
C
9,296
star
18

libco

libco is a coroutine library which is widely used in wechat back-end service. It has been running on tens of thousands of machines since 2013.
C++
8,223
star
19

Hippy

Hippy is designed to easily build cross-platform dynamic apps. 👏
C++
7,955
star
20

Shadow

零反射全动态Android插件框架
Java
7,382
star
21

QMUI_iOS

QMUI iOS——致力于提高项目 UI 开发效率的解决方案
Objective-C
7,084
star
22

lemon-cleaner

腾讯柠檬清理是针对macOS系统专属制定的清理工具。主要功能包括重复文件和相似照片的识别、软件的定制化垃圾扫描、可视化的全盘空间分析、内存释放、浏览器隐私清理以及设备实时状态的监控等。重点聚焦清理功能,对上百款软件提供定制化的清理方案,提供专业的清理建议,帮助用户轻松完成一键式清理。
Objective-C
5,421
star
23

MLeaksFinder

Find memory leaks in your iOS app at develop time.
Objective-C
5,419
star
24

libpag

The official rendering library for PAG (Portable Animated Graphics) files that renders After Effects animations natively across multiple platforms.
C++
4,943
star
25

puerts

PUER(普洱) Typescript. Let's write your game in UE or Unity with TypeScript.
C++
4,902
star
26

kbone

一个致力于微信小程序和 Web 端同构的解决方案
JavaScript
4,772
star
27

TNN

TNN: developed by Tencent Youtu Lab and Guangying Lab, a uniform deep learning inference framework for mobile、desktop and server. TNN is distinguished by several outstanding features, including its cross-platform capability, high performance, model compression and code pruning. Based on ncnn and Rapidnet, TNN further strengthens the support and performance optimization for mobile devices, and also draws on the advantages of good extensibility and high performance from existed open source efforts. TNN has been deployed in multiple Apps from Tencent, such as Mobile QQ, Weishi, Pitu, etc. Contributions are welcome to work in collaborative with us and make TNN a better framework.
C++
4,388
star
28

GT

GT (Great Tit) is a portable debugging tool for bug hunting and performance tuning on smartphones anytime and anywhere just as listening music with Walkman. GT can act as the Integrated Debug Environment by directly running on smartphones.
Java
4,387
star
29

westore

小程序项目分层架构
JavaScript
4,237
star
30

tmagic-editor

TypeScript
4,190
star
31

wujie

极致的微前端框架
TypeScript
4,023
star
32

vap

VAP是企鹅电竞开发,用于播放特效动画的实现方案。具有高压缩率、硬件解码等优点。同时支持 iOS,Android,Web 平台。
Objective-C
3,893
star
33

cherry-markdown

✨ A Markdown Editor
JavaScript
3,505
star
34

phxpaxos

The Paxos library implemented in C++ that has been used in the WeChat production environment.
C++
3,351
star
35

WeFlow

A web developer workflow tool by WeChat team based on tmt-workflow, with cross-platform supported and environment ready.
JavaScript
3,224
star
36

spring-cloud-tencent

Spring Cloud Tencent is a Spring Cloud based Service Governance Framework provided by Tencent.
Java
3,171
star
37

weui.js

A lightweight javascript library for WeUI.
JavaScript
3,167
star
38

tdesign

Enterprise Design System
Vue
3,139
star
39

tencent-ml-images

Largest multi-label image database; ResNet-101 model; 80.73% top-1 acc on ImageNet
Python
3,051
star
40

VasDolly

Android V1 and V2 Signature Channel Package Plugin
Java
3,048
star
41

HunyuanDiT

Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Python
2,945
star
42

Tendis

Tendis is a high-performance distributed storage system fully compatible with the Redis protocol.
C++
2,919
star
43

FaceDetection-DSFD

腾讯优图高精度双分支人脸检测器
Python
2,885
star
44

PhoenixGo

Go AI program which implements the AlphaGo Zero paper
C++
2,871
star
45

behaviac

behaviac is a framework of the game AI development, and it also can be used as a rapid game prototype design tool. behaviac supports the behavior tree, finite state machine and hierarchical task network(BT, FSM, HTN)
C#
2,829
star
46

PocketFlow

An Automatic Model Compression (AutoMC) framework for developing smaller and faster AI applications.
Python
2,783
star
47

MSEC

Mass Service Engine in Cluster(MSEC) is opened source by QQ team from Tencent. It is a backend DEV &OPS engine, including RPC,name finding,load balance,monitoring,release and capacity management.
Java
2,746
star
48

phxsql

A high availability MySQL cluster that guarantees data consistency between a master and slaves.
C++
2,470
star
49

OOMDetector

OOMDetector is a memory monitoring component for iOS which provides you with OOM monitoring, memory allocation monitoring, memory leak detection and other functions.
Objective-C++
2,312
star
50

tsf

coroutine and Swoole based php server framework in tencent
PHP
2,179
star
51

tmt-workflow

A web developer workflow used by WeChat team based on Gulp, with cross-platform supported and solutions prepared.
CSS
2,173
star
52

UnLua

A feature-rich, easy-learning and highly optimized Lua scripting plugin for UE.
C++
2,169
star
53

Hardcoder

Hardcoder is a solution which allows Android APP and Android System to communicate with each other directly, solving the problem that Android APP could only use system standard API rather than the hardware resource of system.
C++
2,155
star
54

LKImageKit

A high-performance image framework, including a series of capabilities such as image views, image downloader, memory caches, disk caches, image decoders and image processors.
Objective-C
2,082
star
55

GameAISDK

基于图像的游戏AI自动化框架
C++
2,030
star
56

TubeMQ

TubeMQ has been donated to the Apache Software Foundation and renamed to InLong, please visit the new Apache repository: https://github.com/apache/incubator-inlong
2,022
star
57

phxrpc

A simple C++ based RPC framework.
C++
1,974
star
58

TscanCode

A static code analyzer for C++, C#, Lua
C++
1,972
star
59

ObjectDetection-OneStageDet

单阶段通用目标检测器
Python
1,966
star
60

InjectFix

InjectFix is a hot-fix solution library for Unity
C#
1,961
star
61

cloudbase-framework

腾讯云开发云原生一体化部署工具 🚀 CloudBase Framework:一键部署,不限框架语言,云端一体化开发,基于Serverless 架构。A front-end and back-end integrated deployment tool. One-click deploy to serverless architecture. https://docs.cloudbase.net/framework/index
JavaScript
1,937
star
62

soter

A secure and quick biometric authentication standard and platform in Android held by Tencent.
Java
1,923
star
63

phxqueue

A high-availability, high-throughput and highly reliable distributed queue based on the Paxos algorithm.
C++
1,899
star
64

plato

腾讯高性能分布式图计算框架Plato
C++
1,895
star
65

MedicalNet

Many studies have shown that the performance on deep learning is significantly affected by volume of training data. The MedicalNet project provides a series of 3D-ResNet pre-trained models and relative code.
Python
1,888
star
66

NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Python
1,807
star
67

TSW

Tencent Server Web
TypeScript
1,803
star
68

sluaunreal

lua dev plugin for unreal engine 4 or 5
C++
1,734
star
69

QMUI_Web

An efficient front-end framework for developers building UI on the web.
JavaScript
1,719
star
70

Biny

Biny is a tiny, high-performance PHP framework for web applications
PHP
1,687
star
71

Metis

Metis is a learnware platform in the field of AIOps.
Python
1,682
star
72

paxosstore

PaxosStore has been deployed in WeChat production for more than two years, providing storage services for the core businesses of WeChat backend. Now PaxosStore is running on thousands of machines, and is able to afford billions of peak TPS.
C++
1,665
star
73

CodeAnalysis

Static Code Analysis - 静态代码分析
Python
1,639
star
74

MimicMotion

High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
Python
1,475
star
75

TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
C++
1,464
star
76

tdesign-vue-next

A Vue3.x UI components lib for TDesign.
TypeScript
1,428
star
77

nohost

基于 Whistle 实现的多账号多环境远程配置及抓包调试平台
JavaScript
1,416
star
78

TencentOS-kernel

腾讯针对云的场景研发的服务器操作系统
1,408
star
79

TBase

TBase is an enterprise-level distributed HTAP database. Through a single database cluster to provide users with highly consistent distributed database services and high-performance data warehouse services, a set of integrated enterprise-level solutions is formed.
C
1,380
star
80

WeDemo

WeDemo为微信团队开源项目,用于帮助微信开发者完成微信登录、微信分享等功能的接入和开发。开发者可参考源代码完成开发,也可以直接将代码应用到自己的App开发中,安全、便捷地在App中实现微信分享、微信登录功能。
Objective-C
1,371
star
81

feflow

🚀 A command line tool aims to improve front-end engineer workflow and standard, powered by TypeScript.
TypeScript
1,360
star
82

GAutomator

Automation for mobile games
Objective-C
1,331
star
83

flare

Flare是广泛投产于腾讯广告后台的现代化C++开发框架,包含了基础库、RPC、各种客户端等。主要特点为易用性强、长尾延迟低。
C++
1,308
star
84

TFace

A trusty face analysis research platform developed by Tencent Youtu Lab
Python
1,289
star
85

LuaPanda

lua debug and code tools for VS Code
Lua
1,235
star
86

FeatherCNN

FeatherCNN is a high performance inference engine for convolutional neural networks.
C++
1,208
star
87

tdesign-miniprogram

A Wechat MiniProgram UI components lib for TDesign.
HTML
1,195
star
88

tgfx

A lightweight 2D graphics library for rendering texts, geometries, and images with high-performance APIs that work across various platforms.
C++
1,057
star
89

tquic

A high-performance, lightweight, and cross-platform QUIC library
Rust
1,020
star
90

RapidView

RapidView is an android ui and lightapp development framework
Java
979
star
91

hel

A module federation SDK which is unrelated to tool chain for module consumer. 工具链无关的运行时模块联邦sdk.
JavaScript
959
star
92

TencentKona-8

Tencent Kona is a no-cost, production-ready distribution of the Open Java Development Kit (OpenJDK), Long-term support(LTS) with quarterly updates. Tencent Kona serves as the default JDK internally at Tencent Cloud for cloud computing and other Java applications.
Java
937
star
93

FAutoTest

A UI automated testing framework for H5 and applets
Python
932
star
94

tdesign-vue

A Vue.js UI components lib for TDesign.
TypeScript
899
star
95

Pebble

Pebble分布式开发框架
C++
866
star
96

mxflutter

使用 TypeScript/JavaScript 来开发 Flutter 应用的框架。
Dart
857
star
97

Face2FaceTranslator

面对面翻译小程序是微信团队针对面对面沟通的场景开发的流式语音翻译小程序,通过微信同声传译插件提供了语音识别,文本翻译等功能。
JavaScript
836
star
98

tdesign-react

A React UI components lib for TDesign.
TypeScript
821
star
99

LightDiffusionFlow

This extension is developed for AUTOMATIC1111's Stable Diffusion web UI that provides import/export options for parameters.
JavaScript
798
star
100

Real-SR

Real-World Super-Resolution via Kernel Estimation and Noise Injection
Python
769
star