• Stars
    star
    679
  • Rank 66,532 (Top 2 %)
  • Language
  • License
    MIT License
  • Created almost 6 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 论文的中文翻译 Chinese Translation!

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 论文的中文翻译

本资源完整的翻译了论文,并且给出了论文中所有引用资料的网络连接,方便对 BERT 感兴趣的朋友们进一步研究 BERT。

  1. 原文 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,这是BERT在2018年11月发布的版本,与2019年5月版本v2有稍许不同。
  2. 以下内容是部分预览内容,完整内容查看本资源中的 Bidirectional_Encoder_Representations_Transformers翻译.md
  3. BERT论文翻译 PDF版下载
  4. 转载请注明出处,商用请联系译者 袁宵 [email protected]
  5. 未来将继续翻译和解析深度学习相关论文,特别是 NLP 方向的论文。
  6. 如果你喜欢我的工作,请点亮右上角星星,谢谢 😃

手机扫码阅读:


BERT:预训练的深度双向 Transformer 语言模型

Jacob Devlin;Ming-Wei Chang;Kenton Lee;Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

图 1:预训练模型结构的不同。BERT 使用双向 Transformer。OpenAI GPT 使用 从左到右的 Transformer。ELMo 使用独立训练的从左到右和从右到左的 LSTM 的连接来为下游任务生成特征。其中,只有 BERT 表示在所有层中同时受到左右语境的制约。

图 2:BERT 的输入表示。输入嵌入是标记嵌入(词嵌入)、句子嵌入和位置嵌入的总和。

摘要

我们提出了一种新的称为 BERT 的语言表示模型,BERT 代表来自 Transformer 的双向编码器表示(Bidirectional Encoder Representations from Transformers)。不同于最近的语言表示模型(Peters et al., 2018Radford et al., 2018), BERT 旨在通过联合调节所有层中的左右上下文来预训练深度双向表示。因此,只需要一个额外的输出层,就可以对预训练的 BERT 表示进行微调,从而为广泛的任务(比如回答问题和语言推断任务)创建最先进的模型,而无需对特定于任务进行大量模型结构的修改。

BERT 的概念很简单,但实验效果很强大。它刷新了 11 个 NLP 任务的当前最优结果,包括将 GLUE 基准提升至 80.4%(7.6% 的绝对改进)、将 MultiNLI 的准确率提高到 86.7%(5.6% 的绝对改进),以及将 SQuAD v1.1 的问答测试 F1 得分提高至 93.2 分(提高 1.5 分)——比人类表现还高出 2 分。

1. 介绍

语言模型预训练可以显著提高许多自然语言处理任务的效果(Dai and Le, 2015Peters et al., 2018Radford et al., 2018Howard and Ruder, 2018)。这些任务包括句子级任务,如自然语言推理(Bow-man et al., 2015Williams et al., 2018)和释义(Dolan and Brockett, 2005),目的是通过对句子的整体分析来预测句子之间的关系,以及标记级任务,如命名实体识别(Tjong Kim Sang and De Meulder, 2003)和 SQuAD 问答(Rajpurkar et al., 2016),模型需要在标记级生成细粒度的输出。

现有的两种方法可以将预训练好的语言模型表示应用到下游任务中:基于特征的和微调。基于特征的方法,如 ELMo (Peters et al., 2018),使用特定于任务的模型结构,其中包含预训练的表示作为附加特特征。微调方法,如生成预训练 Transformer (OpenAI GPT) (Radford et al., 2018)模型,然后引入最小的特定于任务的参数,并通过简单地微调预训练模型的参数对下游任务进行训练。在之前的工作中,两种方法在预训练任务中都具有相同的目标函数,即使用单向的语言模型来学习通用的语言表达。

我们认为,当前的技术严重地限制了预训练表示的效果,特别是对于微调方法。主要的局限性是标准语言模型是单向的,这就限制了可以在预训练期间可以使用的模型结构的选择。例如,在 OpenAI GPT 中,作者使用了从左到右的模型结构,其中每个标记只能关注 Transformer 的自注意层中该标记前面的标记(Williams et al., 2018)。这些限制对于句子级别的任务来说是次优的(还可以接受),但当把基于微调的方法用来处理标记级别的任务(如 SQuAD 问答)时可能会造成不良的影响(Rajpurkar et al., 2016),因为在标记级别的任务下,从两个方向分析上下文是至关重要的。

在本文中,我们通过提出 BERT 改进了基于微调的方法:来自 Transformer 的双向编码器表示。受完形填空任务的启发,BERT 通过提出一个新的预训练任务来解决前面提到的单向约束:“遮蔽语言模型”(MLM masked language model)(Tay-lor, 1953)。遮蔽语言模型从输入中随机遮蔽一些标记,目的是仅根据被遮蔽标记的上下文来预测它对应的原始词汇的 id。与从左到右的语言模型预训练不同,MLM 目标允许表示融合左右上下文,这允许我们预训练一个深层双向 Transformer。除了遮蔽语言模型之外,我们还提出了一个联合预训练文本对来进行“下一个句子预测”的任务。

本文的贡献如下:

  • 我们论证了双向预训练对语言表征的重要性。与 Radford et al., 2018 使用单向语言模型进行预训练不同,BERT 使用遮蔽语言模型来实现预训练深层双向表示。这也与 Peters et al., 2018 的研究形成了对比,他们使用了一个由左到右和由右到左的独立训练语言模型的浅层连接。
  • 我们表明,预训练的表示消除了许多特定于任务的高度工程化的的模型结构的需求。BERT 是第一个基于微调的表示模型,它在大量的句子级和标记级任务上实现了最先进的性能,优于许多特定于任务的结构的模型。
  • BERT 为 11 个 NLP 任务提供了最先进的技术。我们还进行大量的消融研究,证明了我们模型的双向本质是最重要的新贡献。代码和预训练模型将可在 goo.gl/language/bert 获取。

....

参考文献

所有参考文献按论文各小节中引用顺序排列,多次引用会多次出现在下面的列表中。

Abstract 摘要中的参考文献

BERT 文中简写 原始标论文标题 其它
Peters et al., 2018 Deep contextualized word representations ELMo
Radford et al., 2018 Improving Language Understanding with Unsupervised Learning OpenAI GPT

1. Introduction 介绍中的参考文献

BERT 文中简写 原始标论文标题 其它
Peters et al., 2018 Deep contextualized word representations ELMo
Radford et al., 2018 Improving Language Understanding with Unsupervised Learning OpenAI GPT
Dai and Le, 2015 Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087 AndrewMDai and Quoc V Le. 2015
Howard and Ruder, 2018 Universal Language Model Fine-tuning for Text Classification ULMFiT;Jeremy Howard and Sebastian Ruder.
Bow-man et al., 2015 A large annotated corpus for learning natural language inference Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning.
Williams et al., 2018 A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference Adina Williams, Nikita Nangia, and Samuel R Bowman.
Dolan and Brockett, 2005 Automatically constructing a corpus of sentential paraphrases William B Dolan and Chris Brockett. 2005.
Tjong Kim Sang and De Meulder, 2003 Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition Erik F Tjong Kim Sang and Fien De Meulder. 2003.
Rajpurkar et al., 2016 SQuAD: 100,000+ Questions for Machine Comprehension of Text SQuAD
Taylor, 1953 "Cloze Procedure": A New Tool For Measuring Readability Wilson L Taylor. 1953.

2. Related Work 相关工作中的参考文献

BERT 文中简写 原始标论文标题 其它
Brown et al., 1992 Class-based n-gram models of natural language Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992.
Ando and Zhang, 2005 A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data Rie Kubota Ando and Tong Zhang. 2005.
Blitzer et al., 2006 Domain adaptation with structural correspondence learning John Blitzer, Ryan McDonald, and Fernando Pereira.2006.
Collobert and Weston, 2008 A Unified Architecture for Natural Language Processing Ronan Collobert and Jason Weston. 2008.
Mikolov et al., 2013 Distributed Representations of Words and Phrases and their Compositionality CBOW Model;Skip-gram Model
Pennington et al., 2014 GloVe: Global Vectors for Word Representation GloVe
Turian et al., 2010 Word Representations: A Simple and General Method for Semi-Supervised Learning Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Kiros et al., 2015 Skip-Thought Vectors Skip-Thought Vectors
Logeswaran and Lee, 2018 An efficient framework for learning sentence representations Lajanugen Logeswaran and Honglak Lee. 2018.
Le and Mikolov, 2014 Distributed Representations of Sentences and Documents Quoc Le and Tomas Mikolov. 2014.
Peters et al., 2017 Semi-supervised sequence tagging with bidirectional language models Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017.
Peters et al., 2018 Deep contextualized word representations ELMo
Rajpurkar et al., 2016 SQuAD: 100,000+ Questions for Machine Comprehension of Text SQuAD
Socher et al., 2013 Deeply Moving: Deep Learning for Sentiment Analysis SST-2
Tjong Kim Sang and De Meulder, 2003 Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition Erik F Tjong Kim Sang and Fien De Meulder. 2003.
Dai and Le, 2015 Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087 AndrewMDai and Quoc V Le. 2015
Howard and Ruder, 2018 Universal Language Model Fine-tuning for Text Classification ULMFiT;Jeremy Howard and Sebastian Ruder.
Radford et al., 2018 Improving Language Understanding with Unsupervised Learning OpenAI GPT
Wang et al.(2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding GLUE
Con-neau et al., 2017 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017.
McCann et al., 2017 Learned in Translation: Contextualized Word Vectors Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017.
Deng et al. ImageNet: A large-scale hierarchical image database J. Deng,W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009.
Yosinski et al., 2014 How transferable are features in deep neural networks? Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014.

3. BERT 中的参考文献

BERT 文中简写 原始标论文标题 其它
Vaswani et al. (2017) Attention Is All You Need Transformer
Wu et al., 2016 Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation WordPiece
Taylor, 1953 "Cloze Procedure": A New Tool For Measuring Readability Wilson L Taylor. 1953.
Vincent et al., 2008 Extracting and composing robust features with denoising autoencoders denoising auto-encoders
Zhu et al., 2015 Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books BooksCorpus (800M words)
Chelba et al., 2013 One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling Billion Word Benchmark corpus
Hendrycks and Gimpel, 2016 Gaussian Error Linear Units (GELUs) GELUs

4. Experiments 实验中的参考文献

BERT 文中简写 原始标论文标题 其它
Wang et al.(2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding GLUE
Williams et al., 2018 A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference MNLI
Chen et al., 2018 First Quora Dataset Release: Question Pairs QQP
Rajpurkar et al., 2016 SQuAD: 100,000+ Questions for Machine Comprehension of Text QNLI
Socher et al., 2013 Deeply Moving: Deep Learning for Sentiment Analysis SST-2
Warstadt et al., 2018 The Corpus of Linguistic Acceptability CoLA
Cer et al., 2017 SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation STS-B
Dolan and Brockett, 2005 Automatically constructing a corpus of sentential paraphrases MRPC
Bentivogli et al., 2009 The fifth pascal recognizing textual entailment challenge RTE
Levesque et al., 2011 The winograd schema challenge. In Aaai spring symposium: Logical formalizations of commonsense reasoning, volume 46, page 47. WNLI
Rajpurkar et al., 2016 SQuAD: 100,000+ Questions for Machine Comprehension of Text SQuAD
Joshi et al., 2017 TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension TriviaQA
Clark et al., 2018 Semi-Supervised Sequence Modeling with Cross-View Training
Zellers et al., 2018 SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference SWAG

5. Ablation Studies 消融研究中的参考文献

BERT 文中简写 原始标论文标题 其它
Vaswani et al. (2017) Attention Is All You Need Transformer
Al-Rfou et al., 2018 Character-Level Language Modeling with Deeper Self-Attention

More Repositories

1

DeepImage-an-Image-to-Image-technology

DeepNude's algorithm and general image generation theory and practice research, including pix2pix, CycleGAN, UGATIT, DCGAN, SinGAN, ALAE, mGANprior, StarGAN-v2 and VAE models (TensorFlow2 implementation). DeepNude的算法以及通用生成对抗网络(GAN,Generative Adversarial Network)图像生成的理论与实践研究。
Python
5,168
star
2

Entity-Relation-Extraction

Entity and Relation Extraction Based on TensorFlow and BERT. 基于TensorFlow和BERT的管道式实体及关系抽取,2019语言与智能技术竞赛信息抽取任务解决方案。Schema based Knowledge Extraction, SKE 2019
Python
1,218
star
3

Machine-Learning-Book

《机器学习宝典》包含:谷歌机器学习速成课程(招式)+机器学习术语表(口诀)+机器学习规则(心得)+机器学习中的常识性问题 (内功)。该资源适用于机器学习、深度学习研究人员和爱好者参考!
Jupyter Notebook
1,031
star
4

BERT-for-Sequence-Labeling-and-Text-Classification

This is the template code to use BERT for sequence lableing and text classification, in order to facilitate BERT for more tasks. Currently, the template code has included conll-2003 named entity identification, Snips Slot Filling and Intent Prediction.
Python
468
star
5

Multiple-Relations-Extraction-Only-Look-Once

Multiple-Relations-Extraction-Only-Look-Once. Just look at the sentence once and extract the multiple pairs of entities and their corresponding relations. 端到端联合多关系抽取模型,可用于 http://lic2019.ccf.org.cn/kg 信息抽取。
Python
346
star
6

Schema-based-Knowledge-Extraction

Code for http://lic2019.ccf.org.cn/kg 信息抽取。使用基于 BERT 的实体抽取和关系抽取的端到端的联合模型。
Python
284
star
7

Machine_Learning_bookshelf

机器学习深度学习相关书籍、课件、代码的仓库。 Machine learning is the warehouse of books, courseware and codes.
Jupyter Notebook
189
star
8

Multimodal-short-video-dataset-and-baseline-classification-model

500,000 multimodal short video data and baseline models. 50万条多模态短视频数据集和基线模型(TensorFlow2.0)。
Jupyter Notebook
125
star
9

Theoretical-Proof-of-Neural-Network-Model-and-Implementation-Based-on-Numpy

This resource implements a deep neural network through Numpy, and is equipped with easy-to-understand theoretical derivation, mainly for the in-depth understanding of neural networks. 神经网络模型的理论证明与基于Numpy的实现。
Python
77
star
10

Find-a-Machine-Learning-Job

找一份机器学习工作(算法工程师),需要提纲(算法能力)挈领(编程能力),充分准备。 本人学习和在找工作期间受到了很多前辈们的帮助,目前已经找到心仪的工作,撰写此文献给那些在求职路上有梦有汗水的人们!2020秋招算法,难度剧增!没有选择,只能迎难而上。
66
star
11

fan-ren-xiu-xian-zhuan

凡人修仙传(fanrenxiuxianzhuan)的资源汇总,谨献给“凡友”们。
Python
52
star
12

XLNet_Paper_Chinese_Translation

XLNet: Generalized Autoregressive Pretraining for Language Understanding 论文的中文翻译 Paper Chinese Translation!
50
star
13

Slot-Filling-and-Intention-Prediction-in-Paper-Translation

槽填充、意图预测(口语理解)论文整理和中文翻译。Slot filling and intent prediction paper collation and Chinese translation.
49
star
14

SMP2018

SMP2018中文人机对话技术评测(ECDT)
Jupyter Notebook
47
star
15

Image-Captioning

CNN-Encoder and RNN-Decoder (Bahdanau Attention) for image caption or image to text on MS-COCO dataset. 图片描述
Jupyter Notebook
35
star
16

ELMo

ELMo: Embeddings from Language Models. Using, visualizing and understanding EMLo by examples!
Jupyter Notebook
33
star
17

Text-generation-task-and-language-model-GPT2

solve text generation tasks by the language model GPT2, including papers, code, demo demos, and hands-on tutorials. 使用语言模型GPT2来解决文本生成任务的资源,包括论文、代码、展示demo和动手教程。
29
star
18

Transformer_implementation_and_application

The 300 lines of code (Tensorflow 2) completely replicates the Transformer model and is used in neural machine translation tasks and chat bots. 300行代码(Tensorflow 2)完整复现了Transformer模型,并且应用在神经机器翻译任务和聊天机器人上。
Jupyter Notebook
26
star
19

CPlusPlus-Programming-Language-Foundation

《CPlusPlus编程语言基础》又称为“C加加知识树”,用树状思维导图的形式展现C++从业人员必备的所有C++基础知识。
22
star
20

yuanxiaosc.github.io

个人博客;论文;机器学习;深度学习;Python学习;C++学习;
HTML
21
star
21

Keras_Attention_Seq2Seq

A sequence-to-sequence framework of Keras-based generative attention mechanisms that humans can read.一个人类可以阅读的基于Keras的代注意力机制的序列到序列的框架/模型,或许你不用写复杂的代码,直接使用吧。
Python
18
star
22

Deep_dynamic_contextualized_word_representation

TensorFlow code and pre-trained models for A Dynamic Word Representation Model Based on Deep Context. It combines the idea of BERT model and ELMo's deep context word representation.
Python
16
star
23

Path-Classification-Experiment

Introduction to Data Analysis: Path Classification Experiment. 本资源以选择最优路径为例详细介绍了如何解决一般的分类问题,包括原始数据的探索、模型的构建、模型调优和模型预测分析。包含前馈神经网络(Keras)、机器学习模型(sklearn)和绘制数据图表(matplotlib)的基础使用。
Jupyter Notebook
12
star
24

Deep-Convolutional-Generative-Adversarial-Network

Tensorflow 2. This repository demonstrates how to generate images of handwritten digits (MINIST) using a Deep Convolutional Generative Adversarial Network (DCGAN). 深度卷积生成对抗网络
Jupyter Notebook
9
star
25

NLPCC2019-Conference-Materials

NLPCC2019会议资料分享:论文投稿信息总结、NLP当前研究内容和趋势、学者演讲、海报、公司介绍和招聘信息。
7
star
26

Image_to_Text

Taking the image description task on the MS-COCO data set as an example, the template code of Image_to_Text is shown.
Jupyter Notebook
6
star
27

Slot-Gated-Modeling-for-Joint-Slot-Filling-and-Intent-Prediction

Code parsing and paper parsing for "Slot-Gated Modeling for Joint Slot Filling and Intent Prediction"
Python
5
star
28

Seq2Seq-English-French-Machine-Translation-Model

Seq2Seq English-French Machine Translation Model
Python
5
star
29

Hands-on-chat-robots

There are a variety of out-of-the-box chat bot codes here.
Jupyter Notebook
3
star