• Stars
    star
    184
  • Rank 209,187 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pretrained BERT model for analysing COVID-19 Twitter data

COVID-Twitter-BERT

COVID-Twitter-BERT (CT-BERT) is a transformer-based model pretrained on a large corpus of Twitter messages on the topic of COVID-19. The v2 model is trained on 97M tweets (1.2B training examples).

When used on domain specific datasets our evaluation shows that this model will get a marginal performance increase of 10–30% compared to the standard BERT-Large-model. Most improvements are shown on COVID-19 related and on Twitter-like messages.

This repository contains all code and references to models and datasets used in our paper as well as notebooks to finetune CT-BERT on your own datasets. If you end up using our work, please cite it:

Martin Müller, Marcel Salathé, and Per E Kummervold. 
COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter. 
arXiv preprint arXiv:2005.07503 (2020).

Colab

For a demo on how to train a classifier on top of CT-BERT, please take a look at this Colab. It finetunes a model on the SST-2 dataset. It can also easily be modified for finetuning on your own data.

Using Huggingface (on GPU)

Open In Colab

Using Tensorflow 2.2 (on TPUs) - ⚠️ Currently not working due to 2.3 incompatibility ⚠️

Open In Colab

Usage

If you are familiar with finetuning transformer models, the CT-BERT-model is available both as an downloadable archive, in TFHub and as a module in Huggingface.

Version Base model Language TF2 Huggingface TFHub
v1 BERT-large-uncased-WWM en TF2 Checkpoint Huggingface TFHub
v2 BERT-large-uncased-WWM en TF2 Checkpoint Huggingface TFHub

Huggingface

You can load the pretrained model from huggingface:

from transformers import BertForPreTraining
model = BertForPreTraining.from_pretrained('digitalepidemiologylab/covid-twitter-bert-v2')

You can predict tokens using the built-in pipelines:

from transformers import pipeline
import json

pipe = pipeline(task='fill-mask', model='digitalepidemiologylab/covid-twitter-bert-v2')
out = pipe(f"In places with a lot of people, it's a good idea to wear a {pipe.tokenizer.mask_token}")
print(json.dumps(out, indent=4))
[
    {
        "sequence": "[CLS] in places with a lot of people, it's a good idea to wear a mask [SEP]",
        "score": 0.9998226761817932,
        "token": 7308,
        "token_str": "mask"
    },
    ...
]

TF-Hub

import tensorflow_hub as hub

max_seq_length = 96  # Your choice here.
input_word_ids = tf.keras.layers.Input(
  shape=(max_seq_length,),
  dtype=tf.int32,
  name="input_word_ids")
input_mask = tf.keras.layers.Input(
  shape=(max_seq_length,),
  dtype=tf.int32,
  name="input_mask")
input_type_ids = tf.keras.layers.Input(
  shape=(max_seq_length,),
  dtype=tf.int32,
  name="input_type_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/digitalepidemiologylab/covid-twitter-bert/1", trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, input_type_ids])

Finetune CT-BERT using our scripts

The script run_finetune.py can be used for training a classifier. This code depends on the official tensorflow/models implementation of BERT under tensorflow 2.2/Keras.

In order to use our code you need to set up:

  • A Google Cloud bucket
  • A Google Cloud VM running Tensorflow 2.2
  • A TPU in the same zone as the VM also running Tensorflow 2.2

If you are a researcher you may apply for access to TPUs and/or Google Cloud credits.

Install

Clone the repository recursively

git clone https://github.com/digitalepidemiologylab/covid-twitter-bert.git --recursive && cd covid-twitter-bert

Our code was developed using tf-nightly but we made it backwards compatible to run with tensorflow 2.2. We recommend using Anaconda to manage the Python version:

conda create -n covid-twitter-bert python=3.8
conda activate covid-twitter-bert

Install dependencies

pip install -r requirements.txt

Prepare the data

Split your data into a training set train.tsv and a validation set dev.tsv with the following format:

id      label   text
1224380447930683394     label_a       Example text 1
1224380447930683394     label_a       Example text 2
1220843980633661443     label_b       Example text 3

Place these files into the folder data/finetune/originals/<dataset_name>/(train|dev).tsv (using your own dataset_name).

You can then run

cd preprocess
python create_finetune_data.py \
  --run_prefix test_run \
  --finetune_datasets <dataset_name> \
  --model_class bert_large_uncased_wwm \
  --max_seq_length 96 \
  --asciify_emojis \
  --username_filler twitteruser \
  --url_filler twitterurl \
  --replace_multiple_usernames \
  --replace_multiple_urls \
  --remove_unicode_symbols

This will generate TF record files in data/finetune/run_2020-05-19_14-14-53_517063_test_run/<dataset_name>/tfrecords.

You can now upload the data to your bucket:

cd data
gsutil -m rsync -r finetune/ gs://<bucket_name>/covid-bert/finetune/finetune_data/

Start finetuning

You can now finetune CT-BERT on this data using the following command

RUN_PREFIX=testrun                                  # Name your run
BUCKET_NAME=                                        # Fill in your buckets name here (without the gs:// prefix)
TPU_IP=XX.XX.XXX.X                                  # Fill in your TPUs IP here
FINETUNE_DATASET=<dataset_name>                     # Your dataset name
FINETUNE_DATA=<dataset_run>                         # Fill in dataset run name (e.g. run_2020-05-19_14-14-53_517063_test_run)
MODEL_CLASS=covid-twitter-bert
TRAIN_BATCH_SIZE=32
EVAL_BATCH_SIZE=8
LR=2e-5
NUM_EPOCHS=1

python run_finetune.py \
  --run_prefix $RUN_PREFIX \
  --bucket_name $BUCKET_NAME \
  --tpu_ip $TPU_IP \
  --model_class $MODEL_CLASS \
  --finetune_data ${FINETUNE_DATA}/${FINETUNE_DATASET} \
  --train_batch_size $TRAIN_BATCH_SIZE \
  --eval_batch_size $EVAL_BATCH_SIZE \
  --num_epochs $NUM_EPOCHS \
  --learning_rate $LR

Training logs, run configs, etc are then stored to gs://<bucket_name>/covid-bert/finetune/runs/run_2020-04-29_21-20-52_656110_<run_prefix>/. Among tensorflow logs you will find a file called run_logs.json containing all relevant training information

{
    "created_at": "2020-04-29 20:58:23",
    "run_name": "run_2020-04-29_20-51-10_405727_test_run",
    "final_loss": 0.19747886061668396,
    "max_seq_length": 96,
    "num_train_steps": 210,
    "eval_steps": 103,
    "steps_per_epoch": 42,
    "training_time_min": 6.77958079179128,
    "f1_macro": 0.7216383309465823,
    "scores_by_label": {
      ...
    },
    ...
}

Run the script 'sync_bucket_data.py' from your local computer to download all the training logs to data/<bucket_name>/covid-bert/finetune/<run_names>

python sync_bucket_data.py --bucket_name <bucket_name>

Datasets

In our preliminary study we have evaluated our model on five different classification datasets

Dataset name Num classes Reference
COVID Category (CC) 2 Read more
Vaccine Sentiment (VS) 3 See ➡️
Maternal vaccine Sentiment (MVS) 4 [not yet public]
Stanford Sentiment Treebank 2 (SST-2) 2 See ➡️
Twitter Sentiment SemEval (SE) 3 See ➡️

If you end up using these datasets, please make sure to properly cite them.

Pretrain

A documentation of how we created CT-BERT can be found here.

How do I cite COVID-Twitter-BERT?

You can cite our preprint:

@article{muller2020covid,
  title={COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter},
  author={M{\"u}ller, Martin and Salath{\'e}, Marcel and Kummervold, Per E},
  journal={arXiv preprint arXiv:2005.07503},
  year={2020}
}

or

Martin Müller, Marcel Salathé, and Per E. Kummervold. 
COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter.
arXiv preprint arXiv:2005.07503 (2020).

Acknowledgement

  • Thanks to Aksel Kummervold for creating the COVID-Twitter-Bert logo.
  • The model have been trained using resources made available by TPU Research Cloud (TRC) and Google Cloud COVID-19 research credits.
  • The model was trained as a collaboration between Martin Müller, Marcel Salathé and Per Egil Kummervold.
  • PK received funding from the European Commission for the call H2020-MSCA-IF-2017 and the funding scheme MSCA-IF-EF-ST for the VACMA project (grant agreement ID: 797876).
  • MM and MS received funding through the Versatile Emerging infectious disease Observatory grant as a part of the European Commissions Horizon 2020 framework programme (grant agreement ID: 874735).
  • The research was supported with Cloud TPUs from Google’s TPU Research Cloud and Google Cloud Credits in the context of COVID-19-related research”

Authors

More Repositories

1

plantvillage_deeplearning_paper_analysis

Python
96
star
2

VaxGame

network-based vaccination game
JavaScript
81
star
3

foodrepo_api

FoodRepo API
JavaScript
59
star
4

crowdbreaks-paper

Material related to paper "Crowdbreaks: Tracking Health Trends using Public Social Media Data and Crowdsourcing"
Jupyter Notebook
12
star
5

vaccine-sentiment

The code that accompanies the paper: Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control
Python
12
star
6

crowdbreaks

Crowdsourced tracking of health trends
Ruby
12
star
7

nodular.js

JavaScript
11
star
8

concept_drift_paper

This repository contains data & code necessary to reproduce the paper "How machine learning concept drift can negatively affect social media analysis"
Python
6
star
9

crowdbreaks-streamer-v1

Python/Flask application for crowdbreaks
Python
6
star
10

mkondo

A library for downloading and managing Twitter streams of data.
Python
5
star
11

react-native-scandit

Java
5
star
12

swisscovid_efficacy

4
star
13

experts-covid19-twitter

Data and materials for the paper "Experts and Authorities receive disproportionate attention on Twitter during the COVID-19 crisis"
Jupyter Notebook
4
star
14

openbeacon-case

OpenBeacon proximity tag case
3
star
15

population-memetics

The code and paper of the population memetics research project (in progress)
Java
3
star
16

covid-stream

Code specific to the Twitter Labs endpoint for COVID-19 using Amazon Kinesis Firehose
Python
3
star
17

PlantVillage-Analysis-Caffe

Shell
3
star
18

aerosol

Code and data for the paper "Assessing the Dynamics and Control of Droplet- and Aerosol-Transmitted Influenza Using an Indoor Positioning System" Edit Add topics
HTML
3
star
19

MOOCnet

simple small-world network w/ variable coefficient of variation in degree
Java
2
star
20

text-classification

Reproducible text classification
Python
2
star
21

vaccinationcontagionEPJDataScience

Code and Data for the EPJ Data Science 2013 paper "The Dynamics of Health Behavior Sentiments on a Large Online Social Network"
2
star
22

reverse-geocoder

An attempt at a home brewed reverse geocoder
Python
2
star
23

SocialContagion

Java
2
star
24

twitter-visual

Visualization of Twitter Sentiments
Java
2
star
25

keyword_streamer

All the base code to create a keyword based streaming API data collector.
Python
2
star
26

crowdbreaks-streamer

Crowdbreaks Near Real-Time Twitter Streamer. Pipeline: Twitter API v1 → Kinesis Firehose → S3 → Lambda (+ Sagemaker endpoint) → Elasticsearch
Python
2
star
27

COVID-documents

documents related to COVID19
2
star
28

VCFpreprocess

Just a test
Python
1
star
29

pydeepgenomics

Python
1
star
30

plantvillage-analysis-env

Shell
1
star
31

Collocation

Python
1
star
32

global-commons

Code for "Governing the Global Commons with Local Institutions",(2012) Plos One
Java
1
star
33

w3cRio

Supporting material for Validating Methods for Disease Detection Using Twitter part of the PHDA 2013 Workshop
1
star
34

rails_ar_encryption

Rails encrypts, Python decrypts
Python
1
star
35

crowdbreaks-welcome

Welcome doc for the Crowdbreaks project
1
star
36

myfoodrepo_project

The public repository for all things MyFoodRepo
1
star
37

deep-height

Python
1
star
38

Weighted_Network_Analysis

Code analyzes the relationship between degree and average weight in a dataset and determines its influence on disease spread
1
star