• Stars
    star
    182
  • Rank 211,154 (Top 5 %)
  • Language
    CSS
  • License
    GNU General Publi...
  • Created over 4 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Browse Covid-19 & SARS-CoV-2 Scientific Papers with Transformers 🦠 📖

Covid-19 Semantic Browser: Browse Covid-19 & SARS-CoV-2 Scientific Papers with Transformers 🦠 📖

Covid-19 Semantic Browser is an interactive experimental tool leveraging a state-of-the-art language model to search relevant content inside the COVID-19 Open Research Dataset (CORD-19) recently published by the White House and its research partners. The dataset contains over 44,000 scholarly articles about COVID-19, SARS-CoV-2 and related coronaviruses.

Various models already fine-tuned on Natural Language Inference are available to perform the search:

All models are trained on SNLI [3] and MultiNLI [4] using the sentence-transformers library [5] to produce universal sentence embeddings [6]. Embeddings are subsequently used to perform semantic search on CORD-19.

Currently supported operations are:

  • Browse paper abstract with interactive queries.

  • Reproduce SciBERT-NLI, BioBERT-NLI and CovidBERT-NLI training results.

Setup

Python 3.6 or higher is required to run the code. First, install the required libraries with pip, then download the en_core_web_sm language pack for spaCy and data for NLTK:

pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt

Using the Browser

First of all, download a model fine-tuned on NLI from HuggingFace's cloud repository.

python scripts/download_model.py --model scibert-nli

Second, download the data from the Kaggle challenge page and place it in the data folder.

Finally, simply run:

python scripts/interactive_search.py

to enter the interactive demo. Using a GPU is suggested since the creation of the embeddings for the entire corpus might be time-consuming otherwise. Both the corpus and the embeddings are cached on disk after the first execution of the script, and execution is really fast after embeddings are computed.

Use the interactive demo as follows:

Demo GIF

Reproducing Training Results for Transformers

First, download a pretrained model from HuggingFace's cloud repository.

python scripts/download_model.py --model scibert

Second, download the NLI datasets used for training and the STS dataset used for testing.

python scripts/get_finetuning_data.py

Finally, run the finetuning script by adjusting the parameters depending on the model you intend to train (default is scibert-nli).

python scripts/finetune_nli.py

The model will be evaluated against the test portion of the Semantic Text Similarity (STS) benchmark dataset at the end of training. Please refer to my model cards for additional references on parameter values.

References

[1] Beltagy et al. 2019, "SciBERT: Pretrained Language Model for Scientific Text"

[2] Lee et al. 2020, "BioBERT: a pre-trained biomedical language representation model for biomedical text mining"

[3] Bowman et al. 2015, "A large annotated corpus for learning natural language inference"

[4] Adina et al. 2018, "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference"

[5] Reimers et al. 2019, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"

[6] As shown in Conneau et al. 2017, "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data"

More Repositories

1

lambda-bert

A 🤗-style implementation of BERT using lambda layers instead of self-attention
Python
70
star
2

t5-flax-gcp

Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP
Python
58
star
3

paperpile-notion

Integrating Paperpile with Notion Databases 🔄
TeX
50
star
4

it5

Materials for "IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation" 🇮🇹
Jupyter Notebook
30
star
5

ik-nlp-tutorials

Lab tutorials for the MSc NLP course at the University of Groningen 🐮
Jupyter Notebook
19
star
6

lcl23-xnlm-lab

Materials for the Lab "Explaining Neural Language Models from Internal Representations to Model Predictions" at AILC LCL 2023 🔍
Jupyter Notebook
12
star
7

pecore

Materials for "Quantifying the Plausibility of Context Reliance in Neural Machine Translation" at ICLR'24 🐑 🐑
Jupyter Notebook
10
star
8

cancer-detection

Team Capybara final project "Histopathologic Cancer Detection" for the Statistical Machine Learning course @ University of Trieste
Jupyter Notebook
9
star
9

ETC-NLG

Materials for "ETC-NLG: End-to-end Topic-conditioned Natural Language Generation" at NL4AI 2020
Python
8
star
10

masters-thesis

Bookdown for my thesis "Interpreting NLMs for LCA" using the AI2S Thesisdown template.
TeX
8
star
11

mlda-course-units

Material from the Machine Learning and Data Analytics course of DSSC MSc @ UniTS/SISSA
HTML
8
star
12

divemt

Materials for "DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages" at EMNLP'22 🗺️
HTML
7
star
13

newscrapy

Batch-scrape articles from newspaper archives 📰 ⛏️
Python
7
star
14

svevo-letters-analysis

Topic Modeling and Sentiment Analysis on Italo Svevo Epistolary Corpus
Jupyter Notebook
7
star
15

interpreting-complexity

Materials for the MSc Thesis "Interpreting Neural Language Models for Linguistic Complexity Assessment" and related works.
Python
4
star
16

hpc

Material for the Foundation of High Performance Computing Course in the Data Science program @ University of Trieste
C
4
star
17

algorithmic-design

Material from the Algorithmic Design course of DSSC MSc @ UniTS/SISSA
C
3
star
18

lockdown-mobility-analysis

Analysis of shifts in mobility patterns during Italy's lockdown
R
3
star
19

verbalized-rebus

Materials for "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses" at CLiC-it'24 🧩
Jupyter Notebook
3
star
20

gsarti

2
star
21

website-content

Content repository for personal website
HTML
2
star
22

gsarti.github.io

Gabriele Sarti's Personal Website @ http://gsarti.github.io
HTML
1
star
23

melTS

Space Apps Hackaton Material for Team MelTS
Jupyter Notebook
1
star