• Stars
    star
    122
  • Rank 286,893 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 5 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Software and data for "Using Text Embeddings for Causal Inference"

Introduction

This repository contains software and data for "Using Text Embeddings for Causal Inference" (arxiv.org/abs/1905.12741). The paper describes a method for causal inference with text documents. For example, does adding a theorem to a paper affect its chance of acceptance? The method adapts deep language models to address the causal problem.

This software builds on

  1. Bert: github.com/google-research/bert, and on
  2. PeerRead: github.com/allenai/PeerRead

We include pre-processed PeerRead arxiv data for convenience.

There is also a reference implementation in pytorch.

Tensorflow 2

For new projects, we recommend building on the reference tensorflow 2 implementation.

Requirements and setup

  1. You'll need to download a pre-trained BERT model (following the above github link). We use uncased_L-12_H-768_A-12.
  2. Install Tensorflow 1.12

Data

  1. We include a pre-processed copy of PeerRead data for convenience. This data is a collection of arXiv papers submitted to computer science conferences, the accept/reject decisions for these papers, and their abstracts. The raw PeerRead data contains significantly more information. You can get the raw data by following instructions at github.com/allenai/PeerRead. Running the included pre-processing scripts in the PeerRead folder will recreate the included tfrecord file.

  2. The reddit data can be downloaded at archive.org/details/reddit_posts_2018. This data includes all top-level reddit comments where the gender of the poster was annotated in some fashion. Each post has meta information (score, date, username, etc.) and includes the text for the first reply. The processed data used in the paper can be recreated by running the pre-processing scripts in the reddit folder.

You can also re-collect the data from Google BigQuery. The SQL command to do this is in reddit/data_cleaning/BigQuery_get_data. Modifying this script will allow you to change collection parameters (e.g., the year, which responses are included)

Reproducing the PeerRead experiments

The default settings for the code match the settings used in the software. These match the default settings used by BERT, except

  1. we reduce batch size to allow training on a Titan X, and
  2. we adjust the learning rate to account for this.

You'll run the from src code as ./PeerRead/submit_scripts/run_model.sh Before doing this, you'll need to edit run_classifier.sh to change BERT_BASE_DIR=../../bert/pre-trained/uncased_L-12_H-768_A-12 to BERT_BASE_DIR=[path to BERT_pre-trained]/uncased_L-12_H-768_A-12.

The flag --treatment=theorem_referenced controls the experiment. The flag --simulated=real controls whether to use the real effect or one of the semi-synthetic modes.

The effect estimates can be reproduced by running python -m result_processing.compute_ate. This takes in the predictions of the bert model (in tsv format) and passes them into downstream estimators of the causal effect.

To reproduce the baselines, you'll need to produce a tsv for each simulated dataset you want to test on. To do this, you can run python -m PeerRead.dataset.array_from_dataset from src. The flag --beta1=1.0 controls the strength of the confounding. (The other flags control other simulation parameters not used in the paper.)

Misc.

The experiments in the paper use a version of BERT that was further pre-trained on the PeerRead corpus using an unsupervised objective. This can be replicated with ./PeerRead/submit_scripts/run_classifier.sh. This takes about 24 hours on a single Titan Xp. To use a pre-trained BERT, uncomment the INIT_DIR options in run_classifier.sh.

Reproducing the Reddit experiment

  1. First, get the data following instructions above and save it as dat/reddit/2018.json
  2. Run data pre-processing with python -m reddit.data_cleaning.process_reddit
  3. Once the data is processed, instructions for running the experiments are essentially the same as for PeerRead

Maintainers

Dhanya Sridhar and Victor Veitch

More Repositories

1

edward

A probabilistic programming language in TensorFlow. Deep generative models, variational inference.
Jupyter Notebook
4,823
star
2

onlineldavb

Online variational Bayes for latent Dirichlet allocation (LDA)
Python
297
star
3

dtm

This implements topics that change over time (Dynamic Topic Models) and a model of how individual documents predict that change.
Shell
195
star
4

lda-c

This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data.
C
163
star
5

hdp

Hierarchical Dirichlet processes. Topic models where the data determine the number of topics. This implements Gibbs sampling.
C++
150
star
6

ctr

Collaborative modeling for recommendation. Implements variational inference for a collaborative topic models. These models recommend items to users based on item content and other users' ratings.
C++
146
star
7

online-hdp

Online inference for the Hierarchical Dirichlet Process. Fits hierarchical Dirichlet process topic models to massive data. The algorithm determines the number of topics.
Python
144
star
8

deconfounder_tutorial

Jupyter Notebook
86
star
9

hlda

This implements hierarchical latent Dirichlet allocation, a topic model that finds a hierarchy of topics. The structure of the hierarchy is determined by the data.
JavaScript
77
star
10

publications

The pdf and LaTeX for each paper (and sometimes the code and data used to generate the figures).
TeX
73
star
11

class-slda

Implements supervised topic models with a categorical response.
C++
63
star
12

variational-smc

Reference implementation of variational sequential Monte Carlo proposed by Naesseth et al. "Variational Sequential Monte Carlo" (2018)
Python
61
star
13

deep-exponential-families

Deep exponential families (DEFs)
C++
56
star
14

DynamicPoissonFactorization

Dynamic version of Poisson Factorization (dPF). dPF captures the changing interest of users and the evolution of items over time according to user-item ratings.
C++
49
star
15

turbotopics

Turbo topics find significant multiword phrases in topics.
Python
46
star
16

ars-reparameterization

Source code for Naesseth et. al. "Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms" (2017)
Jupyter Notebook
38
star
17

zero-inflated-embedding

Code for the icml paper "zero inflated exponential family embedding"
Python
28
star
18

context-selection-embedding

Context Selection for Embedding Models
Python
27
star
19

ctm-c

This implements variational inference for the correlated topic model.
C
21
star
20

deconfounder_public

Jupyter Notebook
18
star
21

factorial-network-models

Discussion of Durante et al for JSM 2017. Includes factorial network model generalization.
Jupyter Notebook
9
star
22

treeffuser

Treeffuser is an easy-to-use package for probabilistic prediction on tabular data with tree-based diffusion models.
Jupyter Notebook
9
star
23

markovian-score-climbing

Python
8
star
24

diln

This implements the discrete infinite logistic normal, a Bayesian nonparametric topic model that finds correlated topics.
C
6
star
25

poisson-influence-factorization

Jupyter Notebook
4
star
26

Riken_tutorial

Jupyter Notebook
4
star