• Stars
    star
    100
  • Rank 338,705 (Top 7 %)
  • Language
    Python
  • Created over 5 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Plotly dataset-visualization pairs, feature extraction scripts, and model training code for VizML (CHI 2019)

VizML: Training Data, Feature Extraction, and Model Training

This repository provides access to the Plotly dataset-visualization pairs, feature extraction scripts, and model training ssripts used in the VizML paper.

Data Description

We provide subsets of the Plotly corpus with 10K and 100K pairs, the full corpus with 1,066,443 pairs(205G), and features extracted from an aggressively deduplicated set of 119,815 pairs (19G). More information about the corpus schema, the extracted features, and the design choices are provided in the paper.

Dependencies

This repository uses python 3.7.3 and depends on the packages listed in requirements.txt. Create the virtual environment with virtualenv -p python3 venv, enter the virtual environment using source venv/bin/activate, and install dependencies with pip install -r requirements.txt.

How do I use this repository?

Accessing Data

To download and unzip the Plotly dataset-visualization pairs or features, run ./retrieve_data.sh. Comment lines to specify which subsets or features you want to use. Then create a symlink for to access the data ln -s data/[ plotly_plots_with_full_data_with_all_fields_and_header_{ 1k, 100k, full }.tsv data/plot_data.tsv.

Preparing Data

Within the data_cleaning directory:

  • To remove charts without all data: python remove_charts_without_all_data.py
  • To remove duplicate charts: python remove_duplicate_charts.py

Extracting and Characterizing Features

Within the feature_extraction directory, run python extract.py. Then use notebooks/Plotly Performance.ipynb to characterize features (e.g. distribution of number of columns per dataset)

Baseline Model Training

Use notebooks/Descriptive Statistics.ipynb to train the random forest, K-nearest neighbors, naive Bayes, and Logistic regression baseline models. Use notebooks/Model Feature Importances.ipynb to extract feature importances from the random forest baseline model.

Neural Network Training

Within the neural_network directory, run python agg.py [LOAD|TRAIN|EVAL] to load features, train models, then evaluate a particular model.

Benchmarking

Use notebooks/Benchmarking.ipynb to evaluate serialized models against the crowdsourced consensus ground truth.

What's in this repository

retrieve_data.sh: Shell script to download and unzip dataset - visualization pairs and features from Amazon S3 storage
requirements.txt: Python dependencies
data/: Placeholder directory for raw data
features/: Placeholder directory for extracted features
results/: Placeholder directory for intermediate results and figures
models/: Placeholder directory for trained models
feature_extraction/
    └───features/
        └───aggregate_single_field_features.py: Functions to aggregate single - column features
        └───aggregation_helpers.py: Helper functions used in aggregate_single_field_features.py
        └───dateparser.py: Functions to detect and mark dates
        └───helpers.py: Helper functions used in all feature extraction scripts
        └───single_field_features.py: Functions to extract single - column features
        └───transform.py: Functions to transform single - column features
        └───type_detection.py: Functions used to detect data types
    └───outcomes/
        └───chart_outcomes.py: Functions to extract design choices of visualizations
        └───field_encoding_outcomes.py: Functions to extract design choices of encodings
    └───extract.py: Top -level entry point to extract features and outcomes
    └───general_helpers.py: Helpers used in top -level extraction function
helpers/
    └───analysis.py: Helpers functions when training baseline models
    └───processing.py: Helper functions when processing data
    └───util.py: Misc helper functions
neural_network/
    └───agg.py: Top-level entry point to load features and train neural network
    └───evaluate.py: Functions to evaluate trained neural network
    └───nets.py: Class definitions for neural network
    └───paper_ground_truth.py: Script to evaluate best network against benchmarking ground truth
    └───paper_tasks.py: Script to evaluate best network for Plotly test set
    └───save_field.py: Script to prepare training, validation, and testing splits
    └───train.py: Helper functions for model training
    └───train_field.py: Script to train network
    └───util.py: Helper functions
notebooks/
    └───Descriptive Statistics.ipynb: Notebook to generate visualizations of number of charts per user, number of rows per dataset, and number of columns per dataset
    └───Plotly Performance.ipynb: Notebook to train baseline models and assess performance on a hold-out setfrom the Plotly corpus
    └───Model Feature Importances.ipynb: Notebook to extract feature importances from trained models
    └───Benchmarking.ipynb: Notebook to generate predictions of trained models on benchmarking datasets, bootstrap crowdsourced consensus, and compare predictions
preprocessing/: Scripts to preprocess features before ML modeling
    └───deduplication.py: Helper functions to deduplicate charts
    └───impute.py: Helper function to impute missing values
    └───preprocess.py: Helper functions to prepare features for learning
docs/: Landing page and miscellaneous material for documentation

More Repositories

1

AI-generated-characters

AI-generated-character
Jupyter Notebook
445
star
2

Junkyard-Jumbotron

The Junkyard Jumbotron is a web tool that makes it really easy to combine a bunch of random displays into a single, large virtual display. It works with laptops, tablets, smartphones -- anything that can run a web browser. And the magic is that all you need to do to configure one is take a photograph of all the screens.
C++
199
star
3

medrec

medical records on the blockchain https://medrec.media.mit.edu/
JavaScript
156
star
4

unhangout-old

RETIRED
JavaScript
156
star
5

sherlock-project

This repository provides data and scripts to use Sherlock, a DL-based model for semantic data type detection: https://sherlock.media.mit.edu.
Jupyter Notebook
143
star
6

PersonalizedMultitaskLearning

Code for performing 3 multitask machine learning methods: deep neural networks, Multitask Multi-kernel Learning (MTMKL), and a hierarchical Bayesian model (HBLR).
Python
124
star
7

gobo

💭 Gobo: Your social media. Your rules.
JavaScript
108
star
8

para

JavaScript
100
star
9

django-channels-presence

"Rooms" and "Presence" for django-channels
Python
78
star
10

viznet

VizNet is a repository providing real-world datasets that enable, among other things, (re)running empirical studies with higher ecological validity
Jupyter Notebook
74
star
11

MDAgents

Python
27
star
12

Health-LLM

Python
22
star
13

prg-raise-playground

Boilerplate for playing with and deploying Scratch 3.0 modifications!
JavaScript
18
star
14

MediaCloud-Dashboard

Front-end for the MediaCloud database
JavaScript
16
star
15

storybook-photoshop-jsx

JavaScript
16
star
16

ajl.ai

A web application for crowdsourcing image annotations.
JavaScript
16
star
17

ml-certs

Media Lab Digital Certificates
HTML
15
star
18

MITLegalForum

Transforming Law and Legal Processes for the Digital Age
15
star
19

AffectiveComputingQuantifyMeAndroid

The QuantifyMe platform helps researchers conduct single-case experiments in an automated and scientific manner.
Java
15
star
20

ai-generated-media

Jupyter Notebook
14
star
21

unhangout

Python
14
star
22

2019-MIT-Computational-Law-Course

MIT IAP 2019 Computational Law Course
Go
14
star
23

HERMITS_UIST20

Python
13
star
24

nmi-ai-2023

A repository for the paper "Beliefs about AI influence human-AI interaction and can be manipulated to increase perceived trustworthiness, empathy, and effectiveness" Nature Machine Intelligence 2023.
Jupyter Notebook
12
star
25

empathic-stories

HTML
11
star
26

Evolutron

A mini-framework to build and train neural networks for molecular biology.
Jupyter Notebook
11
star
27

OpenCyberDance

Open source Cybernetic Dance System
TypeScript
10
star
28

kukaslxctrl

A small library intended for controlling KUKA robots using KRC4 over KUKA RSI (Robot Sensor Interface) from Simulink.
C
10
star
29

Terra-Incognita

Your personal media geography. Catherine's thesis project.
JavaScript
9
star
30

CityMatrixAI

CityMatrix is an urban decision support system augmented with artificial intelligence. This repo is the UI for the AI assistant of the project.
C#
9
star
31

word-tree

A Unity app designed to help children learn English letter-sound correspondence, sound blending, and sight word recognition.
C#
8
star
32

2018-MIT-IAP-ComputationalLaw

MIT IAP Computational Law Course
8
star
33

bert-slu

Python
8
star
34

Wearable-Sanitizer

Wearable Sanitizer
C++
8
star
35

DeepABM-Pandemic

Python
7
star
36

promise-tracker-builder

Web app for developing and tracking civic monitoring campaigns
JavaScript
7
star
37

MAS.S60.Fall2020

Experiments in Deepfakes : Creativity, Computation, and Criticism
Jupyter Notebook
7
star
38

Generative-Autonomous-Legal-Entities

GALE - Exploring the Potential and the Perils of Autonomous Legal Entities Powered by Generative AI
7
star
39

FutureLaw

Future Law at the MIT Media Lab
6
star
40

Vida_Modeling

User Interface and Simulation Platform for a System Dynamics Model
Python
6
star
41

Realtime-Community-Sign

Software to run LED signs to show community information like bus arrival times and event calendars
Python
6
star
42

TI_EVM_logger

Sensor data log (+stream to websocket) for evaluation modules by Texas Instrument (tested with FDC2214 and LDC1614)
Python
6
star
43

NewsPix

NewsPix is a suite of apps by Matt Carroll, Catherine D'Ignazio and Jay Vachon that drive engagement in local news through pictures and visualizations. Our first app is a browser extension for Chrome and Firefox that delivers breaking news to the new tab window of a desktop user's browser.
CSS
6
star
44

MappingPoliceViolence-Scaper

Scripts that pull together data for our investigation into police violence against un-armed people of color in the US.
HTML
6
star
45

2021-MIT-IAP-Computational-Law-Course

5
star
46

livingmemory

JavaScript
5
star
47

TrustCoreID

For Human Dynamics open collaboration on CoreID project
JavaScript
5
star
48

ml-certs-website-archive

[ARCHIVE] Webpage for the Digital Certificates Project
HTML
5
star
49

Project-Captivate

Glasses project for crowds
C
5
star
50

tidstream

Tools for generating and manipulating live Vorbis and Opus streams
C
5
star
51

thefestival.media.mit.edu

Official website for the Festival of Learning at the MIT Media Lab
HTML
4
star
52

tidzam

Python
4
star
53

doodlebot

DoodleBot guide and resources
OpenSCAD
4
star
54

MIT-CLR

Public Facing GitHub Repo of MIT Computational Law Report
4
star
55

SAR-opal-base

A generalized Unity game builder designed for use in child-robot interactions.
C#
4
star
56

omniFORM

C++
4
star
57

storyspace

A simple storytelling game built in Unity3D / Mono, designed for use with a storytelling robot.
C#
4
star
58

Society-of-Neurons

Jupyter Notebook
4
star
59

SLIC

Sovereign Legal Identity Challenge
4
star
60

OpenMediaLegalHack

#hack4music at MIT Media Lab
4
star
61

MediaCloud-Tag-Explorer

Website you can use to explore MediaCloud tag sets
Python
4
star
62

spiral

Archimedean spiral generator for embroidered speaker coils
Python
4
star
63

AffectiveComputingQuantifyMeDjango

The QuantifyMe platform helps researchers conduct single-case experiments in an automated and scientific manner.
Python
4
star
64

Nightlights-Mobility

A project seeking to link remote observation nightlights data with telecoms-based mobility data
Python
4
star
65

Community-Sign-Server

Server software to manage a network of LED signs showing community information like bus arrival times and event calendars
PHP
3
star
66

GDPR-Hack-Day

GDPR Sunrise Eve Hack Day
3
star
67

Computational-Law-IAP-Workshop-2020

3
star
68

jitsi-meet-server

Experimental Vagrant/Salt configuration for automatically deploying a Jitsi Meet video server
Shell
3
star
69

pugg

A demon of the second kind, designed to overthrow Pugg, the information pirate
Python
3
star
70

asr_google_cloud

subscribes to microphone feed and publishes ASR result over ROS
Python
3
star
71

DistributedIdentity

Collaborative Project of Sarah Schwettmann and Dazza Greenwood
3
star
72

rr_tools

Tools for analysis and processing for the relational robot project.
Python
3
star
73

MediaMeter-Coder

Code to compare historical coverage of US and World issues in US newspapers.
Ruby
3
star
74

Hack4Climate

Hack4Climate at the MIT Media Lab
3
star
75

LegalHackers

LegalHackers.org related research and development activities at law.MIT.edu, the Media Lab and MIT
3
star
76

unhangout-video-server

RETIRED
SaltStack
3
star
77

dcpctrl_v1

Code developed 2015-2016 to control the second iteration of the Digital Construction Platform.
MATLAB
3
star
78

promise-tracker-mobile

Mobile data collection client for civic monitoring campaigns
JavaScript
3
star
79

nytcorpus-ruby

A ruby parser for the New York Times Corpus
Ruby
3
star
80

opera-timeline

Interactive Timeline of Projects by the Opera of the Future
JavaScript
3
star
81

fbserver

FB Server
Ruby
3
star
82

physioHMD

The PhysioHMD platform introduces a software and hardware modular interface built for collecting affect and physiological data from users wearing a head-mounted display. The platform enables researchers and developers to aggregate and interpret signals in real-time and use them to develop novel, personalized interactions, as well as evaluate virtual experiences. Our design offers seamless integration with standard HMDs, requiring minimal setup effort for developers and those with less experience using game engines. The PhysioHMD platform is a flexible architecture that offers an interface that is not only easy to extend but also complemented by a suite of tools for testing and analysis. We hope that PhysioHMD can become a universal, publicly available testbed for VR and AR researchers.
Python
3
star
83

AttentionMapDemo

A geographical heatmap of media attention across the globe, from a variety of sources
JavaScript
2
star
84

dhm

Digital Humanitarian Marketplace
PHP
2
star
85

yourAd

ad design and replacement tool to reclaim your browser ads
JavaScript
2
star
86

tega_teleop

A python rosnode for teleoperating the Tega robot
Python
2
star
87

speech-tapgame-aamas18

This repository contains the source code and associated executables for running the tap game described in "A Social Robot System for Modeling Children's Pronunciation"
Python
2
star
88

eegreconstruction

Jupyter Notebook
2
star
89

Global-Coverage-Study

A small study designed to compare geographic coverage between various types of online news sources
HTML
2
star
90

PopBlocks

JavaScript
2
star
91

subreddit-scripts

A repository for commonly used reddit scripts
Python
2
star
92

prg-s02-system-setup

Python
2
star
93

DigitalIdentitySessions

July 24 2017 at the MIT Media Lb
HTML
2
star
94

genderinmemoriam

Gender in Memoriam
JavaScript
2
star
95

fluid_statistics

Python Statistics Pipeline
Jupyter Notebook
2
star
96

text_analyses_tools

Matching phrases between source and query text files
Python
2
star
97

HCU400

An Annotated Dataset for Exploring Aural Phenomenology through Causal Uncertainty
2
star
98

omniFORM_2021

C++
2
star
99

gee_custom_utilities

A collection of python utility functions for working with Google Earth Engine
Python
2
star
100

promise-tracker-aggregator

Datastore and API for civic monitoring surveys and responses
Ruby
2
star