• Stars
    star
    143
  • Rank 255,525 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 5 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This repository provides data and scripts to use Sherlock, a DL-based model for semantic data type detection: https://sherlock.media.mit.edu.

Sherlock: code, data, and trained model.

Sherlock is a deep-learning approach to semantic data type detection, i.e. labeling tables with column types such as name, address, etc. This is helpful for, among others, data validation, processing and integration. This repository provides data and code to guide usage of Sherlock, retraining the model, and replication of results. Visit https://sherlock.media.mit.edu for more background on this project.

Installation of package

  1. You can install Sherlock by cloning this repository, and run pip install ..
  2. Install dependencies using pip install -r requirements.txt (or requirements38.txt depending on your Python version).

Demonstration of usage

The 00-use-sherlock-out-of-the-box.ipynb notebook demonstrates usage of the readily trained model for a given table.

The notebooks in notebooks/ prefixed with 01-data processing.ipynb and 02-1-train-and-test-sherlock.ipynb can be used to reproduce the results, and demonstrate the usage of Sherlock (from data preprocessing to model training and evaluation).

Data

The raw data (corresponding to annotated table columns) can be downloaded using the download_data() function in the helpers module. This will download +/- 500MB of data into the data directory. Use the 01-data-preprocessing.ipynb notebook to preprocess this data. Each column is then represented by a feature vector of dimensions 1x1588. The extracted features per column are based on "paragraph" embeddings (full column), word embeddings (aggregated from each column cell), character count statistics (e.g. average number of "." in a column's cells) and column-level statistics (e.g. column entropy).

The Sherlock model

The SherlockModel class is specified in the sherlock.deploy.model module. This model constitutes a multi-input neural network which specifies a separate network for each feature set (e.g. the word embedding features), concatenates them, and finally adds a few shared layers. Interaction with the model follows the scikit-learn interface, with methods fit, predict and predict_proba.

Making predictions

The originally trained SherlockModel can be used for generating predictions for a dataset. First, extract features using the features.preprocessing module. The original weights of Sherlock are provided in the repository in the model_files directory and can be loaded using the initialize_model_from_json method of the model. The procedure for making predictions (on the data) is demonstrated in the 02-1-train-and-test-sherlock.ipynb notebook.

Retraining Sherlock

The notebook 02-1-train-and-test-sherlock.ipynb also illustrates how Sherlock can be retrained. The model will infer the number of unique classes from the training labels unless you load a model from a json file, the number of classes will be 78 in that case.

Citing this work

To cite this work, please use the below bibtex:

@inproceedings{Hulsebos:2019:SDL:3292500.3330993,
 author = {Hulsebos, Madelon and Hu, Kevin and Bakker, Michiel and Zgraggen, Emanuel and Satyanarayan, Arvind and Kraska, Tim and Demiralp, {\c{C}}a{\u{g}}atay and Hidalgo, C{\'e}sar},
 title = {Sherlock: A Deep Learning Approach to Semantic Data Type Detection},
 booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \&\#38; Data Mining},
 year={2019},
 publisher = {ACM},
}

Project structure

β”œβ”€β”€ data   <- Placeholder directory to download data into.

β”œβ”€β”€ docs   <- Files for https://sherlock.media.mit.edu landing page.

β”œβ”€β”€ model_files  <- Files with trained model weights and specification.
    β”œβ”€β”€ sherlock_model.json
 Β Β  └── sherlock_weights.h5

β”œβ”€β”€ notebooks   <- Notebooks demonstrating data preprocessing and train/test of Sherlock.
    └── 00-use-sherlock-out-of-the-box.ipynb
    └── 01-data-preprocessing.ipynb
    └── 02-1-train-and-test-sherlock.ipynb
    └── 02-2-train-and-test-sherlock-rf-ensemble.ipynb
    └── 03-train-paragraph-vector-features-optional.ipynb

β”œβ”€β”€ sherlock  <- Package.
Β  Β  β”œβ”€β”€ deploy  <- Code for (re)training Sherlock, as well as model specification.
        └── helpers.py
        └── model.py
 Β Β  β”œβ”€β”€ features     <- Files to turn raw data, storing raw data columns, into features.
        β”œβ”€β”€ feature_column_identifiers   <- Directory with feature names categorized by feature set.
        └── bag_of_characters.py
        └── bag_of_words.py
        └── par_vec_trained_400.pkl
        └── paragraph_vectors.py
        └── preprocessing.py
        └── word_embeddings.py
    β”œβ”€β”€ helpers.py     <- Supportive modules.

More Repositories

1

AI-generated-characters

AI-generated-character
Jupyter Notebook
445
star
2

Junkyard-Jumbotron

The Junkyard Jumbotron is a web tool that makes it really easy to combine a bunch of random displays into a single, large virtual display. It works with laptops, tablets, smartphones -- anything that can run a web browser. And the magic is that all you need to do to configure one is take a photograph of all the screens.
C++
199
star
3

medrec

medical records on the blockchain https://medrec.media.mit.edu/
JavaScript
156
star
4

unhangout-old

RETIRED
JavaScript
156
star
5

PersonalizedMultitaskLearning

Code for performing 3 multitask machine learning methods: deep neural networks, Multitask Multi-kernel Learning (MTMKL), and a hierarchical Bayesian model (HBLR).
Python
124
star
6

gobo

πŸ’­ Gobo: Your social media. Your rules.
JavaScript
108
star
7

vizml

Plotly dataset-visualization pairs, feature extraction scripts, and model training code for VizML (CHI 2019)
Python
100
star
8

para

JavaScript
100
star
9

django-channels-presence

"Rooms" and "Presence" for django-channels
Python
78
star
10

viznet

VizNet is a repository providing real-world datasets that enable, among other things, (re)running empirical studies with higher ecological validity
Jupyter Notebook
74
star
11

MDAgents

Python
27
star
12

Health-LLM

Python
22
star
13

prg-raise-playground

Boilerplate for playing with and deploying Scratch 3.0 modifications!
JavaScript
18
star
14

MediaCloud-Dashboard

Front-end for the MediaCloud database
JavaScript
16
star
15

storybook-photoshop-jsx

JavaScript
16
star
16

ajl.ai

A web application for crowdsourcing image annotations.
JavaScript
16
star
17

ml-certs

Media Lab Digital Certificates
HTML
15
star
18

MITLegalForum

Transforming Law and Legal Processes for the Digital Age
15
star
19

AffectiveComputingQuantifyMeAndroid

The QuantifyMe platform helps researchers conduct single-case experiments in an automated and scientific manner.
Java
15
star
20

ai-generated-media

Jupyter Notebook
14
star
21

unhangout

Python
14
star
22

2019-MIT-Computational-Law-Course

MIT IAP 2019 Computational Law Course
Go
14
star
23

HERMITS_UIST20

Python
13
star
24

nmi-ai-2023

A repository for the paper "Beliefs about AI influence human-AI interaction and can be manipulated to increase perceived trustworthiness, empathy, and effectiveness" Nature Machine Intelligence 2023.
Jupyter Notebook
12
star
25

empathic-stories

HTML
11
star
26

Evolutron

A mini-framework to build and train neural networks for molecular biology.
Jupyter Notebook
11
star
27

OpenCyberDance

Open source Cybernetic Dance System
TypeScript
10
star
28

kukaslxctrl

A small library intended for controlling KUKA robots using KRC4 over KUKA RSI (Robot Sensor Interface) from Simulink.
C
10
star
29

Terra-Incognita

Your personal media geography. Catherine's thesis project.
JavaScript
9
star
30

CityMatrixAI

CityMatrix is an urban decision support system augmented with artificial intelligence. This repo is the UI for the AI assistant of the project.
C#
9
star
31

word-tree

A Unity app designed to help children learn English letter-sound correspondence, sound blending, and sight word recognition.
C#
8
star
32

2018-MIT-IAP-ComputationalLaw

MIT IAP Computational Law Course
8
star
33

bert-slu

Python
8
star
34

Wearable-Sanitizer

Wearable Sanitizer
C++
8
star
35

DeepABM-Pandemic

Python
7
star
36

promise-tracker-builder

Web app for developing and tracking civic monitoring campaigns
JavaScript
7
star
37

MAS.S60.Fall2020

Experiments in Deepfakes : Creativity, Computation, and Criticism
Jupyter Notebook
7
star
38

Generative-Autonomous-Legal-Entities

GALE - Exploring the Potential and the Perils of Autonomous Legal Entities Powered by Generative AI
7
star
39

FutureLaw

Future Law at the MIT Media Lab
6
star
40

Vida_Modeling

User Interface and Simulation Platform for a System Dynamics Model
Python
6
star
41

Realtime-Community-Sign

Software to run LED signs to show community information like bus arrival times and event calendars
Python
6
star
42

TI_EVM_logger

Sensor data log (+stream to websocket) for evaluation modules by Texas Instrument (tested with FDC2214 and LDC1614)
Python
6
star
43

NewsPix

NewsPix is a suite of apps by Matt Carroll, Catherine D'Ignazio and Jay Vachon that drive engagement in local news through pictures and visualizations. Our first app is a browser extension for Chrome and Firefox that delivers breaking news to the new tab window of a desktop user's browser.
CSS
6
star
44

MappingPoliceViolence-Scaper

Scripts that pull together data for our investigation into police violence against un-armed people of color in the US.
HTML
6
star
45

2021-MIT-IAP-Computational-Law-Course

5
star
46

livingmemory

JavaScript
5
star
47

TrustCoreID

For Human Dynamics open collaboration on CoreID project
JavaScript
5
star
48

ml-certs-website-archive

[ARCHIVE] Webpage for the Digital Certificates Project
HTML
5
star
49

Project-Captivate

Glasses project for crowds
C
5
star
50

tidstream

Tools for generating and manipulating live Vorbis and Opus streams
C
5
star
51

thefestival.media.mit.edu

Official website for the Festival of Learning at the MIT Media Lab
HTML
4
star
52

tidzam

Python
4
star
53

doodlebot

DoodleBot guide and resources
OpenSCAD
4
star
54

MIT-CLR

Public Facing GitHub Repo of MIT Computational Law Report
4
star
55

SAR-opal-base

A generalized Unity game builder designed for use in child-robot interactions.
C#
4
star
56

omniFORM

C++
4
star
57

storyspace

A simple storytelling game built in Unity3D / Mono, designed for use with a storytelling robot.
C#
4
star
58

Society-of-Neurons

Jupyter Notebook
4
star
59

SLIC

Sovereign Legal Identity Challenge
4
star
60

OpenMediaLegalHack

#hack4music at MIT Media Lab
4
star
61

MediaCloud-Tag-Explorer

Website you can use to explore MediaCloud tag sets
Python
4
star
62

spiral

Archimedean spiral generator for embroidered speaker coils
Python
4
star
63

AffectiveComputingQuantifyMeDjango

The QuantifyMe platform helps researchers conduct single-case experiments in an automated and scientific manner.
Python
4
star
64

Nightlights-Mobility

A project seeking to link remote observation nightlights data with telecoms-based mobility data
Python
4
star
65

Community-Sign-Server

Server software to manage a network of LED signs showing community information like bus arrival times and event calendars
PHP
3
star
66

GDPR-Hack-Day

GDPR Sunrise Eve Hack Day
3
star
67

Computational-Law-IAP-Workshop-2020

3
star
68

jitsi-meet-server

Experimental Vagrant/Salt configuration for automatically deploying a Jitsi Meet video server
Shell
3
star
69

pugg

A demon of the second kind, designed to overthrow Pugg, the information pirate
Python
3
star
70

asr_google_cloud

subscribes to microphone feed and publishes ASR result over ROS
Python
3
star
71

DistributedIdentity

Collaborative Project of Sarah Schwettmann and Dazza Greenwood
3
star
72

rr_tools

Tools for analysis and processing for the relational robot project.
Python
3
star
73

MediaMeter-Coder

Code to compare historical coverage of US and World issues in US newspapers.
Ruby
3
star
74

Hack4Climate

Hack4Climate at the MIT Media Lab
3
star
75

LegalHackers

LegalHackers.org related research and development activities at law.MIT.edu, the Media Lab and MIT
3
star
76

unhangout-video-server

RETIRED
SaltStack
3
star
77

dcpctrl_v1

Code developed 2015-2016 to control the second iteration of the Digital Construction Platform.
MATLAB
3
star
78

promise-tracker-mobile

Mobile data collection client for civic monitoring campaigns
JavaScript
3
star
79

nytcorpus-ruby

A ruby parser for the New York Times Corpus
Ruby
3
star
80

opera-timeline

Interactive Timeline of Projects by the Opera of the Future
JavaScript
3
star
81

fbserver

FB Server
Ruby
3
star
82

physioHMD

The PhysioHMD platform introduces a software and hardware modular interface built for collecting affect and physiological data from users wearing a head-mounted display. The platform enables researchers and developers to aggregate and interpret signals in real-time and use them to develop novel, personalized interactions, as well as evaluate virtual experiences. Our design offers seamless integration with standard HMDs, requiring minimal setup effort for developers and those with less experience using game engines. The PhysioHMD platform is a flexible architecture that offers an interface that is not only easy to extend but also complemented by a suite of tools for testing and analysis. We hope that PhysioHMD can become a universal, publicly available testbed for VR and AR researchers.
Python
3
star
83

AttentionMapDemo

A geographical heatmap of media attention across the globe, from a variety of sources
JavaScript
2
star
84

dhm

Digital Humanitarian Marketplace
PHP
2
star
85

yourAd

ad design and replacement tool to reclaim your browser ads
JavaScript
2
star
86

tega_teleop

A python rosnode for teleoperating the Tega robot
Python
2
star
87

speech-tapgame-aamas18

This repository contains the source code and associated executables for running the tap game described in "A Social Robot System for Modeling Children's Pronunciation"
Python
2
star
88

eegreconstruction

Jupyter Notebook
2
star
89

Global-Coverage-Study

A small study designed to compare geographic coverage between various types of online news sources
HTML
2
star
90

PopBlocks

JavaScript
2
star
91

subreddit-scripts

A repository for commonly used reddit scripts
Python
2
star
92

prg-s02-system-setup

Python
2
star
93

DigitalIdentitySessions

July 24 2017 at the MIT Media Lb
HTML
2
star
94

genderinmemoriam

Gender in Memoriam
JavaScript
2
star
95

fluid_statistics

Python Statistics Pipeline
Jupyter Notebook
2
star
96

text_analyses_tools

Matching phrases between source and query text files
Python
2
star
97

HCU400

An Annotated Dataset for Exploring Aural Phenomenology through Causal Uncertainty
2
star
98

omniFORM_2021

C++
2
star
99

gee_custom_utilities

A collection of python utility functions for working with Google Earth Engine
Python
2
star
100

promise-tracker-aggregator

Datastore and API for civic monitoring surveys and responses
Ruby
2
star