• Stars
    star
    182
  • Rank 209,858 (Top 5 %)
  • Language
    Jupyter Notebook
  • License
    Other
  • Created almost 8 years ago
  • Updated 23 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems

Triage

Data Science Toolkit for Social Good and Public Policy Problems

image image image

Building data science systems requires answering many design questions, turning them into modeling choices, which in turn run machine learning models. Questions such as cohort selection, unit of analysis determination, outcome determination, feature (explanantory variables) generation, model/classifier training, evaluation, selection, and list generation are often complicated and hard to choose apriori. In addition, once these choices are made, they have to be combined in different ways throughout the course of a project.

Triage is designed to:

  • Guide users (data scientists, analysts, researchers) through these design choices by highlighting critical operational use questions.
  • Provide an integrated interface to components that are needed throughout a data science project workflow.

Quick Links

  • Tutorial on Google Colab - Are you completely new to Triage? Run through a quick tutorial hosted on google colab (no setup necessary) to see what triage can do!
  • Dirty Duck Tutorial - Want a more in-depth walk through of triage's functionality and concepts? Go through the dirty duck tutorial here with sample data
  • QuickStart Guide - Try Triage out with your own project and data
  • Triage Documentation Site - Used Triage before and want more reference documentation?
  • Development - Contribute to Triage development.

Installation

To install Triage, you need:

  • Python 3.8+
  • A PostgreSQL 9.6+ database with your source data (events, geographical data, etc) loaded.
    • NOTE: If your database is PostgreSQL 11+ you will get some speed improvements. We recommend to update to a recent version of PostgreSQL.
  • Ample space on an available disk, (or for example in Amazon Web Services's S3), to store the needed matrices and models for your experiments

We recommend starting with a new python virtual environment and pip installing triage there.

$ virtualenv triage-env
$ . triage-env/bin/activate
(triage-env) $ pip install triage

If you get an error related to pg_config executable, run the following command (make sure you have sudo access):

(triage-env) $ sudo apt-get install libpq-dev python3.9-dev

Then rerun pip install triage

(triage-env) $ pip install triage

To test if triage was installed correctly, type:

(triage-env) $ triage -h

Data

Triage needs data in a postgres database and a configuration file that has credentials for the database. The Triage CLI defaults database connection information to a file stored in 'database.yaml' (example in example/database.yaml).

If you don't want to install Postgres yourself, try triage db up to create a vanilla Postgres 12 database using docker. For more details on this command, check out Triage Database Provisioner

Configure Triage for your project

Triage is configured with a config.yaml file that has parameters defined for each component. You can see some sample configuration with explanations to see what configuration looks like.

Using Triage

  1. Via CLI:
triage experiment example/config/experiment.yaml
  1. Import as a python package:
from triage.experiments import SingleThreadedExperiment

experiment = SingleThreadedExperiment(
    config=experiment_config, # a dictionary
    db_engine=create_engine(...), # http://docs.sqlalchemy.org/en/latest/core/engines.html
    project_path='/path/to/directory/to/save/data' # could be an S3 path too: 's3://mybucket/myprefix/'
)
experiment.run()

There are a plethora of options available for experiment running, affecting things like parallelization, storage, and more. These options are detailed in the Running an Experiment page.

Development

Triag was initially developed at University of Chicago's Center For Data Science and Public Policy and is now being maintained at Carnegie Mellon University.

To build this package (without installation), its dependencies may alternatively be installed from the terminal using pip:

pip install -r requirement/main.txt

Testing

To add test (and development) dependencies, use test.txt:

pip install -r requirement/test.txt [-r requirement/dev.txt]

Then, to run tests:

pytest

Development Environment

To quickly bootstrap a development environment, having cloned the repository, invoke the executable develop script from your system shell:

./develop

A "wizard" will suggest set-up steps and optionally execute these, for example:

(install) begin

(pyenv) installed

(python-3.9.10) installed

(virtualenv) installed

(activation) installed

(libs) install?
1) yes, install {pip install -r requirement/main.txt -r requirement/test.txt -r requirement/dev.txt}
2) no, ignore
#? 1

Contributing

If you'd like to contribute to Triage development, see the CONTRIBUTING.md document.

More Repositories

1

hitchhikers-guide

The Hitchhiker's Guide to Data Science for Social Good
Jupyter Notebook
986
star
2

aequitas

Bias Auditing & Fair ML Toolkit
Python
669
star
3

bikeshare

Statistical models and webapp for predicting when bikeshare stations will be empty or full.
JavaScript
115
star
4

MLforPublicPolicy

Class resources for CAPP 30254 (Machine Learning for Public Policy)
Jupyter Notebook
104
star
5

mlforpublicpolicylab

Repo for ML for Public Policy Lab course at CMU
Jupyter Notebook
101
star
6

data-science-101

Methods, tools, tips, and tricks for anyone interested in getting started doing data science for the social good.
Jupyter Notebook
95
star
7

energywise

An energy analytics tool to make commercial building more energy efficient
Python
77
star
8

fairness_tutorial

Hands-on tutorial on ML Fairness
Jupyter Notebook
69
star
9

student-early-warning

Using machine learning to predict high school dropouts
R
67
star
10

MLinPractice

Repository for ML in Practice Course at CMU (10-718)
Jupyter Notebook
55
star
11

tweedr

A machine learning API to analyze tweets during disasters.
JavaScript
54
star
12

police-eis

DSaPP police early intervention system: using machine learning to predict adverse incidents
Python
51
star
13

wikienergy

Git repo for Wiki Energy project
Jupyter Notebook
46
star
14

411-on-311

Exploratory analysis and predictive models of how Chicago's neighborhoods interact with the City's 311 service requests.
Python
44
star
15

pgdedupe

A simple command line interface to the datamade/dedupe library.
Jupyter Notebook
42
star
16

policy_diffusion

Tracing policy ideas from think tanks and lobbyists through state legislative bills
Python
39
star
17

givinggraph

An API tool to help understand the relationships between non-profits, for-profits, and the causes they support.
Python
28
star
18

ushine-learning

An API that uses machine learning to help the Ushahidi nonprofit do smarter crisis crowdsourcing.
Python
24
star
19

repo-scraper

Search for potential passwords/data leaks in a folder or git repo
Python
23
star
20

cta-sim

Big data simulation of Chicago's public transportation to improve transit planning and reduce bus crowding
CSS
23
star
21

census-communities-usa

Mapping and analyzing local business data from the Census Bureau.
Python
21
star
22

syracuse_public

Python
21
star
23

jakarta_smart_city_traffic_safety_public

Identifying traffic-safety issues in CCTV footage
Jupyter Notebook
20
star
24

data-challenges

A repository of real-world data challenges faced by organizations used for project-based learning
19
star
25

usal_echo_public

Automate process for view classification of the Apical 4 chamber, Apical 2 chamber and Parasternal long axis. Segmentation of the Apical 4 chamber and Apical 2 chamber. Calculate measurements of the Ejection Fraction of the heart to classify it as normal, abnoral or grayzone.
Python
19
star
26

match.edu

Predictive models to identify high-achieving high school students who are likely to undermatch - attend 2-year rather than 4-year colleges, or not go to college at all.
R
18
star
27

dsapp-reading-group

Proceedings of the Center for Data Scientists arguing (about) Public Papers
17
star
28

diogenes

Searching for an honest classifier
Python
17
star
29

ohio

Python I/O extras
Python
17
star
30

cta-otp

OpenTripPlanner tool and transit mobility maps for Chicago
Java
16
star
31

dirtyduck

A Guided Tour of Triage
CSS
16
star
32

eights

Data Science template with focus on prewritten workflows
Python
14
star
33

Random_Forest_Imputer

Automatic missing value imputation using random forests
Python
14
star
34

growth-curves

Statistical models of children's growth curves that predict which kids are at risk of obesity.
Python
14
star
35

argcmdr

Thin argparse wrapper for quick, clear and easy declaration of hierarchical console command interfaces
Python
13
star
36

learning

What fellows are learning about data problems and tools
Python
13
star
37

data-portal-treemap

Chicago Data Portal (data.cityofchicago.org) tree map
Python
13
star
38

DSaPP_RA_Project

This repository includes an exercise for aspiring DSaPP volunteers and research assistants to complete
12
star
39

dssg-training-workshop-2015

Main site for DSSG Training 2015
HTML
11
star
40

acs2pgsql

Download American Community Survey data and put it into a Postgres database
Shell
11
star
41

air_pollution_estimation

Jupyter Notebook
10
star
42

UPSG

A set of tools and conventions to help data scientists share code
Python
10
star
43

streetlights-crime

Statistical models to find whether Chicago street light outages are associated with increased crime
R
10
star
44

predicting_student_enrollment_public

Statistical models and analysis of student enrollment in Chicago Public Schools
R
9
star
45

dssg-manual

This repository contains the Eric & Wendy Schmidt Data Science for Social Good Fellowship Manual
9
star
46

cincinnati

DSaPP project with the City of Cincinnati. Building upon the DSSG15 project
Python
8
star
47

memphis-public

Public repository for the DSSG Memphis project
R
8
star
48

peeps-chili

Ethics with a side of chili. And peeps.
Jupyter Notebook
8
star
49

land-bank

Analytics tool to help the Cook County Land Bank acquire vacant and abandoned properties strategically.
JavaScript
8
star
50

johnson-county-ddj-public

Python
7
star
51

matching-tool

Integrating HMIS and criminal-justice data
Python
7
star
52

dssg2017-text_analysis

Text Analysis Tutorial for DSSG 2017 Conference
Jupyter Notebook
7
star
53

randomize_your_data

Randomize the order of each column to help check for leakage
Jupyter Notebook
7
star
54

timechop

generate time splits for temporal validation
Python
5
star
55

weather2pgsql

Download NOAA weather for a user-specified US state
Shell
5
star
56

tyra

Prediction model evaluation
JavaScript
4
star
57

barefoot-winnie-public

Recommending responses to law related queries - Built during DSSG in collaboration with Barefoot Law
Python
4
star
58

hylas

Webapp for visualizing ML'd data
JavaScript
4
star
59

innovation-ecosystems

Understanding city innovation hotspots using the Census CitySDK
CSS
4
star
60

healthleads-public

The public repo for the 2014 DSSG Health Leads project
Python
4
star
61

project_template

A template for a sample DSSG project.
Shell
4
star
62

solveforgood-wri

Jupyter Notebook
4
star
63

babies-public

This is the publicly available version of the babies repo, containing code used during our project with the Illinois Department of Human Services to predict and reduce adverse births in Illinois.
Python
4
star
64

stupid-csv-tricks

Code for doing slightly atypical things with CSVs
Python
4
star
65

catwalk

Training, testing, and evaluating machine learning classifier models
Python
3
star
66

lorax

Speaks for the trees by providing individual feature importances from random forests.
Python
3
star
67

homelessness-public

Python
3
star
68

marketplace

Code for the Solve for Good platform run by the DSSG Foundation
Python
3
star
69

rws_accident_prediction_public

rws_accident_prediction
Jupyter Notebook
3
star
70

cincinnati2015-public

Predicting blight in Cincinnati
Python
3
star
71

mexico-public

Public facing Mexico repository
R
3
star
72

audition

Choosing the best classifier models
Python
3
star
73

dssg-public-hmda

R
3
star
74

install-cli

Bash library for guided installation & bootstrapping
Shell
3
star
75

sklearn_tutorial

Short tutorial on some pipeline issues
Jupyter Notebook
3
star
76

machine_learning_legislation

Automatically identify earmarks in congressional spending bills
OpenEdge ABL
3
star
77

acdhs_housing_public

Python
3
star
78

nfp

Impact evaluation of the Nurse-Family Partnership nonprofit
R
3
star
79

MS2Postgres

A tool to move data from SQL Server to PostgreSQL in an environment with limited harddrive space.
Python
3
star
80

EDF

Analysis of energy efficiency loan data for the Environmental Defense Fund.
3
star
81

check-for-secrets

Discovering Secrets analysts Possibly Pushed
Shell
2
star
82

el-salvador-mined-public

Reducing Early School Dropout Rates in El Salvador
Jupyter Notebook
2
star
83

hiv-retention-public

Jupyter Notebook
2
star
84

after-hours

R
2
star
85

education-highschool-public

DSSG 2015 project focused on using data science methods to help partner public school districts improve their respective high school graduation rates and outcomes.
HTML
2
star
86

passenv

Shell command like env to run a program in an environment modified by values read from standard input
Python
2
star
87

chile-dt-public

Improving Workplace Safety through Proactive Inspections in Chile
R
2
star
88

cincinnati_ems_public

Jupyter Notebook
2
star
89

tuscany-tourism-public

Data-Driven Planning for Sustainable Tourism in Tuscany
HTML
2
star
90

architect

Plan, design and build train and test matrices
Python
2
star
91

panopticon

The command center at the DSSG office. http://en.wikipedia.org/wiki/Panopticon
Ruby
2
star
92

pakistan_ihhn_public

Public Repository for the DSSG 2022 Pakistan IHHN Project
Python
2
star
93

donors-choose

Jupyter Notebook
2
star
94

data-sci-fellows

The sexy landing page that will make everyone want to apply for the fellowship.
JavaScript
2
star
95

obscuritext

Transform text to be unreadable but still somewhat useful
Jupyter Notebook
2
star
96

dojo_mh_public

Public Repository for DSSG 2022 Douglas and Johnson Counties, KS Mental Health Project
Jupyter Notebook
2
star
97

baltimore_roofs_public

Public Repository for DSSG 2022 Baltimore Roofs Project
Python
2
star
98

sanergy-public

Python
2
star
99

signalled-timeout

Timeout library for generic interruption of main thread by an exception after a configurable duration.
Python
2
star
100

appy-reviews

A "smart" Web application for reviewing DSSG program application submissions
Python
2
star