• Stars
    star
    258
  • Rank 158,189 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created about 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A command line tool to easily add an ethics checklist to your data science projects.

tests codecov PyPI Conda Version

Read more about deon on the project homepage


An ethics checklist for data scientists

deon is a command line tool that allows you to easily add an ethics checklist to your data science projects. We support creating a new, standalone checklist file or appending a checklist to an existing analysis in many common formats.

To help get started, deon includes a default Data Science Ethics Checklist along with a list of real-world examples connected with each item. Users can draw on the default list or develop their own.


δέον • (déon) [n.] (Ancient Greek) wikitionary

Duty; that which is binding, needful, right, proper.


The conversation about ethics in data science, machine learning, and AI is increasingly important. The goal of deon is to push that conversation forward and provide concrete, actionable reminders to the developers that have influence over how data science gets done.

Quickstart

You only need two lines of code to get started!

First, install deon:

$ pip install deon

Then, write out the default checklist to a markdown file called ETHICS.md:

$ deon -o ETHICS.md

Dig into the checklist questions to identify and navigate the ethical considerations in your data science project.

For more configuration details, see the sections on command line options, supported output file types, and custom checklists.

Background and perspective

We have a particular perspective with this package that we will use to make decisions about contributions, issues, PRs, and other maintenance and support activities.

First and foremost, our goal is not to be arbitrators of what ethical concerns merit inclusion. We have a process for changing the default checklist, but we believe that many domain-specific concerns are not included and teams will benefit from developing custom checklists. Not every checklist item will be relevant. We encourage teams to remove items, sections, or mark items as N/A as the concerns of their projects dictate.

Second, we built our initial list from a set of proposed items on multiple checklists that we referenced. This checklist was heavily inspired by an article written by Mike Loukides, Hilary Mason, and DJ Patil and published by O'Reilly: "Of Oaths and Checklists". We owe a great debt to the thinking that proceeded this, and we look forward to thoughtful engagement with the ongoing discussion about checklists for data science ethics.

Third, we believe in the power of examples to bring the principles of data ethics to bear on human experience. This repository includes a list of real-world examples connected with each item in the default checklist. We encourage you to contribute relevant use cases that you believe can benefit the community by their example. In addition, if you have a topic, idea, or comment that doesn't seem right for the documentation, please add it to the wiki page for this project!

Fourth, it's not up to data scientists alone to decide what the ethical course of action is. This has always been a responsibility of organizations that are part of civil society. This checklist is designed to provoke conversations around issues where data scientists have particular responsibility and perspective. This conversation should be part of a larger organizational commitment to doing what is right.

Fifth, we believe the primary benefit of a checklist is ensuring that we don't overlook important work. Sometimes it is difficult with pressing deadlines and a demand to multitask to make sure we do the hard work to think about the big picture. This package is meant to help ensure that those discussions happen, even in fast-moving environments. Ethics is hard, and we expect some of the conversations that arise from this checklist may also be hard.

Sixth, we are working at a level of abstraction that cannot concretely recommend a specific action (e.g., "remove variable X from your model"). Nearly all of the items on the checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Because of this, most of the items are framed as prompts to discuss or consider. Teams will want to document these discussions and decisions for posterity.

Seventh, we can't define exhaustively every term that appears in the checklist. Some of these terms are open to interpretation or mean different things in different contexts. We recommend that when relevant, users create their own glossary for reference.

Eighth, we want to avoid any items that strictly fall into the realm of statistical best practices. Instead, we want to highlight the areas where we need to pay particular attention above and beyond best practices.

Ninth, we want all the checklist items to be as simple as possible (but no simpler), and to be actionable.

Using this tool

Prerequisites

  • Python >3.6: Your project need not be Python 3, but you need Python 3 to execute this tool.

Installation

$ pip install deon

or

$ conda install deon -c conda-forge

Simple usage

We recommend adding a checklist as the first step in your data science project. After creating your project folder, you could run:

$ deon -o ETHICS.md

This will create a markdown file called ETHICS.md that you can add directly to your project.

For simple one-off analyses, you can append the checklist to a Jupyter notebook or RMarkdown file using the -o flag to indicate the output file. deon will automatically append if that file already exists.

$ jupyter notebook my-analysis.ipynb

...

$ deon -o my-analysis.ipynb  # append cells to existing output file

This checklist can be used by individuals or teams to ensure that reviewing the ethical implications of their work is part of every project. The checklist is meant as a jumping-off point, and it should spark deeper and more thourough discussions rather than replace those discussions.

Proudly display your Deon badge

You can add a Deon badge to your project documentation, such as the README, to encourage wider adoption of these ethical practices in the data science community.

HTML badge

<a href="http://deon.drivendata.org/">
    <img src="https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square" alt="Deon badge" />
</a>

Markdown badge

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

Supported file types

Here are the currently supported file types. We will accept pull requests with new file types if there is a strong case for widespread use of that filetype.

  • .txt: ascii
  • .html: html
  • .ipynb: jupyter
  • .md: markdown
  • .rmd: rmarkdown
  • .rst: rst

Command line options

Usage: deon [OPTIONS]

  Easily create an ethics checklist for your data science project.

  The checklist will be printed to standard output by default. Use the --output
  option to write to a file instead.

Options:
  -l, --checklist PATH  Override default checklist file with a path to a custom
                        checklist.yml file.
  -f, --format TEXT     Output format. Default is "markdown". Can be one of
                        [ascii, html, jupyter, markdown, rmarkdown, rst].
                        Ignored and file extension used if --output is passed.
  -o, --output PATH     Output file path. Extension can be one of [.txt, .html,
                        .ipynb, .md, .rmd, .rst]. The checklist is appended if
                        the file exists.
  -w, --overwrite       Overwrite output file if it exists. Default is False,
                        which will append to existing file.
  -m, --multicell       For use with Jupyter format only. Write checklist with
                        multiple cells, one item per cell. Default is False,
                        which will write the checklist in a single cell.
  --help                Show this message and exit.

Default checklist


Data Science Ethics Checklist

Deon badge

A. Data Collection

  • A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
  • A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
  • A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
  • A.4 Downstream bias mitigation: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

B. Data Storage

  • B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
  • B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?
  • B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?

C. Analysis

  • C.1 Missing perspectives: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
  • C.2 Dataset bias: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
  • C.3 Honest representation: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
  • C.4 Privacy in analysis: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
  • C.5 Auditability: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

D. Modeling

  • D.1 Proxy discrimination: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
  • D.2 Fairness across groups: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
  • D.3 Metric selection: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
  • D.4 Explainability: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
  • D.5 Communicate bias: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

E. Deployment

  • E.1 Monitoring and evaluation: How are we planning to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
  • E.2 Redress: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
  • E.3 Roll back: Is there a way to turn off or roll back the model in production if necessary?
  • E.4 Unintended use: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

Data Science Ethics Checklist generated with deon.


Custom checklists

This is not meant to be the only ethical checklist, but instead we try to capture reasonable defaults that are general enough to be widely useful. For your own projects with particular concerns, we recommend your own checklist.yml file that is maintained by your team and passed to this tool with the -l flag.

Custom checklists must follow the same schema as checklist.yml. There must be a top-level title which is a string, and sections which is a list. Each section in the list sections must have a title, a section_id, and then a list of lines. Each line must have a line_id, a line_summary which is a 1-3 word shorthand, and a line string which is the content. The format is as follows:

title: TITLE
sections:
  - title: SECTION TITLE
    section_id: SECTION NUMBER
    lines:
        - line_id: LINE NUMBER
          line_summary: LINE SUMMARY
          line: LINE CONTENT

Changing the checklist

Please see the framing for an understanding of our perspective. Given this perspective, we will consider changes to the default checklist that fit with that perspective and follow this process.

Our goal is to have checklist items that are actionable as part of a review of data science work or as part of a plan. Please avoid suggesting items that are too vague (e.g., "do no harm") or too specific (e.g., "remove social security numbers from data").

Note: This process is an experiment and is subject to change based on how well it works.

A pull request to add an item should change:

The description in the pull request must include:

  • A justification for the change
  • A consideration of related items that already exist, and why this change is different from what exists
  • A published example (academic or press article) of where neglecting the principle has lead to concrete harm (articles that discuss potential or hypothetical harm will not be considered sufficient)

See detailed contributing instructions here.

Discussion and commentary

In addition to this documentation, the wiki pages for the GitHub repository are enabled. This is a good place for sharing of links and discussion of how the checklsits are used in practice.

If you have a topic, idea, or comment that doesn't seem right for the documentation, please add it to the wiki!

References, reading, and more

A robust discussion of data ethics is important for the profession. The goal of this tool is to make it easier to implement ethics review within technical projects. There are lots of great resources if you want to think about data ethics, and we encourage you to do so!

Checklist citations

We're excited to see so many articles popping up on data ethics! The short list below includes articles that directly informed the checklist content as well as a few case studies and thought-provoking pieces on the big picture.

Where things have gone wrong

To make the ideas contained in the checklist more concrete, we've compiled examples of times when things have gone wrong. They're paired with the checklist questions to help illuminate where in the process ethics discussions may have helped provide a course correction.

We welcome contributions! Follow these instructions to add an example.

Related tools

There are other groups working on data ethics and thinking about how tools can help in this space. Here are a few we've seen so far:


deon was created and is maintained by the team at DrivenData. Our mission is to bring the power of data science to social impact organizations.

More Repositories

1

competition-winners

The code for the prize winners in DrivenData competitions.
374
star
2

concept-to-clinic

ALCF Concept to Clinic Challenge
Python
368
star
3

cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
Python
312
star
4

box-plots-sklearn

An implementation of some of the tools used by the winner of the box plots competition using scikit-learn.
Jupyter Notebook
298
star
5

erdantic

Entity relationship diagrams for Python data model classes like Pydantic
Python
292
star
6

image-similarity-challenge

Winners of the Facebook Image Similarity Challenge
123
star
7

open-cities-ai-challenge

Winners of the Open Cities AI Challenge competition
Jupyter Notebook
115
star
8

zamba

A Python package for identifying 42 kinds of animals, training custom models, and estimating distance from camera trap videos
Python
105
star
9

nbautoexport

Automatically export Jupyter notebooks to various file formats (.py, .html, and more) on save.
Python
71
star
10

pandas-path

Use pathlib syntax to easily work with Pandas series containing file paths.
Python
59
star
11

power-laws-forecasting

Winners of the Power Laws forecasting competition
HTML
56
star
12

hateful-memes

52
star
13

the-biomassters

This a repository with the winners' code from the BioMassters challenge
Jupyter Notebook
36
star
14

stac-overflow

Winners of the STAC Overflow: Map Floodwater from Radar Imagery competition
Jupyter Notebook
34
star
15

power-laws-anomalies

Jupyter Notebook
30
star
16

open-ai-caribbean

Python
27
star
17

power-laws-optimization

Example repository for the Power Laws: Optimizing Demand-side Strategies competition on DrivenData
Jupyter Notebook
26
star
18

overhead-geopose-challenge

Winners of DrivenData's Overhead Geopose Challenge
Python
26
star
19

tick-tick-bloom

Winners of the Tick Tick Bloom: Harmful Algal Bloom Detection Challenge
Jupyter Notebook
25
star
20

cloud-cover-winners

Code from the winning submissions for the On Cloud N: Cloud Cover Detection Challenge
Jupyter Notebook
25
star
21

pover-t-tests

Jupyter Notebook
25
star
22

box-plots-for-education

Competition results for Box-plots for Education https://www.drivendata.org/competitions/4/
HTML
21
star
23

hakuna-madata

Jupyter Notebook
21
star
24

power-laws-cold-start

Jupyter Notebook
21
star
25

wind-dependent-variables

Winners of the Wind-dependent Variables: Predict Wind Speeds of Tropical Storms competition
Python
21
star
26

n-plus-one-fish

Winning models for the N+1 Fish, N+2 Fish competition.
Jupyter Notebook
20
star
27

rinse-over-run

Winners of the Sustainable Industry: Rinse Over Run competition
Jupyter Notebook
20
star
28

snomed-ct-entity-linking

Winners of the SNOMED CT Entity Linking Challenge
Python
19
star
29

cyfi

Estimate cyanobacteria density based on Sentinel-2 satellite imagery
Python
19
star
30

drivendata-submission-validator

Simple validator for submissions to DrivenData competitions
Python
19
star
31

pump-it-up

Code from winning competitors in the Pump it Up competition on DrivenData.
Jupyter Notebook
18
star
32

tissuenet-cervical-biopsies

Winners of the TissueNet: Detect Lesions in Cervical Biopsies competition
Python
17
star
33

nasa-airathon

Winning code from the NASA Airathon: Predict Air Quality challenge on DrivenData
Jupyter Notebook
15
star
34

snowcast-showdown

Jupyter Notebook
14
star
35

magnet-geomagnetic-field

Winners of the MagNet: Model the Geomagnetic Field competition
Jupyter Notebook
14
star
36

snomed-ct-entity-linking-runtime

Runtime repository for the SNOMED CT Entity Linking challenge on DrivenData
Makefile
14
star
37

repro-zipfile

A tiny, zero-dependency replacement for Python's zipfile.ZipFile for creating reproducible/deterministic ZIP archives.
Python
12
star
38

cloud-cover-runtime

Code execution runtime for the Cloud Cover competition
Python
11
star
39

odsc-actionable-ethics

"Actionable Ethics for Data Scientists" Workshop Material @ ODSC
Jupyter Notebook
11
star
40

floodwater-runtime

Code execution runtime for the STAC Overflow: Map Floodwater from Radar Imagery competition
Python
11
star
41

naive-bees-classifier

Competition results for the Naive Bees Classifier competition https://www.drivendata.org/competitions/8/
Python
10
star
42

clog-loss-alzheimers-research

Winners of the Clog Loss: Advance Alzheimer’s Research with Stall Catchers competition
Python
10
star
43

keeping-it-fresh

Competition results for Keeping it Fresh https://www.drivendata.org/competitions/5/
Python
10
star
44

pets-prize-challenge-runtime

Evaluation runtime for Phase 2 of the PETs Prize Challenge
Python
9
star
45

water-supply-forecast-rodeo-runtime

Data and runtime repository for the Water Supply Forecast Rodeo competition on DrivenData
Python
9
star
46

noaa-runtime

Code execution for the NOAA MagNet: Model the Geomagnetic Field competition.
Python
9
star
47

boem-belugas-runtime

Code execution runtime for the "Where's Whale-do?" beluga photo-identification challenge
Jupyter Notebook
8
star
48

video-similarity-challenge

Links to winning solutions for the Meta AI Video Similarity Challenge
8
star
49

senior-data-science

Winners of the Senior Data Science
Python
8
star
50

visiomel-melanoma

Winners of the VisioMel Challenge: Predicting Melanoma Relapse competition
C++
8
star
51

metrics

Useful implementations of metrics for competitions on www.drivendata.org
Python
8
star
52

mars-spectrometry

A repository for the winners of the NASA Mars Spectrometry challenge
Jupyter Notebook
8
star
53

deid2-runtime

Code execution runtime for the NIST De-ID2 competition
Python
7
star
54

meta-vsc-descriptor-runtime

Containerized runtime for the Descriptor Track of the Meta Video Similarity Competition
Python
6
star
55

blood-donations

Community-submitted solutions to the Blood Donations competition on DrivenData.
Jupyter Notebook
6
star
56

sfp-cervical-biopsy-runtime

Code execution for the SFP cervical biopsy competition
Makefile
6
star
57

deep-chimpact-winners

Winners of the Deep Chimpact: Depth Estimation for Wildlife Conservation Competition
Jupyter Notebook
6
star
58

setup-python-uv-action

Composite action that sets up Python and uv with optional caching
Python
6
star
59

nasa-airport-pushback

Winners of the Pushback to the Future: Predict Pushback Time at US Airports Challenge
Python
5
star
60

sortedcontainers-pydantic

Adds Pydantic support to sortedcontainers.
Python
5
star
61

wheres-whale-do

Winners of the Where's Whale-do? Competition
Jupyter Notebook
4
star
62

flu-shot-learning-tutorial

Materials for tutorial @ Good Tech Fest 2020
Jupyter Notebook
4
star
63

pale-blue-dot

Winners of the Pale Blue Dot: Visualization Challenge
Jupyter Notebook
4
star
64

minimal-configclasses

Minimal Python library for creating config classes: a data class that can load default overrides from other sources
Python
4
star
65

countable-care

Winners of the Countable Care competition https://www.drivendata.org/competitions/6/
R
3
star
66

pri-matrix-factorization

Jupyter Notebook
3
star
67

meta-vsc-matching-runtime

Containerized runtime for the Matching Track of the Meta AI Video Similarity Competition
Python
3
star
68

visiomel-melanoma-runtime

Makefile
3
star
69

nasa-airport-config

Winners of the Run-way Functions competition
Python
3
star
70

openai-caribbean-challenge-benchmark

Benchmark code for the Open AI Caribbean Challenge: Mapping Disaster Risk from Aerial Imagery
MATLAB
3
star
71

random-walk-of-the-penguins

Jupyter Notebook
2
star
72

prize-winner-template

Template for competition prize winners to submit their code for review
Python
2
star
73

tutorial-flu-shot-learning

Machine learning tutorial based on the Flu Shot Learning competition
Jupyter Notebook
2
star
74

loggingisfun

A tutorial on logging in your Python package
Python
2
star
75

repro-tarfile

A tiny, zero-dependency replacement for Python's tarfile standard library for creating reproducible/deterministic tar archives.
Python
2
star
76

clog-loss-stall-catchers-benchmark

Benchmark code for Clog Loss: Advance Alzheimer’s Research with Stall Catchers
MATLAB
2
star
77

kelp-wanted

This a repository with the winners' code from the Kelp Wanted challenge
Jupyter Notebook
2
star
78

unsupervised-wisdom

Repo with the winners' code from Unsupervised Wisdom: Explore Medical Narratives on Older Adult Falls
HTML
2
star
79

nasa-pushback-federated-learning

Experiment with federated learning models to predict pushback times at US airports!
Python
1
star
80

april-ai-chatbot

Source code for the April "AI chatbot" demo
HTML
1
star
81

from-fog-nets-to-neural-nets

Winners of the From Fog Nets to Neural Nets"
Python
1
star
82

intro-to-reproducible-ml

Materials for tutorial @ Good Tech Fest DS - Nov. 2020
HTML
1
star
83

dengai

Community-submitted solutions to the DengAI competition on DrivenData.
1
star
84

ai4earth-serengeti-runtime

Container specifications for AI for Earth Serengeti competition on DrivenData
R
1
star
85

millennium-development-goals

Community-submitted solutions to the Millennium Development Goals competition on DrivenData.
1
star
86

mars-spectrometry-gcms

Winners' code from the Mars Spectrometry 2: Gas Chromatography challenge
Python
1
star
87

americas-next-top-statistical-model

Competition results for America's Next Top Statistical Model https://www.drivendata.org/competitions/43/
R
1
star
88

pose-bowl-spacecraft-challenge

Winning solutions from the Pose Bowl: Spacecraft Detection and Pose Estimation Challenge
Jupyter Notebook
1
star