Crash Modeling
Outline:
- Project Overview
- Data Sources and Modelling
- Setting up
- Contributing
- Connect with us
- Project Organization
Project Overview
Motivation
This project was originally begun as a collaboration between Data4Democracy and the City of Boston.
On Jan 25th, 2017, 9 pedestrians were hit in Boston by vehicles. While this was a particularly dangerous day, there were 21 fatalities and over 4000 severe injuries due to crashes in 2016 alone, representing a public health issue for all those who live, work, or travel in Boston. The City of Boston would like to partner with Data For Democracy to help develop a dynamic prediction system that they can use to identify potential trouble spots to help make Boston a safer place for its citizens by targeting timely interventions to prevent crashes before they happen.
This is part of the City's long-term Vision Zero initiative, which is committed to the goal of zero fatal and serious traffic crashes in the city by 2030. The Vision Zero concept was first conceived in Sweden in 1997 and has been widely credited with a significant reduction in fatal and serious crashes on Swedenโs roads in the decades since then. Cities across the United States are adopting bold Vision Zero initiatives that share these common principles.
Children growing up today deserve...freedom and mobility. Our seniors should be able to safely get around the communities they helped build and have access to the world around them. Driving, walking, or riding a bike on Bostonโs streets should not be a test of courage.
โ Mayor Martin J. Walsh
What is the goal of the project?
The goal of the project is to promote the development of safer roads by identifying areas of high risk in a city's road network. It seeks to support the decision-making of transportation departments in 3 ways:
-
Identify high risk locations - which roads in the network represent the greatest risk of crashes?
-
Explain the contributing factors of risk - what are the features, patterns and trends that result in a location having elevated risk?
-
Assess the impact of intervention - what is the effect of a past or planned intervention on the risk of crashes?
Who are the intended users of the project?
Though originally a collaboration between Data4Democracy and the City of Boston, the project is now being developed to work for any city that wishes to use it. The intended users include city transportation departments, those responsible for managing risk on road networks and individuals interested in crash risk.
How does the project achieve its goal?
The project uses machine learning to generate predictions of risk by combining various types of data. Right now it makes use of:
-
road segment data to build a map of a city's road network, presently being sourced from OpenStreetMap
-
historical crash data to determine which locations have proved high risk in the past, provided by participating cities through their open data portals
-
safety concerns data to understand where citizens believe their roads are unsafe and the nature of their concerns, also provided by participating cities by way of their respective VisionZero programs or SeeClickFix
Future versions of the project are likely to make use of:
-
traffic volume data to understand which roads experience the highest traffic and how changing trends of usage might affect risk
-
more detailed road features including speed limits, signals, bike lanes, crosswalks, parking etc.
-
road construction data
Predictions are generated on a per road-segment basis can be explored with an interactive visualization.
Who are the intended users? Though originally a collaboration between Data4Democracy and the City of Boston, the project is now being developed to work for any city that wishes to use it. The intended users include city transportation departments, those responsible for managing risk on road networks and individuals interested in crash risk.
What are the requirements for use?
Any city that wishes to can make use of the project. At a minimum, geo-coded historical crash data is required. Beyond this, cities that can supply safety concerns data (VisionZero or otherwise) will be able to generate more advanced predictions of risk.
What is the release schedule?
The intended roadmap of development for the project can be found at https://github.com/Data4Democracy/crash-model/projects.
How can I access the project?
This repo can be downloaded and run in its entirety using Docker, or you can see a current deployment of the project at https://insightlane.org.
Data Sources and Modelling
Data Sources
- Open street maps network and features
- Crash data must be provided (see data standards)
- Pipeline can incorporate other networks and features (see using custom data sources)
- All our processed data is in a private repository in data.world -- ping a project lead or maintainer on Slack to get access. More detailed documentation is contained there.
Data Model
- The data dictionary contains information about the default features included in the model
- As of V2.0, the models tests Logistic Regression vs XGBoost and picks the best performing (based on ROC AUC)
Setting up
I want to set up a local development environment and run the pipeline
- Clone the repo
- Insight Lane with Conda
- Install anaconda or miniconda
- Use the Python 3.7 version
- Navigate to the repo directory
- Create an environment for the project using the command
conda env create -f environment_<your os>.yml (<your os> will be Linux, Mac or PC)
- Activate the environment using
source activate crash-model
I want to set up a Docker development environment
- A basic (Docker)[https://www.docker.com/] image has been created to run the project in a container, using the ContinuumIO miniconda3 base image (Python 3.6)
- Download or build the image
- Download from Docker Hub:
$ docker pull insightlane/crash-model:latest
- Build from the repo:
$ docker build --tag insightlane/crash-model:[tag] .
- Download from Docker Hub:
- To run the image:
$ docker run -d -p 8080:8080 --name bcm.local -v /local/path/to/project_repo:/app insightlane/crash-model:[tag]
- Once the image is running, you can get a bash prompt to run pipeline commands/etc by running the following:
$ docker exec -it bcm.local /bin/bash
I want to add a new city
- Step 1: Obtain crash data for your city
- Try looking for your cityโs open data portal or contacting someone from your local transportation department
- Format should be CSV
- Step 1a: My crash data has addresses instead of latitude and longitude
- See our geocoding section for how to process this into latitude and longitude
- Step 2: Set up your environment (See above)
- Step 3: Generate a configuration file
Detailed walkthrough
- Run
python initialize_city.py -city <city name> -f <folder name> -crash <crash file> --supplemental <supplemental file1>,<supplemental file2>
- City name: e.g. "Cambridge, MA, USA".
- Folder name: Name for city folder
- Crash file: The location of the crash data
- Supplemental files: Any other files that contain additional features
- Edit generated configuration file to specify columns in crash data containing id, latitude and longitude
- Run
- Step 4: Run the pipeline
- Navigate to the src directory
- Run python pipeline.py -c
- Step 5: Check results
- There should be a number of files in the data//processed directory
I want to run the interactive visualization (showcase)
- Obtain a Mapbox token
- Export an environment variable called MAPBOX_TOKEN
- Export an environment variable
CONFIG_FILE=config_<folder name>.yml
Contributing
"First-timers" are welcome! Whether you're trying to learn data science, hone your coding skills, or get started collaborating over the web, we're happy to help. If you have any questions feel free to pose them on our Slack channel, or reach out to one of the team leads.
I want to know whatโs going on and pick up a task I like
Open tasks are available here Issues pertaining towards upcoming releases are available here
I want to add a new city to the online showcase
Once youโve successfully run the pipeline on a city, get in touch with the Insight Lane team for details how to add to the showcase
Connect with us
Join our Slack channel.
Leads:
- @bpben
- @j-t-t
- @terryf82
- @andhint
- @alicefeng
Project Organization
โโโ LICENSE
โโโ README.md <- The top-level README for developers using this project.
โโโ data
โ โโโ external <- Data from third party sources.
โ โโโ interim <- Intermediate data that has been transformed.
โ โโโ processed <- The final, canonical data sets for modeling.
โ โโโ raw <- The original, immutable data dump.
โ
โโโ docs <- A default Sphinx project; see sphinx-doc.org for details
โ
โโโ models <- Trained and serialized models, model predictions, or model summaries
โ
โโโ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
โ the creator's initials, and a short `-` delimited description, e.g.
โ `1.0-jqp-initial-data-exploration`.
โ
โโโ references <- Data dictionaries, manuals, and all other explanatory materials.
โ
โโโ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
โ โโโ figures <- Generated graphics and figures to be used in reporting
โ
โโโ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
โ generated with `pip freeze > requirements.txt`
โ
โโโ src <- Source code for use in this project.
โ โโโ __init__.py <- Makes src a Python module
โ โ
โ โโโ data <- Scripts to download or generate data
โ โ โโโ make_dataset.py
โ โ
โ โโโ features <- Scripts to turn raw data into features for modeling
โ โ โโโ build_features.py
โ โ
โ โโโ models <- Scripts to train models and then use trained models to make
โ โ โ predictions
โ โ โโโ predict_model.py
โ โ โโโ train_model.py
โ โ
โ โโโ visualization <- Scripts to create exploratory and results oriented visualizations
โ โโโ visualize.py
Project structure based on the cookiecutter data science project template. #cookiecutterdatascience