• Stars
    star
    414
  • Rank 104,550 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    Other
  • Created over 8 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code snippets and tutorials for working with social science data in PySpark

pyspark-tutorials

Code snippets and tutorials for working with social science data in PySpark. Note that each .ipynb file can be downloaded and the code blocks executed or experimented with directly using a Jupyter (formerly IPython) notebook, or each one can be displayed in your browser as markdown text just by clicking on it.

Spark Social Science Manual

The tutorials included in this repository are geared towards social scientists and policy researchers that want to undertake research using "big data" sets. A manual to accompany these tutorials is linked below. The objective of the manual is to provide social scientists with a brief overview of the distributed computing solution developed by The Urban Institute's Research Programming Team, and of the changes in how researchers manage and analyze data required by this computing environment.

Spark Social Science Manual

  1. If you're new to Python entirely, consider trying an intro tutorial first. Python is a language that stresses readability of code, so it won't be too difficult to dive right in. This is one good interactive tutorial.

  2. After that, or if you're already comfortable with Python basics, get started with pySpark with these two lessons. They will assume you are comfortable with what Python code looks like and in general how it works, and lay out some things you will need to know to understand the other lessons.

    Basics 1

    • Reading and writing data on S3
    • Handling column data types
    • Basic data exploration and describing
    • Renaming columns

    Basics 2

    • How pySpark processes commands - lazy computing
    • Persisting and unpersisting
    • Timing operations
  3. Basic data tasks are covered in the following guides. Note that these are not intended to be comprehensive! They cover many of the things that are most common, but others may require you to look them up or experiment. Hopefully this framework gives you enough to get started.

    Merging Data

    • Using unionAll to stack rows by matching columns
    • Using join to merge columns by matching specific row values

    Missing Values

    • Handling null values on loading
    • Counting null values
    • Dropping null values
    • Replacing null values

    Moving Average Imputation

    • Using pySpark window functions
    • Calculating a moving average
    • Imputing missing values

    Pivoting/Reshaping

    • Using groupBy to organize data
    • Pivoting data with an aggregation
    • Reshaping from long to wide without aggregation

    Resampling

    • Upsampling data based on a date column
    • Using datetime objects

    Subsetting

    • Filtering data based on criteria
    • Taking a randomized sample

    Summary Statistics

    • Using describe
    • Adding additional aggregations to describe output

    Graphing

    • Aggregating to use Matplotlib and Pandas
  4. The pySpark bootstrap used by the Urban Institute to start a cluster on Amazon Web Services only installs a handful of Python modules. If you need others for your work, or specfic versions, this tutorial explains how to get them. It uses only standard Python libraries, and is therefore not specific to the pySpark environment:

    Installing Python Modules

    • Using the pip module for Python packages
  5. And finally, now that Spark 2.0 is deployed to Amazon Web Services development has begun on OLS and GLM tutorials, which will be uploaded when complete. Introduction to GLM

More Repositories

1

urbnmapr

State and county maps with Alaska and Hawaii
R
143
star
2

quick-quiz

quick quiz generator for bootstrap/jquery
JavaScript
113
star
3

education-data-package-r

R
85
star
4

urbnthemes

Urban Institute's ggplot2 theme and tools
R
79
star
5

sparkr-tutorials

Code snippets and tutorials for working with SparkR.
R
65
star
6

spark-social-science

Automated Spark Cluster Builds with RStudio or PySpark for Policy Research
Shell
39
star
7

graphics-styleguide

Urban's data visualization styleguide
CSS
35
star
8

ipeds-scraper

Download IPEDS complete data files
Python
34
star
9

urban_R_theme

NOTE: urban_R_theme is being phased out. Please use library(urbnthemes).
R
26
star
10

rmarkdown-factsheets

Resources and tools for creating and iterating Urban Institute branded PDFs with R Markdown
TeX
25
star
11

ui-equity-tool

Python
21
star
12

debt-interactive-map

JavaScript
21
star
13

covid-neighborhood-job-analysis

R
21
star
14

data-science-index

An index of tools, guides, and analyses created by the data science team
20
star
15

education-data-package-stata

Stata
19
star
16

R-Trainings

Storing code and related documents for Urban Institute R trainings.
JavaScript
18
star
17

urbntemplates

Easily access R templates for the Urban Institute
R
16
star
18

nccs-public

Python
13
star
19

git-tutorial

Git/ github tutorials, and a place to practice
11
star
20

cost-of-affordable-housing

JavaScript
11
star
21

SiteMonitor

SiteMonitor tool for monitoring web scraping
Python
11
star
22

state-economic-monitor

JavaScript
11
star
23

equity-data-tool

JavaScript
10
star
24

r-at-urban

JavaScript
10
star
25

The-Learning-Curve

Stata
10
star
26

ed-data

Higher education affordability data
R
9
star
27

sedtR

R
9
star
28

nccs

NCCS data platform powered by Jekyll
HTML
7
star
29

pulse_covid_feature_phase2

R
7
star
30

housing-assistance-matters

HAI map update
CSS
7
star
31

immigrants-safety-net-access

Supporting North Carolina’s Immigrant Families
Svelte
7
star
32

dataviz-components

Svelte components for data visualization
Svelte
6
star
33

TPC-styleguide

Data visualization style guide for the Tax Policy Center
CSS
6
star
34

spark-social-science-manual

A manual for researchers on using Spark for social science.
CSS
6
star
35

MortgagesByRace

was http://datatools.urban.org/Features/mortgages-by-race/#8/41.923/-86.149 (note everything after the # is an internal link or data)
CSS
6
star
36

nccsdata

Data Processing Package For NCCS Data
R
5
star
37

social-security-data-tool

Scrapers, parsers, and visualization tools for SSA/Disability Insurance data
HTML
5
star
38

urban-map

Urban Insitute county map template
JavaScript
5
star
39

rmarkdown-fact-pages

HTML
5
star
40

school-transportation

JavaScript
5
star
41

prison-population-forecaster

Translating the Prison Population Forecaster R script to javascript
JavaScript
5
star
42

dc-equity-indicators

For the DC Equity Indicators tool
CSS
5
star
43

education-data-summary-endpoints

Documentation for experimental summary endpoint functionailty
5
star
44

fha-refinance

JavaScript
5
star
45

disrupting-food-insecurity

A dashboard showing food insecurity and peer groups by county
CSS
5
star
46

geocrosswalk

5
star
47

wealth-inequality-charts

Nine Charts About Wealth Inequality in America
JavaScript
4
star
48

urbntables

HTML
4
star
49

2020-census

JavaScript
4
star
50

formal-privacy-comp-appendix

HTML
4
star
51

editorial-styleguide

CSS
4
star
52

free-college

Repo for the scrollytelling feature on free college plans
JavaScript
4
star
53

adrf-linked-data

Code and metadata for creating the linked data files for the Administrative Data Research Facilities (ADRF) Site
SAS
4
star
54

covid-jobloss-feature

Repo for Where Low-Income Jobs Are Being Lost to COVID-19
JavaScript
4
star
55

UrbanTemplates

Helpful abstractions for visualization
JavaScript
3
star
56

state-immigration

State Immigration Policy Resource
CSS
3
star
57

body-camera

HTML
3
star
58

mapping-uncertainty

Explaining uncertainty with poverty rates - maps
JavaScript
3
star
59

lodes-data-downloads

Python
3
star
60

inclusion

JavaScript
3
star
61

rmarkdown-resources

Resources for building .html R Markdown documents at the Urban Institute
HTML
3
star
62

pulse_covid_feature

Data, and code for Tracking COVID-19s Impact by Race and Ethnicity feature
R
3
star
63

code-snippets

Various small snippets of code and useful hacks
HTML
3
star
64

teacher-diversity

JavaScript
3
star
65

job-quality-segregation-race-gender

Getting a Good Job Depends More on Race and Gender than Education
Svelte
3
star
66

syntheval

R
3
star
67

build-your-own-pension

Deployed version of the build your own pension tool
CoffeeScript
2
star
68

UrbanDataDive

A collaboration space for Urban's Domestic Violence Data Dive
2
star
69

ipums-acs-naics-standardization

R
2
star
70

reducing-mass-incarceration

interactive on policy levers to reduce prison population
JavaScript
2
star
71

affordable-housing-shortage-and-zoning

Svelte
2
star
72

promise-neighborhoods

Promise Neighborhoods project
HTML
2
star
73

covid-rental-risk-index

Repo for rental assistance prioritization tool
R
2
star
74

mapping-americas-futures_development

development repository for mapping america's futures
HTML
2
star
75

la-policing-typology

R
2
star
76

spark-social-science-training

Repository for Training Materials for the Spark Social Science project.
HTML
2
star
77

custom-analytics

Custom events for google analytics
JavaScript
2
star
78

school-poverty

Maps of students and poverty
JavaScript
2
star
79

tax-cuts-and-jobs-act-alternatives

HTML
2
star
80

emergency-rental-assistance-priority-index

Codebase underlying the emergency rental assistance prioritization index version 2.0.
R
2
star
81

dynasim-shiny1

R Shiny visualizations of projections of Social Security under different reforms
R
2
star
82

education-funding-trends

Feature: How Has Education Funding Changed Over Time?
JavaScript
2
star
83

school-housing-partnership-desegregate-communities

PRESUNG data tool. production here: https://housingmatters.urban.org/feature/school-housing-partnership-desegragate-communities
Svelte
2
star
84

long-sentences-demo

Scrolling demo and wireframes for Long Sentences
JavaScript
1
star
85

capital-flows-chicago

public repository for source code for https://apps.urban.org/features/capital-flows-chicago
JavaScript
1
star
86

capital-investment-flows

HTML
1
star
87

latino-criminal-justice-data

JavaScript
1
star
88

baltimore-investment-flows

JavaScript
1
star
89

school-funding-do-poor-kids-get-fair-share

Feature on progressivity of state education funding https://apps.urban.org/features/school-funding-do-poor-kids-get-fair-share/
JavaScript
1
star
90

what-drives-state-spending

HTML
1
star
91

community-development-financing

JavaScript
1
star
92

dividing-lines-school-segregation

JavaScript
1
star
93

bymoc-scatter

JavaScript
1
star
94

race-and-taxes

A line by line examination of the 1040 tax form and race
HTML
1
star
95

nonprofit-gains-and-losses

JavaScript
1
star
96

urban-header

Example header for urban features
HTML
1
star
97

nnip-resources

1
star
98

long-prison-terms

Visualizations and static site assets for long sentences feature
HTML
1
star
99

where-are-children-head-start-exposed-environmental-hazards

Where Are Children in Head Start Exposed to Environmental Hazards?
JavaScript
1
star
100

children-of-immigrants

Visualizing Trends for Children of Immigrants
JavaScript
1
star