• Stars
    star
    218
  • Rank 181,805 (Top 4 %)
  • Language
    Jupyter Notebook
  • Created over 4 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This course is a rigorous, year-long introduction to computational social science. We cover topics spanning reproducibility and collaboration, machine learning, natural language processing, and causal inference. This course has a strong applied focus with emphasis placed on doing computational social science.

Computational-Social-Science-Labs

This repo contains all of the materials for Sociology 273, Computational Social Science Parts A/B. Designed as part of Berkeley's Computational Social Science Training Program This course is a rigorous, yearlong introduction to computational social science. The target audience is 2nd year and beyond PhD students who have completed their home departments' introductory statistics courses. We cover topics spanning reproducibility and collaboration, machine learning, natural language processing, and causal inference. This course has a strong applied focus with emphasis placed on doing computational social science. It makes extensive use of simulations, functional programming, and visualizations to illustrate statistical concepts and demonstrate how "computational social science" is a framework to think about how to analyze big data. By the end of the course, students will be well acquainted with some of the latest research and advanced in computational social science research, and begin working on their own projects.

Most modules contains both a "student" version and a "solutions" version. These are substantially the same, with the difference being that the student versions leave some code lines partially blank for in-class challenges. Each project is designed for groups of 3-4 students who use GitHub to collaborate and version control code. Several popular data science libraries are used frequently including sklearn, numpy, pandas, spaCy, gensim, tidyverse, tidymodels, and SuperLearner. For the most part the latest versions for any of these libraries should work, with exceptions noted in the notebooks as necessary.

If you cannot get materials to work locally, you may use DataHub. This allows you to run the materials in an instance on UC Berkeley's servers. No installation is needed from your end - you only need an internet browser and a CalNet ID to log in. By using the DataHub, you can save your work and come back to it at any time. Note: Some users may have to click the link twice if the materials do not load initially.

D-Lab DataHub

Table of Contents

  1. Setup Anaconda Installation
  2. Installation and Reproducible Data Science
    • 1-1 Anaconda Installation
    • 1-2 Command Line Intro:
      • Introduction to use a command line interface (CLI) to interact with a computer
      • Basics of navigating file directory, text editing, and running shell/python scripts
    • 1-3 GitHub Intro:
      • Introduction to git, version control, and GitHub.
      • Best practices for using version control to track code changes, collaborate with others without running into conflicts, and using GitHub to showcase portfolio and find open source software/code
    • 1-4 Statistics Refresher
    • [Project 1]:
      • Use command line and GitHub to create a group repo and practice with version control and branching.
      • Create a personal website using GitHub Pages.
  3. Introduction to Machine Learning
    • 2-1 Math Review:
      • Matrix multiplication
      • Derivatives
      • Integrals
      • numpy/scipy
    • 2-2 Data Splitting and Bias-Variance Tradeoff:
      • Introduction to train/validation/test splits and cross-validation for machine learning
      • Bias-variance tradeoff
      • Confusion matrices
    • 2-3 Regression:
      • Ordinary Least Squares
      • Regularization via Ridge/LASSO
      • Coefficient plots
      • Hyperparameter tuning
    • Project 2:
      • Predict county-level diabetes rates
      • Exploratory data analysis, data cleaning and preparation, hyperparameter tuning, feature selection, model validation
  4. Supervised Machine Learning
    • 3-1 Classification:
      • Imbalanced class labels
      • Logistic regression, decision tree classifier, support vector machine
      • Hyperparameter tuning
      • Metrics (accuracy, recall, precision, AUC-ROC)
    • 3-2 Trees and Ensembles:
      • Decision tree, random forest, adaboost
      • Variable importance plot
    • 3-3 Neural Networks:
      • Multi-layer perceptron
      • keras tensorflow
      • Convolutional neural network
    • Project 3::
      • Predict health code violations in Chicago restaurants.
      • Data preprocessing, classification models, interpretable and explainable machine learning, prediction policy problems
  5. Unsupervised Machine Learning and TPOT:
    • 4-1 Clustering and PCA:
      • Principal components analysis
      • Clustering (k-means, spectral, etc.)
      • Unsupervised learning outputs as inputs to supervised learning
    • 4-2 TPOT:
      • TPOT genetic programming to automatically search for machine learning pipeline for preprocessing, unsupervised learning, and classification/regression
    • Project 4:
      • Unsupervised learning and neural network classification on National Health and Nutrition Examination Survey (NHANES)
      • Difference between dimensionality reduction and clustering
      • Combining dimensionality reduction and clustering
      • Deep learning with one hidden layer
  6. Natural Language Processing
    • 5-1 Text Preprocessing:
      • Tokenization
      • Stop words
      • Entity recognition
      • Lemmatization
      • Bag of words/term frequency-inverse document frequency
      • Naive Bayes
      • spaCy
    • 5-2 Exploratory Data Analysis and Unsupervised Methods:
      • Word clouds
      • Sentiment polarity
      • Topic modeling
    • 5-3 Text Feature Engineering and Classification:
      • N-grams
      • Word counts
      • Topic model proportions as input to classification
      • Combining text and non-text features
    • 5-4 word2vec:
      • Word embeddings
      • t-SNE
      • doc2vec
      • Document average word embeddings
      • Pre-trained embeddings using gensim
    • 5-5 Neural Nets for NLP
    • Project 5:
      • Investigate asymmetric polarization and moderation/extremism in U.S. Congress tweets.
      • Text preprocessing, exploratory data analysis, text feature engineering, classification
  7. Causal Inference
    • 6-1 R Refresher:
      • Introduction to R
      • Dplyr, tidyr, ggplot, purrr
    • 6-2 Randomized Experiments:
      • Average Treatment Effect (ATE)
      • Individual-level Treatment Effect (ITE)
      • Average Treatment Effect on the Treated (ATT)
      • Heterogenous Treatment Effects
      • Randomization Designs (completely, cluster, block)
      • Statistical tests of difference
    • 6-3 Matching Methods:
      • Propensity score matching
      • Full/optimal/greedy matching
      • Mahalanobis distance
      • Double robust estimators
    • Project 6:
      • Replicate studies examining effect of college attendance on political participation.
      • Preprocessing, matching after randomized study to improve covariate balance, simulations to examine different matching configurations effect on ATE estimates
    • 6-4 Regression Discontinuity:
      • Regression discontinuity
      • Running variable
      • McCrary density test
      • Sharp discontinuity
      • Bandwidth selection via Imbens-Kalyanaraman and cross-validation
    • 6-5 Instrumental Variables:
      • Directed Acyclic Graphs (DAGs)
      • Exclusion restriction
      • Colliders
      • Two-Stage Least Squares (2SLS)
    • 6-6 Diff-in-Diffs and Synthetic Control:
      • Difference-in-differences method
      • Parallel trends assumption
      • Synthetic control
      • Augmented synthetic control with Ridge regularization
      • Staggered adoption synthetic control
    • Project 7:
      • Diff-in-diffs and synthetic control to analyze the effect of Affordable Care Act (ACA) Medicaid expansion among adoptees over time.
    • 6-7 Sensitivity Analysis:
      • Manski bounds
      • Rosenbaum sensitivity analysis
      • E-values
    • 6-8 SuperLearner and Longitudinal Targeted Maximum Likelihood Estimation (LTMLE):
      • Ensemble machine learning for causal inference
      • Parallelization in R
      • Targeted learning
      • Double robust estimators
      • Time-dependent confounding
    • Project 8:
      • Effect of blood pressure medication on heart disease using SuperLearner, TMLE, and LTMLE.

More Repositories

1

Machine-Learning-in-R

Workshop (6 hours): preprocessing, cross-validation, lasso, decision trees, random forest, xgboost, superlearner ensembles
CSS
187
star
2

Python-Fundamentals-Legacy

D-Lab's 12 hour introduction to Python. Learn how to create variables and functions, use control flow structures, use libraries, import data, and more, using Python and Jupyter Notebooks.
Jupyter Notebook
168
star
3

R-Fundamentals-Legacy

D-Lab's 12 hour introduction to R Fundamentals. Learn how to create variables and functions, manipulate data frames, make visualizations, use control flow structures, and more, using R in RStudio.
R
139
star
4

Bash-Git

D-Lab's 3 hour introduction to basic Bash commands and using version control with Git and Github.
131
star
5

R-Deep-Learning

Workshop (6 hours): Deep learning in R using Keras. Building & training deep nets, image classification, transfer learning, text analysis, visualization
R
120
star
6

git-fundamentals

A starting point for discovering the wonderful world of Git, GitHub, and Git Annex (Assistant)
Shell
74
star
7

Stata-Fundamentals

D-Lab's 9 hour introduction to performing data analysis with Stata. Learn how to program, conduct data analysis, create visualization, and conduct statistical analyses in Stata.
Stata
72
star
8

python-for-everything

Materials for teaching the Python for Everything workshop at UC Berkeley's D-lab
Jupyter Notebook
69
star
9

Python-Machine-Learning

D-Lab's 6 hour introduction to machine learning in Python. Learn how to perform classification, regression, clustering, and do model selection using scikit-learn in Python.
Jupyter Notebook
66
star
10

MachineLearningWG

D-Lab's Machine Learning Working Group at UC Berkeley, with supervised & unsupervised learning tutorials in R and Python
HTML
65
star
11

Python-Geospatial-Fundamentals-Legacy

D-Lab's 6 hour introduction to working with geospatial data in Python. Learn how to import, visualize, and analyze geospatial data using GeoPandas in Python.
Jupyter Notebook
57
star
12

Python-Data-Visualization-Legacy

D-Lab's 3 hour introduction to data visualization with Python. Learn how to create histograms, bar plots, box plots, scatter plots, compound figures, and more, using matplotlib and seaborn.
Jupyter Notebook
56
star
13

R-Geospatial-Fundamentals-Legacy

This is the repository for D-Lab's Geospatial Fundamentals in R with sf workshop.
Jupyter Notebook
53
star
14

Python-Data-Wrangling-Legacy

D-Lab's 3 hour introduction to data wrangling in Python. Learn how to import and manipulate dataframes using pandas in Python.
Jupyter Notebook
51
star
15

R-Machine-Learning-Legacy

D-Lab's 6 hour introduction to machine learning in R. Learn the fundamentals of machine learning, regression, and classification, using tidymodels in R.
R
47
star
16

Unsupervised-Learning-in-R

Workshop (6 hours): Clustering (Hdbscan, LCA, Hopach), dimension reduction (UMAP, GLRM), and anomaly detection (isolation forests).
R
47
star
17

python-berkeley

python resources of berkeley curated at a place
Jupyter Notebook
44
star
18

Python-Text-Analysis-Fundamentals

D-Lab's 9 hour introduction to text analysis with Python. Learn how to perform bag-of-words, sentiment analysis, topic modeling, word embeddings, and more, using scikit-learn, NLTK, gensim, and spaCy in Python.
Jupyter Notebook
38
star
19

R-Data-Wrangling-Legacy

D-Lab's 6 hour introduction to data wrangling with R. Learn how to manipulate dataframes using the tidyverse in R.
R
37
star
20

python-data-from-web

API and web scraping workshops
Jupyter Notebook
35
star
21

R-Data-Visualization-Legacy

D-Lab's 3 hour introduction to data visualization with R. Learn how to create histograms, bar plots, box plots, scatter plots, compound figures, and more using ggplot2 and cowplot.
28
star
22

R-Functional-Programming

The joy and power of functional programming in R
27
star
23

python-text-analysis-legacy

Text Analysis Workshops for UC Berkeley's D-Lab
Jupyter Notebook
26
star
24

programming-fundamentals

Introduction to Programming for UC Berkeley's D-Lab
Python
23
star
25

ANN-Fundamentals

Jupyter Notebook
23
star
26

DIGHUM101-2020

Jupyter Notebook
20
star
27

Python-Text-Analysis

D-Lab's 12 hour introduction to text analysis with Python. Learn how to perform bag-of-words, sentiment analysis, topic modeling, word embeddings, and more, using scikit-learn, NLTK, Gensim, and spaCy in Python.
Jupyter Notebook
20
star
28

sql-for-r-users

SQL for R Users, Workshop
HTML
19
star
29

Python-Deep-Learning-Legacy

D-Lab's 6 hour introduction to deep learning in Python. Learn how to create and train neural networks using Tensorflow and Keras.
Jupyter Notebook
17
star
30

awesome-dlab

😎 Awesome lists about all kinds of topics and tools interesting to D-Labbers
17
star
31

advanced-data-wrangling-in-R-legacy

Advanced-data-wrangling-in-R, Workshop
HTML
15
star
32

R-Census-Data-Legacy

Workshop on fetching and mapping census data with tidycensus
HTML
14
star
33

Geospatial-Fundamentals-in-QGIS

11
star
34

regular-expressions-in-python

Jupyter Notebook
10
star
35

Qualtrics-Fundamentals

D-Lab's 3 hour introduction to Qualtrics Fundamentals. Learn how to design and manage your own surveys in Qualtrics.
10
star
36

Data-Science-Social-Justice-2022

Materials for D-Lab / UC Berkeley Graduate Division's Data Science + Social Justice summer workshop. These materials provide an introduction to Python, natural language processing, text analysis, word embeddings, and network analysis. They also include discussions on critical approaches to data science to promote social justice.
Jupyter Notebook
10
star
37

Geocoding-in-R

HTML
9
star
38

Python-Data-Wrangling

D-Lab's 3-hour workshop diving deep into Pandas. Learn how to manipulate, index, merge, group, and plot data frames using Pandas functions.
Jupyter Notebook
9
star
39

efficient-reproducible-project-management-in-R

Efficient and Reproducible Project Management in R
HTML
9
star
40

Excel-Fundamentals

D-Lab's six-hour introduction to the basics of Microsoft Excel (with support materials for Google Sheets). Learn Excel functions for handling text, math, dates, logic, and calculations; learn to create charts and pivot tables.
9
star
41

fairML

Bias and Fairness in ML workshop
Jupyter Notebook
8
star
42

Python-Web-Scraping-Legacy

D-Lab's 3 hour introduction to web scraping in Python. Learn how to use APIs and scrape data from websites using the New York Times API and BeautifulSoup in Python.
Jupyter Notebook
7
star
43

regex-intro

Shell
6
star
44

Geospatial-Fundamentals-in-R-sp

HTML
6
star
45

Leaflet-Maps-in-R

A 3-hour intensive workshop to introduce the R Leaflet package
HTML
6
star
46

javascript-viz

A D-Lab intro to JavaScript visualization using the IPython notebook.
HTML
6
star
47

DIGHUM101-2023

Practicing the Digital Humanities, UC Berkeley Summer Session 2023
Jupyter Notebook
6
star
48

LaTeX-Fundamentals

TeX
6
star
49

DIGHUM101-2021

Jupyter Notebook
5
star
50

cloud-computing-working-group

5
star
51

data-security-fundamentals

Data Security Fundamentals
HTML
5
star
52

Python-Fundamentals

D-Lab's 3-part, 6 hour introduction to Python. Learn how to create variables, distinguish data types, use methods, and work with Pandas, using Python and Jupyter.
Jupyter Notebook
4
star
53

Python-Web-APIs

D-Lab's 2 hour introduction to using web APIs in Python. Learn how to obtain data from web platforms using the New York Times API as a case study.
Jupyter Notebook
3
star
54

quick-consulting-examples

Collection of quick pandas, python, and other coding examples based on real consulting requests.
Jupyter Notebook
3
star
55

dlab-berkeley.github.io

Tech overview site showcasing D-Lab's online offerings
CSS
3
star
56

visualization-in-Excel

3
star
57

Python-Web-Scraping

D-Lab's 2 hour introduction to web scraping in Python. Learn how to scrape HTML/CSS data from websites using Requests and Beautiful Soup.
Jupyter Notebook
3
star
58

Data-Science-Social-Justice

Materials for D-Lab / UC Berkeley Graduate Division's Data Science for Social Justice summer workshop. These materials provide an introduction to Python, natural language processing, text analysis, word embeddings, and network analysis. They also include discussions on critical approaches to data science to promote social justice.
Jupyter Notebook
3
star
59

DIGHUM101-2022

Practicing the Digital Humanities, UC Berkeley Summer Session 2022
Jupyter Notebook
3
star
60

Python-Geospatial-Fundamentals

About D-Lab's 4-hour introduction to working with geospatial data in Python. Learn how to import, visualize, and analyze geospatial data in Python.
Jupyter Notebook
2
star
61

Basics-of-Excel

2
star
62

intro-maxqda

2
star
63

Python-Intermediate

D-Lab's 3-part, 6 hour workshop diving deeper into Python. Learn how to create functions, use if-statements and for-loops, and work with Pandas, using Python and Jupyter.
Jupyter Notebook
2
star
64

R-Data-Visualization

D-Lab's 2-hour introduction to data visualization with R. Learn how to create histograms, bar charts, box plots, scatter plots, and more using ggplot2.
R
2
star
65

IRB-Fundamentals

D-Lab's 3 hour introduction to the fundamentals of navigating Institutional Review Boards (IRB).
2
star
66

RStudio-Project-Management

Resources to help you start managing data science projects.
HTML
1
star
67

git-for-project-management

Using Git and GitHub for Project Management
1
star
68

R-package-development

R package development workshop
HTML
1
star
69

Git-Playground

This repository is for D-Lab workshops that require practicing with Git.
1
star
70

sas-intro

Introduction to SAS
TeX
1
star
71

R-Push-Ins

D-Lab's 4.5 hour "push-in" introduction to R, providing a brief survey of foundational R concepts and operations.
R
1
star
72

DEVP229-Spring2021

HTML
1
star
73

MAXQDA-Fundamentals

D-Lab's 2 hour introduction to MAXQDA. Learn how to conduct qualitative data analysis using MAXQDA.
1
star
74

sas-analysis

Data Analysis with SAS
SAS
1
star
75

R-Research-Design

1
star
76

ArcGIS-Online-Fundamentals

1
star
77

dlab-methods

CSS
1
star
78

Computational-Text-Analysis-2017

An introduction to Computational Text Analysis in four 2hr sessions designed to help beginners build intuition, and to interact with workflows for natural language processing, supervised, and unsupervised approaches. Created for CTAWG in 2017 by Ben Gebre-Medhin
HTML
1
star
79

Python-Data-Visualization-Pilot

D-Lab's 4-hour introduction to data visualization with Python. Learn how to create histograms, bar plots, box plots, scatter plots, compound figures, and more, using matplotlib and seaborn.
Jupyter Notebook
1
star
80

HAAS-Python-Workshop

Jupyter Notebook
1
star