• Stars
    star
    2
  • Language
    Jupyter Notebook
  • Created about 5 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Course Outline Data wrangling is a core skill that everyone who works with data should be familiar with since so much of the world's data isn't clean. Though this course is geared towards those who use Python to analyze data, the high-level concepts can be applied in all programming languages and software applications for data analysis. Lesson 1: The Walkthrough In the first lesson of this course, we'll walk through an example of data wrangling so you get a feel for the full process. We'll introduce gathering data, then download a file from the web and import it into a Jupyter Notebook. We'll then introduce assessing data and assess the dataset we just downloaded both visually and programmatically. We'll be looking for quality and structural issues. Finally, we'll introduce cleaning data and use code to clean a few of the issues we identified while assessing. The goal of this walkthrough is awareness rather than mastery, so you'll be able to start wrangling your own data even after just this first lesson. Lessons 2-4: Gathering, Assessing, and Cleaning Data (in Detail) In the following lessons, you'll master gathering, assessing, and cleaning data. We'll cover the full data wrangling process with real datasets too, so think of this course as a series of wrangling journeys. You'll learn by doing and leave each lesson with tangible skills. Your In

More Repositories

1

DataAnalysis_WineCaseStudy

In this first case study, you'll perform the entire data analysis process to investigate a dataset on wine quality. Along the way, you'll explore new ways of manipulating data with NumPy and Pandas, as well as powerful visualisation tools with Matplotlib.
Jupyter Notebook
3
star
2

cleaning-data

Cleaning your data is the third step in data wrangling. It is where you fix the quality and tidiness issues that you identified in the assess step. In this lesson, you'll clean all of the issues you identified in Lesson 3 using Python and pandas. This lesson will be structured as follows: You'll get remotivated (if you aren't already) to clean the dataset for lessons 3 and 4: Phase II clinical trial data that compares the efficacy and safety of a new oral insulin to treat diabetes to injectable insulin You'll learn about the data cleaning process: defining, coding, and testing You'll address the missing data first (and learn why it is usually important to address these completeness issues first) You'll tackle the tidiness issues next (and learn why this is usually the next logical step) And finally, you'll clean up the quality issues This lesson will consist primarily of Jupyter Notebooks, of which there will be two types: one quiz notebook that you'll work with throughout the whole lesson (i.e. your work will carry over from page to page) and three solution notebooks. I'll pop in and out to introduce the larger conceptual bits. You will leverage the most common cleaning functions and methods in the pandas library to clean the nineteen quality issues and four tidiness issues identified in Lesson 3. Given your pandas experience and that this isn't a course on pandas, these functions and methods won't be covered in detail. Regardless, with this experience and your research and documentation skills, you can be confident that leaving this course you'll be able to clean any form of dirty and/or messy data that comes your way in the future.
Jupyter Notebook
3
star
3

INVESTIGATE_A_DATASET

Investigated "Education" dataset using NumPy and pandas. Went through the entire data analysis process, starting by posing a question and finishing by sharing your findings.
Jupyter Notebook
2
star
4

DATA_WRANGLING_PROJECT

Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.
HTML
2
star
5

multiple_hypothesis_testing

How to set up hypothesis tests. You learned the null hypothesis is what we assume to be true before we collect any data, and the alternative is usually what we want to try and prove to be true. You learned about Type I and Type II errors. You learned that Type I errors are the worst type of errors, and these are associated with choosing the alternative when the null hypothesis is actually true. You learned that p-values are the probability of observing your data or something more extreme in favor of the alternative given the null hypothesis is true. You learned that using a confidence interval from the bootstrapping samples, you can essentially make the same decisions as in hypothesis testing (without all of the confusion of p-values). You learned how to make decisions based on p-values. That is, if the p-value is less than your Type I error threshold, then you have evidence to reject the null and choose the alternative. Otherwise, you fail to reject the null hypothesis. You learned that when sample sizes are really large, everything appears statistically significant (that is you end up rejecting essentially every null), but these results may not be practically significant. You learned that when performing multiple hypothesis tests, your errors will compound. Therefore, using some sort of correction to maintain your true Type I error rate is important. A simple, but very conservative approach is to use what is known as a Bonferroni correction, which says you should just divide your \alphaΞ± level (or Type I error threshold) by the number of tests performed. This lesson is often the most challenging for students throughout the entire nanodegree program. In order to really have the ideas here stick, it can help to put them down in your own words. Below are some quizzes to test that you are leaving with the main ideas from this lesson, as well as a link to a great blog post, written by one of your fellow classmates, to assist with the ideas of this lesson!
Jupyter Notebook
2
star
6

Loops

There are several techniques you can use to repeatedly execute Python code. While loops are like repeated if statements, the for loop iterates over all kinds of data structures. Learn all about them in this chapter.
1
star
7

MoviesAppOne

Movies Project Part 1
Java
1
star
8

Data-Engineer-Nanodegree-Assignments

Design data models, build data warehouses and data lakes, automate data pipelines, and work with massive datasets.
Jupyter Notebook
1
star
9

logistic-regression

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc... Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one. Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail which is represented by an indicator variable, where the two values are labeled "0" and "1". In the logistic model, the log-odds (the logarithm of the odds) for the value labeled "1" is a linear combination of one or more independent variables ("predictors"); the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. Analogous models with a different sigmoid function instead of the logistic function can also be used, such as the probit model; the defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio.
Jupyter Notebook
1
star
10

DataAnalysis_EmissionsCaseStudy

Welcome To The Data Analysis Process - Case Study 2 In this second case study, you'll be analyzing fuel economy data provided by the EPA, or Environmental Protection Agency. What is Fuel Economy? Excerpt from Wikipedia page on Fuel Economy in Automobiles: The fuel economy of an automobile is the fuel efficiency relationship between the distance traveled and the amount of fuel consumed by the vehicle. Consumption can be expressed in terms of volume of fuel to travel a distance, or the distance travelled per unit volume of fuel consumed.
Jupyter Notebook
1
star
11

binomial_distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own boolean-valued outcome: success/yes/true/one (with probability p) or failure/no/false/zero (with probability q = 1 βˆ’ p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance. The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N. If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one. However, for N much larger than n, the binomial distribution remains a good approximation, and is widely used.
Jupyter Notebook
1
star
12

Data-Engineer-Nanodegree-Projects-Udacity

Data-Engineer-Nanodegree-Projects-Udacity Projects done in the Data Engineer Nanodegree by Udacity.com Course 1: Data Modeling Introduction to Data Modeling Understand the purpose of data modeling Identify the strengths and weaknesses of different types of databases and data storage techniques Create a table in Postgres and Apache Cassandra Relational Data Models Understand when to use a relational database Understand the difference between OLAP and OLTP databases Create normalized data tables Implement denormalized schemas (e.g. STAR, Snowflake) NoSQL Data Models Understand when to use NoSQL databases and how they differ from relational databases Select the appropriate primary key and clustering columns for a given use case Create a NoSQL database in Apache Cassandra Project: Data Modeling with Postgres and Apache Cassandra
Jupyter Notebook
1
star