• Stars
    star
    167
  • Rank 226,635 (Top 5 %)
  • Language
    Jupyter Notebook
  • Created almost 2 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This topic explains about the implementation of exploratory data analysis (EDA). A total of 21 EDA case studies have been implemented using the Malaysian dataset.

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

🌟 Hit star button to save this repo in your profile

About Us

The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general Exploratory Data Analysis (EDA) information as well as EDA case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and summarizing a dataset to understand its characteristics, identify patterns, and gain insights into the data. EDA is typically performed before more advanced statistical and machine learning techniques are applied and helps in forming hypotheses, selecting appropriate modeling approaches, and ensuring data quality. Here are some key components and techniques used in EDA:

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and summarizing a dataset to understand its characteristics, identify patterns, and gain insights into the data. EDA is typically performed before more advanced statistical and machine learning techniques are applied and helps in forming hypotheses, selecting appropriate modeling approaches, and ensuring data quality. Here are some key components and techniques used in EDA:

  1. Data Summary: Begin by understanding the basic information about the dataset, such as the number of rows and columns, data types, missing values, and summary statistics (mean, median, standard deviation, etc.).

  2. Data Visualization: Visualizing data through plots and charts can provide a clearer understanding of its distribution and patterns. Common types of visualizations include histograms, box plots, scatter plots, and bar charts.

  3. Data Distribution: Analyze the distribution of variables to determine whether they follow normal, uniform, or other types of distributions. This can impact the choice of statistical tests and modeling techniques.

  4. Correlation Analysis: Explore the relationships between variables using correlation matrices, scatter plots, and other correlation measures. This helps identify potential dependencies and multicollinearity.

  5. Outlier Detection: Identify and handle outliers in the data. Outliers can significantly affect statistical measures and model performance.

  6. Categorical Variables: Examine the distribution of categorical variables through frequency tables, bar plots, and pie charts. This helps understand the composition of categorical data.

  7. Data Transformation: Apply transformations (e.g., log transformation, standardization) to make the data more suitable for analysis, especially if it doesn't meet assumptions of statistical methods.

  8. Feature Engineering: Create new variables or features that might be more informative or relevant for the analysis. This could involve aggregating, combining, or extracting information from existing variables.

  9. Missing Data Handling: Deal with missing data, either by imputing missing values or excluding incomplete records. The choice of method depends on the nature of the data and the problem at hand.

  10. Hypothesis Testing: If relevant, perform hypothesis tests to determine whether observed differences or relationships in the data are statistically significant.

  11. Data Transformation: Consider scaling or encoding categorical variables for modeling. This can include one-hot encoding, label encoding, or other techniques.

  12. Dimensionality Reduction: Use techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce the dimensionality of the data while preserving important information.

  13. Time Series Analysis: For time series data, analyze trends, seasonality, and autocorrelation patterns. Techniques like autocorrelation plots and decomposition can be helpful.

  14. Geospatial Analysis: When dealing with geographic data, use maps, geospatial plots, and spatial statistics to understand spatial patterns and relationships.

  15. Text Analysis: If the dataset contains text data, perform text mining and sentiment analysis to extract insights from the textual content.

EDA is an iterative process, and the specific techniques and tools used can vary depending on the nature of the data and the objectives of the analysis. It plays a crucial role in gaining an initial understanding of the data, guiding subsequent analysis, and making informed decisions about the next steps in a data science or analytical project.

Why is EDA so important in data science?

✅️ The main purpose of EDA is to help you look at the data before making any assumptions. In addition to better understanding the patterns in the data or detecting unusual events, it also helps you find interesting relationships between variables.

✅️ Data scientists can use exploratory analysis to ensure that the results they produce are valid and relevant to desired business outcomes and goals.

✅️ EDA also helps stakeholders by verifying that they are asking the right questions.

✅️ EDA can help to answer questions about standard deviations, categorical variables, and confidence intervals.

✅️ After the exploratory analysis is completed and the predictions are determined, its features can be used for more complex data analysis or modeling, including machine learning.

Python

👉 Python is a popular programming language for data science and has several libraries and tools that are commonly used for EDA such as:

  1. Pandas: a library for data manipulation and analysis.
  2. Numpy: a library for numerical computing in Python.
  3. Scikit-learn: Scikit-learn is a machine learning library, but it also includes tools for data preprocessing, feature selection, and dimensionality reduction, which are essential for EDA.
  4. Matplotlib: a plotting library for creating visualizations.
  5. Seaborn: a library based on matplotlib for creating visualizations with a higher-level interface.
  6. Plotly: an interactive data visualization library.

In EDA, you might perform tasks such as cleaning the data, handling missing values, transforming variables, generating summary statistics, creating visualizations (e.g. histograms, scatter plots, box plots), and identifying outliers. All of these tasks can be done using the above libraries in Python.

📖 Notes

Basic Concept

Code & Practice

Videos

Kaggle: Notebook

Github

No. Repository Name Description
1 PacktPublishing/Hands on Exploratory Data analysis with Python This repository is likely associated with a book or course from Packt Publishing, focusing on hands-on exploratory data analysis with Python. It may contain code examples and materials for learning EDA.
2 code4kunal/eda-python-examples This repository likely contains Python examples and code snippets for exploratory data analysis (EDA). It may serve as a resource for those looking to learn EDA techniques with Python.
3 SouRitra01/Exploratory-Data-Analysis-EDA-in-Banking-Using-Python This repository appears to be focused on conducting exploratory data analysis (EDA) in the context of banking using Python. It may contain datasets and code for EDA in the banking domain.
4 sandipanpaul21/EDA-in-Python This repository is likely dedicated to exploratory data analysis (EDA) in Python. It may contain Python scripts, Jupyter notebooks, and related materials for EDA projects.
5 vharivinay/python-eda-viz This repository may be focused on Python-based exploratory data analysis and data visualization. It could provide code and examples for creating data visualizations during EDA.
6 demonpratapdemon/Exploratory-Data-Analysis-EDA-and-PreProcessing This repository seems to cover both exploratory data analysis (EDA) and data preprocessing in Python. It may contain code and resources for these data preparation tasks.
7 PacktPublishing/Python-for-Data-Analysis-step-by-step-with-projects- This repository is likely associated with a book or course from Packt Publishing, focusing on Python for data analysis with step-by-step projects. It may include code and project materials.
8 sandyy2505/Cardio Good Fitness Project This repository may contain code and data related to a fitness project, possibly involving data analysis and visualization in the context of cardio fitness.
9 ajaymache/Data analysis of used car database This repository is likely focused on data analysis of a used car database. It may provide Python code and data for analyzing and exploring information related to used cars.

📖 Lab

No Dataset Colab GitHub
1 Boston Open in Colab Open in GitHub
2 Car Features and MSRP Open in Colab Open in GitHub
3 Housing Dataset Open in Colab Open in GitHub
4 United Nations Development Corporation Open in Colab Open in GitHub

🌟 Case Study: Exploratory Data Analysis

The provided list comprises a collection of case studies, each with a title and accessibility information on platforms like Colab and GitHub. These case studies likely involve data analysis and exploration. For instance, "404 Error" may involve exploring property-related data in Kuala Lumpur, while "Alrite" could be centered around the exportation of plantation products in Sarawak. "BEFE" appears to focus on COVID-19 clusters in Malaysia, "Boboiboy" on property listings in Kuala Lumpur, and "COLBY" on the results of the 14th General Election in Malaysia. "FANTOM" likely tracks daily recorded COVID-19 cases at the state level, "HAHA" pertains to foreign direct investment in Malaysia, and "HD" may involve land usage analysis in Tampin for 2021. Other case studies cover topics such as elections, healthcare, real estate, population, and more, providing a diverse range of data exploration possibilities.

Automated EDA Tools

EDA is a vital but time-consuming task in a data project. Here are 10 open-source tools that generate an EDA report in seconds.

Library Description Web Github
SweetViz - In-depth EDA report in two lines of code.
- Covers information about missing values, data statistics, etc.
- Creates a variety of data visualizations.
- Integrates with Jupyter Notebook.
🌐 :octocat:
Pandas-Profiling - Generate a high-level EDA report of your data in no time.
- Covers info about missing values, data statistics, correlation etc.
- Produces data alerts.
- Plots data feature interactions.
🌐 :octocat:
DataPrep - Supports Pandas and Dask DataFrames.
- Interactive Visualizations.
- 10x Faster than Pandas based tools.
- Covers info about missing values, data statistics, correlation etc.
- Plots data feature interactions.
🌐 :octocat:
AutoViz - Supports CSV, TXT, and JSON.
- Interactive Bokeh charts.
- Covers info about missing values, data statistics, correlation etc.
- Presents data cleaning suggestions.
🌐 :octocat:
D-Tale - Runs common Pandas operation with no-code.
- Exports code of analysis.
- Covers info about missing values, data statistics, correlation etc.
- Highlights duplicates, outliers, etc.
- Integrates with Jupyter Notebook.
🌐 :octocat:
dabl - Primarily provides visualizations.
- Covers wide range of plots: Scatter pair plots. Histograms.
- Target distribution.
🌐 :octocat:
QuickDA - Get overview report of dataset.
- Covers info about missing values, data statistics, correlation etc.
- Produces data alerts.
- Plots data feature interactions.
🌐 :octocat:
Datatile - Extends Pandas describe().
- Provides column stats: column type count, missing, column datatype.
- Mostly statistical information.
🌐 :octocat:
Lux - Provides visualization recommendations.
- Supports EDA on a subset of columns.
- Integrates with Jupyter Notebook.
- Exports code of analysis.
🌐 :octocat:
ExploriPy - Performs statistical testing.
- Column type-wise distribution: Continuous, Categorical
- Covers info about missing values, data statistics, correlation etc.
🌐 :octocat:

Big Data: The Vital Role of Exploratory Data Analysis (EDA)

Big Data refers to the vast and complex datasets that exceed the capabilities of traditional data processing tools. It is characterized by the three V's: volume (large amounts of data), velocity (rapid data generation), and variety (different data types). Exploratory Data Analysis (EDA), on the other hand, is a data analysis approach that involves summarizing, visualizing, and understanding the key characteristics of a dataset to uncover insights and patterns. In the context of Big Data, EDA plays a crucial role in making the data more manageable by identifying relevant subsets, trends, and anomalies, enabling data scientists to extract meaningful information and inform decision-making processes.

Feature Engineering

Feature engineering in data science is the process of selecting, transforming, and creating relevant attributes or variables from raw data to improve the performance of machine learning models. It involves identifying patterns, relationships, and meaningful information within the data, and then designing or modifying features to enhance the model's ability to make accurate predictions or classifications. Effective feature engineering can lead to increased model accuracy, reduced overfitting, and a better understanding of the underlying data, making it a critical step in the data preprocessing pipeline for machine learning tasks.

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors

More Repositories

1

learn-github

A step-by-step guide to getting started with Git and GitHub for beginners.
HTML
737
star
2

learn-php

This course is designed to introduce students the fundamental of knowledge, technologies and components for web application developments. The basic topics includes the standard HTML for content creation, CSS for content presentation, JavaScript for client-side logics, PHP for server-side logics and MySQL for data processing.
JavaScript
180
star
3

software-engineering

This course is designed to give students an introduction to an engineering approach in the development of high-quality software systems. It will discuss the important software engineering concepts in the various types of the common software process models.
HTML
168
star
4

Python-big-data

Python and Pandas are known to have issues around scalability and efficiency. You will learn how to use libraries such as Modin, Dask, Ray, Vaex etc to overcome the problems faced by Pandas.
Jupyter Notebook
126
star
5

SLR-FC

Systematic Literature Review (SLR) using AI involves leveraging artificial intelligence techniques to automate and expedite the process of reviewing and synthesizing large volumes of scholarly literature.
116
star
6

special-topic-data-engineering

This course presents to the students recent research and industrial issues pertaining to data engineering, database systems and technologies. Various topics of interests that are directly or indirectly affecting or are being influenced by data engineering, database systems and technologies are explored and discussed.
Python
108
star
7

python-web

This topic explains how to implement web scraping and python web development. Web scraping topics such as scrapy, beautiful soup, and others will be covered. A case study based on a Malaysian website.
Jupyter Notebook
106
star
8

obsidian

Obsidian.md stands out as an exceptional note-taking application tailored specifically for academic writing. This repository is part of the activities for the Systematic Literature Review using AI workshop.
HTML
98
star
9

python-tutorial

Jupyter Notebook
92
star
10

dataset

84
star
11

obsidian-slr

This repository hosts an Obsidian vault tailored for conducting a Systematic Literature Review (SLR). This repository is part of the activities for the Systematic Literature Review using AI workshop.
73
star
12

undergraduate-project

Final Year Project or commonly known as a Projek Sarjana Muda (PSM) is a course whereby each undergraduate student must undertake and pass in order to graduate. It aims is to equip students with knowledge and skills in problem solving/programming technique through appropriate academic and research activities.
JavaScript
66
star
13

SECP3843

Python
64
star
14

learn-django

Python
62
star
15

HPDP

High performance data processing employs high performance computing (HPC) to process data, which is then translated into information and knowledge. The advent of high-performance computing and data analytics enabled real-time interrogation of extremely large data sets.
Jupyter Notebook
62
star
16

research-design

This course will cover the fundamental steps and implementation on developing the initial ideas to formal academic writing accordingly. Students will be given the mechanisms on how to transform and digest the literature reviews that leads to the proposed title.
Roff
59
star
17

BDM

Course covers big data fundamentals, processes, technologies, platform ecosystem, and management for practical application development.
Jupyter Notebook
50
star
18

Generative-AI-Playground

Generative-AI-Playground is a platform for experimenting with different generative models and techniques. It lets you try out advanced technologies like ChatGPT, Bing.AI and Gemini. This playground is a place where people can learn and practice using these models.
40
star
19

ai-tools

AI-powered literature review tools leverage machine learning to expedite and enhance the scholarly process of identifying, analyzing, and synthesizing relevant research.
39
star
20

research-material

Information related to research material that can be used by postgraduate students
37
star
21

learn-aspnet

This course introduces the fundamentals of web development using ASP.NET, with the aim to develop a database (SQL Server) driven website.
ASP.NET
35
star
22

drshahizan

My profile readme
29
star
23

drshahizan.github.io

CSS
27
star
24

phd

The daily life of a PhD student may differ significantly from that of an undergraduate or Masters student. There will be much more independence and very few 'taught' elements. A typical week will almost certainly include the same number of PhD study hours as a full-time job.
Python
27
star
25

courses

Course material-related information.
21
star
26

AI-Innovation

20
star
27

SLR-MIIT

16
star
28

mySMOKU

JavaScript
9
star
29

mybooks

This repository is used to store the necessary materials for writing an original book. It serves as a centralized location for authors to organize and access research materials, references, drafts, and other documents essential for the book-writing process.
8
star
30

trainee

6
star
31

SLR

5
star
32

data-analytics

5
star
33

learn-chatGPT

3
star
34

book-eda

3
star
35

myTOR

2
star
36

zenodo

1
star