• Stars
    star
    126
  • Rank 284,543 (Top 6 %)
  • Language
    Jupyter Notebook
  • Created almost 2 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python and Pandas are known to have issues around scalability and efficiency. You will learn how to use libraries such as Modin, Dask, Ray, Vaex etc to overcome the problems faced by Pandas.

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

Don't forget to hit the ⭐ if you like this repo.

About Us

The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general big data information as well as big data case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student.

πŸ“š Big data processing

Big data processing involves the systematic handling and analysis of vast and complex datasets that exceed the capabilities of traditional data processing methods. It encompasses the storage, retrieval, and manipulation of massive volumes of information to extract valuable insights. Key steps include data ingestion, where large datasets are collected from various sources, and preprocessing, involving cleaning and transformation to ensure data quality. Advanced analytics, machine learning, and data mining techniques are then applied to uncover patterns, trends, and correlations within the data. Big data processing is integral to informed decision-making, enabling organizations to derive meaningful conclusions from their data, optimize operations, and gain a competitive edge in today's data-driven landscape.

Notes

Big Data: Pandas

Big Data processing with Pandas, a powerful Python library for data manipulation and analysis, involves implementing strategies to handle large datasets efficiently. Scaling to sizable datasets requires adopting techniques such as processing data in smaller chunks using the 'chunksize' parameter in Pandas read_csv function. This approach facilitates reading and processing large datasets in more manageable portions, preventing memory overload. To further optimize memory usage, it's essential to leverage Pandas' features like data types optimization, using more memory-efficient data types when possible. Additionally, utilizing advanced functionalities like the 'skiprows' parameter and filtering columns during data import can significantly enhance performance. By mastering these strategies, one can effectively manage and analyze vast datasets in Python with Pandas, ensuring both computational efficiency and memory optimization in the face of Big Data challenges.

Big Data: Alternatives to Pandas for Processing Large Datasets

Modin

Dask

Datatable

πŸŽ–οΈ Comparison between libraries

Big Data: Case study

Lab

Pandas

Modin

Dask

Comparison between libraries

Contribution πŸ› οΈ

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors

More Repositories

1

learn-github

A step-by-step guide to getting started with Git and GitHub for beginners.
HTML
737
star
2

learn-php

This course is designed to introduce students the fundamental of knowledge, technologies and components for web application developments. The basic topics includes the standard HTML for content creation, CSS for content presentation, JavaScript for client-side logics, PHP for server-side logics and MySQL for data processing.
JavaScript
180
star
3

software-engineering

This course is designed to give students an introduction to an engineering approach in the development of high-quality software systems. It will discuss the important software engineering concepts in the various types of the common software process models.
HTML
168
star
4

Python_EDA

This topic explains about the implementation of exploratory data analysis (EDA). A total of 21 EDA case studies have been implemented using the Malaysian dataset.
Jupyter Notebook
167
star
5

SLR-FC

Systematic Literature Review (SLR) using AI involves leveraging artificial intelligence techniques to automate and expedite the process of reviewing and synthesizing large volumes of scholarly literature.
116
star
6

special-topic-data-engineering

This course presents to the students recent research and industrial issues pertaining to data engineering, database systems and technologies. Various topics of interests that are directly or indirectly affecting or are being influenced by data engineering, database systems and technologies are explored and discussed.
Python
108
star
7

python-web

This topic explains how to implement web scraping and python web development. Web scraping topics such as scrapy, beautiful soup, and others will be covered. A case study based on a Malaysian website.
Jupyter Notebook
106
star
8

obsidian

Obsidian.md stands out as an exceptional note-taking application tailored specifically for academic writing. This repository is part of the activities for the Systematic Literature Review using AI workshop.
HTML
98
star
9

python-tutorial

Jupyter Notebook
92
star
10

dataset

84
star
11

obsidian-slr

This repository hosts an Obsidian vault tailored for conducting a Systematic Literature Review (SLR). This repository is part of the activities for the Systematic Literature Review using AI workshop.
73
star
12

undergraduate-project

Final Year Project or commonly known as a Projek Sarjana Muda (PSM) is a course whereby each undergraduate student must undertake and pass in order to graduate. It aims is to equip students with knowledge and skills in problem solving/programming technique through appropriate academic and research activities.
JavaScript
66
star
13

SECP3843

Python
64
star
14

learn-django

Python
62
star
15

HPDP

High performance data processing employs high performance computing (HPC) to process data, which is then translated into information and knowledge. The advent of high-performance computing and data analytics enabled real-time interrogation of extremely large data sets.
Jupyter Notebook
62
star
16

research-design

This course will cover the fundamental steps and implementation on developing the initial ideas to formal academic writing accordingly. Students will be given the mechanisms on how to transform and digest the literature reviews that leads to the proposed title.
Roff
59
star
17

BDM

Course covers big data fundamentals, processes, technologies, platform ecosystem, and management for practical application development.
Jupyter Notebook
50
star
18

Generative-AI-Playground

Generative-AI-Playground is a platform for experimenting with different generative models and techniques. It lets you try out advanced technologies like ChatGPT, Bing.AI and Gemini. This playground is a place where people can learn and practice using these models.
40
star
19

ai-tools

AI-powered literature review tools leverage machine learning to expedite and enhance the scholarly process of identifying, analyzing, and synthesizing relevant research.
39
star
20

research-material

Information related to research material that can be used by postgraduate students
37
star
21

learn-aspnet

This course introduces the fundamentals of web development using ASP.NET, with the aim to develop a database (SQL Server) driven website.
ASP.NET
35
star
22

drshahizan

My profile readme
29
star
23

drshahizan.github.io

CSS
27
star
24

phd

The daily life of a PhD student may differ significantly from that of an undergraduate or Masters student. There will be much more independence and very few 'taught' elements. A typical week will almost certainly include the same number of PhD study hours as a full-time job.
Python
27
star
25

courses

Course material-related information.
21
star
26

AI-Innovation

20
star
27

SLR-MIIT

16
star
28

mySMOKU

JavaScript
9
star
29

mybooks

This repository is used to store the necessary materials for writing an original book. It serves as a centralized location for authors to organize and access research materials, references, drafts, and other documents essential for the book-writing process.
8
star
30

trainee

6
star
31

SLR

5
star
32

data-analytics

5
star
33

learn-chatGPT

3
star
34

book-eda

3
star
35

myTOR

2
star
36

zenodo

1
star