A very Long never ending Learning around Data Engineering & Machine Learning
New Tech
Interesting Reads
- How to choose a Distributed Database
- Cockroach DB Architecture
- Amundsen Review
- Deep Dive - Foundation DB
- The What, Why, and When of Single-Table Design with DynamoDB
- How To Manage And Monitor Apache Spark On Kubernetes
- Git is hard: screwing up is easy, and figuring out how to fix your mistakes is fucking impossible
- 8 Practical Use Cases of Change Data Capture
- Apache Iceberg- Links
- Kubernetes Port Forwarding Manager
- Querying Parquet with Precision using DuckDB - Much faster compared to Pandas
- What is Apache Pinot - Usecases & Architecture
- Change Data Streaming Patterns in Distributedsystems
- Cuckoo Hashing - An alternative to chaining and linear probing for collision handling
- Riak Database
- Database Indexing
- Parallel Databases using Map Reduce
- REST vs GraphQL
- Linux Namespace & Control Group(cgroup)
- SQL Lexical Structure
- Everything about the Linux kernel
Weekly Digest
- How #dataengineering get complicated over time
- What is eBPF - Sandboxing Programs inside #linux Kernel
- Absolute Basic Explanation of SSTable & Log Structured Merge Trees - Sorted String Table & Faster Random Writes
The Data Engineering
Level 0
- Getting started with #dataengineering Volume 6
๐ - Getting started with Dataengineering Volume 5
๐ - Getting started with Data Engineering, volume 4
๐ ๐ก - Getting started with Data Engineering, volume 3
๐ ๐ก - Getting started with Data Engineering, volume 2
๐ ๐ก - Getting started with Data Engineering, volume 1 ๐๐ก
- Getting started with #dataengineering from basics
- Apache Airflow 2.0
- Some Interesting essentials while learning Apache Airflow
- Dagster Release 0.10.0 - Everything about Exactly-once, Fault-Tolerant Scheduling - Extremely Important Release
๐ ๐ ๐ - #getdbt or Data Build Tools interface across all major Data Workflow Management Platform
๐ฏ โจ ๐ฅ - Apache Superset - An #opensource Fully Featured Business Intelligence Application
๐ ๐ ๐ - The Hop Orchestration Platform, or Apache #Hop (Incubating), aims to facilitate all aspects of data and metadata orchestration
๐ฏ ๐ก โญ - Apache Iceberg Partitioning is way better than Hive ! Hidden Partitioning makes everything easier! ๐
- Trino aka #prestosql is different from Apache Spark SQL - Exclusively designed for Distributed SQL ๐
- Apache Spark is NOT a Map but an MPP/MPI Engine
- Apache Hudi - Design Principles
- OpenTelemetry specification V1.0
- Everything Around PySpark Pandas UDF
๐ - Important skill-set of a Dataengineer - Reduce Cost
- Everything on PyFlink - Python with Apache Flink
- Delta Lake Cheat Sheet
- Dataengineering schedule breakdown, a very flexible estimate
- Parquet - Introduction & Design, An OpenSource File Format
- SQL - Avoiding Antipatterns
- Explaining Apache Kafka - In children's book format
- The Perfect #dataengineering: Top INVALID Reasons behind #datapipelines failures
- What is ETL
- What is Proxy & Reverse Proxy
Level 1
- DataEngg Skills to work with DataScience
- Data Quality, A necessity for Data Driven Projects
- Essential Cloud Skills for Data Engineering
- Open Source Technologies in Data Engineering
- Kubernetes Fundamentals Required as a Data Engineer
- Apache Superset, OSS Business Intelligence for 2021
- #apachekafka as a Database - Summary on both the sides , Arguments, Trade-offs & exceptional
๐ฌ quotesโณ ๐ก โณ - Processing Guarantees in #apachekafka
๐ฏ ๐ ๐ - The best resource - Change Data Analysis with Debezium and Apache Pinot
๐ ๐ก ๐ฟ - Optimizing Apache Kafka Producers & Consumers
๐ ๐ ๐ - Redpanda -A NON-JVM Streaming Platform for mission critical workloads
๐ก ๐ ๐ - Apache Hudi - Turn Batch Jobs to Incremental Model | Complete file management on a Data Lake
- Apache Iceberg - an open table format for huge analytic datasets
- Ballista - Distributed computing platform built primarily on Rust and powered by Apache Arrow
- ZooKeeper, a distributed, open-source coordination service for distributed applications
- Apache Iceberg - Partition Evolution, its simple but its so amazing
- ApacheKafka without ZooKeeper Sneak Peak
๐ - Why Data Discovery is important for Data Engineering
- Queue vs Log - Event driven Architecture
- Database Indexing
Level 1.1
- Multiple criteria search at scale with Apache Pinot & Theta Sketches
- VM vs Containers - Similar but Different
- State of Trino aka PrestoSQL
- ETL is an extremely important component for any modern business
- Top 5 ways to complicate a #dataengineering pipeline/application
๐ฅ - Leader election is commonly used aka Master/Namenode/Leader/Driver
- Dagster vs Airflow - A comparison
- About Single Source of Truth in DataEngineering
- Change Data Capture for Distributed Databases
- Deep Dive on Why Apache Iceberg for Change Data Capture, using Apache Flink ๐
- OpenMetadata is an Open Standard for Metadata. A Single place to Discover, Collaborate, and Get your data right
- About Lakehouse
- etcd - A distributed, reliable key-value store for the most critical data of a distributed system
- What is Redis
- What is Hive
- What is Data Warehouse - An Introduction
- Fundamentals of Designing Data Warehouse
- Database Relational Model - A way of looking at Data
- Data Engineering Infrastructure Notes
Dataengineering Core
- A Data Engineering Story - The Beginning
- Data Engineering - More towards Data Science or Data Analytics or ...
- Data Engineering Interview Patterns
- Basic Checklists while learning Apache Spark
- #apachespark for Distributed Analytics or #businessinteligence Platform - Worth or not ?
- Apache Beam for Search: An Introduction & Addressing the challenge of the Time Problem
๐ ๐ก ๐ - Nextflow is a Workflow Manager exclusively for #bioinformatics
๐ฉน ๐ ๐ฉน - #apachespark Project Zen Update - Making PySpark Better
๐ก ๐ ๐ก - Design - Exactly Once Delivery & Transactional Messaging in #apachekafka
๐ ๐ ๐ - underrated but important skill of a Data Engineer
- Fallacies of Distributed Systems
- As a Data Engineer, some Essentials I did which really helped Data Scientists and the Team
- A very normal Data Engineering work ๐
- What can go wrong in Distributed Data Systems
- Architect and build an #machinelearning use case end to end using Amazon SageMaker ๐
- Around Data Discovery or Metadata Management Platforms
- Amazon S3 Object Lambda - Provide Different Views of Data to Multiple Applications
- Full Stack Data Engineer
- Data cleaning is Hard but why
- Most exciting things about #dataengineering
- The real impact of Disks on #rocksdb State Backend in Apache Flink
- Tips for Distributed System High Availability
- interesting way of collaboration between a Dataengineer & Datascientis
- Building DistributedLog: High-performance replicated log service ๐
- Whiz: Data Analytics Execution Framework based on Intermediate Data
- Adding unlimited Nodes in a #dataengineering platform will eventually drop
- A typical Data Engineering Pipeline
- 'Log' is a fundamental component of a Data Engineering Ecosystem
- Flink CDC
- Readings Around Databases
- Code Review Best Practice, bcz Developers, hate code reviews
- Important Performance Criteria to measure DataEngineering Systems
- Database Internals - Storage
- Data Integration for Databases & Data Warehousing - An Introduction
- What is Protocol Buffer - An excellent important data interchange format for serialization, "Zero Copy" format
- Memcached, Redis & Elasticache - To accelerate your data or databases
- What is LSM-Tree
- Tor aka Onion Router - How does it work?
Infrastructure
- SQL Database on Kubernetes - Best Practices
- Devtron - An Open Source DevOps on Kubernetes, written in Go
๐ฅ ๐ ๐ - Most Popular #opensource BI & Data Analytics Platforms ๐๐ก๐
- datapipelines Dataframe APi is now available with #apachebeam ๐ฏ๐ฅ๐ฏ
- Disaster Recovery for Multi-Region Apache Kafka & Data Consumption using #apacheflink
๐ ๐ ๐ - Kubernetes Api Structure
๐ฏ โ๏ธ ๐ฏ - Architecting a Kubernetes Infrastructure
๐ฏ - Exploring Kubernetes Operator Pattern
๐ก - Docker is an interal part of Data Engineering ML pipeline & that makes security
๐ extremely essential - Rack awareness for #apachekafka Streams Proposal
๐ - Dolt is Git for Data
๐ - Toward Better Data Culture From First Principles by Ube
- Fast and Reliable Schema-Agnostic Log Analytics Platform by Uber
- Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
๐ - Diving Deep on S3 Consistency - Insightful
- Ray- General Purpose ML Infrastructure
- Kubernetes Hardening Guide by National Security Agency
- Everything around Load Balancer
- Data Lakehouse - Is it really the end of Datawarehouse
- Real-Time Exactly-Once Processing with Apache Flink, Kafka, and Pinot @uber
- WTH is Kubernetes Operator? - An Introduction
- Lessons Learned from Sharding Postgres
- What is Kubernetes- An introduction
- ELK Stack - Introduction of Scalable Monitoring
- What is NGINX - An Introduction
- What is Load Balancer - An Introduction
- What is OAuth 2 - Introduction|API Based Authorization
- Kubernetes & Networks - It's hard because multiple options are available
- Kubernetes Reconciliation
- Troubleshooting Kubernetes App
- Kubernetes Best Practices - Classics
- Paper: Serverless Computing: A Survey of Opportunities, Challenges, and Applications
- Choosing the Kubernetes Local Cluster
- Monitoring Kubernetes - Fundamentals of #kubernetes Infrastructure Monitoring
- Kubernetes Controller Manager
- Kubernetes: Why the Pod is still in the Pending State?
- Kubernetes Liveness & Readiness Probe
- Kubernetes Pod/Node Affinity
SQL
- Advanced SQL - Reference CS 564 Database Management Systems
- SQL and Advanced SQL - An asset
- Database Indexing - Almost Everything
- Tuning SQL queries - Tips for writing efficient & faster Queries
- Database Schema Design - Schema Design is a Complicated Necessity
- SQL Query Processing Plan - Basics
- Revisiting SQL Basics - The beginning of Data Science & Data Engineering
- Distributed Advanced Queries - Presto/Trino
- SQL Notes For Professionals, 100+ pages
- Table Partitioning
- SQL complex Queries - Nested Queries & Aggregation
- Gossip Protocols - Designed for Data Consistency & Fault-Tolerance
- Table Partitioning - An Important Concept
- Database Concurrency Control : 2 Phase Locking
- Database Entity Relationship Model
- SQL Join Fundamentals
- Database Indexing
- Database Indexing Notes
- SQL Injection Introduction
- SQL Constraints Fundamentals
- The fundamental of writing SQL queries is different from
- Building a NoSQL Database using Git
- Against SQL - An article on What is not good with #sql
- Using
EXPLAIN
for Data Problems - Things beyond SQL - 10 SQL Queries to Blow Your Mind
๐ - Views, Stored Procedures, Functions & Triggers - SQL
- SQL Transaction & ACID Property
- How to Solve complex SQL queries
- Apache Spark SQL - The Introduction from RDMBS till SparkSQL
- Advanced SQL & Functions
- Basic & Intermediate on Database Sharding
- Complex Database Queries with PostgreSQL
- Query Evaluation - Technical Details "when you execute SQL Query"
- What is Materialized View & how does work in Distributed Databases
- Breaking Down NoSQL Sharding, Replication & Consistency
- Database Query Optimization Technique
- Intermediate SQL
- SQL Stored Procedures
- OLAP & OLTP - Datawarehouse Data Mining
- Database Fundamentals
- SQL Subqueries
- NewSQL Introduction - Basic to Intermediate
- SQL Intermediate & basics Deep Dive
- SQL Basics - The Starting point
- Data Warehousing & OLAP Technology
- Snowflake Datawarehouse
- RelationalAlgebra & SQL
- Logical Schema Design: SQL Database
- Kubernetes Pod Internals - Deep Dive
- The Illustrated Children's Guide to Kubernetes
- SQL Subqueries by Example
- What is Write-Ahead-Logging (WAL)
- [SQL Transactions](SQL Transactions - a sequence of database operations)
- Linux Productivity Tools - This is a Data Infrastructure necessity
NoSQL
- [NoSQL & MongoDB]https://www.linkedin.com/posts/iamabhishekchoudhary_nosql-mongodb-activity-6874231633654935553-Z66u)
- CouchDB Introduction - โข Document Storage Database
Machine Learning
MLOPS
- Machine Learning Workflow
๐ฏ - Dummy Notes On Machine Learning Infrastructire
- Machine Learning Feature Store ๐ฏ
- Deploying #machinelearning model in Production is really HARD but #MLOps can fix that.
- List of #machinelearning & #dataengineering Technologies will be following in 2021 ๐๐ก๐
- MLOps - ZenML #machinelearning with reproducible pipelines
โ ๐ฏ โ - Why? Data Versioning is a complicated problem for Dataengineers
- Explainable AI Cheat Sheet
- Designing Machine Learning infrastructure
- What is Log - Foundation behind Databases & Distributed Systems
- How does the GIT version control work?
Project
- Streamlit Healthcare Machine Learning Data App
- Dstack AI - An open-source tool to develop data applications with Python ๐๐ญ๐
- Adversarial Robustness Toolbox - a Python library for #machinelearning Security ๐ก๐๐
- Biopython is a set of freely available tools for biological computation written in #Python
๐ โ๏ธ ๐
Insightful
- Time to Know More about DASK
- DataEngineering vs Machine Learning
- A good #machinelearning Model is only possible with a good quality of #data.
โ๏ธ - Statistics for #softwareengineer
๐ฅ ๐ฏ ๐ฅ - Monitoring #machinelearning Applications ๐๐ ๐
- Dagster is a data orchestrator for machine learning, analytics, and ETL - Officially #machinelearning driven
๐ฅ ๐ฅ ๐ฅ - Short Notes on -Open source #machinelearning Tracking System
- The best example of Randomness is - #machinelearning model in Production.
๐ ๐ญ ๐ - Flyte is declarative, structured, and highly scalable cloud-native workflow orchestration platform for Distributed Machine Learning
- Tips for Distributed System High Availability
๐ - Building DistributedLog: High-performance replicated log service
๐ - How to scale Kubernetes with Assurance
- Apache Calcite - Building Sql Query Processor from Scratch over Lucene
- Database Storage
- ACID is the foundation of Database, BASE is for NoSQL Databases
- Some common elements behind many Distributed Databases
- Failure Recovery in TrinoDB
- What is LLVM
- What is Garbage Collection
- What is Canary Deployment
Paper
Distributed System
Crazy
- The Snowflake Paper - Core idea is to build an enterprise-ready #datawarehouse solution for the #cloud ๐๐ฐ๐
- Most important points around Distributed #dataengineering Platform
- Fundamental of #distributedsystems Scaling - Avoiding Co-ordination
๐ โจ๏ธ ๐ - Technical Debt in #dataengineering #softwareengineering
๐ ๐ก ๐ - Paper on Wander Join: Online Aggregation via Random Walks
๐ ๐ญ ๐ Join problem - The Delta Lake Paper - High-Performance ACID Table Storage
๐ ๐ก ๐ - Dynamo - AWS Highly Available Key-value Store #distributedsystem ๐ฌ๐ก๐
- An Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables, A Single SQL for all
๐ก ๐ฉ ๐ฉ - Secure & Robust Machine Learning in #healthcare
๐ ๐งช ๐ฅณ - Progress in Medical Science using #deeplearning ๐๐ก๐
- The Amazon Redshift Paper - A fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing #businessintelligence tools
๐ ๐ฐ ๐ญ - Advancing #drugdiscovery via Artificial Intelligence ๐๐ฅ๐ฅ
- Apache Calcite is a dynamic data management framework
๐ ๐ ๐ - Lakehouse - A Paper on new Generation of #datawarehouse technology
๐ก ๐ ๐ก - Calvin: Fast Distributed Transactions for Partitioned Database Systems ๐๐
- Presto or Trino - #SQL on Everything ( The Design, Motivation & Performance) #presto
๐ญ ๐ ๐ก - Design - Exactly Once Delivery & Transactional Messaging in Apache Kafka
- Apache Kafka Paper : Distributed Messaging System for Log Processing
- Paper: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size
- Paper: Ground is an open-source data context service, a system to manage all the information that informs the use of data
- Azure Data Lake Store(ADLS) is a fully-managed, elastic, scalable, and secure file system that supports #hadoop distributed file system (HDFS) and Cosmos semantics
- An LFU (Least Frequently Used) Cache eviction algorithm of O(1) Runtime complexity
- The Berkeley View on Cloud Computing - Paper
- The Google File System - The Paper
๐ - Paper: Report on Distributed Deep Learning on Data Systems
๐ - Crystal: A Unified Cache Storage System for Analytical Databases
- VoltDB
- Magnet - Apache Spark Shuffle mechanism to handle petabytes of daily shuffled data and clusters with thousands of nodes
Set 2
- Paper: Real-time Data Infrastructure @ Uber
- Paper: DBLog, A Watermark Based Change-Data-Capture Framework by Netflix
- Paper: Large Scale Distributed Systems Tracing Infrastructure
- Paper:Paxos vs Raft: Distributed Consensus ๐
- Paper: Sorting in a #distributedsystem
๐ - Paper: A large scale analysis of hundreds of in-memory cache clusters
- Design & Architecture of Amazon Timestream - Streaming at Scale
- Distributed System Synchronization
- Paper: Consistent hashing - Resizing cluster or Load in a #distributedsystems with a simple concept
- Deep Dive - Foundation DB (unbundled database, OLTP, strict serializability, multi-version concurrency control, optimistic concurrency control, simulation testing)
- Distributed Database - ZippyDB is the largest strongly consistent, geographically distributed key-value store at Facebook Database
- BigData Metadata Management System
- Machine Learning for Database Optimizations
- SingleStore - A Distributed Database Management System. It's really more than a Database
- ArrowSAM, in-memory genomics SAM format based on Apache Arrow
- Realtime Data Processing FB - Deep Dive on #streamprocessing
- ArangoDB - Native multi-model NoSQL Distributed #database, From #sql to NoSQL
- To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
- How to bring robustness while Designing Large Scale Complex Systems
- Facebook Datawarehouse
- Building a performant OLTP system on an open-source columnar format, and supporting near-zero overhead data export to external tools
- Towards Demystifying Serverless Machine Learning Training
- Paper: Scalable Linear Algebra on top of Distributed Databases, this will simplify Machine Learning on Databases
- Paper: Are You Sure You Want to Use MMAP in Your Database Management System
- What is RBAC or Role-Based Access Control
- Vectorization vs. Compilation in Query Execution
- SQLite vs DuckDB
Advanced
- Glow is an open-source toolkit for working with genomic data at biobank-scale and beyond using #apachespark & #deltalake ๐๐๐
- ExPASy - Databases and software tools in proteomics, #genomics, phylogeny, systems biology, evolution, population genetics, and transcriptomics
๐ก ๐ ๐ - What is Metadata - A Data Engineering necessity
- What is Distributed Database
- To Partition, or Not to Partition, That is the Join Question in a Real System
- Paper: Solana- A new architecture for a high performance blockchain-inspired by Distributed Systems
- Scaling Large Production Clusters with Partitioned Synchronization
- Paper: Volcano Operator Model is based on relational algebra
- Paper: Faster and Cheaper Serverless Computing on Harvested Resources
- DBOS: A Paper on DBMS-oriented Operating System
- SSD Storage - Scale Caching without increasing too much cost & Smart Indexing for faster data query
- Paper: Lineage Tracing for General Data Warehouse Transformations
- What Every Programmer Should Know About Memory
- Deployment Archetypes for Cloud Applications
- PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers
- Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask
- Dual use of artifcial-intelligence-powered drug discovery
Discussions
- Should you pick Managed Service or build self Managed Open Source Infrastructure
- What is Sigstore
- Security Threat Model
- Kubernetes Security & Secrets
- 2 ways of Data/ML Product Development