• Stars
    star
    181
  • Rank 212,110 (Top 5 %)
  • Language
    HTML
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

Learning Hadoop and Spark

Contents

This is the companion repo to my LinkedIn Learning Courses on Apache Hadoop and Apache Spark.

🐘 1. Learning Hadoop - link
- uses mostly GCP Dataproc
- for running Hadoop & associated libraries (i.e. Hive, Pig, Spark...) workloads

🌩ī¸ 2. Cloud Hadoop: Scaling Apache Spark - link
- uses GCP DataProc, AWS EMR --or--
- Databricks on AWS

⛈ī¸ 3. Azure Databricks Spark Essential Training - link
- uses Azure with Databricks
- for scaling Apache Spark workloads


Development Environment Setup Information

You have a number of options - although it is possible for you to set up a local Hadoop/Spark cluster, I do NOT recommended this approach as it's needlessly complex for initial study. Rather I do recommend that you use a partially or fully-managed cluster. For learning, I most often use a fully-managed (free tier) cluster.

1. SaaS - Databricks --> MANAGED

Databricks offers managed Apache Spark clusters. Databricks can run on AWS, Azure or GCP --> announced in 2021 - link. In this course, I use Databricks running on AWS, as the community editor is simple and fast to set up for learning purposes.

  • Use Databricks Community Edition (managed, hosted Apache Spark), run on AWS. Example notebook shown in screenshot above.
    • uses Databricks (Jupyter-style) notebooks to connect to a one or more custom-sized and managed Spark clusters
    • creates and manages your data files stored in cloud buckets as part of Databricks service
    • uses DFS file system in cluster data operations
    • use Databricks AWS community edition (simplest set up - free tier on AWS) - link --OR--
    • use Databricks Azure trial edition - Azure may require a pay-as-you-go account to get needed CPU/GPU resources
    • try Databricks on GCP beta - announced recently - link

2. PaaS Cloud on GCP (or AWS) --> PARTIALLY-MANAGED

  • Setup a Hadoop/Spark managed cloud-cluster via GCP Dataproc or AWS EMR
    • see setup-hadoop folder in this Repo for instructions/scripts
      • create a GCS (or AWS) bucket for input/output job data
      • see example_datasets folder in this Repo for sample data files
    • for GCP use DataProc includes Jupyter notebook interface --OR--
    • for AWS use EMR you can use EMR Studio (which includes managed Jupyter instances) - link example screenshot shown above
    • for Azure it is possible to use their HDInsight service. I prefer Databricks on Azure because I find it to be more feature complete and performant.

3. IaaS local or cloud --> MANUAL

  • Setup Hadoop/Spark locally or on a 'raw' cloud VM, such as AWS EC2
    • NOT RECOMMENDED for learning - too complex to set up
    • Cloudera Learning VM - also NOT recommended, changes too often, documentation not aligned

Example Jobs or Scripts

EXAMPLES from org.apache.hadoop_or_spark.examples - link for Spark examples

  • Run a Hadoop WordCount Job with Java (jar file)
  • Run a Hadoop and/or Spark CalculatePi (digits) Script with PySpark or other libraries
  • Run using Cloudera shared demo env
    • at https://demo.gethue.com/
    • login is user:demo, pwd:demo

Other LinkedIn Learning Courses on Hadoop or Spark

There are ~ 10 courses on Hadoop/Spark topics on LinkedIn Learning. See graphic below
Learning Paths

  • Hadoop for Data Science Tips and Tricks - link
    • Set up Cloudera Enviroment
    • Working with Files in HDFS
    • Connecting to Hadoop Hive
    • Complex Data Structures in Hive
  • Spark courses - link
    • Various Topics - see screenshot below

LinkedInLearningSpark

More Repositories

1

learning-cloud

Courses, sample code, articles & screencasts - AWS, Azure, & GCP
Jupyter Notebook
453
star
2

learn-snowflakedb

Resources to work with SnowflakeDB
287
star
3

gcp-essentials

Sample code and notes for my GCP courses on LinkedIn Learning
Jupyter Notebook
235
star
4

gcp-for-bioinformatics

GCP for Bioinformatics Researchers
Jupyter Notebook
231
star
5

Hello-AWS-Data-Services

AWS Data/MLServices sample code & notes for my LinkedIn Learning courses
Jupyter Notebook
177
star
6

learning-quantum

Study resources for learning quantum computing
Jupyter Notebook
141
star
7

aws-for-bioinformatics

AWS for Bioinformatics Researchers
Jupyter Notebook
114
star
8

lynnlangit

Lynn Langit profile
Julia
65
star
9

great-github-profiles

Companion Repo to LinkedIn Learning course 'Great GitHub Profiles'
HTML
60
star
10

TeamTeri

Bioinformatics on GCP, AWS or Azure
Shell
52
star
11

aws-cost-control

Companion Repository to Linked In Learning Course "AWS Cost Control"
46
star
12

gcp-ml

Google Cloud Platform Machine Learning Samples
Jupyter Notebook
40
star
13

Spark-Scala-EKS

Spark Scala docker container sample for AWS testing - EKS & S3
HCL
23
star
14

learning-data-mesh

Repo with resources for learning Data Mesh
15
star
15

serverless-architecture

Companion to my Linked In Learning 'Serverless Architecture' course
14
star
16

RedisLabsDemo

demo of using RedisLabs RedisCloud as a user caching store for a node.js app with SQL Azure
C#
13
star
17

learning-ethical-ai

Resources to learn how to implement ethical AI
Python
12
star
18

AdvancedPythonForBio

Work from the book 'Advanced Python for Biologists'
Jupyter Notebook
9
star
19

learning-alibaba-cloud

Companion Repo for LinkedIn Learning Course
TSQL
9
star
20

julia-linear-algebra

study notes and sample code for "Learning Linear Algebra with Julia"
Jupyter Notebook
8
star
21

AWS-Redshift-Matillion-Workshop

Scripts, Instructions and Materials for AWS Redshift and Matillion ETL workshop
Shell
8
star
22

Java-Refactoring-Workbook

Practing Using Excercises from 'Refactoring Workbook'
Java
7
star
23

sample-data

Small datasets and files in many formats, used for testing cloud SQL, NoSQL or Machine Learning Services
PowerShell
6
star
24

learning-codespaces

Index of content to learn to use GitHub Codespaces
4
star
25

learning-nosql

Companion repository to Linked In Learning course 'Cloud NoSQL for SQL Pros'
4
star
26

learning-github

Demo Repo for Learning GitHub
3
star
27

DnBBusinessVerificationAPISample

Sample code for YouTube demo of Dunn And Bradstreet Business Verification API in the Windows Azure Marketplace
C#
3
star
28

AWSDataWarehouse

Demo of AWS Redshift and partners
Shell
3
star
29

consulting

Lynn Langit
CSS
2
star
30

architects-who-code

Architects Who Code
Python
2
star
31

hello-cloud-run

Demo of easy button for CloudRun
Dockerfile
2
star
32

github-slideshow

A robot powered training repository 🤖
Ruby
2
star
33

learn-copilot-workspace

Demo Repo for Copilot Workspace
Java
2
star
34

Intro-to-Google-Cloud-Java-Code-Demos

Intro to Google Cloud for Developers YouTube screencast series - code demos
CSS
1
star
35

FizzBuzz-ML

sample of Fizz Buzz via machine learning model
Python
1
star
36

GCP-Big-Data-Setup

dev environment setup script
Shell
1
star
37

appengine-try-python-flask

Sample for GAE using Python
Python
1
star
38

blastn

Demo of blastn tool for bioinformatics
Jupyter Notebook
1
star
39

ballerina-testing

unit tests for Ballerina Langauge
Ballerina
1
star
40

docker-for-biologists

Resources for using docker for biologists
Dockerfile
1
star