• Stars
    star
    901
  • Rank 50,699 (Top 1.0 %)
  • Language
  • Created over 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startups

Data Engineering Roadmap

Below you can find a chart demonstrating the paths that you can take and the milestones that you would want to achieve in order to become a data engineer. We spoke to senior data engineers and data engineering managers from top tech companies in the Silicon Valley, and consolidated learnings from these conversations and data engineering Meetups in the Bay Area. We hope this can serve as a guide to everyone interested in breaking into data engineering, especially people who do not live in close proximity to any tech hubs and don't have a strong personal network in data engineering.

We are continuing to add recommended resources, example practice projects and additional tips to expand the roadmap. Contributions are welcome and highly appreciated.

If you are having difficulties to commit through this entire roadmap yourself. I will suggest finding someone like minded and have similar goals to start the arduous task of learning together. Heading over to meetups is one way to network with people who are looking to learn. We started a community to help connect people with similar learning goals. We plan to make it forever free for users. If you are interested, head over to https://video.boringppl.com/

Disclaimer

The purpose of this roadmap is to give you an overview of the core skills needed in data engineering. Any summary insights from conversations or Meetups does not represent any companies' opinion. Data engineering roles vary from one company to another, and from one role to another. If you are interested in a specific data engineering role, we recommend reaching out to individuals within the companies to talk about their roles and culture fit. If you found any insights that helped in your journey, the community will greatly benefit from your contributions.

Roadmap

Roadmap

Resources

Netflix Hands-on Data Engineering Workshop (7/11/2018)

  • Typical data engineering project at Netflix

    • Start with a problem statement

    • Data exploration

      • Data sources: logs, data warehouses, third party source API
      • Important to explore the structure, volume, granularity and frequency
    • Data modeling

      • Structure how the eventual output should look like
        • Depending on the consumer(direct end user vs. intermediate input to another system)
        • Depending on the skill level and preferred consumption mode of your consumer (can they parse JSon?)
      • Dimensionality
        • Time: day? Hour? Second? Roll up to what level?
        • Geography: region? Country? What insights are the stakeholders hoping to get out?
        • Device level? It is important because we want to optimize your experience)
      • Metrics
        • What do the stateholders want to measure?
        • Can I add more colors to help stakeholders diagnoze value drivers and extract deeper insights since I am the expert on data?
      • Relationships
        • Datasets have relationships with other entities (e.g. visitors to devices have many to many relationship)
    • Data transformation

      • Filter
      • Enrich the data to add more colors
      • Standarde (e.g. namiong convension)
      • Aggregate
    • Data quality

      • Check work thoroughly: look at trends, missing data gaps and anomalies
  • Collaboration with other teams:

    • UI engineers: logging and instrumentation
    • Other data engineers
    • Data scientists and data analysts: understand their experiments and analysis to prepare data for insights
    • Data platform team: Considerations on efficiency and scalability
  • Hands-on exercise: Build Spark pipeline following the above steps using Python

Conversation with Facebook senior data engineer

  • Data engineers need to have strong fundamentals on relational database and data modeling. It is about problem solving so the more exposure you have, the larger your solution set it, the more likely you can come up with a good solution to address the requirements and constraints you face

Presentation by Facebook data engineering manager: Role of a data engineer during a product launch

  • Facebook data engineering team is diverse: people come from different backgrounds and bring in a rich set of unique perspectives

  • Data engineers are embedded within product teams

  • A product data engineer can empower the team to find product market fit and has significant influence on key metrics that the entire team should focus on driving. This requires one to have a strong product sense

  • Example: On the product launch day, what should be on the dashboard?

    • How will you define active user events? How does it differ from a social media app, an e-commerce app to a B2B app?
    • What should be the cadence of measurement - daily, weekly, monthly, yearly?
    • We don't need to be hyper-focused on user acquisition funnels
    • We don't need to look at marketing conversion stats
    • We need to focus on retention. A product data engineer needs to advocate for that and not distract the team with too many secondary metrics
    • Tracking retention is first step, how can we improve retention? What data do I need to surface to the team to decide on the different drivers?
      • Cohort analysis
  • Example: Launch day data infrastructure

    • validate logging, tie technical implementation to business metrics
    • Think about edge cases and use your business intuition to make decisions to how to handle these
      • If the sytem crashes during the startup process, should we still log it as an active event?
    • Optimize retention query
    • Provision for scalability

License

License: CC BY-NC-SA 4.0

More Repositories

1

data-science-roadmap

Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startups
623
star
2

Linkedin-profiles-scraping

Automatically scrape the web data of people profiles on Linkedin based on a specific search query
Jupyter Notebook
59
star
3

Sales-Reporting

Conduct a Report and Analysis on 200,000 sales data points to answer revenue-related questions for the business
Jupyter Notebook
21
star
4

presidential-debates-comments-clustering

At the point when we started this project, election week is coming up. There was so much excitement in the air on who is the next US president to be elected. There were thousands of articles on who's leading the polls. The US election has been trending on most, if not all, social media platforms. Being Data scientists, we wonder if it would be possible to leverage on these different data sources to understand various topics of discussion surrounding each candidate. Of which, we have decided to focus on Youtube comments as a starting point for this project.
Jupyter Notebook
13
star
5

deeplearning-roadmap

Deep Learning path with multiple notebooks
Jupyter Notebook
5
star
6

data-project-guideline-from-Netflix

"Data science is such a nebulous term. To some, it means data analytics; to some it is synonymous to machine learning; others think there is a data engineering flavor to it. The wide spectrum of possible responsibilities and the nuanced differences across companies or even teams within the same company make the identity evasive. You literally have to speak to a data scientist in company X to understand how company X sees data science. This guidline is inspired by a Netflix talk with a focus on the structure of a data science projects."
5
star
7

Auto-Send-Linkedin-Connect-Request

Automatically send Linkedin invites with personalized messages to a database of targeted profiles.
Python
4
star
8

flownote-dockers

Python
3
star
9

boringppl-meeting-summarization

Jupyter Notebook
3
star
10

k8s-gateway

JavaScript
2
star
11

Crash-Course-on-Python

A comprehensive curriculum of Python programming foundation
Python
1
star
12

hasbrain-helper-files

Helper files for hasBrain notebooks
Python
1
star
13

Rename-and-Organize-files-directories

Toy problem: Practice generating fake data; then, rename and organize the data into folders of their genres
Python
1
star
14

onboarding

JavaScript
1
star