• Stars
    star
    217
  • Rank 182,446 (Top 4 %)
  • Language
    Python
  • Created almost 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

fake-star-detector: A Dagster tutorial

This is a simple Dagster project to analyze the number of fake GitHub stars on any GitHub repository. It is a companion to the blog post found on the Dagster blog.

This project consists of two models:

  • Simpler model: A simple model running β€œlow activity” heuristic. This simple heuristic can detect many (but hardly all) suspected fake accounts that starred the same set of repositories, using nothing but data from the GitHub REST API (via pygithub).

  • Complex detector: An alternative detection model which runs a sophisticated clustering algorithm as well as the heuristic, using the public GH Archive available in Bigquery. This model is written in SQL and uses dbt alongside Dagster.

    • Note: You can run this within the limits of a free-tier BQ account, but the analysis will be reduced in scope. By default, this model only scans data in 2023 on a small repository, in order to make it stay within the free-tier quota.

global-asset-lineage

Table of contents

Getting started

Install instructions

For this tutorial, we assume you have Git installed. Installation details can be found here: https://github.com/git-guides/install-git.

Getting GitHub Access Token for the simpler model

For running the simpler model, you will also need a GitHub Personal Access token to access the GitHub API. This can be created in GitHub here: https://github.com/settings/tokens (after logging in to GitHub). Keep the new access token handy as we will be needing it shortly.

Note: we'll need gist scope because we're going to create a Gist to be the final report in the end.

Creating Google Service Account for the complex model

In order to run the complex model which uses BigQuery public archive data, you will need to have a BigQuery account to run the dbt models. You can sign up for a free account here. Check out Create a Google Service Account to learn how to create a service account and download the JSON key file. This also requires BigQuery API enabled in your service account.

Setting up your local environment

Cloning the repo

You will need to pull a copy of this repository onto your local machine, go into the top level of the cloned repository and run the install command:

git clone https://github.com/dagster-io/fake-star-detector.git
cd fake-star-detector

Building a virtual environment

Install Dagster and our other dependencies - see https://docs.dagster.io/getting-started/install

python3 -m venv venv
source venv/bin/activate

pip install -e ".[dev]"

Note for M1 Mac users you may need to use pip install dagster dagit --find-links=https://github.com/dagster-io/build-grpcio/wiki/Wheels

Running Dagster locally

Start the Dagster UI web server:

dagster dev

Open http://localhost:3000 with your browser to see the project.

Note: If you have previously installed Dagster on your system, you may encounter the error Error: No such command 'dev'.. If this is the case, your system is likely trying to access the Dagster install outside of your venv. Try running the bash command rehash which will Recompute the internal hash table for the PATH variable, then repeat the dagster dev command.

Next, you can run the detection model(s) of your own choice.

  • Simple, more complete history, and no cost: Simpler model -- This model uses data from the GitHub API to detect fake stars. You may hit GitHub API limit depending on how many activities the repository you choose to analyze.
  • Comprehensive, but shorter period at free-tier: Complex model: This model uses the public GH Archive available in Bigquery. You can specify a longer time span and a larger repository to analyze, but it may get expensive as you scan more data in BigQuery.

Running Dagster locally using docker

You can use the below steps to bring up dagster-dev instance using a local docker image

Trying the simpler model using data from the GitHub API

Setting up the environment variables

This model requires GitHub Access Token to access the GitHub API. Refer to the instructions above for how to obtain your own token. (If you are using Docker you can skip this step as you should have already set the token in earlier step)

Then, create a .env file at the root of the repository you just cloned and add your GitHub access token as a variable:

GITHUB_ACCESS_TOKEN=<<GITHUB_ACCESS_TOKEN>>

Running the model

simpler-model

Navigate to simpler_model from the left nav in the UI. This defaults to analyze a small repository frasermarlow/tap-bls. You can click "Materialize all" to kick off the model and check out the result. At the end of the job, it will output a Gist summarizing the analysis, such as this URL. You can find the link in the compute log:

simpler-model-gist-link

You can also specify a different repo by Shift+clicking "Materialize all" on the asset graph page to specify it in the configuration, such as:

ops:
    stargazers:
        config:
            repo: <insert name of the repo to analyze>
Shift+click "Materialize all" Config editor
simpler-model-shift-click simpler-model-config

Then, click "Materialize" to kick off the simple model. In the end, you'll get a GitHub Gist with the results.

Troubleshooting

If you are using the default Dagster storage backed by SQLite, you may encounter an error as:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file

To get your pipeline successfully running, you can Shift+click "Materialize all" on the asset graph page and add the following configuration. This will turn off the default multiprocessing execution.

execution:
  config:
    in_process: null

Explanation of the model

This asset group is a Dagster project and involves 6 assets:

  1. Asset stargazers: We call the GitHub API and retrieve a list of users who have starred the repo, and then turn the response into a Pandas DataFrame.
  2. Asset stargazers_with_user_info: We look up each user in turn and pull their detailed profile from the GitHub API.
  3. Asset classified_stargazers_df: We analyze each profile and match it against our heuristic to determine if they are fake or not, and output a Pandas DataFrame.
  4. Asset real_vs_raw_stars_report: We calculate the percentage of fake stars and output a report in raw Jupyter Notebook format.
  5. Asset github_stars_notebook_gist: We convert the raw Jupyter Notebook into a Gist and output the URL to the Gist.

In addition to the above, we have a few helper functions:

a) validate_star: Matching a profile against the heuristic b) see_if_user_exists: Verifying that a user still exists before pulling the full details c) handle_exception: Handling exceptions for the GitHub API call. This op calls on get_retry_at which returns the x-ratelimit-resetvalue for the GitHub API.

Currently, the pipeline will simply return a result in the Dagster UI as in INFO event type such as "Score is 12.34% fake stars" and will provide a list of usernames flagged as fake.

Running the complex model using BigQuery archive data

This model is written in SQL and uses dbt. You can find the dbt project in the dbt_project directory, and the dbt models in the dbt_project/models/complex_detector/ directory.

Setting up the environment variables

This model requires a BigQuery account and Google Search Account with BigQuery API enabled to access the source data. Refer to the above for instructions.

Next, you will need to add the credentials to your environment. You can do this by adding the following to your .env file:

DBT_BIGQUERY_KEYFILE_PATH='path to your JSON key file'
DBT_BIGQUERY_PROJECT='name of the bigquery project the output will write to'
DBT_BIGQUERY_DATASET='dbt_github_star' # or your desired bigquery dataset name
DBT_BIGQUERY_LOCATION='US' # or your desired location

Running the model

Navigate to complex_detector in the UI. Click "Materialize all" to kick off the complex model. This defaults to analyze frasermarlow/tap-bls.

tables-in-bigquery

In the end, you'll get a few BigQuery tables with the final result:

Tables and views in BigQuery Final result
tables-in-bigquery stat-result-in-bigquery

To check out for another repository, edit the "target_repo" in ./dbt_project/models/fake_star_detector/stg_all_actions_for_actors_who_starred_repo.sql:

{% set target_repo = 'frasermarlow/tap-bls' %}

Explanation of the model

This loads a dbt project which uses GitHub Archive data to identify suspicious users who starred the given repository, and estimates a FAKE STAR score for that repository using two separate heuristics to catch different types of fake accounts.

The dbt project materializes 4 BigQuery tables with estimated FAKE STAR score, and a few staging views:

  1. stg_all_actions_for_actors_who_starred_repo: all activity for users who starred repo in the given time period
  2. starring_actor_summary: table with 1 row per user, summarizing similarity to other users in a set.
  3. starring_actor_repo_summary: table with 1 row per repo, summarizing similarity to other repositories actors touched.
  4. fake_star_stats: fake star summary stats (FAKE STAR score).

More Repositories

1

dagster

An orchestration platform for the development, production, and observation of data assets.
Python
11,116
star
2

dagster-open-platform

Dagster Labs' open-source data platform, built with Dagster.
Python
228
star
3

mdsfest-opensource-mds

Demo Project for Open Source MDS
Python
160
star
4

hooli-data-eng-pipelines

Example Dagster Cloud code for the Hooli Data Engineering organization.
Python
75
star
5

awesome-dagster

All things awesome related to Dagster!
61
star
6

talks

Jupyter Notebook
40
star
7

project-dagster-university

an ephemeral project repo for the DU Dagster project
Python
37
star
8

supercharged-dbt-docs

HTML
25
star
9

dagster-cloud

Unify all your data tools into a productive, enterprise-grade platform.
Python
21
star
10

dagster_llm_finetune

Python
12
star
11

quickstart-etl

Python
11
star
12

dagster-cloud-action

Python
10
star
13

quickstart-gcp

Python
9
star
14

dagster-cloud-cicd-action

GitHub Action to update Dagster Cloud repo locations
JavaScript
7
star
15

data-mesh-demo

Python
7
star
16

quickstart-aws

Python
6
star
17

dagster-website

The data orchestration platform built for productivity.
TypeScript
6
star
18

quickstart-dbt

Python
5
star
19

dagster-cloud-hybrid-quickstart

Template for getting started with Hybrid Dagster Cloud
Python
5
star
20

dagster-cloud-serverless-quickstart

Python
4
star
21

dagster-quickstart

Get started with Dagster ASAP
Python
3
star
22

quickstart-snowflake

Python
2
star
23

docsearch-scraper

This repository holds the code of the DocSearch scraper used to power the hosted version of DocSearch.
Python
2
star
24

dagster-cloud-cicd-action-quickstart

Template to get started with the Dagster Cloud CI GitHub Action
Python
2
star
25

insights-quickstart-dbt-snowflake

Python
2
star
26

dagster-sandbox-example

Python
1
star
27

quickstart-mds

Python
1
star
28

dagster-ml-example

Example Dagster Cloud code for an ML use case.
Python
1
star
29

airflow-migration-example

Python
1
star
30

biweekly-demo

Python
1
star
31

social-listening

Python
1
star
32

helm-user-cloud

Dagster Cloud Helm chart for distribution of User Cloud Agent & other user cloud resources via Helm.
1
star
33

hooli-ml-pipelines

Example Dagster Cloud code for the Hooli ML organization.
Python
1
star
34

dagster-airflow-migration-example

Python
1
star
35

cloud-examples

Curated list of examples and sample code for dagster-cloud projects
Python
1
star
36

build-grpcio

Build wheels for grpcio for m1 macs
1
star