• Stars
    star
    189
  • Rank 204,649 (Top 5 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created almost 11 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Creates github index for similar repositories discovery

Finding related projects on GitHub

This repository creates a recommendation database of "related" projects on GitHub. Interactive version is available here: http://www.yasiv.com/github/#/

How does it work?

How can we tell whether Project A is more related to Project B, than it is related to Project C?

Turns out, that we, project followers, tend to give stars to similar projects. If I gave stars to A, B, and C, and you gave stars to A, B, C, and D, then I should probably go check out D as well. Giving stars on GitHub, most of the time, is a good sign of project appreciation. So if we have starred three projects together, then we value similar things.

To turn this fact into a number, I'm using Sorensen-Dice similarity coefficient:

                        number_of_shared_stars(A, B)
similarity(A, B) = ---------------------------------------
                   number_of_stars(A) + number_of_stars(B)

"Developers who gave star to this repository, also gave star to ..." metric works decently well for projects with 150 .. 2000 stars. For projects with smaller amount of stars there is not enough intersection between watchers. For extremely popular projects coefficient becomes higher when other project is also extremely popular. Thus projects like Bootstrap get Angular, jQuery, and Node as the most relevant.

Data Gathering

GitHub Archive provides gigabytes of data from GitHub. We can query it using Google's BigQuery API.

For example, this query:

SELECT repository_url, actor_attributes_login
FROM [githubarchive:github.timeline]
WHERE type='WatchEvent'

Give us list of repositories, along with users who gave them stars:

| Row | repository_url                                     | actor_attributes_login |
| --- | -------------------------------------------------- | ---------------------- |
| 1   | https://github.com/alump/Masonry                   | markiewb               |
| 2   | https://github.com/andrewjstone/rafter             | kirsn                  |
| 3   | https://github.com/jgraph/draw.io                  | nguyennamtien          |
| 4   | https://github.com/samvermette/SVWebViewController | dlo                    |
| 5   | https://github.com/mafintosh/peerflix              | 0xPr0xy                |
| ..  | ...                                                | ...                    |

By iteratively processing each record we can calculate number of stars for each project. We can also find how many shared stars each project has with every other project. But... The dataset is huge. Today (Nov 30, 2014) there are 25M watch events produced, by more than 1.8M unique users. They are given to more than 1.2M unique repositories. We need to reduce the dataset:

SELECT repository_url, actor_attributes_login
FROM [githubarchive:github.timeline]
WHERE type='WatchEvent' AND actor_attributes_login IN (
  SELECT actor_attributes_login FROM [githubarchive:github.timeline]
  WHERE type='WatchEvent'
  GROUP BY actor_attributes_login HAVING (count(*) > 1) AND (count (*) < 500)
)
GROUP EACH BY repository_url, actor_attributes_login;

Why do we limit lower bound to at least 2 stars?

Since we are using Sorensen-Dice similarity coefficient, users who gave only 1 star total, can be excluded from "shared stars" metric. In theory this will slightly skew similarity coefficient and make two projects more similar than they should be, but in practice results seem to be helpful enough. This also serves as a good filter against bot attacks.

Why do we limit upper bound to at most 500?

To save the CPU power. Is this bad? There are only 0.7% of users who gave more than 500 stars.

This query reduces dataset from 25M to 16M records.

Data storing

We got the dataset, downloaded and stored into CSV file, for further processing.

To calculate similarity we need to be able to quickly answer two questions:

  1. Who gave stars to project A?
  2. Which projects were starred by User B?

If only we could save this into hash-like data structure - that would give us O(1) time to answer both of these questions.

Naive solution to store all inside one process into hash (using either C++ or node) turned out to be extremely inefficient. My processes exceeded 8GB RAM limit, and started killing my laptop with constant swapping.

Maybe I should save it into a local database?

I tried to use neo4j but it failed with out of memory exception during CSV import.

Next and last stop was redis. Absolutely beautiful piece of software. It swallowed 16M rows without blinking an eye. RAM was within sane 3GB range, and disk utilization is only 700MB.

Building recommendations

EDIT (Jan 2016): At the moment GitHub Archive has changed it's API. Unfortunately repository description and actual number of stars are no longer available.

Thus you will need to run

  1. Lunch redis server on default port
  2. Run node indexRepoInfo.js and let it run for 15-20 days (yeah :( )

The indexRepoInfo will download one repository by one and save meta information about repositories (stargazers count, description). There are 1.7 million repositories and GitHub limits API calls to 5k per hour. Thus the number of days is huge. end of edit

Recommendation database is created by these ~200 lines of code. There is a lot of asynchronous code in there, hidden behind promises.

In nutshell, this is what it's doing:

1. Find all repositories with more than 150 stars.
2. For each repository find users who gave it a star.
     For each user who gave a star, find which other projects were starred.
     For each other project increase number of shared stars
3. Produce similarity coefficient.

Final results are saved to disk, and then uploaded to S3, so that the frontend can immediately get them.

license

MIT

More Repositories

1

city-roads

Visualization of all roads within any city
JavaScript
5,402
star
2

VivaGraphJS

Graph drawing library for JavaScript
JavaScript
3,646
star
3

ngraph.path

Path finding in a graph
JavaScript
2,801
star
4

atree

Just a simple Christmas tree, based on reddit story
JavaScript
2,424
star
5

panzoom

Universal pan and zoom library (DOM, SVG, Custom)
JavaScript
1,570
star
6

pm

package managers visualization
JavaScript
1,408
star
7

ngraph

Beautiful Graphs
1,360
star
8

npmgraph.an

2d visualization of npm
JavaScript
1,160
star
9

fieldplay

A vector field explorer
JavaScript
1,108
star
10

sayit

Visualization of related subreddits
JavaScript
937
star
11

vs

Visualization of Google's autocomplete
JavaScript
919
star
12

word2vec-graph

Exploring word2vec embeddings as a graph of nearest neighbors
Python
691
star
13

time

Simple Google Sheets interface to track time
JavaScript
623
star
14

graph-drawing-libraries

Trying to compare known graph drawing libraries
JavaScript
584
star
15

peak-map

Make a ridge line chart from any region on Earth
JavaScript
583
star
16

map-of-reddit

Interactive map of reddit
JavaScript
540
star
17

common-words

visualization of common words in different programming languages
JavaScript
495
star
18

ngraph.graph

Graph data structure in JavaScript
JavaScript
463
star
19

ngraph.pixel

Fast graph renderer based on low level ShaderMaterial from three.js
JavaScript
295
star
20

npmrank

npm dependencies graph metrics
JavaScript
284
star
21

streamlines

Streamlines calculator
JavaScript
275
star
22

isect

Segments intersection detection library
JavaScript
253
star
23

gauss-distribution

A fun little project to show distribution of pixels in Gauss's portrait
HTML
249
star
24

oflow

Optical flow detection in JavaScript
JavaScript
200
star
25

gazer

GitHub analysis and discovery
JavaScript
186
star
26

git-also

For a `file` in your git repository, prints other files that are most often committed together
JavaScript
185
star
27

allnpmviz3d

3d visualization of npm
JavaScript
179
star
28

ngraph.path.demo

This is a demo project for ngraph.path
JavaScript
165
star
29

map-of-github

Inspirational Mapping
Vue
157
star
30

ngraph.forcelayout

Force directed graph layout
JavaScript
146
star
31

pixchart

Turn any image into delightful splash of colors and order
JavaScript
112
star
32

city-script

Collection of scripts that can be loaded into city-roads
JavaScript
112
star
33

winvelviz

Wind visualization over time
JavaScript
100
star
34

yasiv-youtube

Graph of related videos from YouTube
JavaScript
100
star
35

query-state

Application state in query string
JavaScript
97
star
36

e-sum

Visualization of exponential sums
JavaScript
97
star
37

ngraph.forcelayout3d

Force directed graph layout in 3d
JavaScript
95
star
38

index-large-cities

A simple indexer of road networks from OSM. Data for @anvaka/city-roads
JavaScript
92
star
39

graph-start

a simple graph shell to explore ideas
JavaScript
88
star
40

greview

Books that I read and their neighborhoods
86
star
41

lsystem

A simple L-Systems explorer powered by WebGL
JavaScript
86
star
42

jsruntime

Chrome Extension to explore javascript runtime.
JavaScript
85
star
43

map-of-reddit-data

Contains scripts and data to render map of reddit
JavaScript
82
star
44

redsim

reddit discovery
JavaScript
82
star
45

w-gl

A simple WebGL renderer
TypeScript
80
star
46

pplay

Create, play and share pixels. Online WebGL shader editor.
GLSL
76
star
47

ngraph.hde

High dimensional embedding of a graph and its layout
JavaScript
76
star
48

dotparser

Parser of GraphViz dot file format
PEG.js
72
star
49

wind-lines

Streamline animation of wind data
JavaScript
63
star
50

set-vs-object

What is faster Set or Object?
JavaScript
62
star
51

three.map.control

A three.js camera that mimics 2d maps navigation with pan and zoom
JavaScript
53
star
52

ngraph.native

C++ implementation of force-based layout from ngraph
C++
49
star
53

circles

A simple spirograph toy
JavaScript
49
star
54

yaot

Yet another octree
JavaScript
48
star
55

ngraph.three

3D graph renderer powered by three.js
JavaScript
44
star
56

ngraph.generators

Graph generators
JavaScript
42
star
57

citations

Most cited papers by keyword
C++
42
star
58

rafor

requestAnimationFrame friendly async for iterator
JavaScript
41
star
59

ngraph.fabric

Fabric.js graph renderer
JavaScript
38
star
60

playground

Just a set of experiments that I want to play with, but they are too small to be in their own repository
JavaScript
35
star
61

ngraph.centrality

Module to calculate graph centrality metrics
JavaScript
33
star
62

streak

Streak tracking with Google Sheets
JavaScript
33
star
63

sayit-data

data with similar subreddits graph
JavaScript
32
star
64

ngraph.offline.layout

Performs offline layout of large graphs and saves results to the disk
JavaScript
32
star
65

cord-19

exploring research papers about coronaviruses
JavaScript
31
star
66

how-to-debug-node-js-addons

How to debug node.js addons in xcode
30
star
67

wheel

Mouse wheel event unified for all browsers
JavaScript
30
star
68

npmgraphbuilder

Builds graph of npm dependencies from npm registry
JavaScript
29
star
69

generator-n

minimalistic node package yeoman generator
JavaScript
28
star
70

allgithub

Crawling github data
JavaScript
27
star
71

tiny.xml

Tiny (1.6KB) in-browser xml parser
JavaScript
27
star
72

mars

Map of Mars
JavaScript
26
star
73

ngraph.pixi

PIXI.js graph renderer
JavaScript
26
star
74

ngraph.pagerank

PageRank calculation for ngraph.graph
JavaScript
25
star
75

noisylines

Tracking noise with streamlines
JavaScript
23
star
76

nb

Neighborhood beautification: Graph layout through message passing
JavaScript
23
star
77

ngraph.physics.simulator

Physics library for ngraph
JavaScript
23
star
78

strangeb

The strangest thing happens when you rotate Bezier control points
JavaScript
22
star
79

graph-to-vector-field

Converts a graph into vector field texture
JavaScript
20
star
80

amator

Tiny animation library
JavaScript
20
star
81

ngraph.louvain

Given a graph instance detects communities using the Louvain Method
JavaScript
20
star
82

npmgraph

Visualization of NPM dependencies
JavaScript
19
star
83

similar-cities

Visualization of cities with similar road networks
JavaScript
19
star
84

allnpm

Graph generator for entire npm registry
JavaScript
18
star
85

twitter-recommended-graph

Building a proposal for Twitter to show a map of recommended people
JavaScript
18
star
86

streaming-svg-parser

Streaming SVG/XML parser with zero dependencies
JavaScript
18
star
87

extract-osm-roads

A simple utility to fetch a city graph from OSM
JavaScript
18
star
88

quadtree.cc

A C++ implementation of quadtree
C++
17
star
89

rules-of-ml

A simple visualization of Martin Zinkevich article
JavaScript
17
star
90

ngraph.events

Events support in ngraph.*
JavaScript
17
star
91

simplegrad

Simple reverse mode automatic differentiation of scalar values in javascript
JavaScript
16
star
92

portrait

Portrait of quotes
JavaScript
16
star
93

sunburst

For a given tree builds an SVG based SunBurst diagram
JavaScript
15
star
94

local-chat

Local instance of ChatGPT for my kiddo
HTML
15
star
95

what-people-google

Visualization of what people google
JavaScript
15
star
96

vuereddit

A simple reddit client written as a vue component.
Vue
14
star
97

ngraph.fromjson

Library to load graph from simple json format
JavaScript
12
star
98

color-high

A demo of ngraph.forcelayout in 6D space
JavaScript
12
star
99

color-force-vis

Visualizing forces acting on nodes during force layout
JavaScript
11
star
100

allnpmviz.an

Visualization of entire npm
JavaScript
11
star