• Stars
    star
    425
  • Rank 102,094 (Top 3 %)
  • Language
    JavaScript
  • Created about 10 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿ•ท ๐Ÿ•ธ crawl GitHub web pages for insights we can't GET from the API ... ๐Ÿ’ก

:octocat: ๐Ÿ•ท ๐Ÿ•ธ GitHub Scraper

Learn how to parse the DOM of a web page by using your favourite coding community as an example.

Build Status codecov.io contributions welcome HitCount npm package version

Step one: learn JavaScript!

โš ๏ธ Disclaimer / Warning!

This repository/project is intended for Educational Purposes ONLY.
The project and corresponding NPM module should not be used for any purpose other than learning. Please do not use it for any other reason than to learn about DOM parsing and definitely don't depend on it for anything important!

The nature of DOM parsing is that when the HTML/UI changes, the parser will inevitably fail ... GitHub have every right to change/improve their UI as they see fit. When they do change their UI the scraper will inevitably "break"! We have Travis-CI continuous integration to run our tests precisely to check that parsers for the various pages are working as expected. You can run the tests locally too, see "Run The Tests" section below.

Why?

Our initial reason for writing this set of scrapers was to satisfy the curiosity / question:

How can we discover which are the interesting people and projects on GitHub
(without manually checking dozens of GitHub profiles/repositories each day) ?

Our second reason for scraping data from GitHub is so that we can show people a "summary view" of all their issues in our Tudo project (which helps people track/manage/organise/prioritise their GitHub issues). See: dwyl/tudo#51

We needed a simple way of systematically getting data from GitHub (before people authenticate) and scraping is the only way we could think of.

We tried using the GitHub API to get records from GitHub, but sadly, it has quite a few limitations (see: "Issues with GitHub API" section below) the biggest limitation being the rate-limiting on API requests.

Thirdly we're building this project to scratch our own itch
... scraping the pages of GitHub has given us a unique insight into the features of the platform which has leveled-up our skills.

Don't you want to know what's "Hot" right now on GitHub...?

What (Problem are we trying to Solve)?

Having a way of extracting the essential data from GitHub is a solution to a surprisingly wide array of problems, here are a few:

  • Who are the up-and-comming people (worth following) on GitHub?
  • Which are the interesting projects (and why?!)
  • What is the average age of an issue for a project?
  • Is a project's popularity growing or plateaued?
  • Are there (already) any similar projects to what I'm trying to build? (reduce duplication of effort which is rampant in Open Source!!)
  • How many projects get started but never finished?
  • Will my Pull Request ever get merged or is the module maintainer too busy and did I just waste 3 hours?
  • insert your idea/problem here ...
  • Associative Lists e.g: People who starred abc also liked xyz

How?

This module fetches (public) pages from GitHub, "scrapes" the html to extract raw data and returns a JSON Object.

Usage

install from NPM

install from npm and save to your package.json:

npm install github-scraper --save

Use it in your script!

var gs = require('github-scraper');
var url = '/iteles' // a random username
gs(url, function(err, data) {
  console.log(data); // or what ever you want to do with the data
})

Example URLs and Output

Profile Page

User profile has the following format https://github.com/{username}
example: https://github.com/iteles

iteles-github-profile-april-2019-annotated

var gs = require('github-scraper'); // require the module
var url = 'alanshaw' // a random username (of someone you should follow!)
gs(url, function(err, data) {
  console.log(data); // or what ever you want to do with the data
})

Sample output:

{
  "type": "profile",
  "url": "/iteles",
  "avatar": "https://avatars1.githubusercontent.com/u/4185328?s=400&v=4",
  "name": "Ines Teles Correia",
  "username": "iteles",
  "bio": "Co-founder @dwyl | Head cheerleader @foundersandcoders",
  "uid": 4185328,
  "worksfor": "@dwyl",
  "location": "London, UK",
  "website": "http://www.twitter.com/iteles",
  "orgs": {
    "bowlingjs": "https://avatars3.githubusercontent.com/u/8825909?s=70&v=4",
    "foundersandcoders": "https://avatars3.githubusercontent.com/u/9970257?s=70&v=4",
    "docdis": "https://avatars0.githubusercontent.com/u/10836426?s=70&v=4",
    "dwyl": "https://avatars2.githubusercontent.com/u/11708465?s=70&v=4",
    "ladiesofcode": "https://avatars0.githubusercontent.com/u/16606192?s=70&v=4",
    "TheScienceMuseum": "https://avatars0.githubusercontent.com/u/16609662?s=70&v=4",
    "SafeLives": "https://avatars2.githubusercontent.com/u/20841400?s=70&v=4"
  },
  "repos": 28,
  "projects": 0,
  "stars": 453,
  "followers": 341,
  "following": 75,
  "pinned": [
    { "url": "/dwyl/start-here" },
    { "url": "/dwyl/learn-tdd" },
    { "url": "/dwyl/learn-elm-architecture-in-javascript" },
    { "url": "/dwyl/tachyons-bootstrap" },
    { "url": "/dwyl/learn-ab-and-multivariate-testing" },
    { "url": "/dwyl/learn-elixir" }
  ],
  "contribs": 878,
  "contrib_matrix": {
    "2018-04-08": { "fill": "#c6e48b", "count": 1, "x": "13", "y": "0" },
    "2018-04-09": { "fill": "#c6e48b", "count": 2, "x": "13", "y": "12" },
    "2018-04-10": { "fill": "#7bc96f", "count": 3, "x": "13", "y": "24" },
    ...etc...
    "2019-04-11": { "fill": "#c6e48b", "count": 1, "x": "-39", "y": "48" },
    "2019-04-12": { "fill": "#7bc96f", "count": 5, "x": "-39", "y": "60"}
  }
}

Followers

How many people are following a given person on Github. Url format: https://github.com/{username}/followers
example: https://github.com/iteles/followers

var gs = require('github-scraper'); // require the module
var url = 'iteles/followers' // a random username (of someone you should follow!)
gs(url, function(err, data) {
  console.log(data); // or what ever you want to do with the data
})

Sample output:

{ entries:
   [ 'tunnckoCore', 'OguzhanE', 'minaorangina', 'Jasonspd', 'muntasirsyed', 'fmoliveira', 'nofootnotes',
    'SimonLab', 'Danwhy', 'kbocz', 'cusspvz', 'RabeaGleissner', 'beejhuff', 'heron2014', 'joshpitzalis',
    'rub1e', 'nikhilaravi', 'msmichellegar', 'anthonybrown', 'miglen', 'shterev', 'NataliaLKB',
    'ricardofbarros', 'boymanjor', 'asimjaved', 'amilvasishtha', 'Subhan786', 'Neats29', 'lottie-em',
    'rorysedgwick', 'izaakrogan', 'oluoluoxenfree', 'markwilliamfirth', 'bmordan', 'nodeco', 'besarthoxhaj',
    'FilWisher', 'maryams', 'sofer', 'joaquimserafim', 'vs4vijay', 'intool', 'edwardcodes', 'hyprstack',
    'nelsonic' ],
  url: 'https://github.com/iteles/followers' }
ok 1 iteles/followers count: 45

If the person has more than 51 followers they will have multiple pages of followers. The data will have a next_page key with a value such as: /nelsonic/followers?page=2 If you want to keep fetching these subsequent pages of followers, simply keep running the scraper: e.g:

var url = 'alanshaw/followers' // a random username (of someone you should follow!)
gs(url, function(err, data) {
  console.log(data); // or what ever you want to do with the data
  if(data.next_page) {
    gs(data.next_page, function(err2, data2) {
      console.log(data2); // etc.
    })
  }
})

Following

Want to know the list of people this person is following that's easy too! The url format is: https://github.com/{username}/following e.g: https://github.com/iteles/following or https://github.com/nelsonic/following?page=2 (where the person is following more than 51 people ...)

Usage format is identical to followers (above) so here's an example of fetching page 3 of the results:

var gs = require('github-scraper'); // require the module
var url = 'nelsonic/following?page=3' // a random dude
gs(url, function(err, data) {
  console.log(data); // or what ever you want to do with the data
})

Sample output:

{
  entries:
   [ 'kytwb', 'dexda', 'arrival', 'jinnjuice', 'slattery', 'unixarcade', 'a-c-m', 'krosti',
   'simonmcmanus', 'jupiter', 'capaj', 'cowenld', 'FilWisher', 'tsop14', 'NataliaLKB',
   'izaakrogan', 'lynnaloo', 'nvcexploder', 'cwaring', 'missinglink', 'alanshaw', 'olizilla',
   'tancredi', 'Ericat', 'pgte' 'hyprstack', 'iteles' ],
  url: 'https://github.com/nelsonic/following?page=3',
  next_page: 'https://github.com/nelsonic/following?page=4'
}

Starred Repositories

The list of projects a person has starred a fascinating source of insight. url format: https://github.com/stars/{username} e.g: /stars/iteles

var gs = require('github-scraper'); // require the module
var url = 'stars/iteles';           // starred repos for this user
gs(url, function(err, data) {
  console.log(data);                // or what ever you want to do with the data
})

Sample output:

{
  entries:
   [ '/dwyl/repo-badges', '/nelsonic/learn-testling', '/joshpitzalis/testing', '/gmarena/gmarena.github.io',
    '/dwyl/alc', '/nikhilaravi/fac5-frontend', '/foundersandcoders/dossier', '/nelsonic/health', '/dwyl/alvo',
    '/marmelab/gremlins.js', '/docdis/learn-saucelabs', '/rogerdudler/git-guide', '/tableflip/guvnor',
    '/dwyl/learn-redis', '/foundersandcoders/playbook', '/MIJOTHY/FOR_FLUX_SAKE', '/NataliaLKB/learn-git-basics',
    '/nelsonic/liso', '/dwyl/learn-json-web-tokens', '/dwyl/hapi-auth-jwt2', '/dwyl/start-here',
    '/arvida/emoji-cheat-sheet.com', '/dwyl/time', '/docdis/learn-react', '/dwyl/esta', '/alanshaw/meteor-foam',
    '/alanshaw/stylist', '/meteor-velocity/velocity', '/0nn0/terminal-mac-cheatsheet',
    '/bowlingjs/bowlingjs.github.io' ],
  url: 'https://github.com/stars/iteles?direction=desc&page=2&sort=created',
  next_page: 'https://github.com/stars/iteles?direction=desc&page=3&sort=created'
}

Repositories

The second tab on the personal profile page is "Repositories" this is a list of the personal projects the person is working on, e.g: https://github.com/iteles?tab=repositories

github-ines-list-of-repositories

We crawl this page and return an array containing the repo properties:

var url = 'iteles?tab=repositories';
gs(url, function(err, data) {
  console.log(data);  // or what ever you want to do with the data
})

sample output:

{
  entries: [
    { url: '/iteles/learn-ab-and-multivariate-testing',
      name: 'learn-ab-and-multivariate-testing',
      lang: '',
      desc: 'Tutorial on A/B and multivariate testing',
      info: '',
      stars: '4',
      forks: '0',
      updated: '2015-07-08T08:36:37Z' },
    { url: '/iteles/learn-tdd',
      name: 'learn-tdd',
      lang: 'JavaScript',
      desc: 'A brief introduction to Test Driven Development (TDD) in JavaScript',
      info: 'forked from dwyl/learn-tdd',
      stars: '0',
      forks: '4',
      updated: '2015-06-29T17:24:56Z' },
    { url: '/iteles/practical-full-stack-testing',
      name: 'practical-full-stack-testing',
      lang: 'HTML',
      desc: 'A fork of @nelsonic\'s repo to allow for PRs',
      info: 'forked from nelsonic/practical-js-tdd',
      stars: '0',
      forks: '36',
      updated: '2015-06-06T14:40:43Z' },
    { url: '/iteles/styling-for-accessibility',
      name: 'styling-for-accessibility',
      lang: '',
      desc: 'A collection of \'do\'s and \'don\'t\'s of CSS to ensure accessibility',
      info: '',
      stars: '0',
      forks: '0',
      updated: '2015-05-26T11:06:28Z' },
    { url: '/iteles/Ultimate-guide-to-successful-meetups',
      name: 'Ultimate-guide-to-successful-meetups',
      lang: '',
      desc: 'The ultimate guide to organizing successful meetups',
      info: '',
      stars: '3',
      forks: '0',
      updated: '2015-05-19T09:40:39Z' },
    { url: '/iteles/Javascript-the-Good-Parts-notes',
      name: 'Javascript-the-Good-Parts-notes',
      lang: '',
      desc: 'Notes on the seminal "Javascript the Good Parts: byDouglas Crockford',
      info: '',
      stars: '41',
      forks: '12',
      updated: '2015-05-17T16:39:35Z' }  
  ],
  url: 'https://github.com/iteles?tab=repositories' }

Activity feed

Every person on GitHub has an RSS feed for their recent activity; this is the 3rd and final tab of the person's profile page.

it can be viewed online by visiting:

https://github.com/{username}?tab=activity

e.g: /iteles?tab=activity

Parsing the Feed

The activity feed is published as an .atom xml string which contains a list of entries.

We use xml2js (which in turn uses the sax xml parser) to parse the xml stream. This results in a object similar to the following example:

{ '$':
   { xmlns: 'http://www.w3.org/2005/Atom',
     'xmlns:media': 'http://search.yahoo.com/mrss/',
     'xml:lang': 'en-US' },
  id: [ 'tag:github.com,2008:/iteles' ],
  link: [ { '$': [Object] }, { '$': [Object] } ],
  title: [ 'itelesโ€™s Activity' ],
  updated: [ '2015-07-22T23:31:25Z' ],
  entry:
   [ { id: [Object],
       published: [Object],
       updated: [Object],
       link: [Object],
       title: [Object],
       author: [Object],
       'media:thumbnail': [Object],
       content: [Object] },
     { id: [Object],
       published: [Object],
       updated: [Object],
       link: [Object],
       title: [Object],
       author: [Object],
       'media:thumbnail': [Object],
       content: [Object] }
    ]
}

Each call to the atom feed returns the latest 30 enties. We're showing 2 here for illustration (so you get the idea...)

From this we extract only the relevant info:

'2015-07-22T12:33:14Z alanshaw pushed to master at alanshaw/david-www',
'2015-07-22T12:33:14Z alanshaw created tag v9.4.3 at alanshaw/david-www',
'2015-07-22T09:23:28Z alanshaw closed issue tableflip/i18n-browserify#6',
'2015-07-21T17:08:19Z alanshaw commented on issue alanshaw/david#71',
'2015-07-21T08:24:13Z alanshaw pushed to master at tableflip/score-board',
'2015-07-20T17:49:59Z alanshaw deleted branch refactor-corp-events at tableflip/sow-api-client',
'2015-07-20T17:49:58Z alanshaw pushed to master at tableflip/sow-api-client',
'2015-07-20T17:49:58Z alanshaw merged pull request tableflip/sow-api-client#2',
'2015-07-20T17:49:54Z alanshaw opened pull request tableflip/sow-api-client#2',
'2015-07-18T07:30:36Z alanshaw closed issue alanshaw/md-tokenizer#1',
'2015-07-18T07:30:36Z alanshaw commented on issue alanshaw/md-tokenizer#1',

Instead of wasting (what will be Giga) Bytes of space with key:value pairs by storing the entries as JSON, we are storing the activity feed entries as strings in an array. Each item in the array can be broken down into:

{date-time} {username} {action} {link}

As we can see from this there are several event types:

  • pushed to master at
  • created tag v9.4.3 at
  • opened issue
  • commented on issue
  • closed issue
  • deleted branch
  • opened pull request
  • merged pull request
  • starred username/repo-name

For now we are not going to parse the event types, we are simply going to store them in our list for later analysis.

We have a good pointer when its time to start interpreting the data: https://developer.github.com/v3/activity/events/types/

One thing worth noting is that RSS feed is Not Real-Time ... sadly, it only gets updated periodically so we cannot rely on it to have the latest info.

Organization

Organization pages have the following url pattern: https://github.com/{orgname}
example: https://github.com/dwyl

var url = 'dwyl';
gs(url, function(err, data) {
  console.log(data); // or do something way more interesting with the data!
});

sample data (entries truncated for brevity):

{
  entries:
   [ { name: 'hapi-auth-jwt2',
       desc: 'Secure Hapi.js authentication plugin using JSON Web Tokens (JWT)',
       updated: '2015-08-04T19:30:50Z',
       lang: 'JavaScript',
       stars: '59',
       forks: '11' },
     { name: 'start-here',
       desc: 'A Quick-start Guide for People who want to DWYL',
       updated: '2015-08-03T11:04:14Z',
       lang: 'HTML',
       stars: '14',
       forks: '9' },
     { name: 'summer-2015',
       desc: 'Probably the best Summer Sun, Fun & Coding Experience in the World!',
       updated: '2015-07-31T11:02:29Z',
       lang: 'CSS',
       stars: '16',
       forks: '1' },
  ],
  website: 'http://dwyl.io',
  url: 'https://github.com/dwyl',
  name: 'dwyl - do what you love',
  desc: 'Start here: https://github.com/dwyl/start-here',
  location: 'Your Pocket',
  email: '[email protected]',
  pcount: 24,
  avatar: 'https://avatars3.githubusercontent.com/u/11708465?v=3&s=200',
  next_page: '/dwyl?page=2'
}

Note #1: sadly, this has the identical url format to Profile this gets handled by the switcher which infers what is an org vs. profile page by checking for an known element on the page.

Note #2: when an organization has multiple pages of repositories you will see a next_page key/value in the data e.g: /dwyl?page=2 (for the second page of repos)

Repository Stats

This is where things start getting interesting ...

github-repo-page

example: https://github.com/nelsonic/adoro

var url = 'nelsonic/adoro';
gs(url, function(err, data) {
  console.log(data); // or do something way more interesting with the data!
});

sample data:

{
  url: 'https://github.com/nelsonic/adoro',
  desc: 'The little publishing tool you\'ll love using. [work-in-progress]',
  website: 'http://www.dwyl.io/',
  watchers: 3,
  stars: 8,
  forks: 1,
  commits: 12,
  branches: 1,
  releases: 1,
  langs: [ 'JavaScript 90.7%', 'CSS 9.3%' ]
}

Annoyingly the number of issues and pull requests, contributors and issues are only rendered after the page has loaded (via XHR) so we do not get these three stats on page load.

7. Issues

Clicking on the issues icon/link in any repository takes us to the list of all the issues.

Visiting a project with more than a page worth of issues has pagination at the bottom of the page:

tudo-issues-list-showing-pagination

Which has a link to: https://github.com/dwyl/tudo/issues?page=2&q=is%3Aissue+is%3Aopen

tudo-issues-second-page

List of issues for a repository:

var gs  = require('github-scraper');
var url = '/dwyl/tudo/issues';
gs(url, function (err, data) {
  console.log(data); // use the data how ever you like
});

sample output:

{ entries:
   [
     {
       url: '/dwyl/tudo/issues/46',
       title: 'discuss components',
       created: '2015-07-21T15:34:22Z',
       author: 'benjaminlees',
       comments: 3,
       assignee: 'izaakrogan',
       milestone: 'I don\'t know what I\'m doing',
       labels: [ 'enhancement', 'help wanted', 'question' ]
     },
     {
       url: '/dwyl/tudo/issues/45',
       title: 'Create riot components from HTML structure files',
       created: '2015-07-21T15:24:58Z',
       author: 'msmichellegar',
       comments: 2,
       assignee: 'msmichellegar',
       labels: [ 'question' ]
     }
  ], // truncated for brevity
  open: 30,
  closed: 20,
  next: '/dwyl/tudo/issues?page=2&q=is%3Aissue+is%3Aopen',
  url: '/dwyl/tudo/issues'
}

Each issue in the list would create a entry in the crawler (worker) queue:

2015-07-22T12:33:14Z issue /dwyl/tudo/issues/77

Should we include the "all issues by this author" link?

Issue (individual)

The result of scraping dwyl/tudo#51

var gs  = require('github-scraper');
var url = '/dwyl/tudo/issues/51';
gs(url, function (err, data) {
  console.log(data); // use the data how ever you like
});

sample output:

{ entries:
   [ { id: 'issue-96442793',
       author: 'nelsonic',
       created: '2015-07-22T00:00:45Z',
       body: 'instead of waiting for people to perform the steps to authorise Tudo (to access their GitHub orgs/issues we could request their GitHub username on the login page and initiate the retrieval of their issues while they are authenticating... That way, by the time they get back to Tudo their issues dashboard is already pre-rendered and loaded! This is a wow-factor people won\'t be expecting and thus our app immediately delivers on our first promise!\n\nThoughts?' },
     { id: 'issuecomment-123807796',
       author: 'iteles',
       created: '2015-07-22T17:54:12Z',
       body: 'I\'d love to test this out, this will be an amazing selling point if we can get the performance to work like we expect!' },
     { id: 'issuecomment-124048121',
       author: 'nelsonic',
       created: '2015-07-23T10:20:15Z',
       body: '@iteles have you watched the Foundation Episode featuring Kevin Systrom (instagram) ?\n\n\nhttps://www.youtube.com/watch?v=nld8B9l1aRE\n\n\nWhat were the USPs that contributed to instagram\'s success (considering how many photo-related-apps were in the app store at the time) ?\n\ncc: @besarthoxhaj' },
     { id: 'issuecomment-124075792',
       author: 'besarthoxhaj',
       created: '2015-07-23T11:59:31Z',
       body: '@nelsonic love the idea! Let\'s do it!' } ],
  labels: [ 'enhancement', 'help wanted', 'question' ],
  participants: [ 'nelsonic', 'iteles', 'besarthoxhaj' ],
  url: '/dwyl/tudo/issues/51',
  title: 'Pre-fetch people\'s issues while they are authenticating with GitHub',
  state: 'Open',
  author: 'nelsonic',
  created: '2015-07-22T00:00:45Z',
  milestone: 'Minimal Usable Product',
  assignee: 'besarthoxhaj' }

By contrast using the GitHub API to fetch this issue see: https://developer.github.com/v3/issues/#get-a-single-issue

format:

/repos/:owner/:repo/issues/:number
curl https://api.github.com/repos/dwyl/tudo/issues/51

Milestones

Milestones are used to group issues into logical units.

dwyl-tudo-milestones

var gs  = require('github-scraper');
var url = '/dwyl/tudo/milestones';
gs(url, function (err, data) {
  console.log(data); // use the data how ever you like
});

Sample output:

{ entries:
   [ { name: 'Test Milestone - Please Don\'t Close!',
       due: 'Past due by 16 days',
       updated: 'Last updated 5 days ago',
       desc: 'This Milestone in used in our e2e tests to check for an over-due milestone, so please don\'t close it!',
       progress: '0%',
       open: 1,
       closed: 0 },
     { name: 'Minimal Usable Product',
       due: 'Due by July  5, 2016',
       updated: 'Last updated 2 days ago',
       desc: 'What is the absolute minimum we can do to deliver value to people using the app?\n(and thus make them want to come back and use it!)',
       progress: '0%',
       open: 5,
       closed: 0 } ],
  url: 'https://github.com/dwyl/tudo/milestones',
  open: 2,
  closed: 1 }

Labels (for a repository)

All repositories have a set of standard labels (built-in to GitHub) e.g: https://github.com/dwyl/tudo/labels is (currently) only using the "standard" labels.

github-dwyl-tudo-labels-list

Whereas the RethinkDB (which uses GitHub for all their project tracking) uses several custom labels: https://github.com/rethinkdb/rethinkdb/labels

github-rethinkdb-issues-list

We need to crawl these for each repo.

var gs  = require('github-scraper');
var url = '/dwyl/time/labels';
gs(url, function (err, data) {
  console.log(data); // use the data how ever you like
});

Here's the extraction of the standard labels:

[
  { name: 'bug',
    style: 'background-color: #fc2929; color: #fff;',
    link: '/dwyl/tudo/labels/bug',
    count: 3 },
  { name: 'duplicate',
    style: 'background-color: #cccccc; color: #333333;',
    link: '/dwyl/tudo/labels/duplicate',
    count: 0 },
  { name: 'enhancement',
    style: 'background-color: #84b6eb; color: #1c2733;',
    link: '/dwyl/tudo/labels/enhancement',
    count: 11 },
  { name: 'help wanted',
    style: 'background-color: #159818; color: #fff;',
    link: '/dwyl/tudo/labels/help%20wanted',
    count: 21 },
  { name: 'invalid',
    style: 'background-color: #e6e6e6; color: #333333;',
    link: '/dwyl/tudo/labels/invalid',
    count: 1 },
  { name: 'question',
    style: 'background-color: #cc317c; color: #fff;',
    link: '/dwyl/tudo/labels/question',
    count: 10 }
]

or a repo that has custom labels:

{ entries:
  [ { name: '[alpha]',
      style: 'background-color: #79CDCD; color: #1e3333;',
      link: '/dwyl/time/labels/%5Balpha%5D',
      count: 2 },
    { name: 'API',
      style: 'background-color: #006b75; color: #fff;',
      link: '/dwyl/time/labels/API',
      count: 11 },
    { name: 'bug',
      style: 'background-color: #fc2929; color: #fff;',
      link: '/dwyl/time/labels/bug',
      count: 5 },
    { name: 'chore',
      style: 'background-color: #e11d21; color: #fff;',
      link: '/dwyl/time/labels/chore',
      count: 9 },
    { name: 'discuss',
      style: 'background-color: #bfe5bf; color: #2a332a;',
      link: '/dwyl/time/labels/discuss',
      count: 43 },
    { name: 'Documentation',
      style: 'background-color: #eb6420; color: #fff;',
      link: '/dwyl/time/labels/Documentation',
      count: 2 },
    { name: 'duplicate',
      style: 'background-color: #cccccc; color: #333333;',
      link: '/dwyl/time/labels/duplicate',
      count: 0 },
    { name: 'enhancement',
      style: 'background-color: #84b6eb; color: #1c2733;',
      link: '/dwyl/time/labels/enhancement',
      count: 27 },
    { name: 'external dependency',
      style: 'background-color: #D1EEEE; color: #2c3333;',
      link: '/dwyl/time/labels/external%20dependency',
      count: 1 },
    { name: 'FrontEnd',
      style: 'background-color: #f7c6c7; color: #332829;',
      link: '/dwyl/time/labels/FrontEnd',
      count: 26 },
    { name: 'help wanted',
      style: 'background-color: #009800; color: #fff;',
      link: '/dwyl/time/labels/help%20wanted',
      count: 42 },
    { name: 'invalid',
      style: 'background-color: #e6e6e6; color: #333333;',
      link: '/dwyl/time/labels/invalid',
      count: 0 },
    { name: 'investigate',
      style: 'background-color: #fbca04; color: #332900;',
      link: '/dwyl/time/labels/investigate',
      count: 18 },
    { name: 'MVP',
      style: 'background-color: #207de5; color: #fff;',
      link: '/dwyl/time/labels/MVP',
      count: 27 },
    { name: 'NiceToHave',
      style: 'background-color: #fbca04; color: #332900;',
      link: '/dwyl/time/labels/NiceToHave',
      count: 7 },
    { name: 'Post MVP',
      style: 'background-color: #fef2c0; color: #333026;',
      link: '/dwyl/time/labels/Post%20MVP',
      count: 24 },
    { name: 'question',
      style: 'background-color: #cc317c; color: #fff;',
      link: '/dwyl/time/labels/question',
      count: 25 },
    { name: 'UI',
      style: 'background-color: #bfdadc; color: #2c3233;',
      link: '/dwyl/time/labels/UI',
      count: 13 } ],
 url: 'https://github.com/dwyl/time/labels' }

Issues > Search (Bonus Feature)

A much more effective way of collating all the issues relevant to a person is to search for them!

example: https://github.com/search?type=Issues&q=author%3Aiteles&state=open&o=desc&s=created

{
  entries:
   [
     { title: 'Remove flexbox from CSS',
       url: '/dwyl/dwyl.github.io/issues/29',
       desc: 'To ensure the site works across all devices, particularly Kindle/e-readers.',
       author: 'iteles',
       created: '2015-07-25T22:57:20Z',
       comments: 2 },
     { title: 'CSS | Add indentation back into main.css (disappeared from master)',
       url: '/dwyl/tudo/issues/77',
       desc: 'All indentation has been removed from main.css in the latest commit.     \n\nThis needs to be put back in as originally written by @msmichellegar and @iteles.',
       author: 'iteles',
       created: '2015-07-25T16:27:59Z' },
     { title: 'CSS | Investigate styling of issue label colours',
       url: '/dwyl/tudo/issues/72',
       desc: 'Labels can be given any colour so there is no predictable set that we can code into the CSS file.\n\nWe need to investigate what the best way to ensure we can provide the right colour of background to the ...',
       author: 'iteles',
       created: '2015-07-23T17:49:02Z',
       comments: 4 }
  ],
  next: '/search?o=desc&p=2&q=author%3Aiteles&s=created&state=open&type=Issues'
}

Owner

For the issues created across all their personal repositories use a search query of the form:

https://github.com/search?q=user%3A{username|org}
&state={state}
&type=Issues&s={relevance}
&o={order}

e.g: https://github.com/search?q=user%3Aiteles&state=open&type=Issues&s=updated&o=asc

Author (created by)

Or to find all the issues where the person is the author use a query of the following format:

https://github.com/search?q=author%3A{username|org}
&state={state}
&type=Issues&s={relevance}
&o={order}

Assignee (issues assigned to this person)

Or to find all the issues assigned to the person use a query of the following format:

https://github.com/search?q=assignee%3A{username|org}
&state={state}
&type=Issues&s={relevance}
&o={order}
&s={filter}

Mentions

We can use a mentions (search) query to discover all the issues where a given person (username) was mentioned:

https://github.com/search?q=mentions%3A{username}&type=Issues&state={state}

e.g: https://github.com/search?q=mentions%3Aiteles&type=Issues&state=open

This could be more than the issues in the person's (own) repos or the repos the person has access to (via org). e.g: if Sally axks a clarifying question on a project she has not yet contributed to, the issue will not appear when we crawl the repos on her profile or orgs she has access to ...

Issues Filters

There are many filters we can use to find issues, here are a few:

Further Reading on Searching+Filters

For way more details on searching & filters see:

Want More Examples?

If you want even more examples of the pages you can scrape, take a look at our end-to-end tests where we test all the scrapers!


Future Features / Road Map ?

Crawl the List of commits

Would it be interesting to see/track:

  • who makes the most commits to the project
  • when (what time of day/night) people do their work
  • what did the person contribute? (docs, code improvement, tests, typo, dependency update?)

Show your interest in this feature: #17




Contributing?

Contributions are always welcome! We have a backlog of features (many pages we want to parse)
please see: https://github.com/nelsonic/github-scraper/issues
If anything interests you, please lave a comment on the issue.

Your first step to contributing to this project is to run it on your localhost.

1. Clone the Repository

In your terminal, clone the repository from GitHub:

git clone https://github.com/nelsonic/github-scraper.git && cd github-scraper

2. Install the Dependencies

Ensure you have Node.js installed, see https://nodejs.org
Then run the following command to install the project dependencies:

npm install

You should see output in your terminal similar to the following:

added 162 packages from 177 contributors and audited 265 packages in 4.121s

That tells you that the dependencies were successfully installed.

3. Run the Tests

In your terminal execute the following command:

npm test

You should see output similar to the following:

> [email protected] test /Users/n/code/github-scraper
> istanbul cover ./node_modules/tape/bin/tape ./test/*.js | node_modules/tap-spec/bin/cmd.js


  read list of followers for @jupiter (single page of followers)

      - - - GitHub Scraper >> /jupiter/followers >> followers  - - -
    โœ” jupiter/followers data.type: followers
    โœ” @jupiter/followers has 34 followers
    โœ” Nelson in jupiter/followers
    โœ” @jupiter/followers only has 1 page of followers

  read list of followers for @iteles (multi-page)

      - - - GitHub Scraper >> /iteles/followers >> followers  - - -
    โœ” "followers": 51 on page 1
    โœ” iteles/followers multi-page followers


... etc ...

=============================================================================
Writing coverage object [/Users/n/code/github-scraper/coverage/coverage.json]
Writing coverage reports at [/Users/n/code/github-scraper/coverage]
=============================================================================
    =============================== Coverage summary ===============================
    Statements   : 100% ( 192/192 )
    Branches     : 100% ( 63/63 )
    Functions    : 100% ( 22/22 )
    Lines        : 100% ( 192/192 )
    ================================================================================


  total:     102
  passing:   102
  duration:  31.6s

The tests take around 30 seconds to run on my localhost, but your test execution time will vary depending on your location (the further you are from GitHub's servers the slower the tests will run...).

Don't panic if you see some red in your terminal while the tests are running. We have to simulate failure 404 and 403 errors to ensure that we can handle them. Pages some times disappear e.g: a user leaves GitHub or deletes a project. And our script needs to not freak out when that happens. This is good practice in DOM parsing, the web changes a lot!

When the tests pass on your localhost, you know everything is working as expected. Time to move on to the fun bit!

Note: This project follows Test Driven Development (TDD) because it's the only way we can maintain our sanity ... If we didn't have tests it would be chaos and everything would "break" all the time. If you are contributing to the project, please be aware that tests are required and any Pull Requests without tests will not be considered. (please don't take it personally, it's just a rule we have).

If you are new to TDD, please see: github.com/dwyl/learn-tdd

4. Pick an Issue and Write Some Code!

Once you have the project running on your localhost, it's time to pick a page to parse!

There are a bunch of features in the backlog. see: https://github.com/nelsonic/github-scraper/issues

Pick one that interests you and write a comment on it to show your interest in contributing.

Travis-CI?

We use Travis-CI (Continuous Integration), to ensure that our code works and all tests pass whenever a change is made to the code. This is essential in any project and even more so in a DOM parsing one.

If you are new to Travis-CI, please see: github.com/dwyl/learn-travis

Pre-Commit Hook?

When you attempt to commit code on your localhost, the tests will run before your commit will register. This is a precaution to ensure that the code we write is always tested. There is no point writing code that is not being tested as it will "break" almost immediately and be unmaintainable.

Simply wait a few seconds for the tests to pass and then push your work to GitHub.

If you are new to pre-commit hooks, please see: github.com/dwyl/learn-pre-commit




tl;dr

If you are the kind of person that likes to understand how something works, this is your section.

Inferring Which Scraper to use from the URL

lib/switcher.js handles inference. We wanted to use a switch > case construct but, ended up using if/else because there are two types of checks we need to do so if/else seemed simpler.

Interesting Facts

  • GitHub has 10.3 Million users (at last count)
  • yet the most followed person Linus Torvalds "only" has 28k followers (so its a highly distributed network )

Research

Must read up about http://en.wikipedia.org/wiki/Inverted_index so I understand how to use: https://www.npmjs.org/package/level-inverted-index

Useful Links

GitHub Stats API

Example:

curl -v https://api.github.com/users/pgte/followers
[
  {
    "login": "methodmissing",
    "id": 379,
    "avatar_url": "https://avatars.githubusercontent.com/u/379?v=2",
    "gravatar_id": "",
    "url": "https://api.github.com/users/methodmissing",
    "html_url": "https://github.com/methodmissing",
    "followers_url": "https://api.github.com/users/methodmissing/followers",
    "following_url": "https://api.github.com/users/methodmissing/following{/other_user}",
    "gists_url": "https://api.github.com/users/methodmissing/gists{/gist_id}",
    "starred_url": "https://api.github.com/users/methodmissing/starred{/owner}{/repo}",
    "subscriptions_url": "https://api.github.com/users/methodmissing/subscriptions",
    "organizations_url": "https://api.github.com/users/methodmissing/orgs",
    "repos_url": "https://api.github.com/users/methodmissing/repos",
    "events_url": "https://api.github.com/users/methodmissing/events{/privacy}",
    "received_events_url": "https://api.github.com/users/methodmissing/received_events",
    "type": "User",
    "site_admin": false
  },

etc...]

Issues (with using the) GitHub API:

  • The API only returns 30 results per query.
  • X-RateLimit-Limit: 60 (can only make 60 requests per hour) ... 1440 queries per day (60 per hour x 24 hours) sounds like ample on the surface. But, if we assume the average person has at least 2 pages worth of followers (30<) it means on a single instance/server we can only track 720 people. Not really enough to do any sort of trend analysis. ๐Ÿ˜ž If we are tracking people with hundreds of followers (and growing fast) e.g. 300< followers. the number of users we can track comes down to 1440 / 10 = 140 people... (10 requests to fetch complete list of followers) we burn through 1440 requests pretty quickly.
  • There's no guarantee which order the followers will be in (e.g. most recent first?)
  • Results are Cached so they are not-real time like they are in the Web. (seems daft, but its true.) Ideally they would have a Streaming API but sadly, GitHub is built in Ruby-on-Rails which is "RESTful" (not real-time).

But...

Once we know who we should be following, we can use

e.g:

curl -v https://api.github.com/users/pgte/following/visionmedia




FAQ?

Is Crawling a Website Legal...?

The fact that scraping or "crawling" is Google's Business Model suggests that scraping is at least "OK" ...

Started typing this into google and saw: is-it-illegal-to

I read a few articles and was not able to locate a definitive answer ...

More Repositories

1

nelsonic.github.io

๐Ÿ“a place to store ideas, thoughts and interesting links
HTML
29
star
2

learn-math

Repo to collect links for good (free) online math tutorials so I/we can fill in any gaps in our math knowledge/thinking.
19
star
3

autocomplete

Autocomplete all the words!
JavaScript
13
star
4

practice

Practice makes ...
Elixir
11
star
5

node-cdn

A Simple NodeJS Module to upload to Amazon S3 (CloudFront CDN)
JavaScript
9
star
6

braga-houseparent-tiny-home

The person who is responsible for the house will have their own private space, this mini-project is our quest to maximise it!
9
star
7

standing-desk

Sitting is the new smoking. Lets fix that. ๐ŸŒฑ
8
star
8

perma

Generate perma-links from your long urls.
JavaScript
7
star
9

arana

๐Ÿ with great power comes great responsibility
JavaScript
7
star
10

learn-chess

Learn how to play Chess from beginner to pro!
7
star
11

hapi-validation-question

Hapi.js Validation with Joi + failAction question.
JavaScript
7
star
12

hello-world-node-http-server

The most basic possible project for testing a Web Server using Node.js http-server.
JavaScript
6
star
13

tetris

Following the Elm Tetris Tutorial
Elm
6
star
14

amemo

Richtext writing research ๐Ÿ”
Elixir
6
star
15

coin-change-ruby

A basic solution to the classic Coin Change Problem: http://en.wikipedia.org/wiki/Coin_problem
Ruby
6
star
16

learn-reveal.js

๐Ÿ—บ๏ธ A quick introduction/tutorial on how to use reveal.js to build/publish beautiful and engaging presentations!
HTML
6
star
17

learn-cookies

Learn the Why, What & How of Cookies in web apps/sites.
6
star
18

learn-gardening

Learn how to grow amazing plants
6
star
19

aws-certified-developer

Notes on attaining the Amazon Web Services Certified Developer Associate Certifiaction
6
star
20

learn-puppeteer

๐Ÿต Learn how to use Google's Puppeteer write Tests that control Chrome or Chromium over the DevTools Protocol.
5
star
21

phoenix-absinthe-graphql-api

API Example made with Phoenix (V1.3) and Absinthe (GraphQL)
Elixir
5
star
22

vinyl-record-carry-storage-box

๐Ÿ“ฆ A robust/durable yet lightweight storage & carry box for your vinyl records. ๐ŸŽถ
5
star
23

photo-groove

Sample application built following @rtfeldman's Elm in Action
Elm
5
star
24

learn-leveldb

A quick guide to LevelDB (node.js's free/built-in) key-value data store.
5
star
25

learn-mermaid

Learn how to use Mermaid.js to create data-driven diagrams.
4
star
26

hello_world_edeliver

Testing Elixir/Phoenix Deployment with Edeliver & Distillery > "Hot Upgrade" from Travis-CI!
Elixir
4
star
27

learn-music

Learn how to play music from first principals.
4
star
28

colors

A little experiment in color (HTML + CSS + JavaScript)
4
star
29

learn-haskell

https://youtu.be/02_H3LjqMr8
3
star
30

masterbuilder

Master builder simplifies building things in code.
3
star
31

learn-crypto

A simple introduction to crypto for people who JavaScript.
JavaScript
3
star
32

learn-touch-typing

Build a solid foundation of typing fast so you can produce.
3
star
33

serverless-demo

quick demo of serverless-framework (Node.js on Lambda)
JavaScript
3
star
34

bcrypt

A comparison of JavaScript / Node.js bcrypt implementations.
JavaScript
3
star
35

hapi-upload

๐ŸŒ… Effortlessly upload files in your Hapi.js App
3
star
36

picwall

Display feed of instagram photos.
JavaScript
3
star
37

issues

Issues mini project from Chapter 13 of Elixir Book
Elixir
3
star
38

baby-music

A place to research and share helping children learn music. (relevant to older people too!)
3
star
39

tessellate

What is the lightest possible way of arranging blocks of content on a page?
3
star
40

liso

Liso helps non-technical people understand and test web apps. Real Browsers. Real Tests. Real Simple.
CoffeeScript
3
star
41

api-people-rows

API Example made with Phoenix (V1.3) and Absinthe (GraphQL) Content.Rows + People
Elixir
3
star
42

elixirschool_friends

The friends app created while following https://elixirschool.com/en/lessons/ecto/basics
Elixir
3
star
43

tesla-mobile-office

How to turn your Tesla Model 3 (or Y) cabin into the Ultimate Mobile Office to get work done as a software engineer so you can totally justify it as a company car to your accountant. ๐Ÿ‘จโ€๐Ÿ’ป
3
star
44

append-only-log-ex

Following https://github.com/dwyl/phoenix-ecto-append-only-log-example to verify steps.
Elixir
3
star
45

learn-kubernetes

Learn how to use Kubernetes to deploy a cluster of servers on commodity infrastructure.
3
star
46

learn-jupyter

Learn how to use the jupyter notebook format to take notes when learning a new subject!
3
star
47

rumbl

Phoenix App built following Programming Phoenix 1.4 book
Elixir
3
star
48

rename-files-to-lowercase

Just a quick node script to rename all files in a directory to lower-case.
JavaScript
3
star
49

launch-landing-page

Get your idea/project/vision online in the next 30 mins and start getting feedback from real people!
3
star
50

imgresizer

Resize images to a specific height + width using a simple GET request and return a url to the resized image.
CoffeeScript
2
star
51

pxblog

Elixir
2
star
52

uniki

๐Ÿ”‘ Generate a unique key (uniki get it?) based on any string input
JavaScript
2
star
53

learn-payment-processing

https://medium.com/airbnb-engineering/scaling-airbnbs-payment-platform-43ebfc99b324
2
star
54

try-elixir

Try Elixir Online
Elixir
2
star
55

learn-riot

๐Ÿ’Œ A quick introduction to the Riot.js client side MVP library.
JavaScript
2
star
56

7L7W

Seven Languages in Seven Weeks (Code Examples/Eperiments)
2
star
57

phoenix-chat-v1.4

GOTO: https://github.com/dwyl/phoenix-chat-example
Elixir
2
star
58

ipex

A Many-worlds interpretation in code.
JavaScript
2
star
59

learn-bonescript

Using Bonescript (JavaScript) to interact with BeagleBone Black SOC.
2
star
60

cash

Cash is how we store our time. As such we should keep equally good track of it!
2
star
61

closet

2
star
62

enlarge-image

Simple script displays larger version of an image when a thumbnail clicked
JavaScript
2
star
63

urla

Need an easy/fast way to manage many urls? We did. So we're building one!
2
star
64

learn-natural-language-processing

Repo for learning NLP https://youtu.be/fOvTtapxa9c
2
star
65

learn-dokku

Learn how to use Dokku to deploy your apps.
2
star
66

nodeschool

Solutions to all the nodeschool.io tutorials
JavaScript
2
star
67

riotjs2-todomvc

โœ… A demo of todomvc in Riot.js + Hapi.js (with server-side view rendering), Socket.io & Redis persistence. (with Tests & Explanatory Comments!) ๐Ÿ˜ฎ
HTML
2
star
68

learn-lisp

A brief introduction to the LISP programming language.
2
star
69

learn-mobile-first

๐Ÿ“ฑ Learn how to build web/apps which target mobile devices first (but still look great everywhere else!)
2
star
70

learn-solid

Learn how to use Solid in your web/mobile apps for a better world! https://github.com/nelsonic/nelsonic.github.io/issues/493
2
star
71

poang-hack

My repo for collecting ideas for how to improve working with an IKEA Poรคng chair ๐Ÿ’ก
2
star
72

lorm

experiment 2
JavaScript
2
star
73

api-test

Testing building an API with Phoenix v1.3
Elixir
2
star
74

date-utils

A few useful (unit tested) JavaScript date methods
JavaScript
2
star
75

leadership

What does it take to be a leader?
2
star
76

epjs-data-snapshot-importer-script

[SUPERCEDED] Does *exactly* what it's name suggests. A simple "ETL" script to create a database table (if it doesn't exist) and import a snapshot of the ePJS data for use in the main app.
2
star
77

learn-requirejs

:write:
1
star
78

learn-wordpress

๐Ÿ™„
1
star
79

evc

External Voucher Code Generator - Re-Written with (PHP)Unit Tests
PHP
1
star
80

coedit

Starting from Scratch with Something Simple
HTML
1
star
81

idea2

See: http://www.paulgraham.com/ambitious.html
1
star
82

mb

The CLI Tool for Building Awesome things!
1
star
83

g8

1
star
84

acid

faster
1
star
85

risher-stack

THE Stack for Scalable Mobile First Progressively Enhanced Web Applications!
1
star
86

learn-karma

Dive into Karma JavaScript Test Runner
JavaScript
1
star
87

text2binary

Convert any ascii text (any english words) to binary (bytes)
JavaScript
1
star
88

jwt-performance

Benchmark for the performance of node.js JSON Web Token modules
1
star
89

fastest

The fastest way to test your web app on all connected devices/browsers.
1
star
90

joi-browser-test

Comparison of joi-browser and "standard" joi for validating a schema.
JavaScript
1
star
91

elm-tutorial-app

learning elm via https://github.com/sporto/elm-tutorial
Elm
1
star
92

vagrant-coreos-node

Boot a Vagrant VM with CoreOS and Node.js
1
star
93

try-node-qunit

An attempt to use node-qunit for testing a suite
CSS
1
star
94

hapi-reset-password

Easily reset a password in a hapi.js app
1
star
95

crockery-drying-storage-rack

As the name suggests ๐Ÿ˜‰
1
star
96

react-redux-webpack-demo

Basic react/redux/webpack application for demos and training
JavaScript
1
star
97

roof-garden

Initial sketch of Roof Garden for @home
1
star
98

const

What is const and why are all the kรผรผl keedz using it everywhere?
1
star
99

vagrant-ubuntu-node-mongodb

Vagrantfile for *Latest* Ubuntu, Node.js & MongoDB (just add your Angular.js project for "MEAN"...)
1
star
100

codebar-solutions

CSS
1
star