• Stars
    star
    143
  • Rank 257,007 (Top 6 %)
  • Language
    HTML
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract a citation network from Google Scholar

Étudier in Action

étudier is a small Python program that uses Selenium, requests-html and networkx to drive a non-headless browser to collect a citation graph around a particular Google Scholar citation or set of search results. The resulting network is written out as GEXF and GraphML files as well as an HTML file that includes a D3 network visualization (pictured above).

If you are wondering why it uses a non-headless browser it's because Google is quite protective of this data and will routinely ask you to solve a captcha (identifying street signs, cars, etc in photos) to prove you are not a bot. étudier allows you to complete these captcha tasks when they occur and then it continues on its way collecting data. You need to have a browser to interact with in order to do your part.

Install

You'll need to install ChromeDriver before doing anything else. If you use Homebrew on OS X this is as easy as:

brew cask install chromedriver

Then you'll want to install Python 3 and:

pip3 install etudier

Run

To use étudier you first need to navigate to a page on Google Scholar that you are interested in, for example here is the page of citations that reference Sherry Ortner's Theory in Anthropology since the Sixties. Then you start etudier up pointed at that page.

% etudier 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc='

If you are interested in starting with keyword search results in Google Scholar you can do that too. For example here is the url for searching for "cscw memory" if I was interested in papers that talk about the CSCW conference and memory:

% etudier 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=cscw+memory&btnG='

Note: it's important to quote the URL so that the shell doesn't interpret the ampersands as an attempt to background the process.

--pages

By default étudier will collect the 10 citations on that page and then look at the top 10 citations that reference each one. So you will end up with no more than 100 citations being collected (10 on each page * 10 citations).

If you would like to get more than one page of results use the --pages. For example this would result in no more than 400 (20 * 20) results being collected:

% etudier --pages 2 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc=' 

--depth

And finally if you would like to look at the citations of the citations you use the --depth parameter.

% etudier --depth 2 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc='

This will collect the initial set of 10 citations, the top 10 citations for each, and then the top 10 citations of each of those, so no more than 1000 citations 1000 citations (10 * 10 * 10). It's no more because there is certain to be some cross-citation duplication.

--output

By default output.gexf, output.graphml and output.html files will be written to the current working directory, but you can change this with the --output option to control the prefix that is used. The output file will contain rudimentary metadata collected from Google Scholar including:

  • id - the cluster identifier assigned by Google
  • url - the url for the publication
  • title - the title of the publication
  • authors - a comma separated list of the publication authors
  • year - the year of publication
  • cited-by - the number of other publications that cite the publication
  • cited-by-url - a Google Scholar URL for the list of citing publications
  • modularity - the modularity value obtained from community detection

Features of HTML/D3 output

  • Node's color shows its citation group
  • Node's size shows its times being cited
  • Click node to open its source website
  • Dragable nodes
  • Zoom and pan
  • Double-click to center node
  • Resizable window
  • Text labels
  • Hover to highlight 1st-order neighborhood
  • Click and press node to fade surroundings

More Repositories

1

anon

tweet about anonymous Wikipedia edits from particular IP address ranges
JavaScript
962
star
2

pymarc

process MARC records from Python
Python
252
star
3

wikistream

displays edit activity on wikipedia
JavaScript
233
star
4

microdata

python library for extracting html microdata
Python
165
star
5

feediverse

Send RSS/Atom feeds to Mastodon
Python
102
star
6

wikichanges

a NodeJS library for monitoring changes on Wikipedia sites
JavaScript
71
star
7

rdflib-microdata

an rdflib plugin to parse html5 microdata
Python
53
star
8

dedoop

recursively deduplicate a directory and write its contents to a new directory while remembering the old paths
Python
48
star
9

opensearch

A python opensearch client
Python
44
star
10

wikipulse

a gauge widget to display wikipedia activity
JavaScript
41
star
11

linkypedia

a web based tool to monitor how your website content is used in wikipedia
Python
37
star
12

wikidata_suggest

a CLI suggestion tool for Wikidata entities
Python
29
star
13

lod-graph

A protovis visualization of the linked open data cloud.
JavaScript
26
star
14

fondz

fondz is a tool for auto-generating an "archival description" from a bag or series of bags.
JavaScript
25
star
15

bagweb

mirror a website, put it in a bag
Shell
24
star
16

wikigeo

JavaScript library for getting geojson from the Wikipedia API
CoffeeScript
22
star
17

dflat

an implementation of the dflat and redd specifications from CDL for versioning of digital objects
Python
19
star
18

www-wikipedia

Simple Perl client for grabbing content out of Wikipedia
Perl
18
star
19

earls

display urls being tweeted with an event hashtag
JavaScript
18
star
20

ici

Edit Wikipedia Pages Near You
JavaScript
16
star
21

nytimestream

NYTimes Newswire API as a stream using node.js
JavaScript
14
star
22

wikitweets

see tweets that reference wikipedia articles
JavaScript
14
star
23

geonames-localsolr

A little project to help bootstrap a local-solr instance with geonames data.
Python
12
star
24

empirical-cloud

a little demo visualization of owl:sameAs links in billion triple challenge data
JavaScript
12
star
25

itstimeihadsometimealone

Some code to examine and modify your experience of Twitter.
Python
11
star
26

warc

warc library for golang
Go
11
star
27

ptree

minimal PairTree implementation
Python
11
star
28

botnet-retweets

Exploring retweets by Twitter bot-nets.
Python
11
star
29

xkcd2347

Get dependencies for a project on GitHub.
Python
11
star
30

skosdict

turn a SKOS concept scheme into a simple JSON dictionary
Python
11
star
31

spn

Playing around with SavePageNow data.
Jupyter Notebook
10
star
32

dewey-crawler

simplistic crawler and serializer for linked data at dewey.info
Python
10
star
33

creepy-polaroid

display an image for where you are using HTML, JavaScript and Google
10
star
34

paperbot

Twitter bot for Chronicling America
Python
9
star
35

wikitrends

see most viewed wikipedia articles
JavaScript
9
star
36

whrss

scrape White House Blog to generate RSS until it starts working again
Python
9
star
37

saa-glossary

structured data scraped from A Glossary of Archival and Records Terminology
Python
9
star
38

storycorps-meta

collect public storycorps metadata and save as json-ld
Python
8
star
39

ocropy

minimalist wrapper around ocropus for generating hOCR documents from images
Python
8
star
40

bisac

top level BISAC subject vocabulary
Python
8
star
41

multiverse

A JavaScript library for writing generative text in HTML.
HTML
8
star
42

lcco

Converts a textual representation of the Library of Congress Classification Outline into SKOS/RDF and makes it available on the Web in a hierarchical viewer.
Python
8
star
43

cscw-pandoc

Turn your Pandoc Markdown into a CSCW PDF
TeX
8
star
44

trump-archive-wayback

A dataset of Trump Tweet IDs and their Wayback Machine information
Python
7
star
45

metaweb

get metadata for a web page
JavaScript
7
star
46

lcsh-subset

create a subset view of LCSH
Python
7
star
47

opinions

watch SCOTUS opinions for URLs
Python
7
star
48

dev8d-linked-data

some experiments with linked data available from the dev8d conference
Python
7
star
49

fastcat

navigate wikipedia categories quickly in a local redis instance
Python
7
star
50

libweb

extract library homepage urls from LIBWEB
Python
7
star
51

wplinks

utility to get a list of Wikipedia articles that point at a particular website
Python
7
star
52

mediatypes

A project that harvests media type information from the IANA registry, and publishes information as linked data using the Google App Engine.
Python
6
star
53

notebooks

Some random Jupyter notebooks.
Jupyter Notebook
6
star
54

lastcloud

imperfect html/javascript hack to look up musicians you like on soundcloud
JavaScript
6
star
55

json2xml

simplistic json -> xml converter
Python
6
star
56

europeana-crawler

a simple crawler of the RDFa in Europeana
Python
6
star
57

highscores

Displays retro arcade game highscores for original cataloging performed today using OCLC's Worldcat Live API.
JavaScript
6
star
58

dpla-map

a simple pure html/javascript DPLA/GoogleMap mashup
JavaScript
6
star
59

anselm

A Visual Studio Code plugin for qualitative coding using Markdown
JavaScript
6
star
60

alto-words

simplistic calculation of the ratio of dictionary words to all words in a METS Alto OCR file
Python
6
star
61

metatweet

A bot for monitoring the structure of JSON in tweets from the sample stream.
Python
5
star
62

lc-findingaids

JavaScript
5
star
63

luckysocial

googles for a name and looks for social links in the first result
Python
5
star
64

summoner

work with the Serial Solutions Summon API from Python
Python
5
star
65

chronam-widget

view on NDNP content using just HTML/JavaScript and the Chronicling America API
JavaScript
5
star
66

wikipediarevs

A commandline utility for downloading the revision history for one or more Wikipedi articles.
Python
5
star
67

google-count

hack to count google hits
Python
5
star
68

google-the-poem

An epic poem generated using Google auto-complete
JavaScript
5
star
69

lastweet

Update Twitter & Mastodon with your LastFM history.
Python
5
star
70

jekyll-wikidata

A Jekyll plugin for Wikidata.
CSS
5
star
71

data-gov-uk-harvester

tiny little project to harvest rdfa metadata from data.gov.uk
Python
5
star
72

congresseditors

the code that runs the @congresseditors twitter bot
CoffeeScript
5
star
73

aotycmp

hack to see what well reviewed albums-of-the-year are available on Spotify and Rdio
Python
5
star
74

lochief

A linked-data version of kochief
5
star
75

voyage

display a stream of circulation activity for a Voyager ILS
CoffeeScript
4
star
76

vine-tweets

Working with the Vine-Tweets dataset.
Jupyter Notebook
4
star
77

diary

Silly GPT-n experiment
TypeScript
4
star
78

skos_wikidata

match a SKOS concept scheme to Wikidata from the command line
Python
4
star
79

fakepremis

fake premis event twitter bot
Python
4
star
80

subjects-here

An HTML5 experiment that uses OCLC's mapFast to lookup subjects for your current location.
JavaScript
4
star
81

zhang-webarchiving

Notes for my talk about Web Archiving to Jane Zhang's Digital Curation class.
4
star
82

muldicat

tool to generate SKOS for the Multilingual Dictionary of Cataloging Terms and Concepts
Python
4
star
83

worksvenn

generate a Venn diagram for LibraryThing, OCLC and OpenLibrary FRBRization services
Python
4
star
84

wikicites

get a stream of recent citations from wikipedia
CoffeeScript
4
star
85

journos

simple example of looking for journalists in twitter stream
Jupyter Notebook
4
star
86

ead-finder

use Google to find public EAD XML documents
JavaScript
4
star
87

papvc-topicmodel

Jupyter Notebook
4
star
88

id

LCSH SKOS webapp
Python
4
star
89

databib-metadata

example html/metadata examples for databib
Shell
4
star
90

oai2pairtree

command line utility to dump records in an oai-pmh repository as xml in a pairtree
Python
4
star
91

congressedits-archive

a snapshot of the @congressedits twitter archive
JavaScript
4
star
92

jackattack

Collecting & visualizing tweets directed at @jack and @yoyoel
Python
3
star
93

apostrophe

"But I got the crystal ball", he said, and held it to the light.
Python
3
star
94

emailz

turn mboxen into rdf, and visualize w/ d3
Python
3
star
95

warc-analyzer

A client-side web application for analyzing WARC data.
JavaScript
3
star
96

imls-cdx

working files, data, notebooks for museum group at Archives Unleashed DC
Jupyter Notebook
3
star
97

inst341

Jupyter Notebook
3
star
98

webarchives

see if a URL is available in a web archive somewhere on the web
Python
3
star
99

americanarchivist

metadata extractor for The American Archivist
Python
3
star
100

json-intro

Short, gentle introductions to JSON for the aspiring programmer.
3
star