• Stars
    star
    127
  • Rank 282,790 (Top 6 %)
  • Language
    Python
  • Created over 5 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Create a Gephi Citation Graph based on Text Analysis of PDFs from Zotero

Create a Citation Graph based on Simplistic Text Analysis

Inspired by A.R. Siders' R Script from this ResearchGate question

Based on dpapathanasiou's example script for pdfminer

Takes Zotero .CSV Article collections and creates Gephi-compatible files for Graph Edges and Nodes based on citations

screenshot

Principle:

  • Let A be a set of known articles
  • For any a in A, let title_a be its title, and text_a be its text content
  • For some x in A and y in A, x!=y:
    • cites(x,y) is true if title_y appears in text_x

For the above to work, we do some text normalization (removing punctuation, whitespace, special characters) and assume that the title_y would only appear in text_x if it appears in the references section...

Usage:

  1. Export list of articles as .csv from Zotero, (articles should have File attachments)
  2. Run analyze_papers.py zotero_file.csv
  3. Script should produce two files: Edges_titles.csv and Nodes_titles.csv in folder "gephi"
  4. Load them into Gephi with "Load Spreadsheet"

Notes

  • Tested with Python3
  • Uses the library pdfminer
  • You can specify number of processes the script uses to parse the PDFs with parameter --processes (default value is 4)