Create a Citation Graph based on Simplistic Text Analysis
Inspired by A.R. Siders' R Script from this ResearchGate question
Based on dpapathanasiou's example script for pdfminer
Takes Zotero .CSV Article collections and creates Gephi-compatible files for Graph Edges and Nodes based on citations
Principle:
- Let A be a set of known articles
- For any a in A, let title_a be its title, and text_a be its text content
- For some x in A and y in A, x!=y:
- cites(x,y) is true if title_y appears in text_x
For the above to work, we do some text normalization (removing punctuation, whitespace, special characters) and assume that the title_y would only appear in text_x if it appears in the references section...
Usage:
- Export list of articles as .csv from Zotero, (articles should have File attachments)
- Run
analyze_papers.py zotero_file.csv
- Script should produce two files: Edges_titles.csv and Nodes_titles.csv in folder "gephi"
- Load them into Gephi with "Load Spreadsheet"
Notes
- Tested with Python3
- Uses the library pdfminer
- You can specify number of processes the script uses to parse the PDFs with parameter --processes (default value is 4)