• Stars
    star
    349
  • Rank 120,757 (Top 3 %)
  • Language
    Python
  • License
    BSD 2-Clause "Sim...
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python Pushshift.io API Wrapper (for comment/submission search)

Python Pushshift.io API Wrapper (for comment/submission search)

THIS REPOSITORY IS STALE - Please consider using PMAW instead, as that tool is actively maintained

Detailed documentation for PSAW is available at: https://psaw.readthedocs.io/en/latest/

Installation

pip install psaw

Description

A minimalist wrapper for searching public reddit comments/submissions via the pushshift.io API.

Pushshift is an extremely useful resource, but the API is poorly documented. As such, this API wrapper is currently designed to make it easy to pass pretty much any search parameter the user wants to try.

Although it is not necessarily reflective of the current status of the API, you should attempt to familiarize yourself with the Pushshift API documentation to better understand what search arguments are likely to work.

Features

  • Handles rate limiting and exponential backoff subject to maximum retries and maximum backoff limits. A minimum rate limit of 1 request per second is used as a default per consultation with Pushshift's maintainer, /u/Stuck_in_the_matrix.
  • Handles paging of results. Returns all historical results for a given query by default.
  • Optionally handles incorporation of praw to fetch objects after getting ids from pushshift
  • If not using praw, returns results in comment and submission objects whose API is similar to the corresponding praw objects. Additionally, result objects have an additional .d_ attribute that offers dict access to the associated data attributes.
  • Optionally adds a created attribute which converts a comment/submission's created_utc timestamp to the user's local time. (may raise exceptions for users with certain timezone settings).
  • Simple interface to pass query arguments to the API. The API is sparsely documented, so it's often fruitful to just try an argument and see if it works.
  • A stop_condition argument to make it simple to stop yielding results given arbitrary user-defined criteria
  • Commandline interface (CLI) for simplified usage outside of python environment.

WARNINGS

  • Using non-default sort may result in unexpected behavior.
  • Default behavior is to continuously hit the pushshift api. If a query is taking longer than expected to return results, it's possible that psaw is pulling more data than you may want or is caught in some kind of loop.
  • I strongly recommend prototyping queries by printing to stdout to ensure you're getting the desired behavior.

Demo usage (python)

from psaw import PushshiftAPI

api = PushshiftAPI()

Or to use pushshift search to fetch ids and then use praw to fetch objects:

import praw
from psaw import PushshiftAPI

r = praw.Reddit(...)
api = PushshiftAPI(r)

100 most recent submissions

# The `search_comments` and `search_submissions` methods return generator objects
gen = api.search_submissions(limit=100)
results = list(gen)

First 10 submissions to /r/politics in 2017, filtering results to url/author/title/subreddit fields.

The created_utc field will be added automatically (it's used for paging).

import datetime as dt

start_epoch=int(dt.datetime(2017, 1, 1).timestamp())

list(api.search_submissions(after=start_epoch,
                            subreddit='politics',
                            filter=['url','author', 'title', 'subreddit'],
                            limit=10))

Trying a search argument that doesn't actually work

According to the pushshift.io API documentation, we should be able to search submissions by url, but (at the time of this writing) this doesn't actually work in practice. The API should still respect the limit argument and possibly other supported arguments, but no guarantees. If you find that an argument you have passed is not supported by the API, best thing is to just remove it from the query and modify your api call to only utilize supported arguments to mitigate risks from of unexpected behavior.

url = 'http://www.politico.com/story/2017/02/mike-flynn-russia-ties-investigation-235272'
url_results = list(api.search_submissions(url=url, limit=500))

len(url_results), any(r.url == url for r in url_results)
# 500, False

All AskReddit comments containing the text "OP"

Use the q parameter to search text. Omitting the limit parameter does a full historical search. Requests are performed in batches of size specified by the max_results_per_request parameter (default=500). Omitting the "max_reponse_cache" test in the demo below will return all results. Otherwise, this demo will perform two API requests returning 500 comments each. Alternatively, the generator can be queried for additional results.

gen = api.search_comments(q='OP', subreddit='askreddit')

max_response_cache = 1000
cache = []

for c in gen:
    cache.append(c)

    # Omit this test to actually return all results. Wouldn't recommend it though: could take a while, but you do you.
    if len(cache) >= max_response_cache:
        break

# If you really want to: pick up where we left off to get the rest of the results.
if False:
    for c in gen:
        cache.append(c)

Using the aggs argument to summarize search results

When an aggs parameter is provided to a search method, the first result yielded by the generator will contain the aggs result.

api = PushshiftAPI()
gen = api.search_comments(author='nasa', aggs='subreddit')
next(gen)
#  {'subreddit': [
#    {'doc_count': 300, 'key': 'IAmA'},
#    {'doc_count': 6, 'key': 'space'},
#    {'doc_count': 1, 'key': 'ExposurePorn'},
#    {'doc_count': 1, 'key': 'Mars'},
#    {'doc_count': 1, 'key': 'OldSchoolCool'},
#    {'doc_count': 1, 'key': 'news'},
#    {'doc_count': 1, 'key': 'pics'},
#    {'doc_count': 1, 'key': 'reddit.com'}]}
len(list(gen)) # 312

Using the redditor_subreddit_activity convenience method

If you want to profile a redditors activity as in the aggs example, the redditor_subreddit_activity provides a simple shorthand for profiling a user by the subreddits in which they are active, counting comments and submissions separately in a single call, and returning Counter objects for commenting and posting activity, respectively.

api = PushshiftAPI()
result = api.redditor_subreddit_activity('nasa')
result
#{'comment':
#   Counter({
#      'ExposurePorn': 1,
#      'IAmA': 300,
#      'Mars': 1,
#      'OldSchoolCool': 1,
#      'news': 1,
#      'pics': 1,
#      'reddit.com': 1,
#      'space': 6}),
# 'submission':
#   Counter({
#      'IAmA': 3,
#      'ISS': 1,
#      'Mars': 1,
#      'space': 3,
#      'u_nasa': 86})}

Using the stop_condition argument to get the most recent submission by a bot account

gen = api.search_submissions(stop_condition=lambda x: 'bot' in x.author)

for subm in gen:
    pass

print(subm.author)

Collecting results in a pandas.DataFrame for analysis

import pandas as pd

df = pd.DataFrame([thing.d_ for thing in gen])

Special Convenience Attributes

Consider the following simple query:

gen = api.search_submissions(subreddit='pushshift')
thing = next(gen)

Special attributes:

  • thing.d_ a dict containing all of the data attributes attached to the thing (which otherwise would be accessed via dot notation). One specific convenience this enables is simplifying pushing results into a pandas dataframe (above).
  • api.metadata_ The metadata data provided by pushshift (if any) from the most recent successful request. The most useful metadata attributes, IMHO, are:
    • api.metadata_.get('shards') - For checking if any shards are down, which can impact the result cardinality.
    • api.metadata_.get('total_results') - The database-side count of how many total items were found in the query and should be returned after paging through all results. Users have encountered rare edge cases that don't return all expected results, probably due to more than 500 items sharing the same timestamp in a result range. See issue #47 for progress resolving this behavior.

Demo usage (CLI)

For CLI documentation, run

psaw --help

License

PSAW's source is provided under the Simplified BSD License.

  • Copyright (c), 2018, David Marx

More Repositories

1

video-killed-the-radio-star

Notebook and tools for end-to-end automation of music video production with generative AI
Jupyter Notebook
196
star
2

anthology-of-modern-ml

Collection of important articles to be treated as a textbook
Jupyter Notebook
125
star
3

ComfyUI-Keyframed

ComfyUI nodes to facilitate prompt/parameter keyframing
Python
82
star
4

Topological-Anomaly-Detection

Topological Anomaly Detection (TAD) per Gartley and Basener 2009
Python
66
star
5

bench-warmers

DigThatData's Public Brainstorming space
Python
59
star
6

Multi-Modal-Comparators

Unified API to facilitate usage of pre-trained "perceptor" models, a la CLIP
Jupyter Notebook
39
star
7

keyframed

Simple, expressive, pythonic datatypes for manipulating curves parameterized by keyframes and interpolators.
Python
34
star
8

zero-shot-intent-classifier

Minimal zero-shot intent classifier for arbitrary intent slot filling, via LLM prompting w LangChain.
Python
32
star
9

VideoLinkBot

Reddit bot that posts a comment with all of the video links in a submission. Currently only supports YouTube.
Python
28
star
10

notebooks

misc notebooks i wanted to put in tracking
Jupyter Notebook
17
star
11

digthatdata-comfyui-workflows

Sharing some of my secret sauce ;)
14
star
12

Scrape-Subreddit-Users

Utilities for identifying users subscribed to s subreddit, downloading their comment histories, and analyzing their commenting behavior
Python
13
star
13

not-a-package-manager

utilities to facilitate working with codebases that don't ascribe to normal package management paradigms, e.g. ML research code that can be cloned but not installed.
Python
12
star
14

cka_pytorch

Reproducing Raghu et al 2021 - Do Vision Transformers See Like Convolutional Neural Networks?
Jupyter Notebook
10
star
15

the-rest-of-the-fucking-owl

Trigger an LLM in your CI/CD to auto-complete your work
Python
10
star
16

ComfyUI-AudioReactive

porting audioreactivity pipeline from vktrs to comfyui.
Python
10
star
17

TextSummarization

Simple text LexRank text summarization in python
Python
8
star
18

reddit_comment_scraper

Simple python script to download reddit user comments.
Python
8
star
19

sd-lazy-wildcards

sd-webui wildcards filled by LLM, no wordlist files required
Python
6
star
20

awesome-llm-utilities

Tooling and frameworks to support building tools that utilize LLM agents
6
star
21

split_csv

This tool takes a CSV file and splits it into separate files based on the contents of one or several columns. Accompanying scripts add line counts to filenames and convert the output CSV's into formatted .xlsx files (specially formatted for my specific needs, but the code is easily modifiable).
Python
6
star
22

ai-ethics-in-the-wild

some stuff that's worth talking about
6
star
23

anthology-of-ml-for-ai-art

like dmarx/anthology-of-machine-learning, but focused on important papers in AI art
5
star
24

congressional-rollcalls

Database of congressional roll call votes scraped from THOMAS and accompanying code used to build database
Python
5
star
25

inverse-heat-dissipation

[WIP] Implementation of Rissanen et al, "Generative Modelling With Inverse Heat Dissipation" - https://arxiv.org/pdf/2206.13397.pdf
Python
5
star
26

workbench

A simple brainstorming space, powered by the github front-end + github actions.
Python
5
star
27

textbooks-dataset

Curated dataset of open source textbooks
4
star
28

llm-tools

unopinioated misc utilites and templates for building things with LLMs
Python
4
star
29

keyframed_chatgpt

An experiment pair-programming with ChatGPT. Reference the commit-to-commit diffs to see how this developed each step of the way.
HTML
3
star
30

jax-docker

containerized jax environment
Dockerfile
3
star
31

mines

Minesweeper clone in python. Mainly a toy for playing with probability and AI.
Python
3
star
32

ComfyUI-TSP

Animation smoothing via frame permutation using Traveling Salesman optimization
Python
3
star
33

make_for_datascience

Demonstration of how to use Gnu Make effectively in a robust data analytics pipeline
HTML
2
star
34

d3-mines

"High-Dimensional Minesweeper" game in D3
JavaScript
2
star
35

auto-tagger

use an LLM (chatgpt) to automatically categorize documents
Python
2
star
36

TravelingSalesmanMCMC

Playing with a simple traveling salesman problem as an excuse to play with some optimization algorithms I haven't had an opportunity to play with in a while.
Python
2
star
37

Reddit_response_to_Trump

Quantifying the Reddit community's response to US political turmoil following the election of Donald Trump
Jupyter Notebook
2
star
38

discord-comfybot

Simple and powerful Discord bot that uses ComfyUI as a backend
Python
2
star
39

ComfyUI-VKTRS

WIP Suite of interoperating node packs inspired by the Video Killed The Radio Star *Diffusion* notebook
2
star
40

fast-and-simple-dense-retrieval

yet another vector database
Python
1
star
41

owl-test3

Python
1
star
42

star-feed

github keeps killing the "following" feed. time to take matters into my own hands.
Shell
1
star
43

SuicideWatch

Experiment to build a model that detects suicidality in social media activity.
Jupyter Notebook
1
star
44

owl-test

won't it be fun if this works? https://github.com/dmarx/the-rest-of-the-fucking-owl
Python
1
star
45

Target-Shuffling

Demo of target shuffling, aka permutation testing
HTML
1
star
46

enl-supply

Python
1
star
47

whats-in-a-name

[WIP] probing identity and bias in text to image models
Jupyter Notebook
1
star
48

not-an-nft

git is the only blockchain that matters. https://dmarx.github.io/not-an-nft/images.html
TeX
1
star
49

RedditUtilities

A collection of useful tools for scraping reddit
Python
1
star
50

quorum

playing with the idea described [here](https://github.com/dmarx/bench-warmers/blob/main/quorum.md) and [here](https://github.com/dmarx/bench-warmers/blob/main/product_thinking_ai.md)
Python
1
star
51

twitterMonitor

Simple twitter event detection tool
Python
1
star
52

clip-auto-organizer

fuck it, why not - https://github.com/dmarx/bench-warmers/blob/main/clip_auto_organizer.md
Python
1
star
53

siren_unofficial

implementation of Sitzmann et al "Implicit Neural Representations with Periodic Activation Functions" - https://arxiv.org/pdf/2006.09661.pdf
Jupyter Notebook
1
star