• This repository has been archived on 15/May/2024
  • Stars
    star
    109
  • Rank 319,077 (Top 7 %)
  • Language
    Python
  • Created over 8 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Live-scraping pastebin to fight boredom.

pastebin-scraper

This is a multithreaded scraping script for Pastebin. It scrapes the main site for new pastes, downloads their raw content and processes them by a user-defined output format.

WHY?

Fun.

Installation

The usual dance.

pip install -r requirements.txt

Define all required specs in settings.ini. Should you decide to go with a database output, make sure the respective connector is installed. At the moment MySQL with pymysql and SQLite with the standard built in Python 3 connector are supported.

Also note that the file output creates a subdirectory output and dumps every paste as a separate file into it.

Settings

ini is a highly underrated file format. Here are some definitions on what the settings parameter actually do.

GENERAL

  • PasteLimit Stop after having scraped n pastes. Set to 0 for indefinite scraping
  • PBLink URL to Pastebin or another equivalent site
  • DownloadWorkers Number of workers that download the raw paste content and further process it
  • NewPasteCheckInterval Time to wait before checking the main site for new pastes again
  • IPBlockedWaitTime Time to wait until checking the main site again after the scraper's IP has been blocked

LOGGING

  • RotationLog Location of log file that contains debug output
  • MaxRotationSize Size in bytes before another log file is created
  • RotationBackupCount Maximum number of log files to keep

STDOUT/ FILE

  • Enable Enable formatted stdout output of paste data
  • ContentDisplayLimit Maximum amount of characters to show before content is cut off (0 to display all)
  • ShowName Display the paste name
  • ShowLang Display the paste language
  • ShowLink Display the complete paste link
  • ShowData Display the raw paste content
  • DataEncoding Encoding of the raw paste data

MYSQL

  • Enable Enable MySQL output
  • TableName Main table name to insert data into
  • Host MySQL server host
  • Port MySQL server port
  • Username MySQL server user
  • Password User password

SQLITE

  • Enable Enable SQLite output
  • Filename Filename the db should be saved as (usually ends with .db)
  • TableName Main table name to insert data into

If you use this thing for some cool data analysis or even research, let me know if I can help!

Inspiration for this scraper was taken from here.