• Stars
    star
    305
  • Rank 136,879 (Top 3 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created almost 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Elegant and Easy Tweet Preprocessing in Python

Preprocessor

https://travis-ci.org/s/preprocessor.svg?branch=master

Preprocessor is a preprocessing library for tweet data written in Python. When building Machine Learning systems based on tweet and text data, a preprocessing is required. This is required because of quality of the data as well as dimensionality reduction purposes.

This library makes it easy to clean, parse or tokenize the tweets so you don't have to write the same helper functions over and over again ever time.

Features

Currently supports cleaning, tokenizing and parsing:

  • URLs
  • Hashtags
  • Mentions
  • Reserved words (RT, FAV)
  • Emojis
  • Smileys
  • Numbers
  • JSON and .txt file support

Preprocessor v0.6.0 supports Python 3.4+ on Linux, macOS and Windows. Tests run on following setups:

Linux Xenial with Python 3.4.8, 3.5.6, 3.6.7, 3.7.1, 3.8.0, 3.8.3+
macOS with Python 3.7.5, 3.8.0
Windows with Python 3.5.4, 3.6.8

Usage

Basic cleaning:

>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome πŸ‘ https://github.com/s/preprocessor')
'Preprocessor is'

Tokenizing:

>>> p.tokenize('Preprocessor is #awesome πŸ‘ https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'

Parsing:

>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58

Fully customizable:

>>> p.set_options(p.OPT.URL, p.OPT.EMOJI)
>>> p.clean('Preprocessor is #awesome πŸ‘ https://github.com/s/preprocessor')
'Preprocessor is #awesome'

Preprocessor will go through all of the options by default unless you specify some options.

Processing files:

Preprocessor currently supports processing .json and .txt formats. Please see below examples for the correct input format.

Example JSON file

[
    "Preprocessor now supports files. https://github.com/s/preprocessor",
    "#preprocessing is a cruical part of @ML projects.",
    "@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl"
]

Example Text file

Preprocessor now supports files. https://github.com/s/preprocessor
#preprocessing is a cruical part of @ML projects.
@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl

Preprocessing JSON file:

# JSON example
>>> input_file_name = "sample_json.json"
>>> p.clean_file(input_file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451892752_vkeCMTwBEMmX_clean_file_sample.json

Preprocessing text file:

# Text file example
>>> input_file_name = "sample_txt.txt"
>>> p.clean_file(input_file_name, options=[p.OPT.URL, p.OPT.MENTION])
Saved the cleaned tweets to:/tests/artifacts/24052020_013451908865_TE9DWX1BjFws_clean_file_sample.txt

Available Options:

Option Name Option Short Code
URL p.OPT.URL
Mention p.OPT.MENTION
Hashtag p.OPT.HASHTAG
Reserved Words p.OPT.RESERVED
Emoji p.OPT.EMOJI
Smiley p.OPT.SMILEY
Number p.OPT.NUMBER

Installation

Using pip:

$ pip install tweet-preprocessor

Using Anaconda:

$ conda install -c saidozcan tweet-preprocessor

Using manual installation:

$ python setup.py build
$ python setup.py install

Contributing

Are you willing to contribute to preprocessor? That's great! Please follow below steps to contribute to this project:

  1. Create a bug report or a feature idea using the templates on Issues page.
  2. Fork the repository and make your changes.
  3. Open a PR and make sure your PR has tests and all the checks pass.
  4. And that's all!

More Repositories

1

CoreMLDemo

CoreML.framework Demo App
Swift
41
star
2

InstagramPrinter

A Python application that prints Instagram photos.
Python
17
star
3

5Artists

An iOS Today Extension that shows five new artists everyday based on users Spotify saved tracks.
Objective-C
17
star
4

Easify-iOS

An iOS application to test out Spotify API. It uses SwiftUI and Combine.
Swift
14
star
5

YTUBloggersWeb

YTU Bloggers Platform
CSS
10
star
6

SafranObjcCLI

Objective-C Command Line Reader for Safran.io
Objective-C
9
star
7

ChainedAlertController

A mechanism for chaining UIAlertController's to each other
Swift
9
star
8

TimeEffectInSentimentAnalysis

Investigating time effect in sentiment analysis using Active Learning techniques
Python
8
star
9

Aldebaran

A modular iOS application to show upcoming and past SpaceX rocket launches.
Swift
7
star
10

SafranJavaCLI

Java Command Line Reader for Safran.io
Java
6
star
11

Jokkmokk

A small northern kommun in Sweden. And some algorithm exercises in Objc.
Objective-C
5
star
12

SwiftBot

Swift
4
star
13

Saturdays

Weekly curated content client
Swift
4
star
14

Harnosand

A small northern kommun in Sweden. And also a Flickr client.
Swift
4
star
15

PhpPalette

A PHP Application that finds out most common colors of an image
CSS
4
star
16

MomentCard

A Flask based python application that creates card of given Instagram url
Python
4
star
17

TwitterPrinter

A Python application that prints tweets
Python
3
star
18

awesome-secmen-talepleri

Harika seçmen talepleri ✨
3
star
19

PyCoreML

Python examples of saving models in .mlmodel format
Python
2
star
20

s

2
star
21

Kiraz

Preprocessing Turkish text data with Zemberek
Java
2
star
22

Atelier

Learning SwiftUI with Apple tutorials: https://developer.apple.com/tutorials/swiftui/creating-and-combining-views#introduction
Swift
2
star
23

BigDataAnalysis

A project to analyze Enron Email Dataset with MapReduce pattern
Java
2
star