• Stars
    star
    8,736
  • Rank 4,169 (Top 0.09 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 13 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern

Build Status Coverage PyPi version License

Pattern is a web mining module for Python. It has tools for:

  • Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
  • Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
  • Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
  • Network Analysis: graph centrality and visualization.

It is well documented, thoroughly tested with 350+ unit tests and comes bundled with 50+ examples. The source code is licensed under BSD.

Example workflow

Example

This example trains a classifier on adjectives mined from Twitter using Python 3. First, tweets that contain hashtag #win or #fail are collected. For example: "$20 tip off a sweet little old lady today #win". The word part-of-speech tags are then parsed, keeping only adjectives. Each tweet is transformed to a vector, a dictionary of adjective → count items, labeled WIN or FAIL. The classifier uses the vectors to learn which other tweets look more like WIN or more like FAIL.

from pattern.web import Twitter
from pattern.en import tag
from pattern.vector import KNN, count

twitter, knn = Twitter(), KNN()

for i in range(1, 3):
    for tweet in twitter.search('#win OR #fail', start=i, count=100):
        s = tweet.text.lower()
        p = '#win' in s and 'WIN' or 'FAIL'
        v = tag(s)
        v = [word for word, pos in v if pos == 'JJ'] # JJ = adjective
        v = count(v) # {'sweet': 1}
        if v:
            knn.train(v, type=p)

print(knn.classify('sweet potato burger'))
print(knn.classify('stupid autocorrect'))

Installation

Pattern supports Python 2.7 and Python 3.6. To install Pattern so that it is available in all your scripts, unzip the download and from the command line do:

cd pattern-3.6
python setup.py install

If you have pip, you can automatically download and install from the PyPI repository:

pip install pattern

If none of the above works, you can make Python aware of the module in three ways:

  • Put the pattern folder in the same folder as your script.
  • Put the pattern folder in the standard location for modules so it is available to all scripts:
    • c:\python36\Lib\site-packages\ (Windows),
    • /Library/Python/3.6/site-packages/ (Mac OS X),
    • /usr/lib/python3.6/site-packages/ (Unix).
  • Add the location of the module to sys.path in your script, before importing it:
MODULE = '/users/tom/desktop/pattern'
import sys; if MODULE not in sys.path: sys.path.append(MODULE)
from pattern.en import parsetree

Documentation

For documentation and examples see the user documentation.

Version

3.6

License

BSD, see LICENSE.txt for further details.

Reference

De Smedt, T., Daelemans, W. (2012). Pattern for Python. Journal of Machine Learning Research, 13, 2031–2035.

Contribute

The source code is hosted on GitHub and contributions or donations are welcomed.

Bundled dependencies

Pattern is bundled with the following data sets, algorithms and Python packages:

  • Brill tagger, Eric Brill
  • Brill tagger for Dutch, Jeroen Geertzen
  • Brill tagger for German, Gerold Schneider & Martin Volk
  • Brill tagger for Spanish, trained on Wikicorpus (Samuel Reese & Gemma Boleda et al.)
  • Brill tagger for French, trained on Lefff (Benoît Sagot & Lionel Clément et al.)
  • Brill tagger for Italian, mined from Wiktionary
  • English pluralization, Damian Conway
  • Spanish verb inflection, Fred Jehle
  • French verb inflection, Bob Salita
  • Graph JavaScript framework, Aslak Hellesoy & Dave Hoover
  • LIBSVM, Chih-Chung Chang & Chih-Jen Lin
  • LIBLINEAR, Rong-En Fan et al.
  • NetworkX centrality, Aric Hagberg, Dan Schult & Pieter Swart
  • spelling corrector, Peter Norvig

Acknowledgements

Authors:

Contributors (chronological):

  • Frederik De Bleser
  • Jason Wiener
  • Daniel Friesen
  • Jeroen Geertzen
  • Thomas Crombez
  • Ken Williams
  • Peteris Erins
  • Rajesh Nair
  • F. De Smedt
  • Radim Řehůřek
  • Tom Loredo
  • John DeBovis
  • Thomas Sileo
  • Gerold Schneider
  • Martin Volk
  • Samuel Joseph
  • Shubhanshu Mishra
  • Robert Elwell
  • Fred Jehle
  • Antoine Mazières + fabelier.org
  • Rémi de Zoeten + closealert.nl
  • Kenneth Koch
  • Jens Grivolla
  • Fabio Marfia
  • Steven Loria
  • Colin Molter + tevizz.com
  • Peter Bull
  • Maurizio Sambati
  • Dan Fu
  • Salvatore Di Dio
  • Vincent Van Asch
  • Frederik Elwert

More Repositories

1

clicr

Machine reading comprehension on clinical case reports
Python
149
star
2

news-audit

Fake news detection, Google Summer of Code 2017
Python
90
star
3

dutchembeddings

Repository for the word embeddings experiments described in "Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource", presented at LREC 2016.
Python
82
star
4

cat

cat🐈: the repo for the paper "Embarrassingly Simple Unsupervised Aspect extraction"
Python
77
star
5

clinspell

Clinical spelling correction with word and character n-gram embeddings.
Python
74
star
6

MBSP

Memory-based shallow parser for Python
Lex
73
star
7

topbox

Python 2 & 3 wrapper around the Stanford Topic Modeling Toolbox. Intended to be used for hassle-free supervised topic classification with Labeled Latent Dirichlet Allocation (L-LDA, LLDA, sLDA).
Python
59
star
8

bratreader

Python code for reading Brat Repositories. Supports saving and reading from XML files for easy acces to annotations.
Python
41
star
9

wordkit

Featurize words into orthographic and phonological vectors.
Python
39
star
10

hades

Repository for the CLiPS HAte speech DEtection System [HADES].
Python
24
star
11

interpret_with_rules

Code for the paper "Rule induction for global explanation of trained models"
Python
21
star
12

mfaq

MFAQ: a Multilingual FAQ Dataset
Python
17
star
13

humumls

UMLS in Python with MongoDB.
Python
16
star
14

yarn

Disambiguating biomedical and clinical concepts with word embeddings
Python
14
star
15

conch

Unsupervised concept extraction from clinical text
Python
14
star
16

gsoc2018

Google Summer of Code 2018
JavaScript
8
star
17

accumulate

Software created within Accumulate project (www.accumulate.be) at CLiPS, University of Antwerp
8
star
18

metameric

A fast simulator for localist connectionist models.
JavaScript
7
star
19

conversational-agents

Ressources on conversational agents
7
star
20

rnn_expl_rules

Obtain explanation rules from an RNN
Python
5
star
21

dutchclinicalnegation

Negation detection of concepts in Dutch clinical text
Python
4
star
22

SimulatingCochlearImplants

Simulating cochlear implants with neural networks
Python
4
star
23

gsoc2019_bias

Python
4
star
24

memory-networks

Memory networks (and variants) for medical machine reading
Python
4
star
25

srl2tex

Creates LaTeX source from semantic role annotations
Scala
4
star
26

fewshot-biomedical-names

Code for the BioNLP 2021 paper "Scalable Few-Shot Learning of Robust Biomedical Name Representations"
Python
3
star
27

vardial-dfs

CLiPS submission for the 'Discriminating between Dutch and Flemish in Subtitles' (DFS) subtask at VarDial
Python
3
star
28

english_clinical_modality

Negation and speculation detection of concepts in English clinical text
Python
3
star
29

toposcope

Python
3
star
30

ADATaLKS

TeX
2
star
31

gsoc2019_crosslang

GSoC 2019 project on cross language analysis
Python
2
star
32

conll2018

The code for the conll2018 submission: "from strings to other things: linking the neighborhood and transposition effects in word reading."
Python
2
star
33

styloscope

Python
2
star
34

PatientRep

Code repository for learning patient representations
Python
2
star
35

higherlevelsemantics

Code for the LOUHI 2021 paper "Integrating Higher-Level Semantics into Robust Biomedical Name Representations"
Python
1
star
36

clips.github.io

JavaScript
1
star
37

memory-networs-for-reading-comprehension

Memory networks for machine reading comprehension in PyTorch
Python
1
star
38

seg-cnn

Segment CNNs for clinical relation extraction with additional features
Python
1
star
39

gsoc2019_vinlap

GSoC 2019 Project developed by @FabricioLayedra under the supervision of @GuyDePaw. Contact [email protected]
HTML
1
star
40

conceptualgrounding

Code for the EACL 2021 paper "Conceptual Grounding Constraints for Truly Robust Biomedical Name Representations"
Python
1
star