• Stars
    star
    1,235
  • Rank 36,562 (Top 0.8 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 10 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

talon

Mailgun library to extract message quotations and signatures.

If you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like 😄

Usage

Here’s how you initialize the library and extract a reply from a text message:

import talon
from talon import quotations

talon.init()

text =  """Reply

-----Original Message-----

Quote"""

reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
# reply == "Reply"

To extract a reply from html:

html = """Reply
<blockquote>

  <div>
    On 11-Apr-2011, at 6:54 PM, Bob &lt;[email protected]&gt; wrote:
  </div>

  <div>
    Quote
  </div>

</blockquote>"""

reply = quotations.extract_from(html, 'text/html')
reply = quotations.extract_from_html(html)
# reply == "<html><body><p>Reply</p></body></html>"

Often the best way is the easiest one. Here’s how you can extract signature from email message without any machine learning fancy stuff:

from talon.signature.bruteforce import extract_signature


message = """Wow. Awesome!
--
Bob Smith"""

text, signature = extract_signature(message)
# text == "Wow. Awesome!"
# signature == "--\nBob Smith"

Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:

import talon
# don't forget to init the library first
# it loads machine learning classifiers
talon.init()

from talon import signature


message = """Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.

John Doe
via mobile"""

text, signature = signature.extract(message, sender='[email protected]')
# text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
# signature == "John Doe\nvia mobile"

For machine learning talon currently uses the scikit-learn library to build SVM classifiers. The core of machine learning algorithm lays in talon.signature.learning package. It defines a set of features to apply to a message (featurespace.py), how data sets are built (dataset.py), classifier’s interface (classifier.py).

Currently the data used for training is taken from our personal email conversations and from ENRON dataset. As a result of applying our set of features to the dataset we provide files classifier and train.data that don’t have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.

To regenerate the model files, you can run

python train.py

or

from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA
from talon.signature.learning.classifier import train, init
train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)

Open-source Dataset

Recently we started a forge project to create an open-source, annotated dataset of raw emails. In the project we used a subset of ENRON data, cleansed of private, health and financial information by EDRM. At the moment over 190 emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to start using it for talon.

Training on your dataset

talon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the forge project does. Then do:

from talon.signature.learning.dataset import build_extraction_dataset
from talon.signature.learning import classifier as c

build_extraction_dataset("/path/to/your/P/folder", "/path/to/talon/signature/data/train.data")
c.train(c.init(), "/path/to/talon/signature/data/train.data", "/path/to/talon/signature/data/classifier")

Note that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).

Research

The library is inspired by the following research papers and projects:

More Repositories

1

transactional-email-templates

Responsive transactional HTML email templates
HTML
6,820
star
2

godebug

DEPRECATED! https://github.com/derekparker/delve
Go
2,507
star
3

flanker

Python email address and Mime parsing library
Python
1,618
star
4

mailgun-php

Mailgun's Official SDK for PHP
PHP
1,078
star
5

gubernator

High Performance Rate Limiting MicroService and Library
Go
944
star
6

mailgun-js-boland

A simple Node.js helper module for Mailgun API.
JavaScript
896
star
7

kafka-pixy

gRPC/REST proxy for Kafka
Go
752
star
8

mailgun-go

Go library for sending mail with the Mailgun API.
Go
678
star
9

mailgun.js

Javascript SDK for Mailgun
TypeScript
501
star
10

mailgun-ruby

Mailgun's Official Ruby Library
Ruby
462
star
11

groupcache

Clone of golang/groupcache with TTL and Item Removal support
Go
424
star
12

expiringdict

Dictionary with auto-expiring values for caching purposes.
Python
331
star
13

holster

A place to keep useful golang functions and small libraries
Go
277
star
14

validator-demo

Mailgun email address jquery validation plugin http://mailgun.github.io/validator-demo/
JavaScript
259
star
15

node-prelaunch

A Mailgun powered landing page to capture early sign ups
JavaScript
230
star
16

dnsq

DNS Query Tool
Python
107
star
17

documentation

Mailgun Documentation
CSS
79
star
18

scroll

Scroll is a lightweight library for building Go HTTP services at Mailgun.
Go
61
star
19

kafka-http

Kafka http endpoint
Scala
51
star
20

forge

email dataset for email signature parsing
51
star
21

wordpress-plugin

Mailgun's Wordpress Plugin
PHP
47
star
22

lemma

Mailgun Cryptographic Tools
Go
39
star
23

multibuf

Bytes buffer that implements seeking and partially persisting to disk
Go
37
star
24

ttlmap

In memory dictionary with TTLs
Go
22
star
25

frontend-best-practices

Guides for React and Javascript coding style and best practices
21
star
26

pong

Generates http servers that respond in predefined manner
Go
20
star
27

proxyproto

High performance implementation of V1 and V2 Proxy Protocol
Go
19
star
28

log

Go logging library used at Mailgun.
Go
19
star
29

mailgun-meteor-demo

Simple meteor-based emailer with geolocation and UA tracking
JavaScript
16
star
30

timetools

Go library with various time utilities used at Mailgun.
Go
11
star
31

mailgun-java

Java SDK for integrating with Mailgun
Java
11
star
32

pelican-protocol

In ancient Egypt the pelican was believed to possess the ability to prophesy safe passage in the underworld. Pelicans are ferocious eaters of fish.
Go
11
star
33

metrics

Go library for emitting metrics to StatsD.
Go
11
star
34

roman

Obtain, cache, and automatically reload TLS certificates from an ACME server
Go
10
star
35

sandra

Go library providing some convenience wrappers around gocql.
Go
10
star
36

iptools

Go library providing utilities for working with hosts' IP addresses.
Go
9
star
37

cfg

Go library for working with app's configuration files used at Mailgun.
Go
9
star
38

minheap

Slightly more user-friendly heap on top of containers/heap
Go
8
star
39

logrus-hooks

Go
8
star
40

pebblezgo

go example of the pebblez transport: protocol buffers over zeromq
Go
7
star
41

glogutils

Utils for working with google logging library
Go
7
star
42

pylemma

Mailgun Cryptographic Tools
Python
5
star
43

callqueue

Serialized call queues
Go
4
star
44

media

Logos and brand guidelines
4
star
45

sneakercopy

A tool for creating encrypted tar archives for transferring sensitive data.
Rust
4
star
46

scripts

Example scripts that show how to interact with the Mailgun API
Python
1
star
47

etcd3-slim

Thin wrapper around Etcd3 gRPC stubs
Python
1
star
48

go-statsd-client

statsd client for Go
Go
1
star