• This repository has been archived on 06/Mar/2019
  • Stars
    star
    784
  • Rank 58,032 (Top 2 %)
  • Language
    Python
  • License
    Other
  • Created almost 9 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract structured data from ingredient phrases using conditional random fields

CRF Ingredient Phrase Tagger

This repo contains scripts to extract the Quantity, Unit, Name, and Comments from unstructured ingredient phrases. We use it on Cooking to format incoming recipes. Given the following input:

1 pound carrots, young ones if possible
Kosher salt, to taste
2 tablespoons sherry vinegar
2 tablespoons honey
2 tablespoons extra-virgin olive oil
1 medium-size shallot, peeled and finely diced
1/2 teaspoon fresh thyme leaves, finely chopped
Black pepper, to taste

Our tool produces something like:

{
    "qty":     "1",
    "unit":    "pound"
    "name":    "carrots",
    "other":   ",",
    "comment": "young ones if possible",
    "input":   "1 pound carrots, young ones if possible",
    "display": "<span class='qty'>1</span><span class='unit'>pound</span><span class='name'>carrots</span><span class='other'>,</span><span class='comment'>young ones if possible</span>",
}

We use a conditional random field model (CRF) to extract tags from labelled training data, which was tagged by human news assistants. We wrote about our approach on the New York Times Open blog. More information about CRFs can be found here.

On a 2012 Macbook Pro, training the model takes roughly 30 minutes for 130k examples using the CRF++ library.

Development

On OSX:

brew install crf++
python setup.py install

Quick Start

The most common usage is to train the model with a subset of our data, test the model against a different subset, then visualize the results. We provide a shell script to do this, at:

./roundtrip.sh

You can edit this script to specify the size of your training and testing set. The default is 20k training examples and 2k test examples.

Usage

Training

To train the model, we must first convert our input data into a format which crf_learn can accept:

bin/generate_data --data-path=input.csv --count=1000 --offset=0 > tmp/train_file

The count argument specifies the number of training examples (i.e. ingredient lines) to read, and offset specifies which line to start with. There are roughly 180k examples in our snapshot of the New York Times cooking database (which we include in this repo), so it is useful to run against a subset.

The output of this step looks something like:

1            I1      L8      NoCAP  NoPAREN  B-QTY
cup          I2      L8      NoCAP  NoPAREN  B-UNIT
white        I3      L8      NoCAP  NoPAREN  B-NAME
wine         I4      L8      NoCAP  NoPAREN  I-NAME

1/2          I1      L4      NoCAP  NoPAREN  B-QTY
cup          I2      L4      NoCAP  NoPAREN  B-UNIT
sugar        I3      L4      NoCAP  NoPAREN  B-NAME

2            I1      L8      NoCAP  NoPAREN  B-QTY
tablespoons  I2      L8      NoCAP  NoPAREN  B-UNIT
dry          I3      L8      NoCAP  NoPAREN  B-NAME
white        I4      L8      NoCAP  NoPAREN  I-NAME
wine         I5      L8      NoCAP  NoPAREN  I-NAME

Next, we pass this file to crf_learn, to generate a model file:

crf_learn template_file tmp/train_file tmp/model_file

Testing

To use the model to tag your own arbitrary ingredient lines (stored here in input.txt), you must first convert it into the CRF++ format, then run against the model file which we generated above. We provide another helper script to do this:

python bin/parse-ingredients.py input.txt > results.txt

The output is also in CRF++ format, which isn't terribly helpful to us. To convert it into JSON:

python bin/convert-to-json.py results.txt > results.json

See the top of this README for an example of the expected output.

Authors

License

Apache 2.0.

More Repositories

1

covid-19-data

A repository of data on coronavirus cases and deaths in the U.S.
6,989
star
2

objective-c-style-guide

The Objective-C Style Guide used by The New York Times
5,848
star
3

gizmo

A Microservice Toolkit from The New York Times
Go
3,753
star
4

Store

Android Library for Async Data Loading and Caching
Java
3,531
star
5

NYTPhotoViewer

A modern photo viewing experience for iOS.
Objective-C
2,847
star
6

pourover

A library for simple, fast filtering and sorting of large collections in the browser. There is a community-maintained fork that addresses a handful of post-NYT issues available via @hhsnopek's https://github.com/hhsnopek/pourover
JavaScript
2,393
star
7

kyt

Starting a new JS app? Build, test and run advanced apps with kyt ๐Ÿ”ฅ
JavaScript
1,922
star
8

react-tracking

๐ŸŽฏ Declarative tracking for React apps.
JavaScript
1,876
star
9

ice

track changes with javascript
JavaScript
1,708
star
10

backbone.stickit

Backbone data binding, model binding plugin. The real logic-less templates.
JavaScript
1,641
star
11

library

A collaborative documentation site, powered by Google Docs.
JavaScript
1,143
star
12

openapi2proto

A tool for generating Protobuf v3 schemas and gRPC service definitions from OpenAPI specifications
Go
940
star
13

gziphandler

Go middleware to gzip HTTP responses
Go
857
star
14

svg-crowbar

Extracts an SVG node and accompanying styles from an HTML document and allows you to download it all as an SVG file.
JavaScript
840
star
15

Emphasis

Dynamic Deep-Linking and Highlighting
JavaScript
576
star
16

tamper

Ruby
499
star
17

three-loader-3dtiles

This is a Three.js loader module for handling OGC 3D Tiles, created by Cesium. It currently supports the two main formats, Batched 3D Model (b3dm) - based on glTF Point cloud.
TypeScript
444
star
18

react-prosemirror

A library for safely integrating ProseMirror and React.
TypeScript
418
star
19

rd-blender-docker

A collection of Docker containers for running Blender headless or distributed โœจ
Python
415
star
20

Register

Android Library and App for testing Play Store billing
Kotlin
381
star
21

text-balancer

Eliminate typographic widows and other type crimes with this javascript module
JavaScript
373
star
22

document-viewer

The NYTimes Document Viewer
JavaScript
310
star
23

ios-360-videos

NYT360Video plays 360-degree video streamed from an AVPlayer on iOS.
Objective-C
273
star
24

three-story-controls

A three.js camera toolkit for creating interactive 3d stories
TypeScript
247
star
25

backbone.trackit

Manage unsaved changes in a Backbone Model.
JavaScript
202
star
26

aframe-loader-3dtiles-component

A-Frame component using 3D-Tiles
JavaScript
187
star
27

marvin

A go-kit HTTP server for the App Engine Standard Environment
Go
177
star
28

drone-gke

Drone plugin for deploying containers to Google Kubernetes Engine (GKE)
Go
165
star
29

Chronicler

A better way to write your release notes.
JavaScript
162
star
30

nginx-vod-module-docker

Docker image for nginx with Kaltura's VoD module used by The New York Times
Dockerfile
161
star
31

collectd-rabbitmq

A collected plugin, written in python, to collect statistics from RabbitMQ.
Python
143
star
32

public_api_specs

The API Specs (in OpenAPI/Swagger) for the APIs available from developer.nytimes.com
136
star
33

gunsales

Statistical analysis of monthly background checks of gun purchases
R
130
star
34

gcp-vault

A client for securely retrieving secrets from Vault in Google Cloud infrastructure
Go
119
star
35

rd-bundler-3d-plugins

Bundler plugins for optimizing glTF 3D models
JavaScript
119
star
36

Fech

Deprecated. Please see https://github.com/dwillis/Fech for a maintained fork.
Ruby
115
star
37

data-training

Files from the NYT data training program, available for public use.
114
star
38

drone-gae

Drone plugin for managing deployments and services on Google App Engine (GAE)
Go
97
star
39

mock-ec2-metadata

Go
95
star
40

encoding-wrapper

Collection of Go wrappers for Video encoding cloud providers (moved to @video-dev)
Go
85
star
41

redux-taxi

๐Ÿš• Component-driven asynchronous SSR in isomorphic Redux apps
JavaScript
70
star
42

video-captions-api

Agnostic API to generate captions for media assets across different transcription services.
Go
61
star
43

lifeline

A cron-based alternative to running daemons
Ruby
58
star
44

gcs-helper

Tool for proxying and mapping HTTP requests to Google Cloud Storage (GCS).
Go
54
star
45

logrotate

Go
54
star
46

httptest

A simple concurrent HTTP testing tool
Go
48
star
47

kyt-starter-universal

Deprecated, see: https://github.com/NYTimes/kyt/tree/master/packages/kyt-starter-universal
JavaScript
33
star
48

nytcampfin

A thin Python client for The New York Times Campaign Finance API
Python
27
star
49

safejson

safeJSON provides replacements for the 'load' and 'loads' methods in the standard Python 'json' module.
Python
27
star
50

thumbor-docker-image

Docker image for Thumbor smart imaging service
26
star
51

times_wire

A thin Ruby client for The New York Times Newswire API
Ruby
26
star
52

haiti-debt

Historical data on Haitiโ€™s debt payments to France collected by The New York Times.
21
star
53

hhs-child-migrant-data

Data from the U.S. Department of Human Health and Services on children who have migrated to the United States without an adult.
21
star
54

jsonlogic

Clojure
20
star
55

elemental-live-client

JS library to communicate with Elemental live API.
JavaScript
19
star
56

Open-Source-Science-Fair

The New York Times Open Source Science Fair
JavaScript
19
star
57

tweetftp

Ruby Implementation of the Tweet File Transfer Protocol (APRIL FOOLS JOKE)
Ruby
19
star
58

prosemirror-change-tracking-prototype

JavaScript
18
star
59

plumbook

Data from the Plum Book, published by the GPO every 4 years
17
star
60

libvmod-queryfilter

Simple querystring filter/sort module for Varnish Cache v3-v6
M4
16
star
61

sneeze

Python
16
star
62

querqy-clj

Search Query Rewriting for Elasticsearch and more! Built on Querqy.
Clojure
14
star
63

sqliface

handy interfaces and test implementations for Go's database/sql package
Go
14
star
64

grocery

The grocery package provides easy mechanisms for storing, loading, and updating Go structs in Redis.
Go
13
star
65

oak-byo-react-prosemirror-redux

JavaScript
13
star
66

vase.elasticsearch

Vase Bindings for Elasticsearch
Clojure
11
star
67

library-customization-example

An example repo that customizes Library behavior
SCSS
11
star
68

counter

count things, either as a one-off or aggregated over time
Ruby
11
star
69

kyt-starter

The default starter-kyt for kyt apps.
JavaScript
10
star
70

tulsa-1921-data

Data files associated with our story on the 1921 race massacre in Tulsa, Oklahoma.
10
star
71

open-blog-projects

A repository for code examples that are paired with our Open Blog posts
Swift
9
star
72

rd-mobile-pg-demos

HTML
9
star
73

pocket_change

Python
9
star
74

mentorship

7
star
75

sort_by_str

SQL-like sorts on your Enumerables
Ruby
7
star
76

drone-gdm

Drone.io plugin to facilitate the use of Google Deployment Manager in drone deploy phase.
Go
6
star
77

kyt-starter-static

Deprecated, see: https://github.com/NYTimes/kyt/tree/master/packages/kyt-starter-static
JavaScript
6
star
78

s3yum

Python
5
star
79

pocket

Python
5
star
80

drone-openapi

A Drone plugin for publishing Open API service specifications
Go
5
star
81

threeplay

Go client for the 3Play API.
Go
4
star
82

amara

Amara client for Go
Go
4
star
83

go-compare-expressions

Go
3
star
84

kaichu

Python
3
star
85

license

NYT Apache 2.0 license
3
star
86

prosemirror-tooltip

JavaScript
2
star
87

photon-dev_demo

A "Sustainable Systems, Powered By Python" Demo Repository (1 of 3)
Shell
2
star
88

std-cat

Content Aggregation Technology โ€” a standard for content aggregation on the Web
HTML
1
star