• Stars
    star
    541
  • Rank 82,114 (Top 2 %)
  • Language
    Java
  • License
    Other
  • Created almost 13 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Web-Scale Open Information Extraction

ReVerb

ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.

ReVerb takes raw text as input, and outputs (argument1, relation phrase, argument2) triples. For example, given the sentence "Bananas are an excellent source of potassium," ReVerb will extract the triple (bananas, be source of, potassium).

More information is available at the ReVerb homepage: http://reverb.cs.washington.edu

Quick Start

If you want to run ReVerb on a small amount of text without modifying its source code, we provide an executable jar file that can be run from the command line. Follow these steps to get started:

  1. Download the latest ReVerb jar from http://reverb.cs.washington.edu/reverb-latest.jar

  2. Run java -Xmx512m -jar reverb-latest.jar yourfile.txt.

  3. Run java -Xmx512m -jar reverb-latest.jar -h for more options.

Building

Building ReVerb from source requires Apache Maven (http://maven.apache.org). Run this command to download the required dependencies, compile, and create a single executable jar file.

mvn clean compile assembly:single

The compiled class files will be put in the target/classes directory. The single executable jar file will be written to target/reverb-core-*-jar-with-dependencies.jar where * is replaced with the version number.

Command Line Interface

Once you have built ReVerb, you can run it from the command line.

The command line interface to ReVerb takes plain text or HTML as input, and outputs a tab-separated table of output. Each row in the output represents a single extracted (argument1, relation phrase, argument2) triple, plus metadata. The output has the following columns:

  1. The filename (or stdin if the source is standard input)
  2. The sentence number this extraction came from.
  3. Argument1 words, space separated
  4. Relation phrase words, space separated
  5. Argument2 words, space separated
  6. The start index of argument1 in the sentence. For example, if the value is i, then the first word of argument1 is the i-1th word in the sentence.
  7. The end index of argument1 in the sentence. For example, if the value is j, then the last word of argument1 is the jth word in the sentence.
  8. The start index of relation phrase.
  9. The end index of relation phrase.
  10. The start index of argument2.
  11. The end index of argument2.
  12. The confidence that this extraction is correct. The higher the number, the more trustworthy this extraction is.
  13. The words of the sentence this extraction came from, space-separated.
  14. The part-of-speech tags for the sentence words, space-separated.
  15. The chunk tags for the sentence words, space separated. These represent a shallow parse of the sentence.
  16. A normalized version of arg1. See the BinaryExtractionNormalizer javadoc for details about how the normalization is done.
  17. A normalized version of rel.
  18. A normalized version of arg2.

For example:

$ echo "Bananas are an excellent source of potassium." | 
    ./reverb -q | tr '\t' '\n' | cat -n
 1  stdin
 2  1
 3  Bananas
 4  are an excellent source of
 5  potassium
 6  0
 7  1
 8  1
 9  6
10  6
11  7
12  0.9999999997341693
13  Bananas are an excellent source of potassium .
14  NNS VBP DT JJ NN IN NN .
15  B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16  bananas
17  be source of
18  potassium

For a list of options to the command line interface to ReVerb, run reverb -h.

Examples

Running ReVerb on small set of files

./reverb file1 file2 file3 ...

Running ReVerb on standard input

./reverb < input

Running ReVerb on HTML files

The --strip-html flag (short version: -s) removes tags from the input before running ReVerb.

./reverb --strip-html myfile.html

Running ReVerb on a list of files

You may have an entire directory structure that you want to run ReVerb on. ReVerb takes approximately 10 seconds to initialize, so it is not efficient to start a new process for each file. To pass ReVerb a list of paths, use the -f switch:

# Run ReVerb on all files under mydir/
find mydir/ -type f | ./reverb -f

Java Interface

To include ReVerb as a library in your own project, please take a look at the example class ReVerbExample in the src/main/java/edu/washington/cs/knowitall/examples directory.

When running code that calls ReVerb, make sure to increase the Java Virtual Machine heap size by passing the argument -Xmx512m to java. ReVerb loads multiple models into memory, and will be significantly slower if the heap size is not large enough.

Using Eclipse

To modify the ReVerb source code in Eclipse, use Apache Maven to create the appropriate project files:

mvn eclipse:eclipse

Then, start Eclipse and navigate to File > Import. Then, under General, select "Existing Projects into Workspace". Then point Eclipse to the main ReVerb directory.

Including ReVerb as a Dependency

If you want to start a new project that depends on ReVerb, first create a new skeleton project using Maven. The following command will ask you to fill in the details of your project name, etc.:

mvn archetype:generate

Next, add ReVerb as a dependency. To make sure you are using the latest version of ReVerb, consult Maven Central. Do this by adding the following XML under the <project> element:

<dependencies>
  <dependency>
    <groupId>edu.washington.cs.knowitall</groupId>
    <artifactId>reverb-core</artifactId>
    <version>1.4.1</version>
  </dependency>
</dependencies>

Your final pom.xml file should look something like this:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>mygroup</groupId>
  <artifactId>myartifact</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>myartifact</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>edu.washington.cs.knowitall</groupId>
      <artifactId>reverb-core</artifactId>
      <version>1.4.1</version>
    </dependency>
  </dependencies>
</project>

You should be able to include ReVerb in your code now. You can try this out by including import edu.washington.cs.knowitall.extractor.ReVerbExtractor in your program.

Retraining the Confidence Function

ReVerb includes a class for training new confidence functions, given a list of labeled examples, called ReVerbClassifierTrainer. Example code for training a new confidence function confFunction is shown below - the non-trivial part is likely to be converting your labeled data to an Iterable<LabeledBinaryExtraction>.

Example Pseudocode:

// Provide your labeled data here
Iterable<LabeledBinaryExtraction> myLabeledData = ??? 
ReVerbClassifierTrainer trainer = 
    new ReVerbClassifierTrainer(myLabeledData);
Logistic classifier = trainer.getClassifier();
ReVerbConfFunction confFunction = new ReVerbConfFunction(classifier);
 // confFunction is ready to use here.
double conf = confFunction.getConf(extraction);

If you already have a list of binary labeled ReVerb extractions, it should be easy to convert them to ChunkedBinaryExtraction objects, and then to LabeledBinaryExtraction objects (see the constructors for these classes). Also note that ReVerb includes a LabeledBinaryExtractionReader and Writer class. You may wish to (re-)serialize your data using LabeledBinaryExtractionWriter - this will put it in the same format as all previous data used to train ReVerb confidence functions, and it will be easy to read in the future with LabeledBinaryExtractionReader.

Help and Contact

For more information, please visit the ReVerb homepage at the University of Washington: http://reverb.cs.washington.edu.

FAQ

  1. How fast is ReVerb?

    You should really benchmark ReVerb yourself, but on my computer (a new computer in 2011) ReVerb processed 5000 high-quality web sentences in 21 s, or 238 sentences per second, in a single thread. ReVerb is easily parallelizable by processing different sentences concurrently.

Contributors

Citing ReVerb

If you use ReVerb in your academic work, please cite ReVerb with the following BibTeX citation:

@inproceedings{ReVerb2011,
  author =   {Anthony Fader and Stephen Soderland and Oren Etzioni},
  title =    {Identifying Relations for Open Information Extraction},
  booktitle =    {Proceedings of the Conference of Empirical Methods
                  in Natural Language Processing ({EMNLP} '11)},
  year =     {2011},
  month =    {July 27-31},
  address =  {Edinburgh, Scotland, UK}
}

More Repositories

1

openie

Quality information extraction at web scale.
Scala
454
star
2

ollie

Ollie is a open information extractor that uses bootstrapped dependency paths.
Scala
242
star
3

nlptools

A toolkit that wraps various natural language processing implementations behind a common interface.
Scala
101
star
4

openregex

An efficient and flexible token-based regular expression language and engine.
Java
74
star
5

yelp-dataset-challenge

Information extraction over restaurant reviews for the Yelp Dataset Challenge
Python
28
star
6

chunkedextractor

Extractors whose input is a chunked sentence. Includes Relnoun, Nesty, and a scala interface for ReVerb.
Scala
28
star
7

implie

Implicit relation extractor using a natural language model.
Scala
25
star
8

morpha

Morpha lex stemmer converted using jflex.
Java
22
star
9

srlie

The SRL-based Open IE extractor. A principal component of Open IE 4.0.
Scala
19
star
10

common-scala

The UW's library for common routines in scala.
Scala
13
star
11

taggers

Easily identify and label sentence intervals using various taggers.
Scala
11
star
12

DocOpenIE

Document-level information extraction.
Scala
7
star
13

triplestore-qa

Question answering over a triplestore
Scala
7
star
14

openie-demo

The main Open IE demo.
CSS
6
star
15

MultirFramework

Java
5
star
16

Tac2013EntityLinking

Scala
4
star
17

nlpweb

A demonstration of various NLP tools.
CSS
4
star
18

documentextractor

A web application to process documents into extractions and annotate those extractions.
CSS
4
star
19

common-java

Java
3
star
20

hadoop-clueweb

A collection of Hadoop jobs to process ClueWeb into sentences.
Scala
3
star
21

openregex-scala

A scala wrapper for OpenRegex.
Scala
2
star
22

relgrams

Relgrams -- Tool for computing relational co-occurrences.
Scala
2
star
23

openie-backend

Backend code for the Open IE demo (largely deprecated after Rob's efforts to move Open IE to Paralex).
Scala
2
star
24

UIUCWikifier2013Wrapper

Java
2
star
25

extraction-demo

A project for creating extractions from a list of sentences and providing a demo for exploring Open IE extractions. The primary purpose for this project is for exploration of Open IE in the IARPA project.
CSS
2
star
26

MultirExtractor

Java
1
star
27

clueweb-hadoop

1
star
28

kbp-MultiR

Java
1
star
29

KBP2014-Slotfilling-Multir

Scala
1
star
30

tac2013

locationHelper
Scala
1
star