Web2Text
Source code for Web2Text: Deep Structured Boilerplate Removal, full paper at ECIR '18
Introduction
This repository contains
-
Scala code to parse an (X)HTML document into a DOM tree, convert it to a CDOM tree, interpret tree leaves as a sequence of text blocks and extract features for each of these blocks.
-
Python code to train and evaluate unary and pairwise CNNs on top of these features. Inference on the hidden Markov model based on the CNN output potentials can be executed using the provided implementation of the Viterbi algorithm.
-
The CleanEval dataset under
src/main/resources/cleaneval/
:orig
: raw pagesclean
: reference clean pagesaligned
: clean content aligned with the corresponding raw page on a per-character basis using the alignment algorithm described in our paper
-
Output from various other webpage cleaners on CleanEval under
other_frameworks/output
:- Body Text Extractor (Finn et al., 2001)
- Boilerpipe (Kohlschütter et al., 2010): default-extractor, article-extractor, largestcontent-extractor
- Unfluff (Geitgey, 2014)
- Victor (Spousta et al., 2008)
Installation
-
Install Scala and SBT. The code was tested with SBT 1.3.3. You can also use Docker image
hseeberger/scala-sbt:8u222_1.3.3_2.13.1
.- if you struggle installing Scala and SBT, you can run our Scala code in Docker with commands like
docker run -it --rm \ --mount type=bind,source="$(pwd)",target=/root \ hseeberger/scala-sbt:8u222_1.3.3_2.13.1 \ sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"
-
Install Python 3.7 with Tensorflow 1.15 and NumPy.
Usage
See this blog post by Xavier Geerinck with step-by-step instructions on running this code.
Recipe: extracting text from a web page
- Run
ch.ethz.dalab.web2text.ExtractPageFeatures
through sbt. The arguments are:- input html file
- the desired output base filename (script produces
{filename_base}_edge_feature.csv
and{filename_base}_block_features.csv
)
- Use the python script
src/main/python.py
with the 'classify' option. The arguments are:python3 main.py classify {filename_base} {labels_out_filename}
- Use
ch.ethz.dalab.web2text.ApplyLabelsToPage
through sbt to produce clean text. Arguments:- input html file
{labels_out_filename}
from step 2- output destination text file path
HTML to CDOM
In Scala:
import ch.ethz.dalab.web2text.cdom.CDOM
val cdom = CDOM.fromHTML("""
<body>
<h1>Header</h1>
<p>Paragraph with an <i>Italic</i> section.</p>
</body>
""")
println(cdom)
Feature extraction
Example:
import ch.ethz.dalab.web2text.features.{FeatureExtractor, PageFeatures}
import ch.ethz.dalab.web2text.features.extractor._
val unaryExtractor =
DuplicateCountsExtractor
+ LeafBlockExtractor
+ AncestorExtractor(NodeBlockExtractor + TagExtractor(mode="node"), 1)
+ AncestorExtractor(NodeBlockExtractor, 2)
+ RootExtractor(NodeBlockExtractor)
+ TagExtractor(mode="leaf")
val pairwiseExtractor =
TreeDistanceExtractor +
BlockBreakExtractor +
CommonAncestorExtractor(NodeBlockExtractor)
val extractor = FeatureExtractor(unaryExtractor, pairwiseExtractor)
val features: PageFeatures = extractor(cdom)
println(features)
Aligning cleaned text with original source
import ch.ethz.dalab.web2text.alignment.Alignment
val reference = "keep this"
val source = "You should keep this text"
val alignment: String = Alignment.alignment(source, reference)
println(alignment) // â–¡â–¡â–¡â–¡â–¡â–¡â–¡â–¡â–¡â–¡â–¡keep thisâ–¡â–¡â–¡â–¡â–¡
Extracting features for CleanEval
import ch.ethz.dalab.web2text.utilities.Util
import ch.ethz.dalab.web2text.cleaneval.CleanEval
import ch.ethz.dalab.web2text.output.CsvDatasetWriter
val data = Util.time{ CleanEval.dataset(fe) }
// Write block_features.csv and edge_features.csv
// Format of a row: page id, groundtruth label (1/0), features ...
CsvDatasetWriter.write(data, "./src/main/python/data")
// Print the names of the exported features in order
println("# Block features")
fe.blockExtractor.labels.foreach(println)
println("# Edge features")
fe.edgeExtractor.labels.foreach(println)
Training the CNNs
Code related to the CNNs lives in the src/main/python
directory.
To train the CNNs:
- Set the
CHECKPOINT_DIR
variable inmain.py
. - Make sure the files
block_features.csv
andedge_features.csv
are in thesrc/main/python/data
directory. Use the example from the previous section for this. - Convert the CSV files to
.npy
withdata/convert_scala_csv.py
. - Train the unary CNN with
python3 main.py train_unary
. - Train the pairwise CNN with
python3 main.py train_edge
.
Evaluating the CNN
To evaluate the CNN:
- Set the
CHECKPOINT_DIR
variable inmain.py
to point to a directory with trained weights. We provide trained weights based on the cleaneval split and a custom web2text split (with more training data.) - Run
python3 main.py test_structured
to test performance on the CleanEval test set.
The performance of other networks is computed in Scala:
import ch.ethz.dalab.web2text.Main
Main.evaluateOthers()