• Stars
    star
    231
  • Rank 173,434 (Top 4 %)
  • Language
    Haskell
  • License
    Other
  • Created about 11 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents

TagSoup Hackage version Stackage version Build status

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information. This document gives two particular examples of scraping information from the web, while a few more may be found in the Sample file from the source repository. The examples we give are:

  • Obtaining the last modified date of the Haskell wiki
  • Obtaining a list of Simon Peyton Jones' latest papers
  • A brief overview of some other examples

The intial version of this library was written in Javascript and has been used for various commercial projects involving screen scraping. In the examples general hints on screen scraping are included, learnt from bitter experience. It should be noted that if you depend on data which someone else may change at any given time, you may be in for a shock!

This library was written without knowledge of the Java version of TagSoup. They have made a very different design decision: to ensure default attributes are present and to properly nest parsed tags. We do not do this - tags are merely a list devoid of nesting information.

Acknowledgements

Thanks to Mike Dodds for persuading me to write this up as a library. Thanks to many people for debugging and code contributions, including: Gleb Alexeev, Ketil Malde, Conrad Parker, Henning Thielemann, Dino Morelli, Emily Mitchell, Gwern Branwen.

Potential Bugs

There are two things that may go wrong with these examples:

  • The Websites being scraped may change. There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, it's only a few minutes work.
  • The openURL method may not work. This happens quite regularly, and depending on your server, proxies and direction of the wind, they may not work. The solution is to use wget to download the page locally, then use readFile instead. Hopefully a decent Haskell HTTP library will emerge, and that can be used instead.

Last modified date of Haskell wiki

Our goal is to develop a program that displays the date that the wiki at wiki.haskell.org was last modified. This example covers all the basics in designing a basic web-scraping application.

Finding the Page

We first need to find where the information is displayed and in what format. Taking a look at the front web page, when not logged in, we see:

<ul id="footer-info">
  <li id="footer-info-lastmod"> This page was last modified on 9 September 2013, at 22:38.</li>
  <li id="footer-info-copyright">Recent content is available under <a href="/HaskellWiki:Copyrights" title="HaskellWiki:Copyrights">simple permissive license</a>.</li>
</ul>

So, we see that the last modified date is available. This leads us to rule 1:

Rule 1: Scrape from what the page returns, not what a browser renders, or what view-source gives.

Some web servers will serve different content depending on the user agent, some browsers will have scripting modify their displayed HTML, some pages will display differently depending on your cookies. Before you can start to figure out how to start scraping, first decide what the input to your program will be. There are two ways to get the page as it will appear to your program.

Using the HTTP package

We can write a simple HTTP downloader with using the HTTP package:

module Main where

import Network.HTTP

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

main :: IO ()
main = do
    src <- openURL "http://wiki.haskell.org/Haskell"
    writeFile "temp.htm" src

Now open temp.htm, find the fragment of HTML containing the hit count, and examine it.

Finding the Information

Now we examine both the fragment that contains our snippet of information, and the wider page. What does the fragment have that nothing else has? What algorithm would we use to obtain that particular element? How can we still return the element as the content changes? What if the design changes? But wait, before going any further:

Rule 2: Do not be robust to design changes, do not even consider the possibility when writing the code.

If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change.

So now, let's consider the fragment from above. It is useful to find a tag which is unique just above your snippet - something with a nice id or class attribute - something which is unlikely to occur multiple times. In the above example, an id with value footer-info-lastmod seems perfect.

module Main where

import Data.Char
import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

haskellLastModifiedDateTime :: IO ()
haskellLastModifiedDateTime = do
    src <- openURL "http://wiki.haskell.org/Haskell"
    let lastModifiedDateTime = fromFooter $ parseTags src
    putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
    where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")

main :: IO ()
main = haskellLastModifiedDateTime

Now we start writing the code! The first thing to do is open the required URL, then we parse the code into a list of Tags with parseTags. The fromFooter function does the interesting thing, and can be read right to left:

  • First we throw away everything (dropWhile) until we get to an li tag containing id=footer-info-lastmod. The (~==) and (~/=) operators are different from standard equality and inequality since they allow additional attributes to be present. We write "<li id=lastmod>" as syntactic sugar for TagOpen "li" [("id","footer-info-lastmod")]. If we just wanted any open tag with the given id attribute we could have written (~== TagOpen "" [("id","footer-info-lastmod")]) and this would have matched. Any empty strings in the second element of the match are considered as wildcards.
  • Next we take two elements: the <li> tag and the text node immediately following.
  • We call the innerText function to get all the text values from inside, which will just be the text node following the footer-info-lastmod.
  • We split the string into a series of words and drop the first six, i.e. the words This, page, was, last, modified and on
  • We reassemble the remaining words into the resulting string 9 September 2013, at 22:38.

This code may seem slightly messy, and indeed it is - often that is the nature of extracting information from a tag soup.

Rule 3: TagSoup is for extracting information where structure has been lost, use more structured information if it is available.

Simon's Papers

Our next very important task is to extract a list of all Simon Peyton Jones' recent research papers off his home page. The largest change to the previous example is that now we desire a list of papers, rather than just a single result.

As before we first start by writing a simple program that downloads the appropriate page, and look for common patterns. This time we want to look for all patterns which occur every time a paper is mentioned, but no where else. The other difference from last time is that previous we grabbed an automatically generated piece of information - this time the information is entered in a more freeform way by a human.

First we spot that the page helpfully has named anchors, there is a current work anchor, and after that is one for Haskell. We can extract all the information between them with a simple take/drop pair:

takeWhile (~/= "<a name=haskell>") $
drop 5 $ dropWhile (~/= "<a name=current>") tags

This code drops until you get to the "current" section, then takes until you get to the "haskell" section, ensuring we only look at the important bit of the page. Next we want to find all hyperlinks within this section:

map f $ sections (~== "<A>") $ ...

Remember that the function to select all tags with name "A" could have been written as (~== TagOpen "A" []), or alternatively isTagOpenName "A". Afterwards we map each item with an f function. This function needs to take the tags starting just after the link, and find the text inside the link.

f = dequote . unwords . words . fromTagText . head . filter isTagText

Here the complexity of interfacing to human written markup comes through. Some of the links are in italic, some are not - the filter drops all those that are not, until we find a pure text node. The unwords . words deletes all multiple spaces, replaces tabs and newlines with spaces and trims the front and back - a neat trick when dealing with text which has spacing at the source code but not when displayed. The final thing to take account of is that some papers are given with quotes around the name, some are not - dequote will remove the quotes if they exist.

For completeness, we now present the entire example:

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

spjPapers :: IO ()
spjPapers = do
        tags <- parseTags <$> openURL "http://research.microsoft.com/en-us/people/simonpj/"
        let links = map f $ sections (~== "<A>") $
                    takeWhile (~/= "<a name=haskell>") $
                    drop 5 $ dropWhile (~/= "<a name=current>") tags
        putStr $ unlines links
    where
        f :: [Tag String] -> String
        f = dequote . unwords . words . fromTagText . head . filter isTagText

        dequote ('\"':xs) | last xs == '\"' = init xs
        dequote x = x

main :: IO ()
main = spjPapers

Other Examples

Several more examples are given in the Sample.hs file, including obtaining the (short) list of papers from my site, getting the current time and a basic XML validator. All use very much the same style as presented here - writing screen scrapers follow a standard pattern. We present the code from two for enjoyment only.

My Papers

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

ndmPapers :: IO ()
ndmPapers = do
        tags <- parseTags <$> openURL "http://community.haskell.org/~ndm/downloads/"
        let papers = map f $ sections (~== "<li class=paper>") tags
        putStr $ unlines papers
    where
        f :: [Tag String] -> String
        f xs = fromTagText (xs !! 2)

main :: IO ()
main = ndmPapers

UK Time

module Main where

import Network.HTTP
import Text.HTML.TagSoup

openURL :: String -> IO String
openURL x = getResponseBody =<< simpleHTTP (getRequest x)

currentTime :: IO ()
currentTime = do
    tags <- parseTags <$> openURL "http://www.timeanddate.com/worldclock/uk/london"
    let time = fromTagText (dropWhile (~/= "<span id=ct>") tags !! 1)
    putStrLn time

main :: IO ()
main = currentTime

Other Examples

In Sample.hs the following additional examples are listed:

  • Google Tech News
  • Package list form Hackage
  • Print names of story contributors on sequence.complete.org
  • Parse rows of a table

Related Projects

More Repositories

1

hlint

Haskell source code suggestions
Haskell
1,459
star
2

ghcid

Very low feature GHCi based IDE
Haskell
1,130
star
3

shake

Shake build system
Haskell
744
star
4

hoogle

Haskell API search engine
Haskell
689
star
5

bake

UNMAINTAINED: Continuous integration server
Haskell
130
star
6

record-dot-preprocessor

A preprocessor for a Haskell record syntax using dot
Haskell
129
star
7

weeder

Detect dead exports or package imports
Haskell
124
star
8

debug

Haskell library for debugging
JavaScript
121
star
9

rattle

Forward build system with speculation and caching
Haskell
102
star
10

spaceleak

Notes on space leaks
Haskell
101
star
11

extra

Extra Haskell functions
Haskell
93
star
12

cmdargs

Haskell library for command line argument processing
Haskell
91
star
13

build-shootout

Comparison of build program expressive power
Haskell
88
star
14

uniplate

Haskell library for simple, concise and fast generic operations.
Haskell
74
star
15

safe

Haskell library for safe (pattern match free) functions
Haskell
45
star
16

ghc-make

An alternative to ghc --make which supports parallel compilation of modules and runs faster when nothing needs compiling.
Haskell
40
star
17

interpret

Rust
38
star
18

neil

General tools for Neil
Haskell
38
star
19

profiterole

GHC prof manipulation script
Haskell
30
star
20

nsis

Haskell DSL for producing Windows Installer using NSIS
Haskell
26
star
21

derive

A Haskell program and library to derive instances for data types
TeX
25
star
22

supero

Haskell optimisation tool based on supercompilation
Haskell
25
star
23

hexml

A bad XML parser
C
19
star
24

offline-stack

Install Stack without internet access
Haskell
18
star
25

catch

Haskell pattern match analsyis checker
Haskell
15
star
26

js-jquery

Haskell library to obtain minified jQuery code
Haskell
9
star
27

rexe

.exe forwarder, to allow replacing binaries on PATH
Haskell
8
star
28

office

Macros for Microsoft Office
VBA
7
star
29

VSHaskell

Visual Studio 2010 addin
C#
7
star
30

record-hasfield

A version of HasField that will be available in future GHC
Haskell
7
star
31

shake-paper

Paper on the new GHC Shake-based build system
HTML
6
star
32

lasagna

Checker for Haskell layering violations
Haskell
5
star
33

guihaskell

A graphical REPL and development environment for Haskell
Haskell
5
star
34

hwwg

Haskell Website Working Group
5
star
35

shake-bazel

Experimenting with Shake and Bazel combined
Haskell
4
star
36

blogs

TeX
4
star
37

haskell-parser

Haskell parser based on that from GHC
Haskell
4
star
38

filepattern

A file path matching library
Haskell
4
star
39

idris-playground

Playing around with Idris
Idris
4
star
40

proplang

A Haskell library for functional GUI development
Haskell
4
star
41

thesis

My PhD thesis - Transformation and Analysis of Functional Programs
Haskell
3
star
42

js-flot

Haskell library to obtain minified Flot code
Haskell
3
star
43

qed

Experiments writing a prover
Haskell
3
star
44

core-playground

Simple Core language for Haskell experiments
Haskell
3
star
45

js-dgtable

Haskell library to obtain minified jquery.dgtable code
JavaScript
3
star
46

neil-check

Run neil and hlint on all my projects
Haskell
3
star
47

shake-examples

3
star
48

proof

Haskell library for writing proofs
Haskell
2
star
49

ci-check

Test the various supported CI's
2
star
50

hogle-dead

This repo has been moved to the master branch of https://github.com/ndmitchell/hoogle.
Haskell
2
star
51

stack-3137

Reproduce https://github.com/commercialhaskell/stack/issues/3137
Haskell
2
star
52

winhaskell

Windows Haskell GUI interpretter
C++
1
star
53

fossilizer

Generate 3D images of fossil bedding surfaces
JavaScript
1
star
54

bug-i386-ghc84

1
star
55

proposition

Haskell library for manipulating propositions.
Haskell
1
star
56

tex2hs

A program to check for type errors in a Latex document
Haskell
1
star
57

ndmitchell

1
star
58

hlint-test

Haskell
1
star
59

ghc-process

Compiling and linking files in a single process using the GHC API
Haskell
1
star
60

shark

A bad replacement for cabal/stack
1
star
61

index-search

Searching compressed text indicies
C
1
star
62

awesomo

Prototype optimiser for Haskell programs
Haskell
1
star
63

ninjasmith

Ninja file generator and tester
Haskell
1
star
64

firstify

A Haskell library to transform Yhc Core programs to first-order
TeX
1
star