HandsomeSoup
Current Status: Usable and stable. Needs GHC 7.6. Please file bugs!
HandsomeSoup is the library I wish I had when I started parsing HTML in Haskell.
It is built on top of HXT and adds a few functions that make it easier to work with HTML.
Most importantly, it adds CSS selectors to HXT. The goal of HandsomeSoup is to be a complete CSS2 selector parser for HXT.
Install
cabal install HandsomeSoup
Example
Nokogiri, the HTML parser for Ruby, has an example showing how to scrape Google search results. This is easy in HandsomeSoup:
import Text.XML.HXT.Core
import Text.HandsomeSoup
main = do
let doc = fromUrl "http://www.google.com/search?q=egon+schiele"
links <- runX $ doc >>> css "h3.r a" ! "href"
mapM_ putStrLn links
What can HandsomeSoup do for you?
fromUrl
Easily parse an online page using let doc = fromUrl "http://example.com"
parseHtml
Or a local page using contents <- readFile [filename]
let doc = parseHtml contents
css
Easily extract elements using Here are some valid selectors:
doc <<< css "a"
doc <<< css "*"
doc <<< css "a#link1"
doc <<< css "a.foo"
doc <<< css "p > a"
doc <<< css "p strong"
doc <<< css "#container h1"
doc <<< css "img[width]"
doc <<< css "img[width=400]"
doc <<< css "a[class~=bar]"
doc <<< css "a:first-child"
(!)
Easily get attributes using doc <<< css "img" ! "src"
doc <<< css "a" ! "href"
Docs
Find Haddock docs on Hackage.
I also wrote The Complete Guide To Parsing HXT With Haskell.
Credits
Made by Adit.