• Stars
    star
    179
  • Rank 214,039 (Top 5 %)
  • Language
    Clojure
  • License
    Other
  • Created over 14 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A HTML parser for Clojure.

Clojars Project

clj-tagsoup

This is a HTML parser for Clojure, somewhat akin to Common Lisp's cl-html-parse. It is a wrapper around the TagSoup Java SAX parser, but has a DOM interface. It is buildable by Leiningen.

Usage

The two main functions defined by clj-tagsoup are parse and parse-string. The first one can take anything accepted by clojure.java.io's reader function except for a Reader, while the second can parse HTML from a string.

The resulting HTML tree is a vector, consisting of:

  1. a keyword representing the tag name,
  2. a map of tag attributes (mapping keywords to strings),
  3. children nodes (strings or vectors of the same format).

This is the same format as used by hiccup, thus the output of parse is appropriate to pass to hiccup.

There are also utility accessors (tag, attributes, children).

clj-tagsoup will automatically use the correct encoding to parse the file if one is specified in either the HTTP headers (if the argument to parse is an URL object or a string representing one) or a <meta http-equiv="..."> tag.

clj-tagsoup is meant to parse HTML tag soup, but, in practice, nothing prevents you to use it to parse arbitrary (potentially malformed) XML. The :xml keyword argument causes clj-tagsoup to take into consideration the XML header when detecting the encoding.

There are two other options for parsing XML:

  • parse-xml just invokes clojure.xml/parse with TagSoup, so the output format is compatible with clojure.xml and is not the one described above.
  • lazy-parse-xml (introduced in clj-tagsoup 0.3.0) returns a lazy sequence of Event records defined by clojure.data.xml, similarly to the source-seq function from that library.

Example

project.clj:

(defproject clj-tagsoup-example "0.0.1"
  :dependencies [[clj-tagsoup/clj-tagsoup "0.3.0"]])

lein repl:

(use 'pl.danieljanus.tagsoup)
=> nil

(parse "http://example.com")
=> [:html {}
          [:head {}
                 [:title {} "Example Web Page"]]
          [:body {}
                 [:p {} "You have reached this web page by typing \"example.com\",\n\"example.net\",\n  or \"example.org\" into your web browser."]
                 [:p {} "These domain names are reserved for use in documentation and are not available \n  for registration. See "
                     [:a {:shape "rect", :href "http://www.rfc-editor.org/rfc/rfc2606.txt"} "RFC \n  2606"]
                     ", Section 3."]]]

FAQ

  • Why not just use Enlive?

    Truth be told, I wrote clj-tagsoup prior to discovering Enlive, which is an excellent library. That said, I believe clj-tagsoup has its niche. Here is an à la carte list of differences between the two:

    • Enlive is a full-blown templating library; clj-tagsoup just parses HTML (and XML).
    • Unlike Enlive, clj-tagsoup's parse function goes out of its way to return parsed data in a proper encoding. It will detect the <meta http-equiv="..."> tag in your data and reinterpret the input stream to the indicated encoding as needed.
    • clj-tagsoup boasts a way to lazily parse XML with TagSoup.
  • What's with the dependency on stax-utils?

    It's for lazy-parse-xml. It's needed because that function uses clojure.data.xml, which under the hood uses the StAX API. TagSoup is a SAX parser, so a bridge between the two parsing APIs is needed.

    If you don't use lazy-parse-xml, you can optionally exclude stax-utils from your project.clj, like this:

     :dependencies [[clj-tagsoup "0.3.0" :exclusions [net.java.dev.stax-utils/stax-utils]]]
    

Author

clj-tagsoup was written by Daniel Janus.

More Repositories

1

skyscraper

Structural scraping for the rest of us.
Clojure
387
star
2

lithium

Clojure-based x86 assembler and toy Lisp compiler
Clojure
117
star
3

clj-iter

A Clojure iteration macro inspired by Common Lisp iterate.
Clojure
45
star
4

wordchampions

A fun word game!
Clojure
27
star
5

sunflower

Easily extract content from a bunch of similarly-formatted HTML files.
Clojure
23
star
6

soupscraper

dej, mam umierajoncom zupe
Clojure
19
star
7

smyrna

Prosty konkordancer dla języka polskiego
Clojure
18
star
8

summhn

Clojure
12
star
9

clj-tvision

Turbo Vision, the Clojure way
Clojure
11
star
10

cartestian

Test all the combinations
Clojure
9
star
11

clj-json-rpc

A Clojure handler for JSON-RPC compatible with Ring
Clojure
9
star
12

solitaire

Sample app for the re-frame workshop
Clojure
8
star
13

clj-bitfields

Easy accessing C-compatible packed bitfields from Clojure.
Clojure
7
star
14

gumtree-scraper

Gumtree RSS generator
Clojure
6
star
15

spleen

A Scrabble engine written in Clojure.
Clojure
5
star
16

koronalotek

na kogo wypadnie, na tego covid
Clojure
4
star
17

oswn

Operating System Without Name
Assembly
3
star
18

croissant

Yet another web-application framework in Common Lisp.
2
star
19

clj-nkjp

Clojure tools for processing the National Corpus of Polish
Clojure
2
star
20

nhp

Static site generator for my homepage
Clojure
2
star
21

ruby-continuation-webapp

Proof-of-concept continuation-based Sinatra webapp.
Ruby
2
star
22

dxces

A converter of text collections in .txt format to XCES for use with Poliqarp.
Python
2
star
23

haze

Haskellish Abominable Z-machine Emulator
Haskell
2
star
24

setgame

An implementation of Set game in Clojure.
Clojure
1
star
25

cl-netstrings

Reading and writing netstrings from/to binary streams in Common Lisp
Common Lisp
1
star
26

color-europe

Color your own Europe in Clojure!
Clojure
1
star
27

polelum

Clojure
1
star
28

skyscraper-cache-rocksdb

A cache backend for Skyscraper based on RocksDB.
1
star
29

psps

Przenośny Słownik Polskiego Scrabblisty
C
1
star
30

skyscraper-cache-mapdb

A MapDB-based cache backend for Skyscraper.
Clojure
1
star
31

blogs

My Octopress blogs.
Ruby
1
star
32

dotemacs

My Emacs configuration.
Emacs Lisp
1
star
33

versions

Research on Clojure version numbers, in Clojure.
Clojure
1
star
34

pallium

Clojure
1
star