• Stars
    star
    126
  • Rank 284,543 (Top 6 %)
  • Language
    Go
  • Created over 12 years ago
  • Updated over 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Streaming XML parser example in go

go-xml-parse

Streaming XML parser example in Go

Intro

I've recently been messing around with the XML dumps of Wikipedia. These are pretty huge XML files - for instance the most recent revision is 36G when uncompressed. That's a lot of XML!

I've been experimenting with a few different languages and parsers for my task (which also happens to involve some non trivial processing for each article) and found Go to be a great fit.

Go has a common library package for parsing xml (encoding/xml) which is very convenient to code against. However, the simple version of the API requires parsing the whole document at once, which for 36G is not a viable strategy.

The parser can also be used in a streaming mode but I found the documentation and examples online to be terse and non-existant respectively, so here is my example code for parsing wikipedia with encoding/xml and a little explanation! (full example code at https://github.com/dps/go-xml-parse/blob/master/go-xml-parse.go)

Here's a little snippet of an example wikipedia page in the doc:

<page> 
  <title>Apollo 11</title> 
    <redirect title="Foo bar" /> 
    ... 
     <revision> 
     ... 
       <text xml:space="preserve"> 
       {{Infobox Space mission 
       |mission_name=<!--See above->; 
       |insignia=Apollo_11_insignia.png 
     ... 
       </text> 
     </revision> 
</page>

In our Go code, we define a struct to match the element, its nested element and grab a couple of fields we're interested in ( and <title>).

type Redirect struct { 
    Title string `xml:"title,attr"` 
} 

type Page struct { 
    Title string `xml:"title"` 
    Redir Redirect `xml:"redirect"` 
    Text string `xml:"revision>text"` 
}

Now we would usually tell the parser that a wikipedia dump contains a bunch of s and try to read the whole thing, but let's see how we stream it instead.

It's quite simple when you know how - iterate over tokens in the file until you encounter a StartElement with the name "page" and then use the magic decoder.DecodeElement API to unmarshal the whole following page into an object of the Page type defined above. Cool!

decoder := xml.NewDecoder(xmlFile) 

for { 
    // Read tokens from the XML document in a stream. 
    t, _ := decoder.Token() 
    if t == nil { 
        break 
    } 
    // Inspect the type of the token just read. 
    switch se := t.(type) { 
    case xml.StartElement: 
        // If we just read a StartElement token 
        // ...and its name is "page" 
        if se.Name.Local == "page" { 
            var p Page 
            // decode a whole chunk of following XML into the
            // variable p which is a Page (se above) 
            decoder.DecodeElement(&p, &se) 
            // Do some stuff with the page. 
            p.Title = CanonicalizeTitle(p.Title)
            ...
        } 
...

I hope this saves you some time if you need to parse a huge XML file yourself.

More Repositories

1

piui

Add a UI to your standalone Raspberry Pi project using your Android phone
JavaScript
416
star
2

nnrccar

nnrccar
C++
234
star
3

remarkable-keywriter

QML
183
star
4

rust-raytracer

๐Ÿ”ญ A simple ray tracer in Rust ๐Ÿฆ€
Rust
180
star
5

remarkable-wikipedia

QML
142
star
6

rpi-timelapse

Timelapse Camera Controller for Raspberry Pi
Python
104
star
7

montesheet

JavaScript
13
star
8

piui-sdcards

sdcard images for piui
8
star
9

unhumanize

a simple python library to convert humanized time intervals (e.g. 'an hour ago') into timedeltas
7
star
10

northbelt

A belt that buzzes north
Arduino
6
star
11

SwiftUI-Recipes

๐Ÿฝ A SwiftUI demo app showing how to fetch data from the server to populate list views, navigate to detailed results and wire in a search field.
Swift
6
star
12

piui-timelapse

PiUi version of rpi-timelapse
Python
4
star
13

spreadsheet

JavaScript
4
star
14

shortest-sudoku

A collection of tiny Sudoku solvers
Java
4
star
15

android-vector-climacons

Android vector drawable resources for @adamwhitcroft 's Climacons
3
star
16

aoc

Python
3
star
17

recipes

recipes
Python
2
star
18

wear-exchangerates

Exchange Rates complication data provider for Android Wear 2.0
Java
2
star
19

laserUp

๐Ÿ” Create 3D relief map slices for Glowforge. ๐ŸŒŽGenerate input files at
Python
2
star
20

go-zim

Pure Go reader for the ZIM file format
Go
2
star
21

wixel

Wixel Apps
C
2
star
22

lasercut

A collection of laser cutter designs. Most made for my kids.
Python
1
star
23

wordgrid

1
star
24

multihttp

A multiplexing http notifier in golang
Go
1
star
25

webalert

Python
1
star
26

remarkable-ambient-launcher

A launcher for reMarkable
QMake
1
star
27

gcal-cruncher

Crunches .ics files exported from Google calendar to show you where your time has been spent
Ruby
1
star
28

dial.fyi.complications

dial.fyi complications
Java
1
star
29

hotdog

HTML
1
star
30

sundial

A sundial laser-cutter template generator
Python
1
star
31

dial.fyi

Java
1
star
32

simplescheduler

A simple task scheduler using redis for python
Python
1
star