• Stars
    star
    1
  • Language
    Go
  • Created over 9 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Package feed implements a flexible, robust and efficient RSS and Atom parser

Feed Parser (RSS, Atom)

GoDoc Travis Build Status

Package feed implements a flexible, robust and efficient RSS/Atom parser.

If you just want some bytes to be quickly parsed into an object without care about underlying feed type, you can start with this: Simple Use

If you want to take a deeper dive into how you can customize the parser behavior:

Installation & Use

Get the pkg

go get github.com/jloup/xml

Use it in code

import "github.com/jloup/xml/feed"

Simple Use : feed.Parse(io.Reader, feed.DefaultOptions)

Example:

f, err := os.Open("feed.txt")

if err != nil {
    return
}

myfeed, err := feed.Parse(f, feed.DefaultOptions)

if err != nil {
    fmt.Printf("Cannot parse feed: %s\n", err)
    return
}

fmt.Printf("FEED '%s'\n", myfeed.Title)
for i, entry := range myfeed.Entries {
    fmt.Printf("\t#%v '%s' (%s)\n\t\t%s\n\n", i, entry.Title,
                                                 entry.Link,
                                                 entry.Summary)
}

Output:

FEED 'Me, Myself and I'
	#0 'Breakfast' (http://example.org/2005/04/02/breakfast)
		eggs and bacon, yup !

	#1 'Dinner' (http://example.org/2005/04/02/dinner)
		got soap delivered !

feed.Parse returns a BasicFeed which fields are :

// Rss channel or Atom feed
type BasicFeed struct {
  Title   string
  Id      string // Atom:feed:id | RSS:channel:link 
  Date    time.Time
  Image   string // Atom:feed:logo:iri | RSS:channel:image:url
  Entries []BasicEntryBlock
}

type BasicEntryBlock struct {
	Title   string
	Link    string
	Date    time.Time // Atom:entry:updated | RSS:item:pubDate
	Id      string // Atom:entry:id | RSS:item:guid
	Summary string
}

Extending BasicFeed

BasicFeed is really basic struct implementing feed.UserFeed interface. You may want to access more values extracted from feeds. For this purpose you can pass your own implementation of feed.UserFeed to feed.ParseCustom.

type UserFeed interface {
    PopulateFromAtomFeed(f *atom.Feed) // see github.com/jloup/xml/feed/atom
    PopulateFromAtomEntry(e *atom.Entry)
    PopulateFromRssChannel(c *rss.Channel) // see github.com/jloup/xml/feed/rss
    PopulateFromRssItem(i *rss.Item)
}

func ParseCustom(r io.Reader, feed UserFeed, options ParseOptions) error

To avoid starting from scratch, you can embed feed.BasicEntryBlock and feed.BasicFeedBlock in your structs

Example:

type MyFeed struct {
	feed.BasicFeedBlock
	Generator string
	Entries   []feed.BasicEntryBlock
}

func (m *MyFeed) PopulateFromAtomFeed(f *atom.Feed) {
	m.BasicFeedBlock.PopulateFromAtomFeed(f)

	m.Generator = fmt.Sprintf("%s V%s", f.Generator.Uri.String(), 
	                                    f.Generator.Version.String())
}

func (m *MyFeed) PopulateFromRssChannel(c *rss.Channel) {
	m.BasicFeedBlock.PopulateFromRssChannel(c)

	m.Generator = c.Generator.String()
}

func (m *MyFeed) PopulateFromAtomEntry(e *atom.Entry) {
	newEntry := feed.BasicEntryBlock{}
	newEntry.PopulateFromAtomEntry(e)
	m.Entries = append(m.Entries, newEntry)
}

func (m *MyFeed) PopulateFromRssItem(i *rss.Item) {
	newEntry := feed.BasicEntryBlock{}
	newEntry.PopulateFromRssItem(i)
	m.Entries = append(m.Entries, newEntry)

}

func main() {
    f, err := os.Open("feed.txt")

    if err != nil {
        return
    }

    myfeed := &MyFeed{}

    err = feed.ParseCustom(f, myfeed, feed.DefaultOptions)

    if err != nil {
        fmt.Printf("Cannot parse feed: %s\n", err)
        return
    }

    fmt.Printf("FEED '%s' generated with %s\n", myfeed.Title, myfeed.Generator)
    for i, entry := range myfeed.Entries {
        fmt.Printf("\t#%v '%s' (%s)\n", i, entry.Title, entry.Link)
    }
}

Output:

FEED 'Me, Myself and I' generated with http://www.atomgenerator.com/ V1.0
	#0 'Breakfast' (http://example.org/2005/04/02/breakfast)
	#1 'Dinner' (http://example.org/2005/04/02/dinner)

Robustness and recovery from bad input

Feeds are wildly use and it is quite common that a single invalid character, missing closing/starting tag invalidate the whole feed. Standard encoding/xml is quite pedantic (as it should) about input xml.

In order to produce an output feed at all cost, you can set the number of times you want the parser to recover from invalid input via XMLTokenErrorRetry field in ParseOptions. The strategy is quite simple, if xml decoder returns an XMLTokenError while parsing, the faulty token will be removed from input and the parser will retry to build a feed from it. It useful when invalid html, xml is present in content tag (atom) for example.

Example:

f, err := os.Open("testdata/invalid_atom.xml")

opt := feed.DefaultOptions
opt.XMLTokenErrorRetry = 1

_, err = feed.Parse(f, opt)

if err != nil {
  fmt.Printf("Cannot parse feed: %s\n", err)
} else {
  fmt.Println("no error")
}

Output:

no error

with XMLTokenError set to 0, it would have produced the following error:

Cannot parse feed: [XMLTokenError] XML syntax error on line 574: illegal character code U+000C

Parse with specification compliancy checking

RSS and Atom feeds should conform to a specification (which is complex for Atom). The common behavior of Parse functions is to not be too restrictive about input feeds. To validate feeds, you can pass a custom FlagChecker to ParseOptions. If you really know what you are doing you can enable/disable only some spec checks.

Error flags can be found for each standard in packages documentation:

  • RSS : github.com/jloup/xml/feed/rss
  • Atom : github.com/jloup/xml/feed/atom

Example:

// the input feed is not compliant to spec
f, err := os.Open("feed.txt")
if err != nil {
    return
}

// the input feed should be 100% compliant to spec...
flags := xmlutils.NewErrorChecker(xmlutils.EnableAllError)

//... but it is OK if Atom entry does not have <updated> field
flags.DisableErrorChecking("entry", atom.MissingDate)

options := feed.ParseOptions{extension.Manager{}, &flags}

myfeed, err := feed.Parse(f, options)

if err != nil {
    fmt.Printf("Cannot parse feed:\n%s\n", err)
    return
}

fmt.Printf("FEED '%s'\n", myfeed.Title)

Output:

Cannot parse feed:
in 'feed':
[MissingId]
	feed's id should exist

Rss and Atom extensions

Both formats allow to add third party extensions. Some extensions have been implemented for the example e.g. RSS dc:creator (github.com/jloup/xml/feed/rss/extension/dc)

Example:

type ExtendedFeed struct {
    feed.BasicFeedBlock
    Entries []ExtendedEntry
}

type ExtendedEntry struct {
    feed.BasicEntryBlock
    Creator string // <dc:creator> only present in RSS feeds
    Entries []feed.BasicEntryBlock
}

func (f *ExtendedFeed) PopulateFromAtomEntry(e *atom.Entry) {
    newEntry := ExtendedEntry{}
    newEntry.PopulateFromAtomEntry(e)
    f.Entries = append(f.Entries, newEntry)
}

func (f *ExtendedFeed) PopulateFromRssItem(i *rss.Item) {
    newEntry := ExtendedEntry{}
    newEntry.PopulateFromRssItem(i)

    creator, ok := dc.GetCreator(i)
    // we must check the item actually has a dc:creator element
    if ok {
        newEntry.Creator = creator.String()
    }
    f.Entries = append(f.Entries, newEntry)

}

func main() {
     f, err := os.Open("rss.txt")

    if err != nil {
        return
    }

    //Manager is in github.com/jloup/xml/feed/extension
    manager := extension.Manager{}
    // we add the dc extension to it
    // dc extension is in "github.com/jloup/xml/feed/rss/extension/dc"
    dc.AddToManager(&manager)

    opt := feed.DefaultOptions
    //we pass our custom extension Manager to ParseOptions
    opt.ExtensionManager = manager

    myfeed := &ExtendedFeed{}
    err = feed.ParseCustom(f, myfeed, opt)

    if err != nil {
        fmt.Printf("Cannot parse feed: %s\n", err)
        return
    }

    fmt.Printf("FEED '%s'\n", myfeed.Title)
    for i, entry := range myfeed.Entries {
        fmt.Printf("\t#%v '%s' by %s (%s)\n", i, entry.Title,
                                                 entry.Creator,
                                                 entry.Link)
    }
}

Output:

FEED 'Me, Myself and I'
	#0 'Breakfast' by Peter J. (http://example.org/2005/04/02/breakfast)
	#1 'Dinner' by Peter J. (http://example.org/2005/04/02/dinner)