• Stars
    star
    1,917
  • Rank 24,183 (Top 0.5 %)
  • Language
    Elixir
  • License
    MIT License
  • Created about 10 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

Actions Status Floki version Hex Docs Hex.pm License Last Updated

Floki logo

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

Check the documentation 📙.

Usage

Take this HTML as an example:

<!doctype html>
<html>
<body>
  <section id="content">
    <p class="headline">Floki</p>
    <span class="headline">Enables search using CSS selectors</span>
    <a href="https://github.com/philss/floki">Github page</a>
    <span data-model="user">philss</span>
  </section>
  <a href="https://hex.pm/packages/floki">Hex package</a>
</body>
</html>

Here are some queries that you can perform (with return examples):

{:ok, document} = Floki.parse_document(html)

Floki.find(document, "p.headline")
# => [{"p", [{"class", "headline"}], ["Floki"]}]

document
|> Floki.find("p.headline")
|> Floki.raw_html
# => <p class="headline">Floki</p>

Each HTML node is represented by a tuple like:

{tag_name, attributes, children_nodes}

Example of node:

{"p", [{"class", "headline"}], ["Floki"]}

So even if the only child node is the element text, it is represented inside a list.

Installation

Add Floki to your mix.exs:

defp deps do
  [
    {:floki, "~> 0.36.0"}
  ]
end

After that, run mix deps.get.

If you are running on Livebook or a script, you can install with Mix.install/2:

Mix.install([
  {:floki, "~> 0.36.0"}
])

You can check the changelog for changes.

Dependencies

Floki needs the :leex module in order to compile. Normally this module is installed with Erlang in a complete installation.

If you get this "module :leex is not available" error message, you need to install the erlang-dev and erlang-parsetools packages in order get the :leex module. The packages names may be different depending on your OS.

Alternative HTML parsers

By default Floki uses a patched version of mochiweb_html for parsing fragments due to its ease of installation (it's written in Erlang and has no outside dependencies).

However one might want to use an alternative parser due to the following concerns:

  • Performance - It can be up to 20 times slower than the alternatives on big HTML documents.
  • Correctness - in some cases mochiweb_html will produce different results from what is specified in HTML5 specification. For example, a correct parser would parse <title> <b> bold </b> text </title> as {"title", [], [" <b> bold </b> text "]} since content inside <title> is to be treated as plaintext. Albeit mochiweb_html would parse it as {"title", [], [{"b", [], [" bold "]}, " text "]}.

Floki supports the following alternative parsers:

  • fast_html - A wrapper for lexbor. A pure C HTML parser.
  • html5ever - A wrapper for html5ever written in Rust, developed as a part of the Servo project.

fast_html is generally faster, according to the benchmarks conducted by its developers.

You can perform a benchmark by running the following:

$ sh benchs/extract.sh
$ mix run benchs/parse_document.exs

Extracting the files is needed only once.

Using html5ever as the HTML parser

This dependency is written with a NIF using Rustler, but you don't need to install anything to compile it thanks to RustlerPrecompiled.

defp deps do
  [
    {:floki, "~> 0.36.0"},
    {:html5ever, "~> 0.15.0"}
  ]
end

Run mix deps.get and compiles the project with mix compile to make sure it works.

Then you need to configure your app to use html5ever:

# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.Html5ever

Notice that you can pass the HTML parser as an option in parse_document/2 and parse_fragment/2.

Using fast_html as the HTML parser

A C compiler, GNU\Make and CMake need to be installed on the system in order to compile lexbor.

First, add fast_html to your dependencies:

defp deps do
  [
    {:floki, "~> 0.36.0"},
    {:fast_html, "~> 2.0"}
  ]
end

Run mix deps.get and compiles the project with mix compile to make sure it works.

Then you need to configure your app to use fast_html:

# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.FastHtml

More about Floki API

To parse a HTML document, try:

html = """
  <html>
  <body>
    <div class="example"></div>
  </body>
  </html>
"""

{:ok, document} = Floki.parse_document(html)
# => {:ok, [{"html", [], [{"body", [], [{"div", [{"class", "example"}], []}]}]}]}

To find elements with the class example, try:

Floki.find(document, ".example")
# => [{"div", [{"class", "example"}], []}]

To convert your node tree back to raw HTML (spaces are ignored):

document
|> Floki.find(".example")
|> Floki.raw_html
# =>  <div class="example"></div>

To fetch some attribute from elements, try:

Floki.attribute(document, ".example", "class")
# => ["example"]

You can get attributes from elements that you already have:

document
|> Floki.find(".example")
|> Floki.attribute("class")
# => ["example"]

If you want to get the text from an element, try:

document
|> Floki.find(".headline")
|> Floki.text

# => "Floki"

Supported selectors

Here you find all the CSS selectors supported in the current version:

Pattern Description
* any element
E an element of type E
E[foo] an E element with a "foo" attribute
E[foo="bar"] an E element whose "foo" attribute value is exactly equal to "bar"
E[foo~="bar"] an E element whose "foo" attribute value is a list of whitespace-separated values, one of which is exactly equal to "bar"
E[foo^="bar"] an E element whose "foo" attribute value begins exactly with the string "bar"
E[foo$="bar"] an E element whose "foo" attribute value ends exactly with the string "bar"
E[foo*="bar"] an E element whose "foo" attribute value contains the substring "bar"
E[foo|="en"] an E element whose "foo" attribute has a hyphen-separated list of values beginning (from the left) with "en"
E:nth-child(n) an E element, the n-th child of its parent
E:nth-last-child(n) an E element, the n-th child of its parent, counting from bottom to up
E:first-child an E element, first child of its parent
E:last-child an E element, last child of its parent
E:nth-of-type(n) an E element, the n-th child of its type among its siblings
E:nth-last-of-type(n) an E element, the n-th child of its type among its siblings, counting from bottom to up
E:first-of-type an E element, first child of its type among its siblings
E:last-of-type an E element, last child of its type among its siblings
E:checked An E element (checkbox, radio, or option) that is checked
E:disabled An E element (button, input, select, textarea, or option) that is disabled
E.warning an E element whose class is "warning"
E#myid an E element with ID equal to "myid" (for ids containing periods, use #my\\.id or [id="my.id"])
E:not(s) an E element that does not match simple selector s
:root the root node or nodes (in case of fragments) of the document. Most of the times this is the html tag
E F an F element descendant of an E element
E > F an F element child of an E element
E + F an F element immediately preceded by an E element
E ~ F an F element preceded by an E element

There are also some selectors based on non-standard specifications. They are:

Pattern Description
E:fl-contains('foo') an E element that contains "foo" inside a text node
E:fl-icontains('foo') an E element that contains "foo" inside a text node (case insensitive)

Suppressing log messages

Floki may log debug messages related to problems in the parsing of selectors, or parsing of the HTML tree. It also may log some "info" messages related to deprecated APIs. If you want to suppress these log messages, please consider setting the :compile_time_purge_matching option for :logger in your compile time configuration.

See https://hexdocs.pm/logger/Logger.html#module-compile-configuration for details.

Special thanks

License

Copyright (c) 2014 Philip Sampaio Silva

Floki is under MIT license. Check the LICENSE file for more details.

More Repositories

1

rustler_precompiled

Use precompiled NIFs from trusted sources in your Elixir code
Elixir
143
star
2

off_broadway_twitter

An example app on how to write a Broadway producer for the Twitter stream API v2.
Elixir
16
star
3

rustler_precompilation_example

A sample project to demonstrate precompilation using Rustler
Elixir
11
star
4

brazil-in-notebooks

A collection of notebooks presenting data about Brazil.
10
star
5

aws-s3-stream-download-poc

This is a PoC for downloading AWS S3 files using streams in Elixir
Elixir
9
star
6

rustler-precompiled-action

Build lib files for the RustlerPrecompiled project.
7
star
7

explorer_sql

An Explorer backend that works with SQL databases. Only PostgreSQL is supported at the moment.
Elixir
6
star
8

dotfiles

Some of my dotfiles
Lua
6
star
9

live-streaming

Conteúdo de cada stream de código que faço em https://twitch.tv/philipsampaio
6
star
10

venci

My Vim configuration
Vim Script
5
star
11

phoenix-live-view-sample-app

Phoenix live view sample app
Elixir
4
star
12

advent-of-code-2020

Playing with Advent of code 2020
Elixir
3
star
13

kates_app

A Phoenix app to be deployed on k8s
JavaScript
3
star
14

bank

A POC for some experiments using Rails engines
Ruby
3
star
15

todo_list

A simple todo list for experimenting with CLI in Elixir
Elixir
3
star
16

rawl

Work in progress
Elixir
2
star
17

bulldoggy-thor

A command line interface to Bulldoggy API
Ruby
2
star
18

snake

🐍 An experimental implementation of the Snake game in Rust.
Rust
2
star
19

horadocodigo

Links adicionais para aprender a programar.
HTML
2
star
20

til

📝 Things I learn daily.
1
star
21

curso_ruby

Curso de ruby para o pessoal do IFSP
Ruby
1
star
22

formatted_url

The easiest way to get formatted URLs
Ruby
1
star
23

bulldoggy-filesystem

A bulldoggy repository to store data into filesystem
Ruby
1
star
24

maratona

SPOJ exercises - BR
C
1
star
25

philss.github.io

Some words of mine
Ruby
1
star