• Stars
    star
    14
  • Rank 1,427,519 (Top 29 %)
  • Language
    Go
  • License
    BSD 3-Clause "New...
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

go-fasttld is a high performance effective top level domains (eTLD) extraction module.

go-fasttld

Go Reference Go Report Card Coveralls Mentioned in Awesome Go

GitHub license

Summary

go-fasttld is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from URLs.

URLs can either contain hostnames, IPv4 addresses, or IPv6 addresses. eTLD extraction is based on the Mozilla Public Suffix List. Private domains listed in the Mozilla Public Suffix List like 'blogspot.co.uk' and 'sinaapp.com' are also supported.

Demo

Spot any bugs? Report them here

Installation

go get github.com/elliotwutingfeng/go-fasttld

Try the CLI

First, build the CLI application.

# `git clone` and `cd` to the go-fasttld repository folder first
make build_cli

Afterwards, try extracting subcomponents from a URL.

# `git clone` and `cd` to the go-fasttld repository folder first
./dist/fasttld extract https://[email protected]%63.uk:5000/a/b\?id\=42

Try the example code

All of the following examples can be found at examples/demo.go. To play the demo, run the following command:

# `git clone` and `cd` to the go-fasttld repository folder first
make demo

Hostname

// Initialise fasttld extractor
extractor, _ := fasttld.New(fasttld.SuffixListParams{})

// Extract URL subcomponents
url := "https://[email protected]%63.uk:5000/a/b?id=42"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})

// Display results
fasttld.PrintRes(url, res) // Pretty-prints res.Scheme, res.UserInfo, res.SubDomain etc.
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// user a.subdomain example a%63.uk example.a%63.uk 5000 /a/b?id=42 hostname

IPv4 Address

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://127.0.0.1:5000"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// 127.0.0.1 127.0.0.1 5000 ipv4 address

IPv6 Address

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://[aBcD:ef01:2345:6789:aBcD:ef01:2345:6789]:5000"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// aBcD:ef01:2345:6789:aBcD:ef01:2345:6789 aBcD:ef01:2345:6789:aBcD:ef01:2345:6789 5000 ipv6 address

Internationalised label separators

go-fasttld supports the following internationalised label separators (IETF RFC 3490)

Full Stop Ideographic Full Stop Fullwidth Full Stop Halfwidth Ideographic Full Stop
U+002E . U+3002 。 U+FF0E . U+FF61 q
extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://brb\u002ei\u3002am\uff0egoing\uff61to\uff0ebe\u3002a\uff61fk"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// brb\u002ei\u3002am\uff0egoing\uff61to be a\uff61fk be\u3002a\uff61fk hostname

Public Suffix List options

Specify custom public suffix list file

You can use a custom public suffix list file by setting CacheFilePath in fasttld.SuffixListParams{} to its absolute path.

cacheFilePath := "/absolute/path/to/file.dat"
extractor, err := fasttld.New(fasttld.SuffixListParams{CacheFilePath: cacheFilePath})

Updating the default Public Suffix List cache

Whenever fasttld.New is called without specifying CacheFilePath in fasttld.SuffixListParams{}, the local cache of the default Public Suffix List is updated automatically if it is more than 3 days old. You can also manually update the cache by using Update().

// Automatic update performed if `CacheFilePath` is not specified
// and local cache is more than 3 days old
extractor, _ := fasttld.New(fasttld.SuffixListParams{})

// Manually update local cache
if err := extractor.Update(); err != nil {
    log.Println(err)
}

Private domains

According to the Mozilla.org wiki, the Mozilla Public Suffix List contains private domains like blogspot.com and sinaapp.com.

By default, these private domains are excluded (i.e. IncludePrivateSuffix = false)

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://google.blogspot.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// google blogspot com blogspot.com hostname

You can include private domains by setting IncludePrivateSuffix = true

extractor, _ := fasttld.New(fasttld.SuffixListParams{IncludePrivateSuffix: true})
url := "https://google.blogspot.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// google blogspot.com google.blogspot.com hostname

Extraction options

Ignore Subdomains

You can ignore subdomains by setting IgnoreSubDomains = true. By default, subdomains are extracted.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://maps.google.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url, IgnoreSubDomains: true})
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// google com google.com hostname

Punycode

By default, internationalised URLs are not converted to punycode before extraction.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://hello.δΈ–η•Œ.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// hello δΈ–η•Œ com δΈ–η•Œ.com hostname

You can convert internationalised URLs to punycode before extraction by setting ConvertURLToPunyCode = true.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://hello.δΈ–η•Œ.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url, ConvertURLToPunyCode: true})
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https:// hello xn--rhqv96g com xn--rhqv96g.com hostname

Parsing errors

If the URL is invalid, the second value returned by Extract(), error, will be non-nil. Partially extracted subcomponents can still be retrieved from the first value returned, ExtractResult.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://example!.com" // invalid characters in hostname
color.New().Println("The following line should be an error message")
if res, err := extractor.Extract(fasttld.URLParams{URL: url}); err != nil {
    color.New(color.FgHiRed, color.Bold).Print("Error: ")
    color.New(color.FgHiWhite).Println(err)
}
fasttld.PrintRes(url, res) // Partially extracted subcomponents can still be retrieved
Scheme UserInfo SubDomain Domain Suffix RegisteredDomain Port Path HostType
https://

Testing

# `git clone` and `cd` to the go-fasttld repository folder first
make tests

# Alternatively, run tests without race detection
# Useful for systems that do not support the -race flag like windows/386
# See https://tip.golang.org/src/cmd/dist/test.go
make tests_without_race

Benchmarks

# `git clone` and `cd` to the go-fasttld repository folder first
make bench

Modules used

Benchmark Name Source
GoFastTld go-fasttld (this module)
JPilloraGoTld github.com/jpillora/go-tld
JoeGuoTldExtract github.com/joeguo/tldextract
Mjd2021USATldExtract github.com/mjd2021usa/tldextract

Results

Benchmarks performed on AMD Ryzen 7 5800X, Manjaro Linux.

go-fasttld performs especially well on longer URLs.


#1

https://iupac.org/iupac-announces-the-2021-top-ten-emerging-technologies-in-chemistry/

Benchmark Name Iterations ns/op B/op allocs/op Fastest
GoFastTld 8037906 150.8 ns/op 0 B/op 0 allocs/op βœ”οΈ
JPilloraGoTld 1675113 716.1 ns/op 224 B/op 2 allocs/op
JoeGuoTldExtract 2204854 515.1 ns/op 272 B/op 5 allocs/op
Mjd2021USATldExtract 1676722 712.0 ns/op 288 B/op 6 allocs/op

#2

https://www.google.com/maps/dir/Parliament+Place,+Parliament+House+Of+Singapore,+Singapore/Parliament+St,+London,+UK/@25.2440033,33.6721455,4z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x31da19a0abd4d71d:0xeda26636dc4ea1dc!2m2!1d103.8504863!2d1.2891543!1m5!1m1!1s0x487604c5aaa7da5b:0xf13a2197d7e7dd26!2m2!1d-0.1260826!2d51.5017061!3e4

Benchmark Name Iterations ns/op B/op allocs/op Fastest
GoFastTld 6381516 181.9 ns/op 0 B/op 0 allocs/op βœ”οΈ
JPilloraGoTld 431671 2603 ns/op 928 B/op 4 allocs/op
JoeGuoTldExtract 893347 1176 ns/op 1120 B/op 6 allocs/op
Mjd2021USATldExtract 1030250 1165 ns/op 1120 B/op 6 allocs/op

#3

https://a.b.c.d.e.f.g.h.i.j.k.l.m.n.oo.pp.qqq.rrrr.ssssss.tttttttt.uuuuuuuuuuu.vvvvvvvvvvvvvvv.wwwwwwwwwwwwwwwwwwwwww.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy.zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.cc

Benchmark Name Iterations ns/op B/op allocs/op Fastest
GoFastTld 833682 1424 ns/op 0 B/op 0 allocs/op βœ”οΈ
JPilloraGoTld 734790 1640 ns/op 304 B/op 3 allocs/op
JoeGuoTldExtract 695475 1452 ns/op 1040 B/op 5 allocs/op
Mjd2021USATldExtract 330717 3628 ns/op 1904 B/op 8 allocs/op

Implementation details

Why not split on "." and take the last element instead?

Splitting on "." and taking the last element only works for simple eTLDs like com, but not more complex ones like oseto.nagasaki.jp.

eTLD tries

Trie

go-fasttld stores eTLDs in compressed tries.

Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.

Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac

and the example URL host `example.nsw.edu.au`

The compressed trie will be structured as follows:

START
 ╠═ au 🚩 βœ…
 β•‘  β•šβ• edu βœ…
 β•‘     β•šβ• nsw 🚩 βœ…
 β•šβ• ac
    ╠═ com 🚩
    ╠═ edu 🚩
    β•šβ• gov 🚩

=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
βœ… : path to this node found in example URL host `example.nsw.edu.au`

The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw. Reversing the nodes gives the extracted eTLD nsw.edu.au.

Acknowledgements

This module is a port of the Python fasttld module, with additional modifications to support extraction of subcomponents from full URLs, IPv4 addresses, and IPv6 addresses.

More Repositories

1

Inversion-DNSBL-Blocklists

Malicious URLs identified by scanning various public URL sources using the Google Safe Browsing API (over 6 billion URLs scanned daily)
25
star
2

SpamdexingSites

URL feed for blocking spamdexing websites. Let's banish them to the rubbish bin where they belong!
24
star
3

Inversion-DNSBL-Generator

Generate malicious URL blocklists for DNSBL applications like pfBlockerNG or Pi-hole by scanning various public URL sources using the Safe Browsing API from Google and/or Yandex.
Python
14
star
4

2fas-backup-decryptor

CLI tool to decrypt backup files exported from the 2FAS Authenticator app. This application is neither affiliated with Two Factor Authentication Service, Inc. nor 2FAS.
Ruby
11
star
5

Twitter200M

Simple analysis of the Twitter 200M Data Dump of January 2023.
Jupyter Notebook
8
star
6

GlobalAntiScamOrg-blocklist

Machine-readable .txt blocklist of scam URLs and IP Addresses from the Global Anti Scam Organization (https://www.globalantiscam.org) website, updated once a day.
Python
7
star
7

asciiset

asciiset is an ASCII character bitset.
Go
6
star
8

rstthreatsall

This repository consolidates all unique IOCs ever released at rstthreats. Updated at least once a day.
Python
6
star
9

aegis-backup-decryptor

CLI tool to decrypt backup files exported from the Aegis Authenticator app. This application is neither affiliated with Beem Development nor Aegis Authenticator.
Ruby
5
star
10

passwordsim

passwordsim lets you search for passwords similar to your specified password in any passwords dataset. The similarity metric used is the Damerau-Levenshtein distance.
Go
5
star
11

ThreatFox-IOC-IPs

Machine-readable .txt IP blocklist from ThreatFox by Abuse.ch, updated every hour.
Python
4
star
12

ipsniper-info-malicious

Machine-readable .txt blocklist of malicious URLs from ipsniper.info, updated once a day.
Python
4
star
13

get_ips

Get IPv4 and IPv6 addresses of hostnames using socket.getaddrinfo().
Python
3
star
14

ChongLuaDao-Phishing-Blocklist

Machine-readable .txt blocklist of phishing URLs and IP Addresses from the Chα»‘ng Lα»«a Đảo (https://chongluadao.vn) project, updated once a day.
Python
3
star
15

Inversion-CloudIPs

Machine-readable .txt blocklist of IP addresses derived via lexical analysis of cloud virtual machine hostnames listed in the Inversion DNSBL Blocklists, updated every hour.
Python
2
star
16

GovSGTrustedSites

Generate a machine-readable .txt allowlist of trusted site URLs from the Government of Singapore (https://www.gov.sg/trusted-sites) website.
Python
2
star
17

MASRegulatedFinancialInstitutions

Generate a machine-readable .txt allowlist of websites belonging to financial institutions regulated by the Monetary Authority of Singapore (MAS).
Go
2
star
18

USOM-Blocklists

Malicious URLs and IP Addresses compiled by USOM (Computer Emergency Response Team of Turkey), updated once a day.
Python
2
star
19

DayBreak

Fitbit Clock Face for tracking imminent phase of day (e.g. Civil Twilight, Golden Hour, Astronomical Dusk etc.) by calculating sun position timings. GPS connection required. Compatible with Fitbit Sense and Fitbit Versa 3.
JavaScript
2
star
20

advent-of-code-2022

My Python solutions for Advent Of Code 2022.
Python
2
star
21

Ukrainian-EMA-blocklist

Machine-readable .txt blocklist of fraudulent URLs and IP Addresses from the Ukrainian Interbank Payment Systems Member Association "EMA" (https://www.ema.com.ua) website, updated once a day.
Python
2
star
22

fernet

A Dart library for encrypting and decrypting messages using the Fernet scheme.
Dart
2
star
23

bitwarden_backup_decryptor

CLI tool to decrypt backup files exported from Bitwarden. This application is not affiliated with Bitwarden, Inc.
Dart
1
star
24

TencentQQURLSec

Tencent QQ has an undocumented publicly-accessible real-time feed of malicious URLs. This repository extracts these URLs, at regular intervals, to a machine-readable .txt blocklist compatible with firewall applications like Pi-hole and pfBlockerNG.
Python
1
star
25

balloon-hashing

Balloon Hashing implemented in Ruby.
Ruby
1
star
26

motp

A Dart library for generating Mobile-OTP (mOTP) codes.
Dart
1
star
27

take_a_shot

Capture screenshot and/or HTML source code data of any given webpage and print it to stdout.
Python
1
star
28

CommodoreEFFPassphrase

Generate memorable EFF Dice-Generated Passphrases in Commodore BASIC.
BASIC
1
star
29

balloon-hashing-kotlin

Balloon Hashing implemented in Kotlin.
Kotlin
1
star
30

TEQSA-illegal-cheating-websites

Machine-readable .txt blocklist of illegal cheating websites blocked by the Australian Government, updated once a day.
Python
1
star
31

pypi_package_names

Retrieve all unique package names from the PyPi JSON-based Simple API.
Python
1
star
32

steam_totp

A Dart library for generating 5-character alphanumeric Steam TOTP codes.
Dart
1
star