• Stars
    star
    207
  • Rank 189,769 (Top 4 %)
  • Language
    Java
  • License
    MIT License
  • Created over 9 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Java library to extract links (URLs, email addresses) from plain text; fast, small and smart

autolink-java

Java library to extract links such as URLs and email addresses from plain text. It's smart about where a link ends, such as with trailing punctuation.

ci Coverage status Maven Central status

Introduction

You might think: "Do I need a library for this? I can just write a regex for this!". Let's look at a few cases:

  • In text like https://example.com/. the link should not include the trailing dot
  • https://example.com/, should not include the trailing comma
  • (https://example.com/) should not include the parens

Seems simple enough. But then we also have these cases:

  • https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda) should include the trailing paren
  • https://üñîçøðé.com/ä should also work for Unicode (including Emoji and Punycode)
  • <https://example.com/> should not include angle brackets

This library behaves as you'd expect in the above cases and many more. It parses the input text in one pass with limited backtracking.

Thanks to Rinku for the inspiration.

Usage

This library is supported on Java 9 or later. It works on Android (minimum API level 19). It has no external dependencies.

Maven coordinates (see here for other build systems):

<dependency>
    <groupId>org.nibor.autolink</groupId>
    <artifactId>autolink</artifactId>
    <version>0.11.0</version>
</dependency>

Extracting links:

import org.nibor.autolink.*;

String input = "wow, so example: http://test.com";
LinkExtractor linkExtractor = LinkExtractor.builder()
        .linkTypes(EnumSet.of(LinkType.URL, LinkType.WWW, LinkType.EMAIL))
        .build();
Iterable<LinkSpan> links = linkExtractor.extractLinks(input);
LinkSpan link = links.iterator().next();
link.getType();        // LinkType.URL
link.getBeginIndex();  // 17
link.getEndIndex();    // 32
input.substring(link.getBeginIndex(), link.getEndIndex());  // "http://test.com"

Note that by default all supported types of links are extracted. If you're only interested in specific types, narrow it down using the linkTypes method.

The above returns all the links. Sometimes what you want to do is go over some input, process the links and keep the surrounding text. For that case, there's an extractSpans method.

Here's an example of using that to transform the text to HTML and wrapping URLs in an <a> tag (escaping is done using owasp-java-encoder):

import org.nibor.autolink.*;
import org.owasp.encoder.Encode;

String input = "wow http://test.com such linked";
LinkExtractor linkExtractor = LinkExtractor.builder()
        .linkTypes(EnumSet.of(LinkType.URL)) // limit to URLs
        .build();
Iterable<Span> spans = linkExtractor.extractSpans(input);

StringBuilder sb = new StringBuilder();
for (Span span : spans) {
    String text = input.substring(span.getBeginIndex(), span.getEndIndex());
    if (span instanceof LinkSpan) {
        // span is a URL
        sb.append("<a href=\"");
        sb.append(Encode.forHtmlAttribute(text));
        sb.append("\">");
        sb.append(Encode.forHtml(text));
        sb.append("</a>");
    } else {
        // span is plain text before/after link
        sb.append(Encode.forHtml(text));
    }
}

sb.toString();  // "wow <a href=\"http://test.com\">http://test.com</a> such linked"

Note that this assumes that the input is plain text, not HTML. Also see the "What this is not" section below.

Features

URL extraction

Extracts URLs of the form scheme://example with any potentially valid scheme. URIs such as example:test are not matched (may be added as an option in the future). If only certain schemes should be allowed, the result can be filtered. (Note that schemes can contain dots, so foo.http://example is recognized as a single link.)

Includes heuristics for not including trailing delimiters such as punctuation and unbalanced parentheses, see examples below.

Supports internationalized domain names (IDN). Note that they are not validated and as a result, invalid URLs may be matched.

Example input and linked result:

Use LinkType.URL for this, and see test cases here.

WWW link extraction

Extract links like www.example.com. They need to start with www. but don't need a scheme://. For detecting the end of the link, the same heuristics apply as for URLs.

Examples:

Not supported:

  • Uppercase www's, e.g. WWW.example.com and wWw.example.com
  • Too many or too few w's, e.g. wwww.example.com

The domain must have at least 3 parts, so www.com is not valid, but www.something.co.uk is.

Use LinkType.WWW for this, and see test cases here.

Email address extraction

Extracts emails such as [email protected]. Matches international email addresses, but doesn't verify the domain name (may match too much).

Examples:

Not supported:

  • Quoted local parts, e.g. "this is sparta"@example.com
  • Address literals, e.g. foo@[127.0.0.1]

Note that the domain must have at least one dot (e.g. foo@com isn't matched), unless the emailDomainMustHaveDot option is disabled.

Use LinkType.EMAIL for this, and see test cases here.

What this is not

This library is intentionally not aware of HTML. If it was, it would need to depend on an HTML parser and renderer. Consider this input:

HTML that contains <a href="https://one.example">links</a> but also plain URLs like https://two.example.

If you want to turn the plain links into a elements but leave the already linked ones intact, I recommend:

  1. Parse the HTML using an HTML parser library
  2. Walk through the resulting DOM and use autolink-java to find links within text nodes only
  3. Turn those into a elements
  4. Render the DOM back to HTML

Contributing

See CONTRIBUTING.md file.

License

Copyright (c) 2015-2022 Robin Stocker and others, see Git history

MIT licensed, see LICENSE file.

More Repositories

1

taglib-ruby

Ruby interface for the TagLib C++ library, for reading and writing meta-data (tags) of many audio formats
C++
255
star
2

linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
Rust
203
star
3

git-merge-repos

Program for merging multiple Git repositories into one, preserving previous history, tags and branches
Java
137
star
4

id3lib-ruby

Ruby interface to the id3lib C++ library for easily editing ID3 tags of MP3 audio files
C++
41
star
5

frozen-bubble-android

git svn clone of http://frozenbubbleandroid.googlecode.com/svn/
Java
34
star
6

guava-java8-presentation

Examples of using Guava, with some Java 8 additions
JavaScript
30
star
7

curlall

Simple curl-like CLI tool to automatically page through APIs
Rust
25
star
8

brainztag

Command line tool to tag and rename music albums using MusicBrainz data
Python
22
star
9

jar-manifest-formatter

Pretty-prints JAR manifest files (used by OSGi)
JavaScript
10
star
10

digitec_watcher

Script to watch the Digitec website for price or delivery status changes and send out notifications per e-mail
Ruby
7
star
11

ausballot

Tiny website for previewing the ballot papers (house and senate) for Austalian federal elections
TypeScript
5
star
12

nis-ffi

NIS (YP) library using libc's libnsl through ruby-ffi
Ruby
2
star
13

egit

Eclipse Git plugin
Java
2
star
14

advent-of-code-2022

https://adventofcode.com/2022
Java
2
star
15

ejb3unit

Fork of http://ejb3unit.sourceforge.net/ for upgrading to JPA 2.0
Java
2
star
16

7langs7weeks

Exercises from "Seven Languages in Seven Weeks"
Ruby
1
star
17

egit-mergetool-encoding-problem

Sample repository showing a bug in EGit's merge tool
1
star
18

clojure-sudoku

Simple Sudoku solver in Clojure
Clojure
1
star
19

swig-ruby-subclass-namespace-problem

Example for SWIG Ruby problem with subclass in other namespace
Ruby
1
star