Ksoup - Kotlin Multiplatform HTML Parser
Ksoup is a lightweight Kotlin Multiplatform library for parsing HTML, extracting HTML tags, attributes, and text, and encoding and decoding HTML entities.
Features
- Parse HTML from String
- Extract HTML tags, attributes, and text
- Encode and decode HTML entities
- Lightweight and does not depend on any other library
- Kotlin Multiplatform support
- Fast and efficient
- Unit tested
Installation
Add the dependency below to your module's build.gradle.kts
or build.gradle
file:
Kotlin version | Ksoup version |
---|---|
1.9.0 | 0.2.0 |
1.8.22 or lower | 0.1.4 |
val version = "0.2.0"
// For parsing HTML
implementation("com.mohamedrejeb.ksoup:ksoup-html:$version")
// Only for encoding and decoding HTML entities
implementation("com.mohamedrejeb.ksoup:ksoup-entites:$version")
Usage
Parsing HTML
To parse HTML from a String, use the KsoupHtmlParser
class, and provide an implementation of the KsoupHtmlHandler
interface, and a KsoupHtmlOptions
object.
Both of them are optional, you can use the default ones if you want.
KsoupHtmlParser
You can create a parser using the KsoupHtmlParser()
, there are several methods that you can use, for example write
to parse a String, and end
to close the parser when you are done:
val ksoupHtmlParser = KsoupHtmlParser()
// String to parse
val html = "<h1>My Heading</h1>"
// Pass the HTML to the parser (It is going to parse the HTML and call the callbacks)
ksoupHtmlParser.write(html)
// Close the parser when you are done
ksoupHtmlParser.end()
KsoupHtmlHandler
You can directly implement KsoupHtmlHandler
interface or use KsoupHtmlHandler.Builder()
:
// Implement `KsoupHtmlHandler` interface
val firstHandler = object : KsoupHtmlHandler {
override fun onOpenTag(name: String, attributes: Map<String, String>, isImplied: Boolean) {
println("Open tag: $name")
}
}
// Use `KsoupHtmlHandler.Builder()`
val secondHandler = KsoupHtmlHandler
.Builder()
.onOpenTag { name, attributes, isImplied ->
println("Open tag: $name")
}
.build()
There are several methods that you can override, for example is you want to just extract the text from the HTML, you can override the onText
method:
// String to parse
val html = """
<html>
<head>
<title>My Title</title>
</head>
<body>
<h1>My Heading</h1>
<p>My paragraph.</p>
</body>
</html>
""".trimIndent()
// String to store the extracted text
var string = ""
// Create a handler
val handler = KsoupHtmlHandler
.Builder()
.onText { text ->
string += text
}
.build()
// Create a parser
val ksoupHtmlParser = KsoupHtmlParser(
handler = handler,
)
// Pass the HTML to the parser (It is going to parse the HTML and call the callbacks)
ksoupHtmlParser.write(html)
// Close the parser when you are done
ksoupHtmlParser.end()
You can also use onOpenTag
and onCloseTag
to know when a tag is opened or closed, it can be used for scrapping data from a website or powering a rich text editor,
Also you can use onComment
to know when a comment is found in the HTML and onAttribute
to know when attributes are found in a tag.
KsoupHtmlOptions
You can also pass KsoupHtmlOptions
to the parser to change the behavior of the parser, you can for example disable the decoding of HTML entities which is enabled by default:
val options = KsoupHtmlOption(
decodeEntities = false,
)
Encoding and Decoding HTML Entities
You can use the KsoupEntities
class to encode and decode HTML entities:
// Encode HTML entities
val encoded = KsoupEntities.encodeHtml("Hello & World") // return: Hello & World
// Decode HTML entities
val decoded = KsoupEntities.decodeHtml("Hello & World") // return: Hello & World
KsoupEntities
also provides methods to encode and decode only XML entities or HTML4.
The KsoupEntities
class is available in the ksoup-entites
module.
Both encodeHtml
and decodeHtml
methods support all HTML5 entities, XML entities, and HTML4 entities.
Coming Features
- Add clear documentation
- Add Markdown parser
Contribution
If you've found an error in this sample, please file an issue.
Feel free to help out by sending a pull request ❤️.
❤️
Find this library useful? Support it by joining stargazers for this repository. ⭐
Also, follow me on GitHub for more libraries!
License
Copyright 2023 Mohamed Rejeb
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.