String Parsing with Soulver Core
A declarative & type-safe approach to parsing data from strings
SoulverCore gives you human-friendly, type-safe & performant data parsing from Swift strings.
Specify types you want to parse from a string. If they are present, you get back ready-to-use data primitives (not strings!).
This approach to data parsing allows you to ignore:
- The specifics of how the data you need is formatted in text
- Random words (or other data points), surrounding the data you need
Examples
Let's look at a few examples:
let (testCount, failureCount, timeTaken) = "Executed 4 tests, with 1 failure in 0.009 seconds".find(.number, .number, .time)!
testCount // 4
failureCount // 1
timeTaken // 0.009 seconds
let (date, temperature, humidity) = "On August 23, 2022 the temperature in Chicago was 68.3 ºF (with a humidity of 74%)".find(.date, .temperature, .percentage)!
date // August 23, 2022
temperature // 68.3 ºF
humidity // 74%
let (earnings, fileSize, url) = "Total Earnings From PDF: $12.2k (3.25 MB, at https://lifeadvice.co.uk/pdfs/download?id=guide)".find(.currency, .fileSize, .url)!
earnings // 12,200 USD
fileSize // 3.25 MB
url // https://lifeadvice.co.uk/pdfs/download?id=guide
Note: the returned data points are not strings. They are native Swift data types (available as elements on a tuple), on which you can immediately perform operations:
let numbers = "100 + 20".find(.number, .number)!
let sum = numbers.0 + numbers.1 // 120
Up to 6 data points can be requested in a single call. Variadic generics are planned for Swift 6, so we'll support more in the future.
The beauty of high order data extraction
Observe the beauty of the higher order concepts used here: numbers come in many formats (1,000, 30k, .456), yet a simple ".number" query "matches" them all. And .date "matches" dates in commonly used date formats.
For cases where the locale plays a role in the format of data, you may specify a locale in the find method (otherwise the current system Locale is used):
let europeanNumber = "€1.333,24".find(.currency, locale: Locale(identifier: "en_DE"))
let americanDate = "05/30/21".find(.date, locale: Locale(identifier: "en_US")) // month/day/year
Where possible, standard Swift primitives are returned (URL, Date, Decimal, etc). In cases where no Swift primitive wholly captures the data present in the string, a SoulverCore value type is returned with properties containing the relevant data.
Supported data types
Symbol | Match Examples | Return Type |
---|---|---|
.number | 123.45, 10k, -.3, 3,000, 50_000 | Decimal |
.binaryNumber | 0b1011010 | UInt |
.hexNumber | 0x31FE28 | UInt |
.boolean | 'true' or 'false' | Bool |
.percentage | 10%, 230.99% | Decimal |
.date | March 12, 2004, 21/04/77, July the 4th, etc | Date |
.unixTimestamp | 1661259854 | TimeInterval |
.place | Paris, Tokyo, Bali, Israel | SoulverCore.Place |
.airport | SFO, LAX, SYD | SoulverCore.Place |
.timezone | AEST, GMT, EST | SoulverCore.Place |
.currencyCode | USD, EUR, DOGE | String |
.currency | $10.00, AU$30k, 350 JPY | SoulverCore.UnitExpression |
.time | 10 s, 3 min, 4 weeks | SoulverCore.UnitExpression |
.distance | 10 km, 3 miles, 4 cm | SoulverCore.UnitExpression |
.temperature | 25 °C, 77 °F, 10C, 5 F | SoulverCore.UnitExpression |
.weight | 10kg, 45 lb | SoulverCore.UnitExpression |
.area | 30 m2, 40 in2 | SoulverCore.UnitExpression |
.speed | 30 mph | SoulverCore.UnitExpression |
.volume | 3 litres, 4 cups, 10 fl oz | SoulverCore.UnitExpression |
.timespan | 3 hours 12 minutes | SoulverCore.Timespan |
.laptime | 01:30:22.490 (hh:mm:ss.ms) | SoulverCore.Laptime |
.timecode | 03:10:21:16 (hh:mm:ss:frames) | SoulverCore.Frametime |
.pitch | A4, Bb7, C#9 | SoulverCore.Pitch |
.url | https://soulver.app | URL |
.emailAddress | [email protected] | String |
.hashTag | #this_is_a_tag | String |
.whitespace | All whitespace characters (including tabs) are collapsed into a single whitespace token | String |
Getting started
- The SoulverCore framework includes a highly optimized string parser, which can produce an array of tokens representing data types in a given string. This is exactly what we need.
- Add the SoulverCore binary framework to your project. The package is located at https://github.com/soulverteam/SoulverCore (In Xcode, go File > Add Packages…)
- Be sure to "import SoulverCore" at the top of any Swift files in which you wish to process strings
Finding data in strings
As we saw above, finding a data point in a string is as simple as asking for it:
let percent = "Results of likeness test: 83% match".find(.percentage)
// percent is the decimal 0.83
Extracting multiple data points is no harder. A tuple is returned with the correct number of arguments and data types:
let payrollEntry = "CREDIT 03/02/2022 Payroll from employer $200.23" // this string has inconsistent whitespace between entities, but this isn't a problem for us
let (date, currency) = payrollEntry.find(.date, .currency)!
date // Either February 3, or March 2, depending on your system locale
currency // UnitExpression object (use .value to get the decimalValue, and .unit.identifier to get the currency code - USD)
Extracting a data point from an array of strings
We can also call find with a single data type on an array of strings, and get back an array of the corresponding data type of the match:
let amounts = ["Zac spent $50", "Molly spent US$81.9 (with her 10% discount)", "Jude spent $43.90 USD"].find(.currency)
let totalAmount = amounts.reduce(0.0) {
$0 + $1.value
}
// totalAmount is $175.80
Transforming data in strings
Imagine we wanted to standardize the whitespace in the string from the previous example:
let standardized = "CREDIT 03/02/2022 Payroll from employer $200.23".replacingAll(.whitespace) { whitespace in
return " "
}
// standardized is "CREDIT 03/02/2022 Payroll from employer $200.23"
Or perhaps you want to convert European formatted numbers into Swift "standard" ones:
let standardized = "10.330,99 8.330,22 330,99".replacingAll(.number, locale: Locale(identifier: "en_DE")) { number in
return NumberFormatter.localizedString(from: number as NSNumber, number: .decimal)
}
// standardized is "10,330.99 8,330.22 330.99")
Or perhaps you want to convert Celsius temperatures into Fahrenheit:
let convertedTemperatures = ["25 °C", "12.5 degrees celsius", "-22.6 C"].replacingAll(.temperature) { celsius in
let measurementC: Measurement<UnitTemperature> = Measurement(value: celsius.value.doubleValue, unit: .celsius)
let measurementF = measurementC.converted(to: .fahrenheit)
let formatter = MeasurementFormatter()
formatter.unitOptions = .providedUnit
return formatter.string(from: measurementF)
}
// convertedTemperatures is ["77°F", "54.5°F", "-8.68°F"]
Extending SoulverCore with your own custom types
Let's imagine we had strings with the following format, describing some containers:
- "Color: blue, size: medium, volume: 12.5 cm3"
- "Color: red, size: small, volume: 6.2 cm3"
- "Color: yellow, size: large, volume: 17.82 cm3"
We want to extract this data into a custom Swift type that represents a Container.
- Define our model classes (if they don't exist already)
enum Color: String, RawRepresentable {
case blue
case red
case yellow
}
enum Size: String, RawRepresentable {
case small
case medium
case large
}
struct Container {
let color: Color
let size: Size
let volume: Decimal
init(_ data: (Color, Size, UnitExpression)) {
self.color = data.0
self.size = data.1
self.volume = data.2.value
}
}
- Then create parsers for Color and Size, and add them static variables on DataPoint
struct ColorParser: DataFromTokenParser {
typealias DataType = Color
func parseDataFrom(token: SoulverCore.Token) -> Color? {
return Color(rawValue: token.stringValue.lowercased())
}
}
struct SizeParser: DataFromTokenParser {
typealias DataType = Size
func parseDataFrom(token: SoulverCore.Token) -> Size? {
return Size(rawValue: token.stringValue.lowercased())
}
}
extension DataPoint {
static var color: DataPoint<ColorParser> {
return DataPoint<ColorParser>(parser: ColorParser())
}
static var size: DataPoint<SizeParser> {
return DataPoint<SizeParser>(parser: SizeParser())
}
}
- That's all the setup. You can now parse the data from the string, and populate your model objects:
let container1 = Container("Color: blue, size: medium, volume: 12.5 cm3".find(.color, .size, .volume)!)
let container2 = Container("Color: red, size: small, volume: 6.2 cm3".find(.color, .size, .volume)!)
let container3 = Container("Color: yellow, size: large, volume: 17.82 cm3".find(.color, .size, .volume)!)
Using SoulverCore as a parser inside Swift Regex Builder (coming in 5.7)
SoulverCore will be able to be used to parse data inside the Swift regex builder DSL coming in 5.7. This is often easier than figuring out how to match the format of your data with a regular expression.
if #available(macOS 13.0, iOS 16.0, *) {
let input = "Cost: 365.45, Date: March 12, 2022"
let regex = Regex {
"Cost: "
Capture {
DataPoint<NumberFromTokenParser>.number
}
", Date: "
Capture {
DataPoint<DateFromTokenParser>.date
}
}
let match = input.wholeMatch(of: regex).1 // 365.45
}
Note: it's confusing and unfortunate that the Swift compiler can't seem to infer the DataPoint generic parameter from a static variable on DataPoint (anyone know why?).
Until this is fixed, you must explicitly specify the DataFromTokenParser corresponding to the type of the data you want to match.
Performance
SoulverCore is unlikely to be your app's bottleneck.
In our testing SoulverCore does ~6k operations/second on Intel and 10k+ operations/second on  Silicon.
While this is admittedly not as fast as regex, in fairness, SoulverCore is doing a lot more work. Before your query is checked for matches, SoulverCore parses the complete string into tokens representing various data types, of which it can identify more than 20 (including dates, numbers & units in various formats, places, timezones and more…).
A regex that did this would be impossible to construct, and even if such a regex were possible, it would run much more slowly than SoulverCore does.
Comparison with other data parsing approaches
Apple's toolkit for string parsing includes Regex, NSScanner & NSDataDetector. Let's compare and contrast each of these with SoulverCore.
Regular Expressions
Regular expressions will always be with us, but ask yourself, do you really want to use them for data processing?
They're non-trivial to understand at a glance, and constructing a correct regex to match data is, at the minimum, tedious (if not mentally quite challenging sometimes).
Regex only "sees" sets of characters/numbers/whitespace so it forces you to think about the string format of the data you want to parse, and also often about how to skip past other strings leading up to it.
So even with the significant enhancements to regex in Swift 5.7 (type-safe tuple matches & the regex builder syntax), regex makes you think about data parsing at the wrong level of abstraction (i.e. characters, rather than data types).
If Swift is to achieve its goal of becoming the world's greatest string & data processing language, it needs something more human friendly at the level of abstraction of data, not character sets.
NSScanner
A scanner is an imperative (rather than declarative) approach to parsing data out of strings. You move a scanner through a string step-by-step, scanning out the components that you want.
One benefit of NSScanner is that it's able to ignore parts of strings you don't care about. However scanner still only knows about numbers and strings - not higher level data types.
Here is a StackOverflow post that illustrates the use of NSScanner to scan the integer from the string "user logged (3 attempts)".
NSString *logString = @"user logged (3 attempts)";
NSString *numberString;
NSScanner *scanner = [NSScanner scannerWithString:logString];
[scanner scanUpToCharactersFromSet:[NSCharacterSet decimalDigitCharacterSet] intoString:nil];
[scanner scanCharactersFromSet:[NSCharacterSet decimalDigitCharacterSet] intoString:&numberString];
NSLog(@"Attempts: %i", [numberString intValue]); // 3
Regex (in Swift 5.7+) is somewhat more concise
if #available(macOS 13.0, iOS 16.0, *) {
let match = "user logged (3 attempts)".firstMatch(of: /([+\\-]?[0-9]+)/)
let numberSubstring = match!.0
let number = Int(numberSubstring)
}
And now SoulverCore:
let number = "user logged (3 attempts)".find(.number)
NSDataDetector
NSDataDetector is an NSRegularExpression subclass that is able to scan a string for dates, URLs, phone numbers, addresses, and flight details. It's a great class, and supports many different formats. Additionally, it return propers data types from strings, like URL and Date (much like SoulverCore).
Compare:
NSDataDetector
let input = "Learn more at https://fascinatingcaptian.com today."
let detector = try! NSDataDetector(types: NSTextCheckingResult.CheckingType.link.rawValue)
let url = detector.firstMatch(in: input, options: [], range: NSRange(location: 0, length: input.utf16.count))!.url!
SoulverCore
let url = "Learn more at https://fascinatingcaptian.com today".find(.url)
NSDataDetector's downsides are that the API is not particularly "Swifty", supported data types are limited, and it's not part of the platform-independent implementation of Foundation (so you can't use it on Linux, Windows, etc)
Licence
SoulverCore is a commercially licensable, closed-source Swift framework. The standard licensing terms of SoulverCore do apply for its use in string processing (see SoulverCore Licence).
For personal (non-commercial) projects, you do not need a license. So go ahead and use this great library in your personal projects!
There are also attribution-only licences available for a few commercial use cases.