Myers diff algorithm in Scala.
"app.tulz" %%% "stringdiff" % "0.3.4"
The core algorithm is a Scala translation of the Python implementation described here: https://blog.robertelder.org/diff-algorithm/
Additionally, the following is implemented on top of it:
- interpretation of the algorithm's output
- customizable formatting of the interpreted result into a string (with a few provided out of the box formatters)
- additional transformations for when the input is expected to be a sequence of tokens (
TokenDiff
)
The core algorithm outputs a set of instructions to get from SeqA
to SeqB
, something like this:
in order to get from s1="bcdefgzio" to s2="abcxyfgi":
Insert a from s2 before position 0 into s1.
Delete d from s1 at position 2 in s1.
Insert x from s2 before position 3 into s1.
Insert y from s2 before position 3 into s1.
Delete e from s1 at position 3 in s1.
Delete z from s1 at position 6 in s1.
Delete o from s1 at position 8 in s1.
It's hard to work with it. Also, it is somewhat tricky at first to understand how to follow these instructions (try it! :) ).
MyersInterpret
object parses the raw output into a List[DiffElement]
:
trait DiffElement[Repr]
object DiffElement {
final case class InBoth[Repr](v: Repr) extends DiffElement[Repr]
final case class InFirst[Repr](v: Repr) extends DiffElement[Repr]
final case class InSecond[Repr](v: Repr) extends DiffElement[Repr]
final case class Diff[Repr](x: Repr, y: Repr) extends DiffElement[Repr]
}
The result of the interpretation (without any transformations applied) for the same example:
StringDiff
.raw(
"bcdefgzio",
"abcxyfgi",
collapse = false
).mkString("[\n ", "\n ", "\n]")
[
InSecond(a)
InBoth(bc)
InFirst(d)
InSecond(x)
InSecond(y)
InFirst(e)
InBoth(fg)
InFirst(z)
InBoth(i)
InFirst(o)
]
By default, diff functions will collapse the diff:
StringDiff
.raw(
"bcdefgzio",
"abcxyfgi",
collapse = true // default is true
).mkString("[\n ", "\n ", "\n]")
[
InSecond(a)
InBoth(bc)
Diff(de,xy)
InBoth(fg)
InFirst(z)
InBoth(i)
InFirst(o)
]
Here, the following list of DiffElement
s:
[
InFirst(d)
InSecond(x)
InSecond(y)
InFirst(e)
]
got collapsed into a single one:
[
Diff(de,xy)
]
In a nutshell, collapsing removes empty elements and joins same or otherwise "join-able" subsequent DiffElement
s.
Examples:
- any
InFirst
,InLast
,Diff
orInBoth
gets removed if the element is empty Diff
becomesInFirst
orInSecond
if one the elements is emptyInFirst(a)+InFirst(b) -> InFirst(ab)
InFirst(a)+InSecond(b) -> Diff(a,b)
InSecond(a)+InDiff(b,c) -> Diff(ab,c)
- etc
SeqDiff
.seq(
Seq(1, 2, 3, 4, 5).toIndexedSeq,
Seq(1, 2, 8, 3, 8, 4, 5, 0).toIndexedSeq
).mkString("[\n ", "\n ", "\n]")
[
InBoth(Vector(1, 2))
InSecond(Vector(8))
InBoth(Vector(3))
InSecond(Vector(8))
InBoth(Vector(4, 5))
InSecond(Vector(0))
]
println(
StringDiff.diff(
"bcdefgzio",
"abcxyfgi"
)
)
[
InSecond(a)
InBoth(bc)
Diff(de,xy)
InBoth(fg)
InFirst(z)
InBoth(i)
InFirst(o)
]
println(
StringDiff.text(
"bcdefgzio",
"abcxyfgi"
)
)
[∅|a]]bc][de|xy]]fg][z|∅]]i][o|∅]
println(
StringDiff.ansi(
"bcdefgzio",
"abcxyfgi"
)
)
(default formatters highlight missing text with yellow, extraneous text — with red, and matching text is underlined)
When the inputs are strings that are expected to contain whitespace-separated tokens, TokenDiff
will try to make the diff more comprehensible in terms of tokens (while preserving the accuracy).
println(
TokenDiff.ansi(
"match-1 match-2 diff-1 diff-2 match-3 match-4 diff-1 diff-2 match-1 match-2 diff-1 match-3 match-4 diff-1 match-1 match-2 diff-1 match-3 match-4 suffix-1",
"prefix-1 match-1 match-2 diff-3 match-3 match-4 match-1 match-2 diff-2 diff-3 match-3 match-4 diff-2 match-1 match-2 match-3 match-4"
)
)
With a StringDiff
the output would look like the following:
println(
StringDiff.ansi(
"match-1 match-2 diff-1 diff-2 match-3 match-4 diff-1 diff-2 match-1 match-2 diff-1 match-3 match-4 diff-1 match-1 match-2 diff-1 match-3 match-4 suffix-1",
"prefix-1 match-1 match-2 diff-3 match-3 match-4 match-1 match-2 diff-2 diff-3 match-3 match-4 diff-2 match-1 match-2 match-3 match-4"
)
)
println(
TokenDiff.ansiBoth(
"match-1 match-2 diff-1 diff-2 match-3 match-4 diff-1 diff-2 match-1 match-2 diff-1 match-3 match-4 diff-1 match-1 match-2 diff-1 match-3 match-4 suffix-1",
"prefix-1 match-1 match-2 diff-3 match-3 match-4 match-1 match-2 diff-2 diff-3 match-3 match-4 diff-2 match-1 match-2 match-3 match-4"
)
)
SeqDiff.seq(
Seq(1, 2, 3),
Seq(2, 3, 4)
)
StringDiff.ansi("abc", "acb")
// OR
StringDiff("abc", "acb")
StringDiff.ansiBoth("abc", "acb")
StringDiff.text("abc", "acb")
StringDiff.diff("abc", "acb")
StringDiff.raw("abc", "acb")
TokenDiff.ansi("abc", "acb")
// OR
TokenDiff("abc", "acb")
TokenDiff.ansiBoth("abc", "acb")
TokenDiff.text("abc", "acb")
TokenDiff.diff("abc", "acb")
TokenDiff.raw("abc", "acb")
Formatters are instances of the DiffFormat[Out]
trait:
trait DiffFormat[Out] {
def apply(diff: List[DiffElement[String]]): Out
}
object MyFormat extends DiffFormat[MyDiffOutput] { ... }
val diff: MyDiffOutput = MyFormat(StringDiff("abc", "acb"))
For example, the TextDiffFormat
is implemented like this:
object TextDiffFormat extends DiffFormat[String] {
import DiffElement._
def apply(diff: List[DiffElement[String]]): String = {
val sb = new StringBuilder
diff.foreach {
case InBoth(both) =>
sb.append("]")
sb.appendAll(both)
sb.append("]")
case InSecond(second) =>
sb.append("[∅|")
sb.appendAll(second)
sb.append("]")
case InFirst(first) =>
sb.append("[")
sb.appendAll(first)
sb.append("|∅]")
case Diff(first, second) =>
sb.append("[")
sb.appendAll(first)
sb.append("|")
sb.appendAll(second)
sb.append("]")
case _ =>
}
sb.toString()
}
Iurii Malchenko – @yurique
stringdiff
is provided under the MIT license.