clj-diff
Provides diff
and patch
functions for Clojure sequences where (diff a b) -> x
and (patch a x) -> b
. Also provides edit-distance
and levenshtein-distance
functions for calculating the difference between two sequences.
Usage
user=> (use 'clj-diff.core)
user=> (diff "John went to the movies." "John was in the movies.")
{:+ [[5 \a \s \space \i]], :- [6 8 9 10 11]}
user=> (patch "John went to the movies." *1)
"John was in the movies."
user=> (edit-distance "John went to the movies." "John was in the movies."))
9
user=> (levenshtein-distance "John went to the movies." "John was in the movies.")
8
There is already a Java library which does this well. Why create a Clojure version? So that we can do this:
user=> (def a [{:a 1} {:a 2} {:a 3} {:a 4} {:a 5} {:a 6} {:a 7}])
user=> (def b [{:a 2} {:a 3} {:a 4} {:a 5} {:a 6} {:a 7} {:a 1}])
user=> (diff a b)
{:+ [[6 {:a 1}]], :- [0]}
user=> (patch a *1)
({:a 2} {:a 3} {:a 4} {:a 5} {:a 6} {:a 7} {:a 1})
user=> (edit-distance a b)
2
Notes
The current diff algorithm comes from the paper An O(NP) Sequence Comparison Algorithm by Sun Wu, Udi Manber, Gene Myers and Webb Miller. It is fast and memory efficient. It also makes use of the pre-diff optimizations mentioned in Neil Fraser’s Diff Strategies. The worst-case running time of the algorithm is dependent only on the length of the longest sequence (N) and the number of deletions (P). This is much better than the Myers algorithm which is an O(ND) algorithm where N is the sum of the length of the two sequences and D is the edit distance.
I have created a separate project for comparing the performance of different diff algorithms. The main performance goal of this project is to create a Clojure diff that can outperform Fraser’s Java implementation. One of the most interesting results is show below.
This chart shows the most interesting range of comparison between the three algorithms. The current algorithm being used by clj-diff is labeled “Miller”. For larger sequences, the Miller algorithm outperforms Fraser. To summarize all of the performance test results; the Miller algorithm is never more than 50 milliseconds slower than the Fraser algorithm but the Fraser algorithm can be multiple seconds slower, and will run out of memory more quickly, for large sequences. For more information about performance see the clj-diff-performance project.
Installation
Leiningen
Add [clj-diff "1.0.0-SNAPSHOT"]
to your :dependencies in project.clj.
Maven
Add the following dependency:
<dependency>
<groupId>clj-diff</groupId>
<artifactId>clj-diff</artifactId>
<version>1.0.0-SNAPSHOT</version>
</dependency>
…which comes from Clojars…
<repository>
<id>clojars.org</id>
<url>http://clojars.org/repo</url>
</repository>
References
- An O(ND) Difference Algorithm and Its Variations by Eugene W. Myers
- An O(NP) Sequence Comparison Algorithm by Sun Wu, Udi Manber, Gene Myers and Webb Miller
In order to understand the above two titles, you will need to know what N, D and P stand for. Let A and B be two sequences with lengths Q and M where Q >= M. Let D be the length of the minimum edit script for A and B. In the title of the first paper N = Q + M. In the second paper N = Q and P = (1/2)D – (1/2)(Q – M). Therefore, the O(NP) version is faster which is visualized in the charts below.
Roadmap
Version 1.0
- Text diff visualization.
- HTML diff visualization.
- Semantic cleanup.
- Sequence chunking. Allow line based diff for strings.
Version 2.0
- Improve performance by searching the edit graph from both ends at the same time.
- Arbitrary Clojure form diff/Nested diffs. Integrate with clojure.data/diff in Clojure 1.3
- Set the maximum time to look for an optimal diff.
- Add a fast and correct levenshtein-distance function.
License
Copyright © 2010-2011 Brenton Ashworth
Distributed under the Eclipse Public License, the same as Clojure uses. See the file COPYING.