xdiff: vendor xdiff library from git

Authored by ryanmce on Mar 2 2018, 7:25 PM.



Vendor git's xdiff library from git commit
d7c6c2369d7c6c2369ac21141b7c6cceaebc6414ec3da14ad using GPL2+ license.

There is another recent user report that hg diff generates suboptimal
result. It seems the fix to issue4074 isn't good enough. I crafted some
other interesting cases, and hg diff barely has any advantage compared with
gnu diffutils or git diff.

testcasegnu diffutilshg diffgit diff
lines timelines timelines time
patience6 0.00602 0.086 0.00
random91772 0.90109462 0.7091772 0.24
json2 0.031264814 1.812 0.29

"lines" means the size of the output, i.e. the count of "+/-" lines. "time"
means seconds needed to do the calculation. Both are the smaller the better.
"hg diff" counts Python startup overhead.

Git and GNU diffutils generate optimal results. For the "json" case, git can
have an optimization that does a scan for common prefix and suffix first,
and match them if the length is greater than half of the text. See
https://neil.fraser.name/news/2006/03/12/. That would make git the fastest
for all above cases.

About testcases:

Aiming for the weakness of the greedy "patience diff" algorithm. Using
git's patience diff option would also get suboptimal result. Generated using
the Python script:

open('a', 'w').write('\n'.join(list('a' + 'x' * 300 + 'u' + 'x' * 700 + 'a\n')))
open('b', 'w').write('\n'.join(list('b' + 'x' * 700 + 'u' + 'x' * 300 + 'b\n')))

Generated using the script in test-issue4074.t. It practically makes the
algorithm suffer. Impressively, git wins in both performance and diff

The recent user reported case. It's a single line movement near the end of a
very large (800K lines) JSON file.

Test Plan

Code taken as-is.

Diff Detail

rHG Mercurial
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.
ryanmce created this revision.Mar 2 2018, 7:25 PM
quark edited the summary of this revision. (Show Details)Mar 2 2018, 8:57 PM
ryanmce added a subscriber: quark.Mar 3 2018, 9:22 AM

To be clear, this patch series was created by @quark, but phabricator did not keep the author information from the patches.

I've ported it here to generate discussion about this as a path forward.

quark updated this revision to Diff 6455.Mar 3 2018, 3:07 PM
quark edited the summary of this revision. (Show Details)
quark edited the test plan for this revision. (Show Details)
ryanmce planned changes to this revision.Mar 3 2018, 3:25 PM

This existing directory is called thirdparty not third-party, unfortunately.

quark updated this revision to Diff 6472.Mar 3 2018, 3:30 PM
indygreg accepted this revision as: indygreg.Mar 3 2018, 3:48 PM
indygreg added a subscriber: indygreg.

I'm OK taking this provided we set up fuzzers in contrib/fuzz. We can do that post landing though.

indygreg accepted this revision.Mar 3 2018, 4:21 PM
This revision is now accepted and ready to land.Mar 3 2018, 4:21 PM
durin42 accepted this revision.Mar 3 2018, 6:11 PM
This revision was automatically updated to reflect the committed changes.