xdiff: remove patience and histogram diff algorithms
ClosedPublic

Authored by ryanmce on Mar 2 2018, 7:25 PM.

Details

Summary

Patience diff is the normal diff algorithm, plus some greediness that
unconditionally matches common common unique lines. That means it is easy to
construct cases to let it generate suboptimal result, like:

open('a', 'w').write('\n'.join(list('a' + 'x' * 300 + 'u' + 'x' * 700 + 'a\n')))
open('b', 'w').write('\n'.join(list('b' + 'x' * 700 + 'u' + 'x' * 300 + 'b\n')))

Patience diff has been advertised as being able to generate better results for
some C code changes. However, the more scientific way to do that is the
indention heuristic [1].

Since patience diff could generate suboptimal result more easily and its
"better" diff feature could be replaced by the new indention heuristic, let's
just remove it and its variant histogram diff to simplify the code.

[1]: https://github.com/git/git/commit/433860f3d0beb0c6f205290bd16cda413148f098

Test Plan

gcc -fPIC *.c --shared -o xdiff.so still builds.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
ryanmce created this revision.Mar 2 2018, 7:25 PM
quark updated this revision to Diff 6456.Mar 3 2018, 3:07 PM
quark edited the summary of this revision. (Show Details)
quark retitled this revision from xdiff: remove patience and histogram diffs to xdiff: remove patience and histogram diff algorithms.
quark updated this revision to Diff 6473.Mar 3 2018, 3:30 PM
indygreg accepted this revision as: indygreg.Mar 3 2018, 3:50 PM
indygreg added a subscriber: indygreg.

I'm leaning towards keeping these algorithms so we can expose them as alternate implementations in the future. It will also making syncing code from upstream easier. But removing the unused-by-us code is fine.

quark added a subscriber: quark.Mar 3 2018, 4:21 PM

Googling "patience diff", most results will say it is slower but has better diff quality sometimes.

That is very misleading - the algorithm adds greediness (i.e. incorrectness) and is in theory faster in some cases. The "better" quality is also untrue comparing to indent heuristic.

Technically, I don't think it's worthwhile to keep - definitely worse quality and unpredictable performance changes comparing to default diff + indent heuristic setup.

From a non-technical point of view, I think educating people that they cannot be misled is very important. This one alone is an enough reason to do the cleanup, imo.

durin42 accepted this revision.Mar 3 2018, 6:11 PM
This revision is now accepted and ready to land.Mar 3 2018, 6:11 PM
This revision was automatically updated to reflect the committed changes.