( )⚙ D3212 patch: implement a new worddiff algorithm

This is an archive of the discontinued Mercurial Phabricator instance.

patch: implement a new worddiff algorithm
ClosedPublic

Authored by quark on Apr 9 2018, 6:59 PM.

Details

Summary

The previous worddiff algorithm has many problems. The major problem is it
does a "similarity check" that selects a subset of matched lines to do
inline diffs. It is a bad idea because:

  • The "similarity check" is non-obvious to users. For example, a simple change from "long long x" to "int64_t x" will fail the similarity check and won't be diff-ed as expected.
  • Selecting "lines" to diff won't work as people expect if there are line wrapping changes.
  • It has a sad time complexity if lines do not match, could be O(N^2)-ish.

There are other problems in implementation details.

  • Lines can match across distant hunks (if the next hunk does not have "-" lines).
  • "difflib" is slow.

The solution would be removing the "similarity check", and just diff all
words in a same hunk. So no content will be missed and everything will be
diff-ed as expected. This is similar to what code review tool like
Phabricator does.

This diff implements the word diff algorithm as described above. It also
avoids difflib to be faster.

Note about colors: To be consistent, "changed inserted" parts and "purely
insertion blocks" should have a same color, since they do not exist in the
previous version. Instead of highlighting differences, this patch chooses to
dim common parts. This is also more consistent with Phabricator or GitHub
webpage. That said, the labels are defined in a way that people can still
highlight changed parts and leave purely inserted/deleted hunks use the
"non-highlighted" color.

As one example, running:

hg log -pr df50b87d8f736aff8dc281f816bddcd6f306930c mercurial/commands.py \
  --config experimental.worddiff=1 --color=debug --config diff.unified=0

The previous algorithm outputs:

[diff.file_a|--- a/mercurial/commands.py	Fri Mar 09 15:53:41 2018 +0100]
[diff.file_b|+++ b/mercurial/commands.py	Sat Mar 10 12:33:19 2018 +0530]
[diff.hunk|@@ -2039,1 +2039,4 @@]
[diff.deleted|-][diff.deleted.highlight|@command('^forget',][diff.deleted| ][diff.deleted.highlight|walkopts,][diff.deleted| _('[OPTION]... FILE...'), inferrepo=True)]
[diff.inserted|+@command(]
[diff.inserted|+    '^forget',]
[diff.inserted|+    walkopts + dryrunopts,]
[diff.inserted|+ ][diff.inserted.highlight|  ][diff.inserted| _('[OPTION]... FILE...'), inferrepo=True)]
[diff.hunk|@@ -2074,1 +2077,3 @@]
[diff.deleted|-    rejected = cmdutil.forget(ui, repo, m, prefix="",][diff.deleted.highlight| explicitonly=False)[0]]
[diff.inserted|+    dryrun = opts.get(r'dry_run')]
[diff.inserted|+    rejected = cmdutil.forget(ui, repo, m, prefix="",]
[diff.inserted|+                              explicitonly=False, dryrun=dryrun)[0]]

The new algorithm outputs:

[diff.file_a|--- a/mercurial/commands.py	Fri Mar 09 15:53:41 2018 +0100]
[diff.file_b|+++ b/mercurial/commands.py	Sat Mar 10 12:33:19 2018 +0530]
[diff.hunk|@@ -2039,1 +2039,4 @@]
[diff.deleted|-][diff.deleted.unchanged|@command(][diff.deleted.unchanged|'^forget',][diff.deleted.unchanged| ][diff.deleted.changed|walkopts][diff.deleted.unchanged|,][diff.deleted.changed| ][diff.deleted.unchanged|_('[OPTION]... FILE...'), inferrepo=True)]
[diff.inserted|+][diff.inserted.unchanged|@command(]
[diff.inserted|+][diff.inserted.changed|    ][diff.inserted.unchanged|'^forget',]
[diff.inserted|+][diff.inserted.changed|    walkopts][diff.inserted.unchanged| ][diff.inserted.changed|+ dryrunopts][diff.inserted.unchanged|,]
[diff.inserted|+][diff.inserted.changed|    ][diff.inserted.unchanged|_('[OPTION]... FILE...'), inferrepo=True)]
[diff.hunk|@@ -2074,1 +2077,3 @@]
[diff.deleted|-][diff.deleted.unchanged|    rejected = cmdutil.forget(ui, repo, m, prefix="",][diff.deleted.changed| ][diff.deleted.unchanged|explicitonly=False][diff.deleted.unchanged|)[0]]
[diff.inserted|+][diff.inserted.changed|    dryrun = opts.get(r'dry_run')]
[diff.inserted|+][diff.inserted.unchanged|    rejected = cmdutil.forget(ui, repo, m, prefix="",]
[diff.inserted|+][diff.inserted.changed|                              ][diff.inserted.unchanged|explicitonly=False][diff.inserted.changed|, dryrun=dryrun][diff.inserted.unchanged|)[0]]

Practically, when diffing a 8k line change, the time spent on worddiff
reduces from 4 seconds to 0.14 seconds.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

quark created this revision.Apr 9 2018, 6:59 PM
quark added a comment.Apr 9 2018, 7:11 PM

This is the before and after comparison:

spectral added inline comments.
mercurial/color.py
94

These are the first uses of 'dim' in the default set of things, and I don't think we can rely on it; for color.mode=auto, we really mean "ansi" (aka ecma48) (unless on windows), and don't do any detection of capabilities, so we just output \e[2m and some terminals just ignore that (like mine, rxvt-unicode v9.22). If using color.mode=terminfo, we at least get error messages (I did --config color.log.user='dim green'):

ignoring unknown color/effect 'dim' (configured in color.log.user)

Apparently cygwin doesn't advertise 'dim', and while the linux virtual console advertises it and supports it, it doesn't actually support a dim color (at least on my machine), it just always puts it in a weird blue :)

I think I'd prefer that changed be bold and unchanged be non-bold. For most terminals, that'll lead to a visible difference in intensity (bold being brighter unless using a weird palette), and for those that aren't configured for that, it'll at least be a heavier weight. It's better than having literally zero difference between them without any explanation why. I think it'll also be more obvious which lines have it; in your screenshot the difference between dim and regular is pretty subtle.

quark added inline comments.Apr 9 2018, 9:57 PM
mercurial/color.py
94

As I mentioned in the summary, I believe diff.inserted and diff.inserted.changed should have a same color. And diff.inserted probably shouldn't be bold.

Looking at this review page, you will notice the diff.inserted and diff.inserted.changed are using a same color, where diff.inserted.unchanged is using a different (lighter) background color.

dim works fine where it is supported. For terminals that do not support it, people can override the settings. For dim feature detection, that's an issue in the color code which is unrelated to this change. Since worddiff is experimental and off by default, I don't think dim detection should block this patch.

yuja requested changes to this revision.Apr 10 2018, 10:59 AM
yuja added a subscriber: yuja.

I have no opinion about the "dim" thingy, but the series generally looks
good to me.

Thanks for tackling on the painfully slow SequenceMatcher.ratio() issue.

mercurial/patch.py
53–54

Nit: _wordsplitter as it is private constant

54

Missed br'' here though "\t" and "\x" of string escape are compatible with regexp's.

2542

Nit: maybe we can sort out tokens here instead of re-parsing tabs, newlines, trailing whitespaces later.

But I'm not sure if that will make things simpler.

This revision now requires changes to proceed.Apr 10 2018, 10:59 AM
yuja added a comment.Apr 10 2018, 11:40 AM

BTW, what's the default of git?

The comment in color.py is a bit scary, which says "most terminals don't
support dim." If that's true, we shouldn't use "dim" by default.

quark added a comment.Apr 10 2018, 2:46 PM

Git first had a contrib/diff-highlight/diff-highlight script which inverts foreground/background for hunks with len(deleted_lines) = len(inserted_lines).

Then the latest version shows diff inline. That is:

common words [+inserted words with green color][-deleted words with red color] common words

I dislike that, since it could be ambiguous ("[+ x]" could be part of the original text).

For colors, it's really a hard question. I think if we can detect "dim" is unsupported and make it a no-op, then it'd be fine to use. infocmp can report "dim" correctly on my Linux terminal. Internally, we patched color.py to use 16 and 256 colors even if terminfo reports 8 colors. But I guess that's not an acceptable solution here.

mercurial/patch.py
2542

For a split list ['\n', '\t', ' '], mdiff might return a hunk that joins them. So ''.join(al[a1:a2]) will become more complex.

I think having mdiff step free from EOL/tab handling makes the code easier to read.

spectral added inline comments.
mercurial/color.py
94

I understand the reasoning for wanting diff.inserted and diff.inserted.changed to be the same color, but unfortunately I think it might not be super feasible in upstream.

'dim' support in my terminfo database (and via testing) is actually pretty common among the biggest terminals; for oddities, I'm seeing:

  • the linux console (very few people use it) has weird support
  • screen simulates it with underline, tmux passes it through (so I don't see it)
  • rxvt-unicode doesn't support it
  • the terminfo profile for cygwin doesn't indicate support for it
  • Apple's terminal advertises itself as xterm-compatible and seems to support it
  • iTerm2 similarly

I'm willing to be convinced it's OK, especially since this mode isn't the default. @dhduvall originally wrote the "not supported by many terminals" bit, I wonder if they have any suggestions. @indygreg has also dealt with terminal issues (and deals with windows more than me, so might know more there).

quark added inline comments.Apr 10 2018, 5:11 PM
mercurial/color.py
94

There are not many choices - dim, 16/256 colors, or bold. We ended up with 16/256 colors internally for wider support (ex. tmux). But I'd like to express my (strong) options that:

  • diff.inserted.changed and diff.inserted are same
  • diff.inserted is not bold

in this patch. Because color.py has no 16/256 support yet (and if it does detection conservatively, most terminals will only report 8 color support). And the only remaining choice is "dim". For weird terminals like tmux, I think it's their bugs to fix, not this patch.

That said, I'm fine with changing the defaults to whatever. So feel free to send follow-ups changing it.

yuja added a comment.Apr 11 2018, 10:42 AM

That said, I'm fine with changing the defaults to whatever. So feel
free to send follow-ups changing it.

Can you split a patch changing the color scheme so we can easily
back it out as needed?

mercurial/patch.py
2542

Ah, got it. Let's leave mdiff refactoring until later, then.

quark added a comment.EditedApr 11 2018, 5:07 PM
In D3212#51917, @yuja wrote:

Can you split a patch changing the color scheme so we can easily
back it out as needed?

Note the color configs are not entirely equivalent to the old code. To give an example:

-LINE-1
-LINE-2
-LINE-3
-LINE-4-FOO
+LINE-4-BAR

The old code will use "normal" color for LINE-1 to LINE-3. And highlight only "FOO", "BAR". The new code will treat LINE-1 to 3 and "FOO", "BAR" the same, because they are the "changed" part, and treat "LINE-4-" differently. If we use "bold" for "changed" then it's much easier to make large hunks of code bold, which will look noisy.

Therefore I still think it's reasonable to avoid using "bold" in this patch to avoid noisy outputs.

yuja added a comment.Apr 12 2018, 7:51 AM

Note the color configs are not entirely equivalent to the old code.

I know. It would be roughly equivalent to the old code of sim >= 0.0.

Therefore I still think it's reasonable to avoid using "bold" in this patch to avoid noisy outputs.

I agree on that, but the default output has to be somewhat distinguishable
even if "dim" isn't supported.

dhduvall added inline comments.Apr 12 2018, 10:10 AM
mercurial/color.py
94

I'm not sure what testing I did to claim "most", but I did make that comment in 2011, and xterm didn't start supporting dim until 2014 (it had long supported invisible), so that certainly helped inform that comment. Given @spectral's survey, I'd probably rewrite the comment simply to say that we're using ui.debug() to not be noisy in the wide world of different terminals.

Given support by xterm, iTerm2, and Terminal.app (and I just verified gnome-terminal supports it, too), I'd lean towards just using dim whenever it made sense now, though it'd be nice to know that at least one major Windows terminal app supported it, too. I don't know how widespread rxvt-unicode is in the Linux world, or other terminal emulators that might not support it.

I won't comment on the dim-the-unchanged vs bright-the-changed bits, or how this interacts with the ANSI color mode, other than to say that the default colors aren't an interface, and can be tweaked later without worries about backwards compatibility. Obviously you don't want to change them radically without good reason once people get used to them, but tweaking dim/bold/normal/background/whatever once people have had a chance to experiment more widely might be a reasonable thing to do.

spectral added inline comments.Apr 12 2018, 8:42 PM
mercurial/color.py
94

Ah, didn't know that xterm support for it was that recent :)

I'm fine with assuming that rxvt-unicode isn't used enough to warrant avoiding it here, especially if someone with a windows machine can chime in with whether or not it's supported. For platforms where the default terminal emulator doesn't support it, the maintainers can adjust it to be whatever they like (or we could do so if we create official packages for that platform/distribution).

tmux *does* support it, it just doesn't support *emulating* it if the terminal on the outside of tmux doesn't.

I'm fine with the change using dim. Maybe I can use this to get rxvt-unicode to improve their support. :P

I think one alternative is just to use green_background like git/contrib/diff-highlight/diff-highlight.perl. It satisfies all properties I'd like to have, and is supported by weird terminals including cmd.exe and less.exe. And is different from diff.file_a, diff.file_b colors. I'll probably just use this.

I can't really tell what's going on there, frankly, unlike with the dim case, where it make sense, but I can always change the colors myself.

durin42 accepted this revision.Apr 16 2018, 7:01 PM

In the name of progress, I'm going to land these. If we find problems with color defaults, let's try and fix them in the RC period, but it's still an experimental feature so we've got plenty of wiggle room.

(I'll say that red-dimmed text is very hard for this moderate protan with a black terminal window.)

This revision was automatically updated to reflect the committed changes.
In D3212#52767, @quark wrote:

👍 for git inspired colored output. I gave a try to the dim version and it was less readable than the old version.