This is an archive of the discontinued Mercurial Phabricator instance.

patch: implement a new worddiff algorithm
ClosedPublic

Authored by quark on Apr 9 2018, 6:59 PM.

Download Raw Diff

Details

Reviewers

durin42
yuja

Group Reviewers

hg-reviewers

Commits

rHG35632d392279: patch: implement a new worddiff algorithm

Summary

The previous worddiff algorithm has many problems. The major problem is it
does a "similarity check" that selects a subset of matched lines to do
inline diffs. It is a bad idea because:

The "similarity check" is non-obvious to users. For example, a simple change from "long long x" to "int64_t x" will fail the similarity check and won't be diff-ed as expected.
Selecting "lines" to diff won't work as people expect if there are line wrapping changes.
It has a sad time complexity if lines do not match, could be O(N^2)-ish.

There are other problems in implementation details.

Lines can match across distant hunks (if the next hunk does not have "-" lines).
"difflib" is slow.

The solution would be removing the "similarity check", and just diff all
words in a same hunk. So no content will be missed and everything will be
diff-ed as expected. This is similar to what code review tool like
Phabricator does.

This diff implements the word diff algorithm as described above. It also
avoids difflib to be faster.

Note about colors: To be consistent, "changed inserted" parts and "purely
insertion blocks" should have a same color, since they do not exist in the
previous version. Instead of highlighting differences, this patch chooses to
dim common parts. This is also more consistent with Phabricator or GitHub
webpage. That said, the labels are defined in a way that people can still
highlight changed parts and leave purely inserted/deleted hunks use the
"non-highlighted" color.

As one example, running:

hg log -pr df50b87d8f736aff8dc281f816bddcd6f306930c mercurial/commands.py \
  --config experimental.worddiff=1 --color=debug --config diff.unified=0

The previous algorithm outputs:

[diff.file_a|--- a/mercurial/commands.py	Fri Mar 09 15:53:41 2018 +0100]
[diff.file_b|+++ b/mercurial/commands.py	Sat Mar 10 12:33:19 2018 +0530]
[diff.hunk|@@ -2039,1 +2039,4 @@]
[diff.deleted|-][diff.deleted.highlight|@command('^forget',][diff.deleted| ][diff.deleted.highlight|walkopts,][diff.deleted| _('[OPTION]... FILE...'), inferrepo=True)]
[diff.inserted|+@command(]
[diff.inserted|+    '^forget',]
[diff.inserted|+    walkopts + dryrunopts,]
[diff.inserted|+ ][diff.inserted.highlight|  ][diff.inserted| _('[OPTION]... FILE...'), inferrepo=True)]
[diff.hunk|@@ -2074,1 +2077,3 @@]
[diff.deleted|-    rejected = cmdutil.forget(ui, repo, m, prefix="",][diff.deleted.highlight| explicitonly=False)[0]]
[diff.inserted|+    dryrun = opts.get(r'dry_run')]
[diff.inserted|+    rejected = cmdutil.forget(ui, repo, m, prefix="",]
[diff.inserted|+                              explicitonly=False, dryrun=dryrun)[0]]

The new algorithm outputs:

[diff.file_a|--- a/mercurial/commands.py	Fri Mar 09 15:53:41 2018 +0100]
[diff.file_b|+++ b/mercurial/commands.py	Sat Mar 10 12:33:19 2018 +0530]
[diff.hunk|@@ -2039,1 +2039,4 @@]
[diff.deleted|-][diff.deleted.unchanged|@command(][diff.deleted.unchanged|'^forget',][diff.deleted.unchanged| ][diff.deleted.changed|walkopts][diff.deleted.unchanged|,][diff.deleted.changed| ][diff.deleted.unchanged|_('[OPTION]... FILE...'), inferrepo=True)]
[diff.inserted|+][diff.inserted.unchanged|@command(]
[diff.inserted|+][diff.inserted.changed|    ][diff.inserted.unchanged|'^forget',]
[diff.inserted|+][diff.inserted.changed|    walkopts][diff.inserted.unchanged| ][diff.inserted.changed|+ dryrunopts][diff.inserted.unchanged|,]
[diff.inserted|+][diff.inserted.changed|    ][diff.inserted.unchanged|_('[OPTION]... FILE...'), inferrepo=True)]
[diff.hunk|@@ -2074,1 +2077,3 @@]
[diff.deleted|-][diff.deleted.unchanged|    rejected = cmdutil.forget(ui, repo, m, prefix="",][diff.deleted.changed| ][diff.deleted.unchanged|explicitonly=False][diff.deleted.unchanged|)[0]]
[diff.inserted|+][diff.inserted.changed|    dryrun = opts.get(r'dry_run')]
[diff.inserted|+][diff.inserted.unchanged|    rejected = cmdutil.forget(ui, repo, m, prefix="",]
[diff.inserted|+][diff.inserted.changed|                              ][diff.inserted.unchanged|explicitonly=False][diff.inserted.changed|, dryrun=dryrun][diff.inserted.unchanged|)[0]]

Practically, when diffing a 8k line change, the time spent on worddiff
reduces from 4 seconds to 0.14 seconds.

Diff Detail

Repository

rHG Mercurial

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

quark created this revision.Apr 9 2018, 6:59 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptApr 9 2018, 6:59 PM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

quark added a reviewer: durin42.Apr 9 2018, 7:00 PM

This is the before and after comparison:

spectral added a subscriber: spectral.Apr 9 2018, 9:03 PM

spectral added inline comments.

mercurial/color.py
94	These are the first uses of 'dim' in the default set of things, and I don't think we can rely on it; for color.mode=auto, we really mean "ansi" (aka ecma48) (unless on windows), and don't do any detection of capabilities, so we just output \e[2m and some terminals just ignore that (like mine, rxvt-unicode v9.22). If using color.mode=terminfo, we at least get error messages (I did --config color.log.user='dim green'): ignoring unknown color/effect 'dim' (configured in color.log.user) Apparently cygwin doesn't advertise 'dim', and while the linux virtual console advertises it and supports it, it doesn't actually support a dim color (at least on my machine), it just always puts it in a weird blue :) I think I'd prefer that `changed` be bold and `unchanged` be non-bold. For most terminals, that'll lead to a visible difference in intensity (bold being brighter unless using a weird palette), and for those that aren't configured for that, it'll at least be a heavier weight. It's better than having literally zero difference between them without any explanation why. I think it'll also be more obvious which lines have it; in your screenshot the difference between dim and regular is pretty subtle.

quark added inline comments.Apr 9 2018, 9:57 PM

mercurial/color.py
94	As I mentioned in the summary, I believe `diff.inserted` and `diff.inserted.changed` should have a same color. And `diff.inserted` probably shouldn't be bold. Looking at this review page, you will notice the `diff.inserted` and `diff.inserted.changed` are using a same color, where `diff.inserted.unchanged` is using a different (lighter) background color. `dim` works fine where it is supported. For terminals that do not support it, people can override the settings. For `dim` feature detection, that's an issue in the color code which is unrelated to this change. Since worddiff is experimental and off by default, I don't think dim detection should block this patch.

I have no opinion about the "dim" thingy, but the series generally looks
good to me.

Thanks for tackling on the painfully slow SequenceMatcher.ratio() issue.

mercurial/patch.py
53–54	Nit: `_wordsplitter` as it is private constant
54	Missed `br''` here though "\t" and "\x" of string escape are compatible with regexp's.
2542	Nit: maybe we can sort out tokens here instead of re-parsing tabs, newlines, trailing whitespaces later. But I'm not sure if that will make things simpler.

This revision now requires changes to proceed.Apr 10 2018, 10:59 AM

BTW, what's the default of git?

The comment in color.py is a bit scary, which says "most terminals don't
support dim." If that's true, we shouldn't use "dim" by default.

Git first had a contrib/diff-highlight/diff-highlight script which inverts foreground/background for hunks with len(deleted_lines) = len(inserted_lines).

Then the latest version shows diff inline. That is:

common words [+inserted words with green color][-deleted words with red color] common words

I dislike that, since it could be ambiguous ("[+ x]" could be part of the original text).

For colors, it's really a hard question. I think if we can detect "dim" is unsupported and make it a no-op, then it'd be fine to use. infocmp can report "dim" correctly on my Linux terminal. Internally, we patched color.py to use 16 and 256 colors even if terminfo reports 8 colors. But I guess that's not an acceptable solution here.

mercurial/patch.py
2542	For a split list `['\n', '\t', ' ']`, mdiff might return a hunk that joins them. So `''.join(al[a1:a2])` will become more complex. I think having mdiff step free from EOL/tab handling makes the code easier to read.

spectral added subscribers: dhduvall, indygreg.Apr 10 2018, 3:38 PM

spectral added inline comments.

mercurial/color.py
94	I understand the reasoning for wanting `diff.inserted` and `diff.inserted.changed` to be the same color, but unfortunately I think it might not be super feasible in upstream. 'dim' support in my terminfo database (and via testing) is actually pretty common among the biggest terminals; for oddities, I'm seeing: the linux console (very few people use it) has weird support screen simulates it with underline, tmux passes it through (so I don't see it) rxvt-unicode doesn't support it the terminfo profile for cygwin doesn't indicate support for it Apple's terminal advertises itself as xterm-compatible and seems to support it iTerm2 similarly I'm willing to be convinced it's OK, especially since this mode isn't the default. @dhduvall originally wrote the "not supported by many terminals" bit, I wonder if they have any suggestions. @indygreg has also dealt with terminal issues (and deals with windows more than me, so might know more there).

quark added inline comments.Apr 10 2018, 5:11 PM

mercurial/color.py
94	There are not many choices - dim, 16/256 colors, or bold. We ended up with 16/256 colors internally for wider support (ex. tmux). But I'd like to express my (strong) options that: diff.inserted.changed and diff.inserted are same diff.inserted is not bold in this patch. Because color.py has no 16/256 support yet (and if it does detection conservatively, most terminals will only report 8 color support). And the only remaining choice is "dim". For weird terminals like tmux, I think it's their bugs to fix, not this patch. That said, I'm fine with changing the defaults to whatever. So feel free to send follow-ups changing it.

That said, I'm fine with changing the defaults to whatever. So feel
free to send follow-ups changing it.

Can you split a patch changing the color scheme so we can easily
back it out as needed?

mercurial/patch.py
2542	Ah, got it. Let's leave mdiff refactoring until later, then.

In D3212#51917, @yuja wrote:

Can you split a patch changing the color scheme so we can easily
back it out as needed?

Note the color configs are not entirely equivalent to the old code. To give an example:

-LINE-1
-LINE-2
-LINE-3
-LINE-4-FOO
+LINE-4-BAR

The old code will use "normal" color for LINE-1 to LINE-3. And highlight only "FOO", "BAR". The new code will treat LINE-1 to 3 and "FOO", "BAR" the same, because they are the "changed" part, and treat "LINE-4-" differently. If we use "bold" for "changed" then it's much easier to make large hunks of code bold, which will look noisy.

Therefore I still think it's reasonable to avoid using "bold" in this patch to avoid noisy outputs.

Note the color configs are not entirely equivalent to the old code.

I know. It would be roughly equivalent to the old code of sim >= 0.0.

Therefore I still think it's reasonable to avoid using "bold" in this patch to avoid noisy outputs.

I agree on that, but the default output has to be somewhat distinguishable
even if "dim" isn't supported.

dhduvall added inline comments.Apr 12 2018, 10:10 AM

mercurial/color.py
94	I'm not sure what testing I did to claim "most", but I did make that comment in 2011, and xterm didn't start supporting dim until 2014 (it had long supported invisible), so that certainly helped inform that comment. Given @spectral's survey, I'd probably rewrite the comment simply to say that we're using `ui.debug()` to not be noisy in the wide world of different terminals. Given support by xterm, iTerm2, and Terminal.app (and I just verified gnome-terminal supports it, too), I'd lean towards just using dim whenever it made sense now, though it'd be nice to know that at least one major Windows terminal app supported it, too. I don't know how widespread rxvt-unicode is in the Linux world, or other terminal emulators that might not support it. I won't comment on the dim-the-unchanged vs bright-the-changed bits, or how this interacts with the ANSI color mode, other than to say that the default colors aren't an interface, and can be tweaked later without worries about backwards compatibility. Obviously you don't want to change them radically without good reason once people get used to them, but tweaking dim/bold/normal/background/whatever once people have had a chance to experiment more widely might be a reasonable thing to do.

spectral added inline comments.Apr 12 2018, 8:42 PM

mercurial/color.py
94	Ah, didn't know that xterm support for it was that recent :) I'm fine with assuming that rxvt-unicode isn't used enough to warrant avoiding it here, especially if someone with a windows machine can chime in with whether or not it's supported. For platforms where the default terminal emulator doesn't support it, the maintainers can adjust it to be whatever they like (or we could do so if we create official packages for that platform/distribution). tmux does support it, it just doesn't support emulating it if the terminal on the outside of tmux doesn't. I'm fine with the change using dim. Maybe I can use this to get rxvt-unicode to improve their support. :P

I think one alternative is just to use green_background like git/contrib/diff-highlight/diff-highlight.perl. It satisfies all properties I'd like to have, and is supported by weird terminals including cmd.exe and less.exe. And is different from diff.file_a, diff.file_b colors. I'll probably just use this.

I can't really tell what's going on there, frankly, unlike with the dim case, where it make sense, but I can always change the colors myself.

durin42 accepted this revision.Apr 16 2018, 7:01 PM

In the name of progress, I'm going to land these. If we find problems with color defaults, let's try and fix them in the RC period, but it's still an experimental feature so we've got plenty of wiggle room.

(I'll say that red-dimmed text is very hard for this moderate protan with a black terminal window.)

Closed by commit rHG35632d392279: patch: implement a new worddiff algorithm (authored by quark). · Explain WhyApr 16 2018, 7:11 PM

This revision was automatically updated to reflect the committed changes.

In D3212#52767, @quark wrote:

👍 for git inspired colored output. I gave a try to the dim version and it was less readable than the old version.

Revision Contents
Changeset List

			Path	Packages
M			mercurial/color.py (6 lines)
M			mercurial/patch.py (75 lines)
M			tests/test-diff-color.t (54 lines)

Status	Author	Revision
Closed	quark	D3212 patch: implement a new worddiff algorithm
Closed	quark	D3211 patch: buffer lines for a same hunk
Closed	quark	D3210 patch: move yielding "\n" to the end of loop

Diff 8335

mercurial/color.py

	'grep.date': 'magenta',			'grep.date': 'magenta',
	'bookmarks.active': 'green',			'bookmarks.active': 'green',
	'branches.active': 'none',			'branches.active': 'none',
	'branches.closed': 'black bold',			'branches.closed': 'black bold',
	'branches.current': 'green',			'branches.current': 'green',
	'branches.inactive': 'none',			'branches.inactive': 'none',
	'diff.changed': 'white',			'diff.changed': 'white',
	'diff.deleted': 'red',			'diff.deleted': 'red',
	'diff.deleted.highlight': 'red bold underline',			'diff.deleted.changed': 'red',
				'diff.deleted.unchanged': 'red dim',
				spectralUnsubmitted Not Done These are the first uses of 'dim' in the default set of things, and I don't think we can rely on it; for color.mode=auto, we really mean "ansi" (aka ecma48) (unless on windows), and don't do any detection of capabilities, so we just output \e[2m and some terminals just ignore that (like mine, rxvt-unicode v9.22). If using color.mode=terminfo, we at least get error messages (I did --config color.log.user='dim green'): ignoring unknown color/effect 'dim' (configured in color.log.user) Apparently cygwin doesn't advertise 'dim', and while the linux virtual console advertises it and supports it, it doesn't actually support a dim color (at least on my machine), it just always puts it in a weird blue :) I think I'd prefer that `changed` be bold and `unchanged` be non-bold. For most terminals, that'll lead to a visible difference in intensity (bold being brighter unless using a weird palette), and for those that aren't configured for that, it'll at least be a heavier weight. It's better than having literally zero difference between them without any explanation why. I think it'll also be more obvious which lines have it; in your screenshot the difference between dim and regular is pretty subtle. spectral: These are the first uses of 'dim' in the default set of things, and I don't think we can rely…
				quarkAuthorUnsubmitted Not Done As I mentioned in the summary, I believe `diff.inserted` and `diff.inserted.changed` should have a same color. And `diff.inserted` probably shouldn't be bold. Looking at this review page, you will notice the `diff.inserted` and `diff.inserted.changed` are using a same color, where `diff.inserted.unchanged` is using a different (lighter) background color. `dim` works fine where it is supported. For terminals that do not support it, people can override the settings. For `dim` feature detection, that's an issue in the color code which is unrelated to this change. Since worddiff is experimental and off by default, I don't think dim detection should block this patch. quark: As I mentioned in the summary, I believe `diff.inserted` and `diff.inserted.changed` should…
				spectralUnsubmitted Not Done I understand the reasoning for wanting `diff.inserted` and `diff.inserted.changed` to be the same color, but unfortunately I think it might not be super feasible in upstream. 'dim' support in my terminfo database (and via testing) is actually pretty common among the biggest terminals; for oddities, I'm seeing: the linux console (very few people use it) has weird support screen simulates it with underline, tmux passes it through (so I don't see it) rxvt-unicode doesn't support it the terminfo profile for cygwin doesn't indicate support for it Apple's terminal advertises itself as xterm-compatible and seems to support it iTerm2 similarly I'm willing to be convinced it's OK, especially since this mode isn't the default. @dhduvall originally wrote the "not supported by many terminals" bit, I wonder if they have any suggestions. @indygreg has also dealt with terminal issues (and deals with windows more than me, so might know more there). spectral: I understand the reasoning for wanting `diff.inserted` and `diff.inserted.changed` to be the…
				quarkAuthorUnsubmitted Not Done There are not many choices - dim, 16/256 colors, or bold. We ended up with 16/256 colors internally for wider support (ex. tmux). But I'd like to express my (strong) options that: diff.inserted.changed and diff.inserted are same diff.inserted is not bold in this patch. Because color.py has no 16/256 support yet (and if it does detection conservatively, most terminals will only report 8 color support). And the only remaining choice is "dim". For weird terminals like tmux, I think it's their bugs to fix, not this patch. That said, I'm fine with changing the defaults to whatever. So feel free to send follow-ups changing it. quark: There are not many choices - dim, 16/256 colors, or bold. We ended up with 16/256 colors…
				dhduvallUnsubmitted Not Done I'm not sure what testing I did to claim "most", but I did make that comment in 2011, and xterm didn't start supporting dim until 2014 (it had long supported invisible), so that certainly helped inform that comment. Given @spectral's survey, I'd probably rewrite the comment simply to say that we're using `ui.debug()` to not be noisy in the wide world of different terminals. Given support by xterm, iTerm2, and Terminal.app (and I just verified gnome-terminal supports it, too), I'd lean towards just using dim whenever it made sense now, though it'd be nice to know that at least one major Windows terminal app supported it, too. I don't know how widespread rxvt-unicode is in the Linux world, or other terminal emulators that might not support it. I won't comment on the dim-the-unchanged vs bright-the-changed bits, or how this interacts with the ANSI color mode, other than to say that the default colors aren't an interface, and can be tweaked later without worries about backwards compatibility. Obviously you don't want to change them radically without good reason once people get used to them, but tweaking dim/bold/normal/background/whatever once people have had a chance to experiment more widely might be a reasonable thing to do. dhduvall: I'm not sure what testing I did to claim "most", but I did make that comment in 2011, and xterm…
				spectralUnsubmitted Not Done Ah, didn't know that xterm support for it was that recent :) I'm fine with assuming that rxvt-unicode isn't used enough to warrant avoiding it here, especially if someone with a windows machine can chime in with whether or not it's supported. For platforms where the default terminal emulator doesn't support it, the maintainers can adjust it to be whatever they like (or we could do so if we create official packages for that platform/distribution). tmux does support it, it just doesn't support emulating it if the terminal on the outside of tmux doesn't. I'm fine with the change using dim. Maybe I can use this to get rxvt-unicode to improve their support. :P spectral: Ah, didn't know that xterm support for it was that recent :) I'm fine with assuming that rxvt…
	'diff.diffline': 'bold',			'diff.diffline': 'bold',
	'diff.extended': 'cyan bold',			'diff.extended': 'cyan bold',
	'diff.file_a': 'red bold',			'diff.file_a': 'red bold',
	'diff.file_b': 'green bold',			'diff.file_b': 'green bold',
	'diff.hunk': 'magenta',			'diff.hunk': 'magenta',
	'diff.inserted': 'green',			'diff.inserted': 'green',
	'diff.inserted.highlight': 'green bold underline',			'diff.inserted.changed': 'green',
				'diff.inserted.unchanged': 'green dim',
	'diff.tab': '',			'diff.tab': '',
	'diff.trailingwhitespace': 'bold red_background',			'diff.trailingwhitespace': 'bold red_background',
	'changeset.public': '',			'changeset.public': '',
	'changeset.draft': '',			'changeset.draft': '',
	'changeset.secret': '',			'changeset.secret': '',
	'diffstat.deleted': 'red',			'diffstat.deleted': 'red',
	'diffstat.inserted': 'green',			'diffstat.inserted': 'green',
	'formatvariant.name.mismatchconfig': 'red',			'formatvariant.name.mismatchconfig': 'red',

mercurial/patch.py

	procutil,			procutil,
	stringutil,			stringutil,
	)			)

	stringio = util.stringio			stringio = util.stringio

	gitre = re.compile(br'diff --git a/(.) b/(.)')			gitre = re.compile(br'diff --git a/(.) b/(.)')
	tabsplitter = re.compile(br'(\t+\|[^\t]+)')			tabsplitter = re.compile(br'(\t+\|[^\t]+)')
	_nonwordre = re.compile(br'([^a-zA-Z0-9_\x80-\xff])')			wordsplitter = re.compile(br'(\t+\| +\|[a-zA-Z0-9_\x80-\xff]+\|'
				'[^ \ta-zA-Z0-9_\x80-\xff])')
				yujaUnsubmitted Not Done Nit: `_wordsplitter` as it is private constant yuja: Nit: `_wordsplitter` as it is private constant
				yujaUnsubmitted Not Done Missed `br''` here though "\t" and "\x" of string escape are compatible with regexp's. yuja: Missed `br''` here though "\t" and "\x" of string escape are compatible with regexp's.

	PatchError = error.PatchError			PatchError = error.PatchError

	# public functions			# public functions

	def split(stream):			def split(stream):
	'''return an iterator of individual patches from a stream'''			'''return an iterator of individual patches from a stream'''
	def isheader(line, inheader):			def isheader(line, inheader):
	else:			else:
	yield (token, label)			yield (token, label)

	if chompline != stripline:			if chompline != stripline:
	yield (chompline[len(stripline):], 'diff.trailingwhitespace')			yield (chompline[len(stripline):], 'diff.trailingwhitespace')
	if chompline != line:			if chompline != line:
	yield (line[len(chompline):], '')			yield (line[len(chompline):], '')

				def diffsinglehunkinline(hunklines):
				"""yield tokens for a list of lines in a single hunk, with inline colors"""
				# prepare deleted, and inserted content
				a = ''
				b = ''
				for line in hunklines:
				if line[0] == '-':
				a += line[1:]
				elif line[0] == '+':
				b += line[1:]
				else:
				raise error.ProgrammingError('unexpected hunk line: %s' % line)
				# fast path: if either side is empty, use diffsinglehunk
				if not a or not b:
				for t in diffsinglehunk(hunklines):
				yield t
				return
				# re-split the content into words
				al = wordsplitter.findall(a)
				bl = wordsplitter.findall(b)
				# re-arrange the words to lines since the diff algorithm is line-based
				aln = [s if s == '\n' else s + '\n' for s in al]
				bln = [s if s == '\n' else s + '\n' for s in bl]
				an = ''.join(aln)
				bn = ''.join(bln)
				# run the diff algorithm, prepare atokens and btokens
				atokens = []
				btokens = []
				blocks = mdiff.allblocks(an, bn, lines1=aln, lines2=bln)
				for (a1, a2, b1, b2), btype in blocks:
				changed = btype == '!'
				for token in mdiff.splitnewlines(''.join(al[a1:a2])):
				atokens.append((changed, token))
				for token in mdiff.splitnewlines(''.join(bl[b1:b2])):
				btokens.append((changed, token))
				yujaUnsubmitted Not Done Nit: maybe we can sort out tokens here instead of re-parsing tabs, newlines, trailing whitespaces later. But I'm not sure if that will make things simpler. yuja: Nit: maybe we can sort out tokens here instead of re-parsing tabs, newlines, trailing…
				quarkAuthorUnsubmitted Not Done For a split list `['\n', '\t', ' ']`, mdiff might return a hunk that joins them. So `''.join(al[a1:a2])` will become more complex. I think having mdiff step free from EOL/tab handling makes the code easier to read. quark: For a split list `['\n', '\t', ' ']`, mdiff might return a hunk that joins them. So `''.join(al…
				yujaUnsubmitted Not Done Ah, got it. Let's leave mdiff refactoring until later, then. yuja: Ah, got it. Let's leave mdiff refactoring until later, then.

				# yield deleted tokens, then inserted ones
				for prefix, label, tokens in [('-', 'diff.deleted', atokens),
				('+', 'diff.inserted', btokens)]:
				nextisnewline = True
				for changed, token in tokens:
				if nextisnewline:
				yield (prefix, label)
				nextisnewline = False
				# special handling line end
				isendofline = token.endswith('\n')
				if isendofline:
				chomp = token[:-1] # chomp
				token = chomp.rstrip() # detect spaces at the end
				endspaces = chomp[len(token):]
				# scan tabs
				for maybetab in tabsplitter.findall(token):
				if '\t' == maybetab[0]:
				currentlabel = 'diff.tab'
				else:
				if changed:
				currentlabel = label + '.changed'
				else:
				currentlabel = label + '.unchanged'
				yield (maybetab, currentlabel)
				if isendofline:
				if endspaces:
				yield (endspaces, 'diff.trailingwhitespace')
				yield ('\n', '')
				nextisnewline = True

	def difflabel(func, args, *kw):			def difflabel(func, args, *kw):
	'''yields 2-tuples of (output, label) based on the output of func()'''			'''yields 2-tuples of (output, label) based on the output of func()'''
				if kw.get(r'opts') and kw[r'opts'].worddiff:
				dodiffhunk = diffsinglehunkinline
				else:
				dodiffhunk = diffsinglehunk
	headprefixes = [('diff', 'diff.diffline'),			headprefixes = [('diff', 'diff.diffline'),
	('copy', 'diff.extended'),			('copy', 'diff.extended'),
	('rename', 'diff.extended'),			('rename', 'diff.extended'),
	('old', 'diff.extended'),			('old', 'diff.extended'),
	('new', 'diff.extended'),			('new', 'diff.extended'),
	('deleted', 'diff.extended'),			('deleted', 'diff.extended'),
	('index', 'diff.extended'),			('index', 'diff.extended'),
	('similarity', 'diff.extended'),			('similarity', 'diff.extended'),
	('---', 'diff.file_a'),			('---', 'diff.file_a'),
	('+++', 'diff.file_b')]			('+++', 'diff.file_b')]
	textprefixes = [('@', 'diff.hunk'),			textprefixes = [('@', 'diff.hunk'),
	# - and + are handled by diffsinglehunk			# - and + are handled by diffsinglehunk
	]			]
	head = False			head = False

	# buffers a hunk, i.e. adjacent "-", "+" lines without other changes.			# buffers a hunk, i.e. adjacent "-", "+" lines without other changes.
	hunkbuffer = []			hunkbuffer = []
	def consumehunkbuffer():			def consumehunkbuffer():
	if hunkbuffer:			if hunkbuffer:
	for token in diffsinglehunk(hunkbuffer):			for token in dodiffhunk(hunkbuffer):
	yield token			yield token
	hunkbuffer[:] = []			hunkbuffer[:] = []

	for chunk in func(args, *kw):			for chunk in func(args, *kw):
	lines = chunk.split('\n')			lines = chunk.split('\n')
	linecount = len(lines)			linecount = len(lines)
	for i, line in enumerate(lines):			for i, line in enumerate(lines):
	if head:			if head:

tests/test-diff-color.t

	[diff.inserted\|+ assuming this works)]			[diff.inserted\|+ assuming this works)]
	[diff.inserted\|+be changed into four!]			[diff.inserted\|+be changed into four!]

	[diff.deleted\|-three of those lines will]			[diff.deleted\|-three of those lines will]
	[diff.deleted\|-collapse onto one]			[diff.deleted\|-collapse onto one]
	[diff.deleted\|-(to see if it works)]			[diff.deleted\|-(to see if it works)]
	[diff.inserted\|+three of those lines have]			[diff.inserted\|+three of those lines have]
	[diff.inserted\|+collapsed onto one]			[diff.inserted\|+collapsed onto one]
	#if false
	$ hg diff --config experimental.worddiff=True --color=debug			$ hg diff --config experimental.worddiff=True --color=debug
	[diff.diffline\|diff --git a/file1 b/file1]			[diff.diffline\|diff --git a/file1 b/file1]
	[diff.file_a\|--- a/file1]			[diff.file_a\|--- a/file1]
	[diff.file_b\|+++ b/file1]			[diff.file_b\|+++ b/file1]
	[diff.hunk\|@@ -1,16 +1,17 @@]			[diff.hunk\|@@ -1,16 +1,17 @@]
	[diff.deleted\|-this is the ][diff.deleted.highlight\|first][diff.deleted\| line]			[diff.deleted\|-][diff.deleted.changed\|this][diff.deleted.unchanged\| is the first ][diff.deleted.changed\|line]
	[diff.deleted\|-this is the second line]			[diff.deleted\|-][diff.deleted.unchanged\|this is the second line]
	[diff.deleted\|-][diff.deleted.highlight\| ][diff.deleted\|third line starts with space]			[diff.deleted\|-][diff.deleted.changed\| ][diff.deleted.unchanged\|third line starts with space]
	[diff.deleted\|-][diff.deleted.highlight\|+][diff.deleted\| starts with a ][diff.deleted.highlight\|plus][diff.deleted\| sign]			[diff.deleted\|-][diff.deleted.changed\|+][diff.deleted.unchanged\| starts with a ][diff.deleted.changed\|plus][diff.deleted.unchanged\| sign]
	[diff.deleted\|-][diff.tab\| ][diff.deleted\|this one with ][diff.deleted.highlight\|one][diff.deleted\| tab]			[diff.deleted\|-][diff.tab\| ][diff.deleted.unchanged\|this one with ][diff.deleted.changed\|one][diff.deleted.unchanged\| tab]
	[diff.deleted\|-][diff.tab\| ][diff.deleted\|now with full ][diff.deleted.highlight\|two][diff.deleted\| tabs]			[diff.deleted\|-][diff.tab\| ][diff.deleted.unchanged\|now with full ][diff.deleted.changed\|two][diff.deleted.unchanged\| tabs]
	[diff.deleted\|-][diff.tab\| ][diff.deleted\|now tabs][diff.tab\| ][diff.deleted\|everywhere, much fun]			[diff.deleted\|-][diff.tab\| ][diff.deleted.unchanged\|now ][diff.deleted.unchanged\|tabs][diff.tab\| ][diff.deleted.unchanged\|everywhere, much fun]
	[diff.inserted\|+that is the first paragraph]			[diff.inserted\|+][diff.inserted.changed\|that][diff.inserted.unchanged\| is the first ][diff.inserted.changed\|paragraph]
	[diff.inserted\|+][diff.inserted.highlight\| ][diff.inserted\|this is the ][diff.inserted.highlight\|second][diff.inserted\| line]			[diff.inserted\|+][diff.inserted.changed\| ][diff.inserted.unchanged\|this is the second line]
	[diff.inserted\|+third line starts with space]			[diff.inserted\|+][diff.inserted.unchanged\|third line starts with space]
	[diff.inserted\|+][diff.inserted.highlight\|-][diff.inserted\| starts with a ][diff.inserted.highlight\|minus][diff.inserted\| sign]			[diff.inserted\|+][diff.inserted.changed\|-][diff.inserted.unchanged\| starts with a ][diff.inserted.changed\|minus][diff.inserted.unchanged\| sign]
	[diff.inserted\|+][diff.tab\| ][diff.inserted\|this one with ][diff.inserted.highlight\|two][diff.inserted\| tab]			[diff.inserted\|+][diff.tab\| ][diff.inserted.unchanged\|this one with ][diff.inserted.changed\|two][diff.inserted.unchanged\| tab]
	[diff.inserted\|+][diff.tab\| ][diff.inserted\|now with full ][diff.inserted.highlight\|three][diff.inserted\| tabs]			[diff.inserted\|+][diff.tab\| ][diff.inserted.unchanged\|now with full ][diff.inserted.changed\|three][diff.inserted.unchanged\| tabs]
	[diff.inserted\|+][diff.tab\| ][diff.inserted\|now][diff.inserted.highlight\| there are][diff.inserted\| tabs][diff.tab\| ][diff.inserted\|everywhere, much fun]			[diff.inserted\|+][diff.tab\| ][diff.inserted.unchanged\|now ][diff.inserted.changed\|there are ][diff.inserted.unchanged\|tabs][diff.tab\| ][diff.inserted.unchanged\|everywhere, much fun]

	this line won't change			this line won't change

	two lines are going to			two lines are going to
	[diff.deleted\|-be changed into ][diff.deleted.highlight\|three][diff.deleted\|!]			[diff.deleted\|-][diff.deleted.unchanged\|be changed into ][diff.deleted.changed\|three][diff.deleted.unchanged\|!]
	[diff.inserted\|+(entirely magically,]			[diff.inserted\|+][diff.inserted.changed\|(entirely magically,]
	[diff.inserted\|+ assuming this works)]			[diff.inserted\|+][diff.inserted.changed\| assuming this works)]
	[diff.inserted\|+be changed into ][diff.inserted.highlight\|four][diff.inserted\|!]			[diff.inserted\|+][diff.inserted.unchanged\|be changed into ][diff.inserted.changed\|four][diff.inserted.unchanged\|!]

	[diff.deleted\|-three of those lines ][diff.deleted.highlight\|will]			[diff.deleted\|-][diff.deleted.unchanged\|three of those lines ][diff.deleted.changed\|will]
	[diff.deleted\|-][diff.deleted.highlight\|collapse][diff.deleted\| onto one]			[diff.deleted\|-][diff.deleted.changed\|collapse][diff.deleted.unchanged\| onto one]
	[diff.deleted\|-(to see if it works)]			[diff.deleted\|-][diff.deleted.changed\|(to see if it works)]
	[diff.inserted\|+three of those lines ][diff.inserted.highlight\|have]			[diff.inserted\|+][diff.inserted.unchanged\|three of those lines ][diff.inserted.changed\|have]
	[diff.inserted\|+][diff.inserted.highlight\|collapsed][diff.inserted\| onto one]			[diff.inserted\|+][diff.inserted.changed\|collapsed][diff.inserted.unchanged\| onto one]
	#endif

	multibyte character shouldn't be broken up in word diff:			multibyte character shouldn't be broken up in word diff:

	$ $PYTHON <<'EOF'			$ $PYTHON <<'EOF'
	> with open("utf8", "wb") as f:			> with open("utf8", "wb") as f:
	> f.write(b"blah \xe3\x82\xa2 blah\n")			> f.write(b"blah \xe3\x82\xa2 blah\n")
	> EOF			> EOF
	$ hg ci -Am 'add utf8 char' utf8			$ hg ci -Am 'add utf8 char' utf8
	$ $PYTHON <<'EOF'			$ $PYTHON <<'EOF'
	> with open("utf8", "wb") as f:			> with open("utf8", "wb") as f:
	> f.write(b"blah \xe3\x82\xa4 blah\n")			> f.write(b"blah \xe3\x82\xa4 blah\n")
	> EOF			> EOF
	$ hg ci -m 'slightly change utf8 char' utf8			$ hg ci -m 'slightly change utf8 char' utf8

	#if false
	$ hg diff --config experimental.worddiff=True --color=debug -c.			$ hg diff --config experimental.worddiff=True --color=debug -c.
	[diff.diffline\|diff --git a/utf8 b/utf8]			[diff.diffline\|diff --git a/utf8 b/utf8]
	[diff.file_a\|--- a/utf8]			[diff.file_a\|--- a/utf8]
	[diff.file_b\|+++ b/utf8]			[diff.file_b\|+++ b/utf8]
	[diff.hunk\|@@ -1,1 +1,1 @@]			[diff.hunk\|@@ -1,1 +1,1 @@]
	[diff.deleted\|-blah ][diff.deleted.highlight\|\xe3\x82\xa2][diff.deleted\| blah] (esc)			[diff.deleted\|-][diff.deleted.unchanged\|blah ][diff.deleted.changed\|\xe3\x82\xa2][diff.deleted.unchanged\| blah] (esc)
	[diff.inserted\|+blah ][diff.inserted.highlight\|\xe3\x82\xa4][diff.inserted\| blah] (esc)			[diff.inserted\|+][diff.inserted.unchanged\|blah ][diff.inserted.changed\|\xe3\x82\xa4][diff.inserted.unchanged\| blah] (esc)
	#endif

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	7924		Apr 9 2018, 6:59 PM	★	★
Diff 2	8335	rHG35632d3922797b5ec69907ea631197e3474f0bc5	Apr 9 2018, 6:58 PM	★	★

	[diff.inserted\|+ assuming this works)]			[diff.inserted\|+ assuming this works)]
	[diff.inserted\|+be changed into four!]			[diff.inserted\|+be changed into four!]

	[diff.deleted\|-three of those lines will]			[diff.deleted\|-three of those lines will]
	[diff.deleted\|-collapse onto one]			[diff.deleted\|-collapse onto one]
	[diff.deleted\|-(to see if it works)]			[diff.deleted\|-(to see if it works)]
	[diff.inserted\|+three of those lines have]			[diff.inserted\|+three of those lines have]
	[diff.inserted\|+collapsed onto one]			[diff.inserted\|+collapsed onto one]
	#if false
	$ hg diff --config experimental.worddiff=True --color=debug			$ hg diff --config experimental.worddiff=True --color=debug
	[diff.diffline\|diff --git a/file1 b/file1]			[diff.diffline\|diff --git a/file1 b/file1]
	[diff.file_a\|--- a/file1]			[diff.file_a\|--- a/file1]
	[diff.file_b\|+++ b/file1]			[diff.file_b\|+++ b/file1]
	[diff.hunk\|@@ -1,16 +1,17 @@]			[diff.hunk\|@@ -1,16 +1,17 @@]
	[diff.deleted\|-this is the ][diff.deleted.highlight\|first][diff.deleted\| line]			[diff.deleted\|-][diff.deleted.changed\|this][diff.deleted.unchanged\| is the first ][diff.deleted.changed\|line]
	[diff.deleted\|-this is the second line]			[diff.deleted\|-][diff.deleted.unchanged\|this is the second line]
	[diff.deleted\|-][diff.deleted.highlight\| ][diff.deleted\|third line starts with space]			[diff.deleted\|-][diff.deleted.changed\| ][diff.deleted.unchanged\|third line starts with space]
	[diff.deleted\|-][diff.deleted.highlight\|+][diff.deleted\| starts with a ][diff.deleted.highlight\|plus][diff.deleted\| sign]			[diff.deleted\|-][diff.deleted.changed\|+][diff.deleted.unchanged\| starts with a ][diff.deleted.changed\|plus][diff.deleted.unchanged\| sign]
	[diff.deleted\|-][diff.tab\| ][diff.deleted\|this one with ][diff.deleted.highlight\|one][diff.deleted\| tab]			[diff.deleted\|-][diff.tab\| ][diff.deleted.unchanged\|this one with ][diff.deleted.changed\|one][diff.deleted.unchanged\| tab]
	[diff.deleted\|-][diff.tab\| ][diff.deleted\|now with full ][diff.deleted.highlight\|two][diff.deleted\| tabs]			[diff.deleted\|-][diff.tab\| ][diff.deleted.unchanged\|now with full ][diff.deleted.changed\|two][diff.deleted.unchanged\| tabs]
	[diff.deleted\|-][diff.tab\| ][diff.deleted\|now tabs][diff.tab\| ][diff.deleted\|everywhere, much fun]			[diff.deleted\|-][diff.tab\| ][diff.deleted.unchanged\|now ][diff.deleted.unchanged\|tabs][diff.tab\| ][diff.deleted.unchanged\|everywhere, much fun]
	[diff.inserted\|+that is the first paragraph]			[diff.inserted\|+][diff.inserted.changed\|that][diff.inserted.unchanged\| is the first ][diff.inserted.changed\|paragraph]
	[diff.inserted\|+][diff.inserted.highlight\| ][diff.inserted\|this is the ][diff.inserted.highlight\|second][diff.inserted\| line]			[diff.inserted\|+][diff.inserted.changed\| ][diff.inserted.unchanged\|this is the second line]
	[diff.inserted\|+third line starts with space]			[diff.inserted\|+][diff.inserted.unchanged\|third line starts with space]
	[diff.inserted\|+][diff.inserted.highlight\|-][diff.inserted\| starts with a ][diff.inserted.highlight\|minus][diff.inserted\| sign]			[diff.inserted\|+][diff.inserted.changed\|-][diff.inserted.unchanged\| starts with a ][diff.inserted.changed\|minus][diff.inserted.unchanged\| sign]
	[diff.inserted\|+][diff.tab\| ][diff.inserted\|this one with ][diff.inserted.highlight\|two][diff.inserted\| tab]			[diff.inserted\|+][diff.tab\| ][diff.inserted.unchanged\|this one with ][diff.inserted.changed\|two][diff.inserted.unchanged\| tab]
	[diff.inserted\|+][diff.tab\| ][diff.inserted\|now with full ][diff.inserted.highlight\|three][diff.inserted\| tabs]			[diff.inserted\|+][diff.tab\| ][diff.inserted.unchanged\|now with full ][diff.inserted.changed\|three][diff.inserted.unchanged\| tabs]
	[diff.inserted\|+][diff.tab\| ][diff.inserted\|now][diff.inserted.highlight\| there are][diff.inserted\| tabs][diff.tab\| ][diff.inserted\|everywhere, much fun]			[diff.inserted\|+][diff.tab\| ][diff.inserted.unchanged\|now ][diff.inserted.changed\|there are ][diff.inserted.unchanged\|tabs][diff.tab\| ][diff.inserted.unchanged\|everywhere, much fun]

	this line won't change			this line won't change

	two lines are going to			two lines are going to
	[diff.deleted\|-be changed into ][diff.deleted.highlight\|three][diff.deleted\|!]			[diff.deleted\|-][diff.deleted.unchanged\|be changed into ][diff.deleted.changed\|three][diff.deleted.unchanged\|!]
	[diff.inserted\|+(entirely magically,]			[diff.inserted\|+][diff.inserted.changed\|(entirely magically,]
	[diff.inserted\|+ assuming this works)]			[diff.inserted\|+][diff.inserted.changed\| assuming this works)]
	[diff.inserted\|+be changed into ][diff.inserted.highlight\|four][diff.inserted\|!]			[diff.inserted\|+][diff.inserted.unchanged\|be changed into ][diff.inserted.changed\|four][diff.inserted.unchanged\|!]

	[diff.deleted\|-three of those lines ][diff.deleted.highlight\|will]			[diff.deleted\|-][diff.deleted.unchanged\|three of those lines ][diff.deleted.changed\|will]
	[diff.deleted\|-][diff.deleted.highlight\|collapse][diff.deleted\| onto one]			[diff.deleted\|-][diff.deleted.changed\|collapse][diff.deleted.unchanged\| onto one]
	[diff.deleted\|-(to see if it works)]			[diff.deleted\|-][diff.deleted.changed\|(to see if it works)]
	[diff.inserted\|+three of those lines ][diff.inserted.highlight\|have]			[diff.inserted\|+][diff.inserted.unchanged\|three of those lines ][diff.inserted.changed\|have]
	[diff.inserted\|+][diff.inserted.highlight\|collapsed][diff.inserted\| onto one]			[diff.inserted\|+][diff.inserted.changed\|collapsed][diff.inserted.unchanged\| onto one]
	#endif

	multibyte character shouldn't be broken up in word diff:			multibyte character shouldn't be broken up in word diff:

	$ $PYTHON <<'EOF'			$ $PYTHON <<'EOF'
	> with open("utf8", "wb") as f:			> with open("utf8", "wb") as f:
	> f.write(b"blah \xe3\x82\xa2 blah\n")			> f.write(b"blah \xe3\x82\xa2 blah\n")
	> EOF			> EOF
	$ hg ci -Am 'add utf8 char' utf8			$ hg ci -Am 'add utf8 char' utf8
	$ $PYTHON <<'EOF'			$ $PYTHON <<'EOF'
	> with open("utf8", "wb") as f:			> with open("utf8", "wb") as f:
	> f.write(b"blah \xe3\x82\xa4 blah\n")			> f.write(b"blah \xe3\x82\xa4 blah\n")
	> EOF			> EOF
	$ hg ci -m 'slightly change utf8 char' utf8			$ hg ci -m 'slightly change utf8 char' utf8

	#if false
	$ hg diff --config experimental.worddiff=True --color=debug -c.			$ hg diff --config experimental.worddiff=True --color=debug -c.
	[diff.diffline\|diff --git a/utf8 b/utf8]			[diff.diffline\|diff --git a/utf8 b/utf8]
	[diff.file_a\|--- a/utf8]			[diff.file_a\|--- a/utf8]
	[diff.file_b\|+++ b/utf8]			[diff.file_b\|+++ b/utf8]
	[diff.hunk\|@@ -1,1 +1,1 @@]			[diff.hunk\|@@ -1,1 +1,1 @@]
	[diff.deleted\|-blah ][diff.deleted.highlight\|\xe3\x82\xa2][diff.deleted\| blah] (esc)			[diff.deleted\|-][diff.deleted.unchanged\|blah ][diff.deleted.changed\|\xe3\x82\xa2][diff.deleted.unchanged\| blah] (esc)
	[diff.inserted\|+blah ][diff.inserted.highlight\|\xe3\x82\xa4][diff.inserted\| blah] (esc)			[diff.inserted\|+][diff.inserted.unchanged\|blah ][diff.inserted.changed\|\xe3\x82\xa4][diff.inserted.unchanged\| blah] (esc)
	#endif