This is an archive of the discontinued Mercurial Phabricator instance.

grep: fixes errorneous output of grep in forward order
ClosedPublic

Authored by pulkit on Mar 26 2018, 10:21 AM.

Download Raw Diff

Details

Reviewers

yuja
sangeet259

Group Reviewers

hg-reviewers

Summary

If grep is passed a revset in forwards order via -r , say -r 0:tip
Then the output is errorneous. This patch fixes that. The ouput was wrong because
we deleted the last revision key in the matches and when we moved to the next
revision we didn't had this to comapare the diff. So the pstates dict was always
empty and in the SequenceMatcher, to convert and empty pstate to the states
dictionary you would always insert. This patch keeps the matches dictionary until
the end of this window and clears it at once when this window ends. This solves the
above mentioned problem and also do not cause any memory leak.

A more verbose explanation:

Demonstrating the problem with forward ordered grep `--all` and the solution

So if you do a hg grep "pattern" --all, you would get all the lines along with revision numbers showing when this pattern appeared or disappeared from the repository. Cool, that's what we want.

And by default, it does a reverse revlog search. That is, it starts from the latest revision and goes all the way upto the oldest one. And accordingly shows you the result.

All the demonstration are performed on this repository : https://bitbucket.org/sangeet259/hg_grep_all

*You can try to reproduce the following results by cloning it*

Okay, now coming to the point.
This is the output of hg grep "ruleid" --all on that repo.

histedit.py:6:-:        # So adding a ruleid in the last line !
histedit.py:6:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:+:        # So adding a ruleid in the last line !
histedit.py:5:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:-:        #     rev = node.bin(ruleid)
histedit.py:4:-:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:-:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:        # ruleid bruhh.
histedit.py:4:+:        #     rev = node.bin(ruleid)
histedit.py:4:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:3:-:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:0:+:        #     rev = node.bin(ruleid)
histedit.py:0:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:0:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:                    raise error.ParseError("invalid changeset %s" % ruleid)

This is what we expect. Now if you do hg grep -r 0:tip "ruleid" --all, what you want is that the older hits are shown first, that we start from the oldest revision and iterate upto the latest revision. But the result is unexpected :
Output of **hg grep -r 0:tip "ruleid" --all** on that repo.

histedit.py:0:+:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:0:+:        #     rev = node.bin(ruleid)
histedit.py:0:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:0:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:            raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:3:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:3:+:        #     rev = node.bin(ruleid)
histedit.py:3:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:3:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:3:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:3:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:3:+:            raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:4:+:        # ruleid bruhh.
histedit.py:4:+:        #     rev = node.bin(ruleid)
histedit.py:4:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:            raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:5:+:        # ruleid bruhh.
histedit.py:5:+:        #     rev = node.bin(ruleid)
histedit.py:5:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:+:        # So adding a ruleid in the last line !
histedit.py:5:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:+:            raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:6:+:        # ruleid bruhh.
histedit.py:6:+:        #     rev = node.bin(ruleid)
histedit.py:6:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:+:            raise error.ParseError("invalid changeset %s" % ruleid)

Points to note:

There are no removals
Even in case of additions, they are just not correct

I realised this was due to the empty pstate dictionary that was being passed. And this dictionary was empty because at the end of each revision we are simply deleting the matches[rev].
So in the next revision it's passing [] as pstate, as the computation of diff via difflib.SequenceMatcher() requires comparing the previous state with this state, it's simply showing that the previous state is empty and to get the current state, you just need to add everything from the current state to the pstate. This is the reason why there are no removals in the output and the discrepancy even in the additions.

To solve this we need to simply keep the matches and not delete it. But as Jordi told, there will be a huge memory leak which we can not afford, so I came up with another solution of keeping the matches dictionary only till the end of this window and clearing it up once the window ends.
I did not make any changes to the line del revfiles[rev] . So we will know when a window ends when this revfiles dict gets empty.
Hence this :

del revfiles[rev]
if not revfiles:
    matches.clear()

Now see the result of hg grep --all -r 0:tip "ruleid" [The modified hg]

histedit.py:0:+:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:0:+:        #     rev = node.bin(ruleid)
histedit.py:0:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:0:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:3:-:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:-:        #     rev = node.bin(ruleid)
histedit.py:4:-:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:-:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:        # ruleid bruhh.
histedit.py:4:+:        #     rev = node.bin(ruleid)
histedit.py:4:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:+:        # So adding a ruleid in the last line !
histedit.py:5:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:-:        # So adding a ruleid in the last line !
histedit.py:6:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:+:                    raise error.ParseError("invalid changeset %s" % ruleid)

We got the older hits first and the results are correct.

PS: There are other benifts of using clear over one by ine del on a dictionary.

Time: I compared the times of dict.clear() and one by one del dict[key]

IPython Shell

In [80]: a = dict.fromkeys(range(1000))

In [81]: %%time
    ...: for i in range(1000):
    ...:     del a[i]
    ...:
CPU times: user 261 micro sec, sys: 22 micro sec, total: 283 micro sec
Wall time: 191 micro sec

In [82]: a = dict.fromkeys(range(1000))

In [83]: %time a.clear()
CPU times: user 35 micro sec, sys: 0 ns, total: 35 micro sec
Wall time: 42 micro sec

Memory: clear() method apparently, according to this SO answer is more memory efficient

https://stackoverflow.com/questions/10446839/does-dictionarys-clear-method-delete-all-the-item-related-objects-from-memory/30519908#30519908

Diff Detail

Repository

rHG Mercurial

Lint

Lint Skipped

Unit

Unit Tests Skipped

Event Timeline

sangeet259 created this revision.Mar 26 2018, 10:21 AM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptMar 26 2018, 10:21 AM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

Since it required some explanation I have written a post on it.
http://sangeetmishra.me/2018/03/26/demonstrating-the-problem-with-forward-ordered-grep-all-in-mercurial.html

sangeet259 added a comment.Mar 26 2018, 10:24 AM

This comment was removed by sangeet259.

sangeet259 edited the summary of this revision. (Show Details)Mar 26 2018, 3:50 PM

sangeet259 updated this revision to Diff 7321.

This patch keeps the matches dictionary until
the end of this window and clears it at once when this window ends.

This is really helpful while reading the patch. Perhaps it's worth adding
an inline comment?

The change looks good, but can't be applied on the current tip. Can you
rebase?

This revision now requires changes to proceed.Mar 27 2018, 8:34 AM

@yuja Yeah, doing it right away, my bad. I should have based it on some revision which is already in the stable.

sangeet259 edited the summary of this revision. (Show Details)Mar 27 2018, 10:55 AM

sangeet259 updated this revision to Diff 7342.

Queued, thanks. Adjusted the commit message to close " (issue3885)".

This revision is now accepted and ready to land.Mar 27 2018, 11:16 AM

This was committed as https://www.mercurial-scm.org/repo/hg/rev/a2a6755a3def.

Revision Contents
Changeset List

			Path	Packages
M			mercurial/commands.py (5 lines)
M			tests/test-grep.t (12 lines)

Diff 7342

mercurial/commands.py

	pstates = matches.get(parent, {}).get(copy or fn, [])			pstates = matches.get(parent, {}).get(copy or fn, [])
	if pstates or states:			if pstates or states:
	r = display(fm, fn, ctx, pstates, states)			r = display(fm, fn, ctx, pstates, states)
	found = found or r			found = found or r
	if r and not opts.get('all'):			if r and not opts.get('all'):
	skip[fn] = True			skip[fn] = True
	if copy:			if copy:
	skip[copy] = True			skip[copy] = True
	del matches[rev]
	del revfiles[rev]			del revfiles[rev]
				# We will keep the matches dict for the duration of the window
				# clear the matches dict once the window is over
				if not revfiles:
				matches.clear()
	fm.end()			fm.end()

	return not found			return not found

	@command('heads',			@command('heads',
	[('r', 'rev', '',			[('r', 'rev', '',
	_('show only heads which are descendants of STARTREV'), _('STARTREV')),			_('show only heads which are descendants of STARTREV'), _('STARTREV')),
	('t', 'topo', False, _('show topological heads only')),			('t', 'topo', False, _('show topological heads only')),

tests/test-grep.t

	1 files updated, 0 files merged, 0 files removed, 0 files unresolved			1 files updated, 0 files merged, 0 files removed, 0 files unresolved
	$ echo blue > color			$ echo blue > color
	$ hg commit -A -m "3 blue"			$ hg commit -A -m "3 blue"

	$ hg grep --all red			$ hg grep --all red
	color:3:-:red			color:3:-:red
	color:1:+:red			color:1:+:red


				Issue3885: test that changing revision order does not alter the
				revisions printed, just their order.

				$ hg grep --all red -r "all()"
				color:1:+:red
				color:3:-:red

				$ hg grep --all red -r "reverse(all())"
				color:3:-:red
				color:1:+:red

	$ cd ..			$ cd ..

	$ hg init a			$ hg init a
	$ cd a			$ cd a
	$ cp "$TESTDIR/binfile.bin" .			$ cp "$TESTDIR/binfile.bin" .
	$ hg add binfile.bin			$ hg add binfile.bin
	$ hg ci -m 'add binfile.bin'			$ hg ci -m 'add binfile.bin'
	$ hg grep "MaCam" --all			$ hg grep "MaCam" --all
	binfile.bin:0:+: Binary file matches			binfile.bin:0:+: Binary file matches

	$ cd ..			$ cd ..

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	7290		Mar 26 2018, 10:21 AM	★	★
Diff 2	7321		Mar 26 2018, 3:50 PM	★	★
Diff 3	7342		Mar 27 2018, 10:54 AM	★	★