This is an archive of the discontinued Mercurial Phabricator instance.

grep: fixes errorneous output of grep in forward order
ClosedPublic

Authored by pulkit on Mar 26 2018, 10:21 AM.

Details

Reviewers
yuja
sangeet259
Group Reviewers
hg-reviewers
Summary

If grep is passed a revset in forwards order via -r , say -r 0:tip
Then the output is errorneous. This patch fixes that. The ouput was wrong because
we deleted the last revision key in the matches and when we moved to the next
revision we didn't had this to comapare the diff. So the pstates dict was always
empty and in the SequenceMatcher, to convert and empty pstate to the states
dictionary you would always insert. This patch keeps the matches dictionary until
the end of this window and clears it at once when this window ends. This solves the
above mentioned problem and also do not cause any memory leak.


A more verbose explanation:

Demonstrating the problem with forward ordered grep --all and the solution

So if you do a hg grep "pattern" --all, you would get all the lines along with revision numbers showing when this pattern appeared or disappeared from the repository. Cool, that's what we want.

And by default, it does a reverse revlog search. That is, it starts from the latest revision and goes all the way upto the oldest one. And accordingly shows you the result.

All the demonstration are performed on this repository : https://bitbucket.org/sangeet259/hg_grep_all

*You can try to reproduce the following results by cloning it*

Okay, now coming to the point.
This is the output of hg grep "ruleid" --all on that repo.

histedit.py:6:-:        # So adding a ruleid in the last line !
histedit.py:6:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:+:        # So adding a ruleid in the last line !
histedit.py:5:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:-:        #     rev = node.bin(ruleid)
histedit.py:4:-:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:-:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:        # ruleid bruhh.
histedit.py:4:+:        #     rev = node.bin(ruleid)
histedit.py:4:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:3:-:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:0:+:        #     rev = node.bin(ruleid)
histedit.py:0:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:0:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:                    raise error.ParseError("invalid changeset %s" % ruleid)

This is what we expect. Now if you do hg grep -r 0:tip "ruleid" --all, what you want is that the older hits are shown first, that we start from the oldest revision and iterate upto the latest revision. But the result is unexpected :
Output of **hg grep -r 0:tip "ruleid" --all** on that repo.

histedit.py:0:+:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:0:+:        #     rev = node.bin(ruleid)
histedit.py:0:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:0:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:            raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:3:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:3:+:        #     rev = node.bin(ruleid)
histedit.py:3:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:3:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:3:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:3:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:3:+:            raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:4:+:        # ruleid bruhh.
histedit.py:4:+:        #     rev = node.bin(ruleid)
histedit.py:4:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:            raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:5:+:        # ruleid bruhh.
histedit.py:5:+:        #     rev = node.bin(ruleid)
histedit.py:5:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:+:        # So adding a ruleid in the last line !
histedit.py:5:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:+:            raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:6:+:        # ruleid bruhh.
histedit.py:6:+:        #     rev = node.bin(ruleid)
histedit.py:6:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:+:            ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:+:            _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:+:            raise error.ParseError("invalid changeset %s" % ruleid)

Points to note:

  1. There are no removals
  2. Even in case of additions, they are just not correct

I realised this was due to the empty pstate dictionary that was being passed. And this dictionary was empty because at the end of each revision we are simply deleting the matches[rev].
So in the next revision it's passing [] as pstate, as the computation of diff via difflib.SequenceMatcher() requires comparing the previous state with this state, it's simply showing that the previous state is empty and to get the current state, you just need to add everything from the current state to the pstate. This is the reason why there are no removals in the output and the discrepancy even in the additions.

To solve this we need to simply keep the matches and not delete it. But as Jordi told, there will be a huge memory leak which we can not afford, so I came up with another solution of keeping the matches dictionary only till the end of this window and clearing it up once the window ends.
I did not make any changes to the line del revfiles[rev] . So we will know when a window ends when this revfiles dict gets empty.
Hence this :

del revfiles[rev]
if not revfiles:
    matches.clear()

Now see the result of hg grep --all -r 0:tip "ruleid" [The modified hg]

histedit.py:0:+:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:        # # ruleid can be anything from rev numbers, hashes, "bookmarks" etc
histedit.py:0:+:        #     rev = node.bin(ruleid)
histedit.py:0:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:0:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:0:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:0:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:3:-:        # ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:-:        #     rev = node.bin(ruleid)
histedit.py:4:-:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:-:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:        # ruleid bruhh.
histedit.py:4:+:        #     rev = node.bin(ruleid)
histedit.py:4:+:        #         _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:        #         raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:4:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:4:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:4:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:5:+:        # So adding a ruleid in the last line !
histedit.py:5:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:5:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:5:+:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:-:        # So adding a ruleid in the last line !
histedit.py:6:-:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:-:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:-:                    raise error.ParseError("invalid changeset %s" % ruleid)
histedit.py:6:+:                    ruleid = rule.strip().split(' ', 1)[0]
histedit.py:6:+:                    _ctx = scmutil.revsingle(state.repo, ruleid)
histedit.py:6:+:                    raise error.ParseError("invalid changeset %s" % ruleid)

We got the older hits first and the results are correct.

PS: There are other benifts of using clear over one by ine del on a dictionary.

  1. Time: I compared the times of dict.clear() and one by one del dict[key]
IPython Shell

In [80]: a = dict.fromkeys(range(1000))

In [81]: %%time
    ...: for i in range(1000):
    ...:     del a[i]
    ...:
CPU times: user 261 micro sec, sys: 22 micro sec, total: 283 micro sec
Wall time: 191 micro sec

In [82]: a = dict.fromkeys(range(1000))

In [83]: %time a.clear()
CPU times: user 35 micro sec, sys: 0 ns, total: 35 micro sec
Wall time: 42 micro sec
  1. Memory: clear() method apparently, according to this SO answer is more memory efficient

https://stackoverflow.com/questions/10446839/does-dictionarys-clear-method-delete-all-the-item-related-objects-from-memory/30519908#30519908

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

sangeet259 created this revision.Mar 26 2018, 10:21 AM
This comment was removed by sangeet259.
sangeet259 edited the summary of this revision. (Show Details)Mar 26 2018, 3:50 PM
sangeet259 updated this revision to Diff 7321.
yuja requested changes to this revision.Mar 27 2018, 8:34 AM
yuja added a subscriber: yuja.

This patch keeps the matches dictionary until
the end of this window and clears it at once when this window ends.

This is really helpful while reading the patch. Perhaps it's worth adding
an inline comment?

The change looks good, but can't be applied on the current tip. Can you
rebase?

This revision now requires changes to proceed.Mar 27 2018, 8:34 AM
sangeet259 added a comment.EditedMar 27 2018, 9:43 AM

@yuja Yeah, doing it right away, my bad. I should have based it on some revision which is already in the stable.

sangeet259 edited the summary of this revision. (Show Details)Mar 27 2018, 10:55 AM
sangeet259 updated this revision to Diff 7342.
yuja accepted this revision.Mar 27 2018, 11:16 AM

Queued, thanks. Adjusted the commit message to close " (issue3885)".

This revision is now accepted and ready to land.Mar 27 2018, 11:16 AM
pulkit commandeered this revision.
pulkit closed this revision.