This is an archive of the discontinued Mercurial Phabricator instance.

tersestatus: re-implement the functionality to terse the status
ClosedPublic

Authored by pulkit on Oct 6 2017, 4:12 PM.

Details

Summary

The previous terse status implementation was hacking around os.listdir() and was
flaky. There have been a lot of instances of mercurial buildbots failing
and google's internal builds failing because of the
hacky implementation of terse status. Even though I wrote the last
implementation but it was hard for me to find the reason for the flake.

The new implementation can be slower than the old one but is clean and easy to
understand.

In this we create a node object for each directory and create a tree
like structure starting from the root of the working copy. While building the
tree like structure we store some information on the nodes which will be helpful
for deciding later whether we can terse the dir or not.
Once the whole tree is build we traverse and built the list of files for each
status with required tersing.

There is no behaviour change as the old test, test-status-terse.t passes with
the new implementation.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

pulkit created this revision.Oct 6 2017, 4:12 PM
dlax requested changes to this revision.Oct 7 2017, 4:23 AM
dlax added a subscriber: dlax.

Just a couple of comments (mostly Python things) based on a first quick read. I haven't closely looked at the algorithm yet.

Thanks for working back on this!

mercurial/cmdutil.py
445–567
if s not in stdic:
    raise ...
454

No need for the intermediate statusdic, just iterate as for name in ('modified', 'added', ...): and use name[0] as sname.

Alternatively (probably better), dirnode's __init__ could take the statuslist as an argument and do the addfile() calls itself.

459

I wonder if using a dict keyed by "status short name" (later transformed into a list of lists) instead of tersedlist that is queried by index through the stdic indirection wouldn't be simpler.

468

Unless I'm missing something, there's no need for this subdirlist, you can just do for subdir in rootobj.subdirs.values():.

473
for sublist in tersedlist:
    sublist.sort()
mercurial/commands.py
4818

For readability, I'd suggest to only have a single call to repo.status() and define the ignored, clean and unknown kwargs in the if terse: before that call.

This revision now requires changes to proceed.Oct 7 2017, 4:23 AM
pulkit updated this revision to Diff 2520.Oct 7 2017, 5:40 PM
pulkit marked 4 inline comments as done.Oct 7 2017, 5:53 PM
pulkit added inline comments.
mercurial/cmdutil.py
445–567

Done thanks!

454

I have implemented the first advice.

For passing statlist to __init__, I disagree with that. If you see the line above, that's also calling __init__ which is creating node objects for the subdirectories. If I pass statlist to __init__, I will need some kind of new way to find what will be the statlist which is needed to pass there.

459

I didn't exactly understand whether you mean that will be simpler or not but I have updated the implementation using a dictionary. It that's what you meant, thanks. Good advice!

468

Ah yes. Forgot to cleanup. Thanks!

473

Thanks!

mercurial/commands.py
4818

I can implement your suggestion using following ways:

  1. add 'clean' and 'unknown' to show variable and take them out later as that variable is used later also
  1. add a new variable which handles this.

I certainly don't like first one and hesitant to go for second one as that seems kind of ugly. But if you insist me I can go with second option or a better way if you suggest any.

Have not adopted the suggestion in the updated version.

dlax added a comment.Oct 12 2017, 6:22 AM

Took the time to read this in details. I agree that the algorigthm/idea is fairly simple to grasp.

I don't remember the old algorithm so I'd trust @pulkit when he says that it's easier to follow/understand. I'm not sure about the "flakyness" mentioned in the commit message, but even if this does not resolve this issue (which is pretty vaguely described at this point), I think having an understandable algorithm would help in the future.

mercurial/cmdutil.py
452

if subdir not in self.subdirs

476

I think this function and the one above could be methods of the dirnode class as they essentially manipulate the state of a dirnode object. At least for me, while reading the code, that would have been clearer.

559

Here and in the above _addfilestotersed() call, it's not easy to follow if something happens to tersedict. In fact, from a general point of view, I think it's not a very good practice to mutate function arguments. So instead, you could have these functions return (or better yield) (shortstatus, filepath) tuples from which you could update the dict here and thus avoid passing it in these functions.

In the caller, that would give:

diff --git a/mercurial/cmdutil.py b/mercurial/cmdutil.py
--- a/mercurial/cmdutil.py
+++ b/mercurial/cmdutil.py
@@ -551,11 +549,13 @@ def tersedir(statuslist, terseargs):
         tersedict[attrname[0]] = []
 
     # we won't be tersing the root dir, so add files in it
-    _addfilestotersed(rootobj, tersedict)
+    for st, fpath in _addfilestotersed(rootobj):
+        tersedict[st].append(fpath)
 
     # process each sub-directory and build tersedlist
     for subdir in rootobj.subdirs.values():
-        _processtersestatus(subdir, tersedict, terseargs)
+        for st, f in _processtersestatus(subdir, terseargs):
+            tersedict[st].append(f)

(of course function names should be changed).

Finally, I think _processtersestatus() can be slightly modified to avoid the call to _addfilestotersed() here by behaving differently if the current dirnode is the root dir or not. That would give a single loop in the diff above.

dlax requested changes to this revision.Oct 12 2017, 6:23 AM
This revision now requires changes to proceed.Oct 12 2017, 6:23 AM
dlax added inline comments.Oct 12 2017, 6:24 AM
mercurial/commands.py
4818

Ok, let's keep this for later (but not forget it :))

pulkit marked 3 inline comments as done.Oct 12 2017, 1:20 PM
In D985#17043, @dlax wrote:

Took the time to read this in details. I agree that the algorigthm/idea is fairly simple to grasp.

Yay! thanks.

I don't remember the old algorithm so I'd trust @pulkit when he says that it's easier to follow/understand. I'm not sure about the "flakyness" mentioned in the commit message, but even if this does not resolve this issue (which is pretty vaguely described at this point), I think having an understandable algorithm would help in the future.

For context, the idea of old algorithm was using os.listdir() to list files in each directory and check their status and terse accordingly. But due to edge cases arising due to the idea, the implementation was very much hacky.
I am sorry I didn't mentioned about flakyness. The flakyness is observed in our mercurial buildbots and some internal buildbots at Google. I am unable to find failed builds at the moment but the builds occasionally fails with test-status-terse.t and it was happening from couple of months. I will include this thing in updated version.

mercurial/cmdutil.py
559

Well I agree with you on this. Yesterday I accidentally stumbled on your repo at hg.logilab.org/ by clicking a link in one of the commit message and saw that you have the above feedback implemented yourself. Being lazy, I want to ask you, will you like to followup on these or I can just pull from them and fold them in this commit giving proper credits to you.

dlax added inline comments.Oct 12 2017, 3:21 PM
mercurial/cmdutil.py
559

I'm uncovered... Please pull and fold; that's a good start and there's probably more to do.

pulkit edited the summary of this revision. (Show Details)Oct 12 2017, 7:56 PM
pulkit updated this revision to Diff 2645.

The next two patches D1042 and D1043, though sent by me are patches by Denis. They were some good cleanups which I think was wrong to fold into my original patch, hence sent separately.

dlax added a comment.Oct 13 2017, 5:42 AM
In D985#17380, @pulkit wrote:

The next two patches D1042 and D1043, though sent by me are patches by Denis. They were some good cleanups which I think was wrong to fold into my original patch, hence sent separately.

Surprisingly, Phabricator does not make the patch user information visible... but through phabread, I get it.

durin42 added inline comments.
mercurial/cmdutil.py
406–408

Perversely, locals are significantly faster than globals in Python. Not going to block landing this patch on the change, but it *might* show up in a perf* command to cache this locally in the function it's used in.

durin42 accepted this revision.Oct 14 2017, 12:46 AM
This revision was automatically updated to reflect the committed changes.