( )⚙ D3990 linelog: add a Python implementation of the linelog datastructure

This is an archive of the discontinued Mercurial Phabricator instance.

linelog: add a Python implementation of the linelog datastructure
ClosedPublic

Authored by durin42 on Aug 1 2018, 11:17 AM.

Details

Summary

This datastructure was originally developed by Jun Wu at Facebook,
inspired by SCCS weaves. It's useful as a cache for blame information,
but also is the magic that makes hg absorb easy to implement. In
service of importing the code to Mercurial, I wanted to actually
/understand/ it, and once I did I decided to take a run at
implementing it.

The help/internals/linelog.txt document is the README from Jun Wu's
implementaiton. It all applies to our linelog implementation.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

durin42 created this revision.Aug 1 2018, 11:17 AM
indygreg requested changes to this revision.Aug 1 2018, 1:59 PM
indygreg added a subscriber: indygreg.

I wrote several comments. But overall this code seems very reasonable as a first implementation. Especially if we want to maintain backwards compatibility with the C implementation for the initial import. My mind was blown when I realized linelog was an interpreted bytecode. Crazy town.

Most of my comments can be deferred to follow-ups. I'd be tempted to defer all the performance-related changes to follow-ups so we can measure the impact they have. I'm actually curious about that.

I probably could grant review. I figured I'd send this back to you in case you wanted to make some changes. I also think this may want another set of eyes because it is a big piece of code. And I'm not that great with algorithms and kind of skimmed some of the lower-level logic around operation traversal. There's a lot to digest!

mercurial/linelog.py
31

Nit: I think we use > to indicate big-endian elsewhere.

Also, I'm a fan of using little-endian for on-disk formats to save conversion operations since x86 is little-endian. Not that it matters given the overhead of Python. But it can come into play when e.g. implementing these things in Rust. I'm inclined to ignore it for now. As long as we have a mechanism for versioning the on-disk and exchanged formats.

36

We may want to define slots=True on this and annotateresult so objects take up less space. Could be done as a follow-up.

56

abc requires module import time computation, which adds overhead.

I'd encourage you to use interfaceutil add supplement test-check-interfaces.py to perform the interface conformance tests not at run time.

78–89

While I like the abstraction of instructions, given the simplicity of the language and the overhead of function calls in Python, I wonder if we'd be better off with the execution logic inlined. The performance speedup is already significant with this code. So deferring on performance optimization seems reasonable.

91–96

It feels like we may want to use attrs with slots=True for these types.

211

Constants might be a bit nicer to read.

223–224

I think these want docstrings.

240–241

Nit: maybe display the instruction count instead / as well?

248

We don't really use @classmethod in Mercurial. Consider breaking out into a normal function.

254

Nit: drop the parens

266

Use pycompat.xrange (we should probably establish a lint for this).

267

Would it be faster to implement this as a list comprehension? I can't recall if Python optimizes away the overhead of list.append in that case. Could be done as a follow-up.

270–272

I'm assuming programs can get a bit large? We may want to turn this into a generator of chunks.

371–377

I agree. I'm not a fan of the API. But this can be cleaned up later.

379–380

self._lastannotate may be None. I assume this is part of the API for the same reason as annotateresult.

394

pycompat.xrange.

This revision now requires changes to proceed.Aug 1 2018, 1:59 PM

Oh, I'd also appreciate replacing blame with annotate throughout this series so we can avoid the culture of negativity. I've actually heard people commend Mercurial over <other VCSs> because annotate is the primary verb in the UI!

durin42 marked 4 inline comments as done.Aug 1 2018, 6:47 PM
durin42 updated this revision to Diff 9759.
durin42 added inline comments.Aug 1 2018, 6:47 PM
mercurial/linelog.py
78–89

Yep. This code is shaped for maximum comprehension now, and I'm sure we can buy more speed later if we want.

379–380

Yep. I wanted to import absorb and fastannotate before iterating on the API so we'd be looking at both consumers as we refactored.

indygreg accepted this revision.Aug 1 2018, 7:29 PM

I'm OK breaking out my rubber stamp for this.

This revision is now accepted and ready to land.Aug 1 2018, 7:29 PM
This revision was automatically updated to reflect the committed changes.

I'm still just trying to understand how weaves work. Here are some questions for you for now. We may want to document some of the answers in a follow-up patch (not just here in Phabricator).

mercurial/help/internals/linelog.txt
111

Does that mean that we don't produce these cases? Or we fail if they happen and we fall back to old annotate?

116–119

Could we add content to these examples to make them clearer? I don't follow how the rewrite works without seeing the content in there. I can imagine a rewrite that looks something like the following, but that's not what the example says, so I'm probably missing something.

^AI/D x     ^AI/D x
foo         foo
            ^AE x
^AD/I y  -> ^AI/D x
            ^AD/I y
bar         bar
^AE x       ^AE y
            ^AE x
            ^AD/I y
baz         baz
^AE y       ^AE y
124–127

Same here, adding content might help. I imagine it means something like this (just the LHS):

^AI x + 1
foo
^AI x
bar
^AE x
baz
^AE x + 1

But what does that even mean? That bar got added in revision x and then it got added again in revision x+1? I suppose it means it won't be added the second time, but I agree that it makes sense to consider that malformed.

129

Why is insertion inside an earlier deletion considered invalid? That seems like what would happen when you revive a line (as you also say below). How do we model that instead?

129

I agree that deletion inside deletion should be invalid, but why isn't that part of item 2 instead? It seems more similar to that case (i.e. nested deletion seems more like nested insertion to me).

129

How about nested deletion inside insertion? Why is that different?

135

I assume this should read ^AE x + 1 on the RHS

I'm still just trying to understand how weaves work. Here are some questions for you for now. We may want to document some of the answers in a follow-up patch (not just here in Phabricator).

Tragically, I think you understand linelog.txt better than I do at this point - maybe send me patches as you work things out?