Page MenuHomePhabricator

linelog: add a Python implementation of the linelog datastructure

Authored by durin42 on Aug 1 2018, 11:17 AM.



This datastructure was originally developed by Jun Wu at Facebook,
inspired by SCCS weaves. It's useful as a cache for blame information,
but also is the magic that makes hg absorb easy to implement. In
service of importing the code to Mercurial, I wanted to actually
/understand/ it, and once I did I decided to take a run at
implementing it.

The help/internals/linelog.txt document is the README from Jun Wu's
implementaiton. It all applies to our linelog implementation.

Diff Detail

rHG Mercurial
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

durin42 created this revision.Aug 1 2018, 11:17 AM
indygreg requested changes to this revision.Aug 1 2018, 1:59 PM
indygreg added a subscriber: indygreg.

I wrote several comments. But overall this code seems very reasonable as a first implementation. Especially if we want to maintain backwards compatibility with the C implementation for the initial import. My mind was blown when I realized linelog was an interpreted bytecode. Crazy town.

Most of my comments can be deferred to follow-ups. I'd be tempted to defer all the performance-related changes to follow-ups so we can measure the impact they have. I'm actually curious about that.

I probably could grant review. I figured I'd send this back to you in case you wanted to make some changes. I also think this may want another set of eyes because it is a big piece of code. And I'm not that great with algorithms and kind of skimmed some of the lower-level logic around operation traversal. There's a lot to digest!


Nit: I think we use > to indicate big-endian elsewhere.

Also, I'm a fan of using little-endian for on-disk formats to save conversion operations since x86 is little-endian. Not that it matters given the overhead of Python. But it can come into play when e.g. implementing these things in Rust. I'm inclined to ignore it for now. As long as we have a mechanism for versioning the on-disk and exchanged formats.


We may want to define slots=True on this and annotateresult so objects take up less space. Could be done as a follow-up.


abc requires module import time computation, which adds overhead.

I'd encourage you to use interfaceutil add supplement to perform the interface conformance tests not at run time.


While I like the abstraction of instructions, given the simplicity of the language and the overhead of function calls in Python, I wonder if we'd be better off with the execution logic inlined. The performance speedup is already significant with this code. So deferring on performance optimization seems reasonable.


It feels like we may want to use attrs with slots=True for these types.


Constants might be a bit nicer to read.


I think these want docstrings.


Nit: maybe display the instruction count instead / as well?


We don't really use @classmethod in Mercurial. Consider breaking out into a normal function.


Nit: drop the parens


Use pycompat.xrange (we should probably establish a lint for this).


Would it be faster to implement this as a list comprehension? I can't recall if Python optimizes away the overhead of list.append in that case. Could be done as a follow-up.


I'm assuming programs can get a bit large? We may want to turn this into a generator of chunks.


I agree. I'm not a fan of the API. But this can be cleaned up later.


self._lastannotate may be None. I assume this is part of the API for the same reason as annotateresult.



This revision now requires changes to proceed.Aug 1 2018, 1:59 PM

Oh, I'd also appreciate replacing blame with annotate throughout this series so we can avoid the culture of negativity. I've actually heard people commend Mercurial over <other VCSs> because annotate is the primary verb in the UI!

durin42 marked 4 inline comments as done.Aug 1 2018, 6:47 PM
durin42 updated this revision to Diff 9759.
durin42 added inline comments.Aug 1 2018, 6:47 PM

Yep. This code is shaped for maximum comprehension now, and I'm sure we can buy more speed later if we want.


Yep. I wanted to import absorb and fastannotate before iterating on the API so we'd be looking at both consumers as we refactored.

indygreg accepted this revision.Aug 1 2018, 7:29 PM

I'm OK breaking out my rubber stamp for this.

This revision is now accepted and ready to land.Aug 1 2018, 7:29 PM
This revision was automatically updated to reflect the committed changes.

I'm still just trying to understand how weaves work. Here are some questions for you for now. We may want to document some of the answers in a follow-up patch (not just here in Phabricator).


Does that mean that we don't produce these cases? Or we fail if they happen and we fall back to old annotate?


Could we add content to these examples to make them clearer? I don't follow how the rewrite works without seeing the content in there. I can imagine a rewrite that looks something like the following, but that's not what the example says, so I'm probably missing something.

^AI/D x     ^AI/D x
foo         foo
            ^AE x
^AD/I y  -> ^AI/D x
            ^AD/I y
bar         bar
^AE x       ^AE y
            ^AE x
            ^AD/I y
baz         baz
^AE y       ^AE y

Same here, adding content might help. I imagine it means something like this (just the LHS):

^AI x + 1
^AI x
^AE x
^AE x + 1

But what does that even mean? That bar got added in revision x and then it got added again in revision x+1? I suppose it means it won't be added the second time, but I agree that it makes sense to consider that malformed.


Why is insertion inside an earlier deletion considered invalid? That seems like what would happen when you revive a line (as you also say below). How do we model that instead?


I agree that deletion inside deletion should be invalid, but why isn't that part of item 2 instead? It seems more similar to that case (i.e. nested deletion seems more like nested insertion to me).


How about nested deletion inside insertion? Why is that different?


I assume this should read ^AE x + 1 on the RHS

I'm still just trying to understand how weaves work. Here are some questions for you for now. We may want to document some of the answers in a follow-up patch (not just here in Phabricator).

Tragically, I think you understand linelog.txt better than I do at this point - maybe send me patches as you work things out?