We've spent a lot of time hacking around the outputs of hg in jf (a new
tool we use at FB to interact with our code-review system). We've figured out that it
could be all avoided and made cleaner with simple hg extension that dumps all
the data we need in a format that we can easily consume.
Details
- Reviewers
ryanmce - Group Reviewers
Restricted Project - Commits
- rFBHGX1843e9c0a000: crdump: introduce extension to dump data for code review tools
see attached test
Diff Detail
- Repository
- rFBHGX Facebook Mercurial Extensions
- Branch
- default
- Lint
Lint OK - Unit
Unit Test Errors
Event Timeline
If it's not end-user facing, maybe make it a debug command? Note some fields (ex. desc) would have encoding issues if it's not UTF-8.
Looks super solid overall. Just a few issues and nitpicks. Nice work!
hgext3rd/crdump.py | ||
---|---|---|
18–21 | Shouldn't this use extensions.find()? | |
26 | this could be a revset, so it might be worth making "revision" plural: "revisions to dump" | |
43 | full path or relative to output_directory? | |
48 | base_tmp_dir isn't a thing according to this doc | |
77 | nit: prefer cdata: it's clearly data about commits, not the commits themselves. | |
92 | I'd feel happier if we used phases.CONSTANT rather than hardcoding the "public" name here. | |
104–105 | Comment about the format of the data here and why splitting on @ is the right thing please. | |
115 | What does except None do? Seems weird and not what you want: https://stackoverflow.com/questions/19327320/python-except-none | |
120 | We want full context -- as much as possible at any rate, right? | |
128 | So this would create paths named like foo/bar/baz_HASH, right? That looks pretty weird to me. I'd rather do something like: HASH.bin/foo/bar/baz (I acknowledge that this is bike-shedding so feel free to ignore) | |
138 | you're mixing styles here. Prefer binaryfiles = [] | |
140–156 | What happens when a binary file becomes non-binary, or visa-versa? What does the diff look like? Is this tested? What happens on the phabricator side? | |
160–163 | No, let's not replicate this logic please. Let's reuse what we already have in this repo. Also, I don't think it will always necessarily have facebook.com -- what about our inevitable switch to fb.com again? | |
166 | Is this the same revset we use in jf? Should this be passed in in case we need to change it in jf? |
hgext3rd/crdump.py | ||
---|---|---|
18–21 | That's how phrevset is doing it. | |
115 | My bad. | |
120 | It's using the diff.unified from the config. But I suppose we can hardcode it here. I'll add also git=True and binary=False. | |
140–156 | no idea what happens on phabricator side, we need to figure it out. I'll add a test. | |
160–163 | I don't want this extension to have to rely on phrevset extension being enabled just to use this little regex from it. I'll copy it to a constant. | |
166 | In jf we don't have "public base" revset. And I think for the function name it's the only revset that's correct :) |
responded to all the issues except enconding issue mentioned by @quark (will do it later)
https://www.mercurial-scm.org/wiki/EncodingStrategy says that commit messages and usernames are stored in UTF8 encoding.
Just a few nitpicks, land when responded to or addressed.
hgext3rd/crdump.py | ||
---|---|---|
18–21 | Does that mean it's right? | |
29 | Comment about this number? Or pass it in via -U? (we can do this later though) | |
51 | for clarity and consistency with below, let's say "path to file containing..." | |
56 | for clarity and consistency with the below, let's use "path to file relative to repo root" | |
120 | That sounds good for now. | |
166 | We do have a public base revset for the bundle upload, actually. Can you check what we use there? I optimized it to be fast and correct iirc. | |
183 | jf uses: const base = `last(::ancestor(${revs}) & public())`; I believe this may be faster than what you have; can you test? Note: you don't need the ancestor bit because there's only one rev at this point. | |
tests/test-crdump.t | ||
220 | What? |
hgext3rd/crdump.py | ||
---|---|---|
128 | it's hash of the filenode |
One final thing: I think it still might make sense to split up the information collection calls into two: one for the "heavy" data (diff, binary files, list of filenames), and one for the changelog-only data (so we can reserve revisions earlier, especially if the diff generation is slow). It doesn't prevent us from keeping crdump this way though.
Regardless, let's ship this and start testing it out!
hgext3rd/crdump.py | ||
---|---|---|
128 | Ah, in that case let's just drop the path part. |
Shouldn't this use extensions.find()?