Download Raw Diff

Details

Reviewers

Group Reviewers

Restricted Project

Commits

rFBHGX1843e9c0a000: crdump: introduce extension to dump data for code review tools

Summary

We've spent a lot of time hacking around the outputs of hg in jf (a new
tool we use at FB to interact with our code-review system). We've figured out that it
could be all avoided and made cleaner with simple hg extension that dumps all
the data we need in a format that we can easily consume.

Test Plan

see attached test

Diff Detail

Repository

rFBHGX Facebook Mercurial Extensions

Branch

default

Lint

Lint OK

Unit

Unit Test Errors

Event Timeline

mitrandir created this revision.Aug 1 2017, 10:49 AM

Herald added a reviewer: Restricted Project. · View Herald TranscriptAug 1 2017, 10:49 AM

refine help

mitrandir added a reviewer: ryanmce.Aug 1 2017, 10:56 AM

tested and fixed the case of binary file removal

If it's not end-user facing, maybe make it a debug command? Note some fields (ex. desc) would have encoding issues if it's not UTF-8.

Looks super solid overall. Just a few issues and nitpicks. Nice work!

hgext3rd/crdump.py
18–21	Shouldn't this use `extensions.find()`?
26	this could be a revset, so it might be worth making "revision" plural: "revisions to dump"
43	full path or relative to `output_directory`?
48	`base_tmp_dir` isn't a thing according to this doc
77	nit: prefer `cdata`: it's clearly data about commits, not the commits themselves.
92	I'd feel happier if we used phases.CONSTANT rather than hardcoding the "public" name here.
104–105	Comment about the format of the data here and why splitting on @ is the right thing please.
115	What does `except None` do? Seems weird and not what you want: https://stackoverflow.com/questions/19327320/python-except-none
120	We want full context -- as much as possible at any rate, right?
128	So this would create paths named like `foo/bar/baz_HASH`, right? That looks pretty weird to me. I'd rather do something like: `HASH.bin/foo/bar/baz` (I acknowledge that this is bike-shedding so feel free to ignore)
138	you're mixing styles here. Prefer `binaryfiles = []`
140–156	What happens when a binary file becomes non-binary, or visa-versa? What does the diff look like? Is this tested? What happens on the phabricator side?
160–163	No, let's not replicate this logic please. Let's reuse what we already have in this repo. Also, I don't think it will always necessarily have facebook.com -- what about our inevitable switch to fb.com again?
166	Is this the same revset we use in jf? Should this be passed in in case we need to change it in jf?

This revision now requires changes to proceed.Aug 1 2017, 2:51 PM

mitrandir added inline comments.Aug 2 2017, 8:40 AM

hgext3rd/crdump.py
18–21	That's how phrevset is doing it.
115	My bad.
120	It's using the diff.unified from the config. But I suppose we can hardcode it here. I'll add also git=True and binary=False.
140–156	no idea what happens on phabricator side, we need to figure it out. I'll add a test.
160–163	I don't want this extension to have to rely on phrevset extension being enabled just to use this little regex from it. I'll copy it to a constant.
166	In jf we don't have "public base" revset. And I think for the function name it's the only revset that's correct :)

responded to all the issues except enconding issue mentioned by @quark (will do it later)

https://www.mercurial-scm.org/wiki/EncodingStrategy says that commit messages and usernames are stored in UTF8 encoding.

Just a few nitpicks, land when responded to or addressed.

hgext3rd/crdump.py
18–21	Does that mean it's right?
29	Comment about this number? Or pass it in via -U? (we can do this later though)
51	for clarity and consistency with below, let's say "path to file containing..."
56	for clarity and consistency with the below, let's use "path to file relative to repo root"
120	That sounds good for now.
166	We do have a public base revset for the bundle upload, actually. Can you check what we use there? I optimized it to be fast and correct iirc.
183	jf uses: const base = `last(::ancestor(${revs}) & public())`; I believe this may be faster than what you have; can you test? Note: you don't need the ancestor bit because there's only one rev at this point.
tests/test-crdump.t
220	What?

This revision is now accepted and ready to land.Aug 2 2017, 12:30 PM

mitrandir marked 20 inline comments as done.Aug 2 2017, 3:20 PM

mitrandir added inline comments.

hgext3rd/crdump.py
128	it's hash of the filenode

One final thing: I think it still might make sense to split up the information collection calls into two: one for the "heavy" data (diff, binary files, list of filenames), and one for the changelog-only data (so we can reserve revisions earlier, especially if the diff generation is slow). It doesn't prevent us from keeping crdump this way though.

Regardless, let's ship this and start testing it out!

hgext3rd/crdump.py
128	Ah, in that case let's just drop the path part.

ryanmce retitled this revision from crdump: introduce extension to crdump: introduce extension to dump data for code review tools.Aug 3 2017, 4:50 AM

ryanmce edited the summary of this revision. (Show Details)

mitrandir marked 10 inline comments as done.Aug 3 2017, 7:40 AM

mitrandir added inline comments.

hgext3rd/crdump.py
18–21	fixed.
tests/test-crdump.t
220	facepalm

fixes for the rest of the comments

Closed by commit rFBHGX1843e9c0a000: crdump: introduce extension to dump data for code review tools (authored by mitrandir). · Explain WhyAug 4 2017, 9:19 AM

This revision was automatically updated to reflect the committed changes.

			Path	Packages
A	M		hgext3rd/crdump.py (165 lines)
A	M		tests/test-crdump.t (177 lines)

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	473		Aug 1 2017, 10:49 AM	★	★
Diff 2	474	refine help	Aug 1 2017, 10:55 AM	★	★
Diff 3	478	tested and fixed the case of binary file removal	Aug 1 2017, 2:20 PM	★	★
Diff 4	500	responded to all the issues except enconding issue mentioned by @quark (will do…	Aug 2 2017, 8:42 AM	★	★
Diff 5	522	fixes for the rest of the comments	Aug 3 2017, 7:47 AM	★	★
Diff 6	536	rFBHGX1843e9c0a00055a62f02178d9b08fc2ac5c18e15	Aug 3 2017, 7:50 AM	★	★

Commit	Local	Parents	Author	Summary	Date
78d840783bc2	3663	397926ee66b0	Mateusz Kwapich	crdump: introduce extension (Show More…)	Aug 1 2017, 10:55 AM

Diff 474

hgext3rd/crdump.py

This file was added.

				# crdump.py - dump changesets information to filesystem
				#
				from __future__ import absolute_import

				import json, os, re, shutil, tempfile
				from os import path

				from mercurial import (
				error,
				registrar,
				scmutil,
				)

				from mercurial.i18n import _
				from mercurial.node import hex

				try:
				from hgsubversion import util as svnutil
				except ImportError:
				svnutil = None

				ryanmceUnsubmitted Done Shouldn't this use `extensions.find()`? ryanmce: Shouldn't this use `extensions.find()`?
				mitrandirAuthorUnsubmitted Done That's how phrevset is doing it. mitrandir: That's how phrevset is doing it.
				ryanmceUnsubmitted Done Does that mean it's right? ryanmce: Does that mean it's right?
				mitrandirAuthorUnsubmitted Not Done fixed. mitrandir: fixed.
				cmdtable = {}
				command = registrar.command(cmdtable)

				@command('crdump',
				[('r', 'rev', [], _("revision to dump"))],
				ryanmceUnsubmitted Done this could be a revset, so it might be worth making "revision" plural: "revisions to dump" ryanmce: this could be a revset, so it might be worth making "revision" plural: "revisions to dump"
				_('hg crdump [OPTION]... [-r] [REV]'))
				def crdump(ui, repo, revs, *opts):
				"""
				ryanmceUnsubmitted Done Comment about this number? Or pass it in via -U? (we can do this later though) ryanmce: Comment about this number? Or pass it in via -U? (we can do this later though)
				Dump the info about the revisions in format that's friendly for sending the
				patches for code review.

				The output is a JSON list with dictionary for each specified revision: ::

				{
				"output_directory": an output directory for all temporary files
				"commits": [
				{
				"node": commit hash,
				"date": date in format [unixtime, timezone offset],
				"desc": commit message,
				"patch_file": file containing patch in unified diff format,
				"files": list of files touched by commit,
				ryanmceUnsubmitted Done full path or relative to `output_directory`? ryanmce: full path or relative to `output_directory`?
				"binary_files": [
				{
				"filename": filename relative to repo root,
				"old_file": path to file (relative to base_tmp_dir) with a dump
				of the old version of the file,
				ryanmceUnsubmitted Done `base_tmp_dir` isn't a thing according to this doc ryanmce: `base_tmp_dir` isn't a thing according to this doc
				"new_file": path to file (relative to base_tmp_dir) with a dump
				of the new version of the file,
				},
				ryanmceUnsubmitted Done for clarity and consistency with below, let's say "path to file containing..." ryanmce: for clarity and consistency with below, let's say "path to file containing..."
				...
				],
				"user": commit author,
				"p1": {
				"node": hash,
				ryanmceUnsubmitted Done for clarity and consistency with the below, let's use "path to file relative to repo root" ryanmce: for clarity and consistency with the below, let's use "path to file relative to repo root"
				"differential_revision": xxxx
				},
				"public_base": {
				"node": public base commit hash,
				"svnrev": svn revision of public base (if hgsvn repo),
				}
				},
				...
				]
				}
				"""

				revs = list(revs)
				revs.extend(opts['rev'])

				if not revs:
				raise error.Abort(_('revisions must be specified'))
				revs = scmutil.revrange(repo, revs)

				commits = []
				outdir = tempfile.mkdtemp(suffix='hg.crdump')
				ryanmceUnsubmitted Done nit: prefer `cdata`: it's clearly data about commits, not the commits themselves. ryanmce: nit: prefer `cdata`: it's clearly data about commits, not the commits themselves.
				try:
				for rev in revs:
				ctx = repo[rev]
				rdata = {
				'node': hex(ctx.node()),
				'date': map(int, ctx.date()),
				'desc': ctx.description(),
				'files': ctx.files(),
				'p1': {
				'node': ctx.parents()[0].hex(),
				},
				'user': ctx.user(),
				}
				if ctx.parents()[0].phasestr() != "public":
				# we need this only if parent is in the same draft stack
				ryanmceUnsubmitted Done I'd feel happier if we used phases.CONSTANT rather than hardcoding the "public" name here. ryanmce: I'd feel happier if we used phases.CONSTANT rather than hardcoding the "public" name here.
				rdata['p1']['differential_revision'] = \
				phabricatorrevision(ctx.parents()[0])

				pbctx = publicbase(repo, ctx)
				if pbctx:
				rdata['public_base'] = {
				'node': hex(pbctx.node()),
				}
				if svnutil:
				svnrev = svnutil.getsvnrev(pbctx)
				rdata['public_base']['svnrev'] = \
				svnrev.split('@')[1] if svnrev else None
				rdata['patch_file'] = dumppatch(ui, repo, ctx, outdir)
				ryanmceUnsubmitted Done Comment about the format of the data here and why splitting on @ is the right thing please. ryanmce: Comment about the format of the data here and why splitting on @ is the right thing please.
				rdata['binary_files'] = dumpbinaryfiles(ui, repo, ctx, outdir)
				commits.append(rdata)

				ui.write(json.dumps({
				'output_directory': outdir,
				'commits': commits,
				}, sort_keys=True, indent=4, separators=(',', ': ')))
				ui.write('\n')
				except Exception as e:
				shutil.rmtree(outdir)
				ryanmceUnsubmitted Done What does `except None` do? Seems weird and not what you want: https://stackoverflow.com/questions/19327320/python-except-none ryanmce: What does `except None` do? Seems weird and not what you want: https://stackoverflow.
				mitrandirAuthorUnsubmitted Done My bad. mitrandir: My bad.
				raise e

				def dumppatch(ui, repo, ctx, outdir):
				chunks = ctx.diff()
				patchfile = '%s.patch' % hex(ctx.node())
				ryanmceUnsubmitted Done We want full context -- as much as possible at any rate, right? ryanmce: We want full context -- as much as possible at any rate, right?
				mitrandirAuthorUnsubmitted Done It's using the diff.unified from the config. But I suppose we can hardcode it here. I'll add also git=True and binary=False. mitrandir: It's using the diff.unified from the config. But I suppose we can hardcode it here. I'll add…
				ryanmceUnsubmitted Done That sounds good for now. ryanmce: That sounds good for now.
				with open(path.join(outdir, patchfile), 'w') as f:
				for chunk in chunks:
				f.write(chunk)
				return patchfile

				def dumpfctx(outdir, fctx):
				outfile = '%s_%s' % (fctx.path(), hex(fctx.filenode()))
				writepath = path.join(outdir, outfile)
				ryanmceUnsubmitted Done So this would create paths named like `foo/bar/baz_HASH`, right? That looks pretty weird to me. I'd rather do something like: `HASH.bin/foo/bar/baz` (I acknowledge that this is bike-shedding so feel free to ignore) ryanmce: So this would create paths named like `foo/bar/baz_HASH`, right? That looks pretty weird to me.
				mitrandirAuthorUnsubmitted Done it's hash of the filenode mitrandir: it's hash of the filenode
				ryanmceUnsubmitted Done Ah, in that case let's just drop the path part. ryanmce: Ah, in that case let's just drop the path part.
				if not path.isdir(path.dirname(writepath)):
				os.makedirs(path.dirname(writepath))
				if not path.isfile(writepath):
				with open(writepath, 'w') as f:
				f.write(fctx.data())
				return outfile

				def dumpbinaryfiles(ui, repo, ctx, outdir):
				binary_files = []
				for fname in ctx.files():
				ryanmceUnsubmitted Done you're mixing styles here. Prefer `binaryfiles = []` ryanmce: you're mixing styles here. Prefer `binaryfiles = []`
				fctx = ctx[fname]
				pctx = ctx.parents()[0]
				if fctx.isbinary():
				newfile = dumpfctx(outdir, fctx)
				oldfile = None
				if fname in pctx:
				pfctx = pctx[fname]
				if pfctx.isbinary():
				oldfile = dumpfctx(outdir, pfctx)
				binary_files.append({
				'file_name': fname,
				'old_file': oldfile,
				'new_file': newfile,
				})

				return binary_files

				def phabricatorrevision(ctx):
				ryanmceUnsubmitted Done What happens when a binary file becomes non-binary, or visa-versa? What does the diff look like? Is this tested? What happens on the phabricator side? ryanmce: What happens when a binary file becomes non-binary, or visa-versa? What does the diff look like?
				mitrandirAuthorUnsubmitted Done no idea what happens on phabricator side, we need to figure it out. I'll add a test. mitrandir: no idea what happens on phabricator side, we need to figure it out. I'll add a test.
				match = re.search('Differential Revision:.facebook\.com./D(\d+)',
				ctx.description())
				return match.group(1) if match else ''

				def publicbase(repo, ctx):
				base = repo.revs('last(::%d - not public())', ctx.rev())
				if len(base):
				ryanmceUnsubmitted Done No, let's not replicate this logic please. Let's reuse what we already have in this repo. Also, I don't think it will always necessarily have facebook.com -- what about our inevitable switch to fb.com again? ryanmce: No, let's not replicate this logic please. Let's reuse what we already have in this repo. Also…
				mitrandirAuthorUnsubmitted Done I don't want this extension to have to rely on phrevset extension being enabled just to use this little regex from it. I'll copy it to a constant. mitrandir: I don't want this extension to have to rely on phrevset extension being enabled just to use…
				return repo[base.first()]
				return None
				ryanmceUnsubmitted Done Is this the same revset we use in jf? Should this be passed in in case we need to change it in jf? ryanmce: Is this the same revset we use in jf? Should this be passed in in case we need to change it in…
				mitrandirAuthorUnsubmitted Done In jf we don't have "public base" revset. And I think for the function name it's the only revset that's correct :) mitrandir: In jf we don't have "public base" revset. And I think for the function name it's the only…
				ryanmceUnsubmitted Done We do have a public base revset for the bundle upload, actually. Can you check what we use there? I optimized it to be fast and correct iirc. ryanmce: We do have a public base revset for the bundle upload, actually. Can you check what we use…
				ryanmceUnsubmitted Done jf uses: const base = `last(::ancestor(${revs}) & public())`; I believe this may be faster than what you have; can you test? Note: you don't need the ancestor bit because there's only one rev at this point. ryanmce: jf uses: ``` const base = `last(::ancestor(${revs}) & public())`; ``` I believe this may be…

tests/test-crdump.t

This file was added.

				$ cat >> $HGRCPATH << EOF
				> [extensions]
				> drawdag=$RUNTESTDIR/drawdag.py
				> crdump=$TESTDIR/../hgext3rd/crdump.py
				> EOF

				Create repo
				$ mkdir repo
				$ cd repo
				$ hg init
				$ echo A > a
				$ echo -e "A\0" > bin1
				$ hg addremove
				adding a
				adding bin1
				$ hg commit -m a
				$ hg phase -p .

				$ echo A > a
				$ echo -e "a\0b" > bin1
				$ echo -e "b\0" > bin2
				$ hg addremove
				adding bin2
				$ hg commit -m "b
				> Differential Revision: phabricator.facebook.com/D123"

				$ echo C > c
				$ hg addremove
				adding c
				$ hg commit -m c

				Test basic dump of two commits

				$ hg crdump -r ".^^::." \| tee ../json_output
				{
				"commits": [
				{
				"binary_files": [
				{
				"file_name": "bin1",
				"new_file": "bin1_23c26c825bddcb198e701c6f7043a4e35dcb8b97",
				"old_file": null
				}
				],
				"date": [
				0,
				0
				],
				"desc": "a",
				"files": [
				"a",
				"bin1"
				],
				"node": "65d913976cc18347138f7b9f5186010d39b39b0f",
				"p1": {
				"node": "0000000000000000000000000000000000000000"
				},
				"patch_file": "65d913976cc18347138f7b9f5186010d39b39b0f.patch",
				"public_base": {
				"node": "65d913976cc18347138f7b9f5186010d39b39b0f",
				"svnrev": null
				},
				"user": "test"
				},
				{
				"binary_files": [
				{
				"file_name": "bin1",
				"new_file": "bin1_5f54dc7f5b744f0bf88fcfe31eaba3cabc7a5f0c",
				"old_file": "bin1_23c26c825bddcb198e701c6f7043a4e35dcb8b97"
				},
				{
				"file_name": "bin2",
				"new_file": "bin2_31f7b4d23cf93fd41972d0a879086e900cbf06c9",
				"old_file": null
				}
				],
				"date": [
				0,
				0
				],
				"desc": "b\nDifferential Revision: phabricator.facebook.com/D123",
				"files": [
				"bin1",
				"bin2"
				],
				"node": "bfcf9917c5dbf2b3b24f0bb4bf5b73611c5fe573",
				"p1": {
				"node": "65d913976cc18347138f7b9f5186010d39b39b0f"
				},
				"patch_file": "bfcf9917c5dbf2b3b24f0bb4bf5b73611c5fe573.patch",
				"public_base": {
				"node": "65d913976cc18347138f7b9f5186010d39b39b0f",
				"svnrev": null
				},
				"user": "test"
				},
				{
				"binary_files": [],
				"date": [
				0,
				0
				],
				"desc": "c",
				"files": [
				"c"
				],
				"node": "bfaadbd049a38e851652d584627afd83a7298969",
				"p1": {
				"differential_revision": "123",
				"node": "bfcf9917c5dbf2b3b24f0bb4bf5b73611c5fe573"
				},
				"patch_file": "bfaadbd049a38e851652d584627afd83a7298969.patch",
				"public_base": {
				"node": "65d913976cc18347138f7b9f5186010d39b39b0f",
				"svnrev": null
				},
				"user": "test"
				}
				],
				"output_directory": "*" (glob)
				}

				>>> import json
				>>> from os import path
				>>> with open("../json_output") as f:
				... data = json.loads(f.read())
				>>> outdir = data['output_directory']
				>>> for commit in data['commits']:
				... print "#### commit %s" % commit['node']
				... print open(path.join(outdir, commit['patch_file'])).read()
				... for binfile in commit['binary_files']:
				... print "######## file %s" % binfile['file_name']
				... if binfile['old_file'] is not None:
				... print "######## old"
				... print open(path.join(outdir, binfile['old_file'])).read().encode('hex')
				... if binfile['new_file'] is not None:
				... print "######## new"
				... print open(path.join(outdir, binfile['new_file'])).read().encode('hex')
				#### commit 65d913976cc18347138f7b9f5186010d39b39b0f
				diff -r 000000000000 -r 65d913976cc1 a
				--- /dev/null Thu Jan 01 00:00:00 1970 +0000
				+++ b/a Thu Jan 01 00:00:00 1970 +0000
				@@ -0,0 +1,1 @@
				+A
				diff -r 000000000000 -r 65d913976cc1 bin1
				Binary file bin1 has changed

				######## file bin1
				######## new
				41000a
				#### commit bfcf9917c5dbf2b3b24f0bb4bf5b73611c5fe573
				diff -r 65d913976cc1 -r bfcf9917c5db bin1
				Binary file bin1 has changed
				diff -r 65d913976cc1 -r bfcf9917c5db bin2
				Binary file bin2 has changed

				######## file bin1
				######## old
				41000a
				######## new
				6100620a
				######## file bin2
				######## new
				62000a
				#### commit bfaadbd049a38e851652d584627afd83a7298969
				diff -r bfcf9917c5db -r bfaadbd049a3 c
				--- /dev/null Thu Jan 01 00:00:00 1970 +0000
				+++ b/c Thu Jan 01 00:00:00 1970 +0000
				@@ -0,0 +1,1 @@
				+C



				>>> import shutil
				>>> shutil.rmtree(outdir)
				NameError("name 'outdir' is not defined",)
				ryanmceUnsubmitted Done What? ryanmce: What?
				mitrandirAuthorUnsubmitted Not Done facepalm mitrandir: facepalm

This is an archive of the discontinued Mercurial Phabricator instance.

crdump: introduce extension to dump data for code review tools
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 474

hgext3rd/crdump.py

tests/test-crdump.t

This is an archive of the discontinued Mercurial Phabricator instance.

crdump: introduce extension to dump data for code review toolsClosedPublic

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 474

hgext3rd/crdump.py

tests/test-crdump.t

crdump: introduce extension to dump data for code review tools
ClosedPublic

Revision Contents
Changeset List