This is an archive of the discontinued Mercurial Phabricator instance.

Differential D6183

copies: add config option for writing copy metadata to file and/or changset
ClosedPublic

Authored by martinvonz on Apr 2 2019, 3:30 PM.

Download Raw Diff

Details

Reviewers

None

Group Reviewers

hg-reviewers

Commits

rHG0e41f40b01cc: copies: add config option for writing copy metadata to file and/or changset
rHG36d70c14ed25: copies: add config option for writing copy metadata to file and/or changset
rHG0fdf45cb3dc6: copies: add config option for writing copy metadata to file and/or changset

Summary

This introduces a config option that lets you choose to write copy
metadata to the changeset extras instead of to filelog. There's also
an option to write it to both places. I imagine that may possibly be
useful when transitioning an existing repo.

The copy metadata is stored as two fields in extras: one for copies
since p1 and one for copies since p2.

I may need to add more information later in order to make copy tracing
faster. Specifically, I'm thinking out recording which files were
added or removed so that copies._chaincopies() doesn't have to look at
the manifest for that. But that would just be an optimization and that
can be added once we know if it's necessary.

I have also considered saving space by using replacing the destination
file path by an index into the "files" list, but that can also be
changed later (but before the feature is ready to release).

Diff Detail

Repository

rHG Mercurial

Lint

Lint Skipped

Unit

Unit Tests Skipped

Event Timeline

martinvonz created this revision.Apr 2 2019, 3:30 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptApr 2 2019, 3:30 PM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

martinvonz added a child revision: D6184: changelog: extract a _string_unescape() to mirror _string_escape().Apr 2 2019, 3:30 PM

martinvonz added a parent revision: D6181: localrepo: rename crev in _filecommit() to cnode, since it's a node.Apr 2 2019, 3:30 PM

martinvonz updated this revision to Diff 14657.Apr 4 2019, 7:58 PM

martinvonz updated this revision to Diff 14663.Apr 5 2019, 1:48 AM

martinvonz edited parent revisions, added: D6163: copies: extract function for deciding whether to use changeset-centric algos; removed: D6181: localrepo: rename crev in _filecommit() to cnode, since it's a node.Apr 5 2019, 1:48 AM

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes). In addition, stoing copy in the changeset is kind of a "schema breakage" making its adoption slower.

Instead I would advertise for keeping the copy data inside the filelog, using a changeset centric cache summing up the information. The entries from this changeset centric cache can be exchanged over the wire alongside their associated changesets, solving your remote-filelog usecase.

In D6183#90300, @marmoute wrote:

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

In addition, stoing copy in the changeset is kind of a "schema breakage" making its adoption slower.
Instead I would advertise for keeping the copy data inside the filelog, using a changeset centric cache summing up the information. The entries from this changeset centric cache can be exchanged over the wire alongside their associated changesets, solving your remote-filelog usecase.

That sounds more complicated for unclear benefit.

Hi Martin,

Thanks for taking on copy tracing, it's been on our mind for a while, too.

Some of our users would be very interested in the expected speedups of the copy tracing system, however the impact of putting that data in the changeset itself would not be acceptable to them in practice. For instance, if I understand correctly, it would affect all existing hashes and could lead to subtle problems while exchanging data. It's possible that some of that could be worked around over time, but from our perspective, it looks as if a cache-based system would avoid them entirely. This would make the benefits of your work available to all users in the short term.

We understand it looks to be more complex to you, but we're willing to help. I'm pretty confident that, working together, we can nail this before the freeze in a way that would lift all concerns. This matter is important enough to us that we're ready to make working with you on that a priority in our schedule.

If that suits you, we could have a video chat this week to get over the details - of which we could post a summary here to keep the whole community in the loop.

What do you think?

In D6183#90327, @martinvonz wrote:

In D6183#90300, @marmoute wrote:

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:

          size           |   in filelog | in changeset | increase |
.hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
.hg/                     |   2866813806 |   2804688010 |   -2.17% |

The performance impact is terrible, however. hg st --rev last-mozilla-central --rev GECKO_2_1_BASE (~30k commits apart) went from about 5 seconds to about 6 minutes. That's because the current code reads manifests. We should be able to remove that.

In D6183#90470, @gracinet wrote:

Hi Martin,
Thanks for taking on copy tracing, it's been on our mind for a while, too.
Some of our users would be very interested in the expected speedups of the copy tracing system, however the impact of putting that data in the changeset itself would not be acceptable to them in practice. For instance, if I understand correctly, it would affect all existing hashes and could lead to subtle problems while exchanging data. It's possible that some of that could be worked around over time, but from our perspective, it looks as if a cache-based system would avoid them entirely. This would make the benefits of your work available to all users in the short term.
We understand it looks to be more complex to you, but we're willing to help. I'm pretty confident that, working together, we can nail this before the freeze in a way that would lift all concerns. This matter is important enough to us that we're ready to make working with you on that a priority in our schedule.
If that suits you, we could have a video chat this week to get over the details - of which we could post a summary here to keep the whole community in the loop.
What do you think?

I see benefits of both solutions. As you said, the cache solution's primary benefit is that it works on existing repos. For new repos (in environments where everyone using the repo has a new Mercurial version), it seems simpler to store and exchange the copy information in the changeset instead of writing it to the filelog and also to a cache.

I think the way I've written these patches, it shouldn't be too hard for you guys to extend it (in core) by adding another option for experimental.copies.read-from to make it read from the cache. I'm happy to talk about that. I'm honestly not happy to talk about doing only the cache solution. Okay with you?

In D6183#90481, @martinvonz wrote:
In D6183#90327, @martinvonz wrote:

In D6183#90300, @marmoute wrote:

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:
          size           |   in filelog | in changeset | increase |
.hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
.hg/                     |   2866813806 |   2804688010 |   -2.17% |

I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?

In D6183#90486, @pulkit wrote:
In D6183#90481, @martinvonz wrote:
In D6183#90327, @martinvonz wrote:

In D6183#90300, @marmoute wrote:

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:
          size           |   in filelog | in changeset | increase |
.hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
.hg/                     |   2866813806 |   2804688010 |   -2.17% |
I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?

You'll need to patch in this series and also D6219. Then run hg convert --config experimental.copies.write-to=changeset-only. However, note that the performance is most likely going to be a lot *worse* for now, so there's not much reason to try it IMO (except to verify that my numbers above are valid, or to see that it won't be much worse in your repo). It took over 30 hours to convert the mozilla-unified repo for me.

In D6183#90498, @martinvonz wrote:
In D6183#90486, @pulkit wrote:
In D6183#90481, @martinvonz wrote:
In D6183#90327, @martinvonz wrote:

In D6183#90300, @marmoute wrote:

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:
          size           |   in filelog | in changeset | increase |
.hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
.hg/                     |   2866813806 |   2804688010 |   -2.17% |
I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?
You'll need to patch in this series and also D6219. Then run hg convert --config experimental.copies.write-to=changeset-only. However, note that the performance is most likely going to be a lot *worse* for now, so there's not much reason to try it IMO (except to verify that my numbers above are valid, or to see that it won't be much worse in your repo). It took over 30 hours to convert the mozilla-unified repo for me.

Thanks, I will check the change in size of .hg/ and changelog size.

We had half an hour of direct chat about this yesterday, with @martinvonz and @marmoute. Here's a summary

The architecture of the proposed change makes the storage and conveying of the information orthogonal to its usage. Therefore, it can be built upon to introduce other storage strategies, such as the caching system suggested by @marmoute.
The documentation for the new config option should clearly state that changeset-only is applicable only in cases where all the peers enable it and prior changesets hashes don't matter. I'm not 100% sure of the exact technical condition at this point, but I suppose someone can find a simple sentence that would be correct enough.
Octobus hopes that the caching strategy could eventually become useful for the general public.
@martinvonz helped us understand his implementation by answering further technical questions by @marmoute.

Overall, my feeling is that it's been a very productive talk and that we have a clear path forward.

In D6183#90573, @gracinet wrote:

We had half an hour of direct chat about this yesterday, with @martinvonz and @marmoute. Here's a summary

The architecture of the proposed change makes the storage and conveying of the information orthogonal to its usage. Therefore, it can be built upon to introduce other storage strategies, such as the caching system suggested by @marmoute.

The documentation for the new config option should clearly state that changeset-only is applicable only in cases where all the peers enable it and prior changesets hashes don't matter. I'm not 100% sure of the exact technical condition at this point, but I suppose someone can find a simple sentence that would be correct enough.

Octobus hopes that the caching strategy could eventually become useful for the general public.

@martinvonz helped us understand his implementation by answering further technical questions by @marmoute.

Overall, my feeling is that it's been a very productive talk and that we have a clear path forward.

Thanks for the summary, Georges! Reviewers, feel free to queue this if you think it looks good.

+def encodecopies(copies):
+ items = [
+ '%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
+ for k in sorted(copies)
+ ]
+ return "\n".join(items)

It might be nitpicky, but I think it's better to not embed \0 into the
extras field. Almost all extras data are texts, and IIRC we regret that
transplant sources are stored in binary form.

In D6183#90698, @yuja wrote:

+def encodecopies(copies):
+ items = [
+ '%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
+ for k in sorted(copies)
+ ]
+ return "\n".join(items)

It might be nitpicky, but I think it's better to not embed \0 into the
extras field. Almost all extras data are texts, and IIRC we regret that
transplant sources are stored in binary form.

Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.

martinvonz updated this revision to Diff 14734.Apr 13 2019, 6:16 PM

martinvonz added a child revision: D6186: changelog: parse copy metadata if available in extras.Apr 13 2019, 6:21 PM

> It might be nitpicky, but I think it's better to not embed `\0` into the
>  extras field. Almost all extras data are texts, and IIRC we regret that
>  transplant sources are stored in binary form.
Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.

I don't remember, but we do store even boolean value as text, not in binary
\0/\1 form. transplant_source is the solo exception.

https://www.mercurial-scm.org/wiki/ChangesetExtra

And if we pick \0/\n separators, _string_escape() wouldn't be needed
at the encodecopies() layer.

In D6183#90722, @yuja wrote:
> It might be nitpicky, but I think it's better to not embed `\0` into the
>  extras field. Almost all extras data are texts, and IIRC we regret that
>  transplant sources are stored in binary form.
Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
I don't remember, but we do store even boolean value as text, not in binary
\0/\1 form. transplant_source is the solo exception.

Perhaps it's just so {extras} doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)

https://www.mercurial-scm.org/wiki/ChangesetExtra
And if we pick \0/\n separators, _string_escape() wouldn't be needed
at the encodecopies() layer.

Oh, now I see what you're saying! That's embarrassing. So maybe we should _string_escape() the whole thing? I'll do that.

martinvonz updated this revision to Diff 14738.Apr 14 2019, 1:29 AM

> >   >  extras field. Almost all extras data are texts, and IIRC we regret that
> >   >  transplant sources are stored in binary form.
> >   
> >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
>
> I don't remember, but we do store even boolean value as text, not in binary
>  `\0`/`\1` form. `transplant_source` is the solo exception.
Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)

Ok. That might be the reason, and I'm fine with the \0 separator.

> https://www.mercurial-scm.org/wiki/ChangesetExtra
> 
> And if we pick \0/\n separators, _string_escape() wouldn't be needed
>  at the encodecopies() layer.
Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that.

Not really. I meant _string_escape() could be removed entirely if we store
copies in binary (valid_filename + invalid_filename_separator) form. The extra
dict will be encoded later.

In D6183#90738, @yuja wrote:

> >   >  extras field. Almost all extras data are texts, and IIRC we regret that
> >   >  transplant sources are stored in binary form.
> >   
> >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
>
> I don't remember, but we do store even boolean value as text, not in binary
>  `\0`/`\1` form. `transplant_source` is the solo exception.
Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)

Ok. That might be the reason, and I'm fine with the \0 separator.

> https://www.mercurial-scm.org/wiki/ChangesetExtra
> 
> And if we pick \0/\n separators, _string_escape() wouldn't be needed
>  at the encodecopies() layer.
Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that.

Not really. I meant _string_escape() could be removed entirely if we store
copies in binary (valid_filename + invalid_filename_separator) form. The extra
dict will be encoded later.

Sure, if we're okay with the \0 and \n separators being printed to the terminal when the user uses the {extras} template, then we can just drop the encoding. Sounds like you're okay with that, and I also don't care too much, so I'll drop the encoding.

martinvonz updated this revision to Diff 14766.Apr 16 2019, 12:19 PM

Closed by commit rHG0fdf45cb3dc6: copies: add config option for writing copy metadata to file and/or changset (authored by martinvonz). · Explain WhyApr 16 2019, 7:22 PM

This revision was automatically updated to reflect the committed changes.

I support experimenting with putting copy metadata in the changelog. And the patches before this one did a lot of work to allow copy metadata to be read from alternate sources, which is great, since it can allow flexibility in the future (think copy caches, copy modifications outside of a commit, etc).

I haven't looked at all these patches in detail, but it seems to me there should be a repo requirement in the case(s) where (all) copy metadata is not in the filelogs. Without a repo requirement, an old client may attempt to open a repo and not be able to find the copy metadata. It is OK to duplicate copy metadata in the changelog and have newer clients use copy metadata from the changelog if it is available. But if all copy metadata isn't available in the filelogs, there needs to be a requirement to lock out old clients.

That being said, we may want to be aggressive than this! If a new client is writing copy metadata to filelogs and the changelog, an old client may commit to the repo with the copy metadata just in the filelogs. I'm not sure about the code behavior, but presumably a new client configured to use changelog copy metadata would forego reading the filelog metadata since it is expecting to read it from the changelog. This could result in a new client missing copy metadata written by an old client. So we would need a repo requirement to lock out old clients from writing to the repo.

Then there's the wire protocol aspect. How does the copy metadata writing setting get propagated to the client? If it fails to get propagated, it is a similar situation to the local repo situation. Again, there needs to be some kind of requirement/capability detection here and the server setting needs to find its way to the client or else bad things can happen.

Anyway, this is exciting work! It is still an experimental feature, so the implementation doesn't have to be perfect. But we will need to cross the repo requirements/capabilities bridge at some point. Can't wait to see the benefits of this work!

An idea to consider (which may have been proposed already) is to write a *no copy metadata* entry into extras when writing copy metadata to the changelog. If we did things this way, a new client could know definitively that no copy metadata is available and to not fall back to reading from the filelogs. I haven't fully thought this through, but that should provide better compatibility between older and newer clients. Obviously the tradeoff is you could have a mixed repo (some changesets wouldn't have copy metadata in changelog) and you would need to duplicate copy metadata across changelog and filelogs to maintain compatibility. Something to contemplate...

In D6183#91060, @indygreg wrote:

I support experimenting with putting copy metadata in the changelog. And the patches before this one did a lot of work to allow copy metadata to be read from alternate sources, which is great, since it can allow flexibility in the future (think copy caches, copy modifications outside of a commit, etc).

Yes, Pierre-Yves and Georges wanted to work on adding them to a cache. As you hint at, that would also allow copy detection to be run at a later point to update the cache. And yes, the previous patches should have decoupled the algorithms from the storage well (the new assumption is that we can cheaply get copy metadata for a whole changeset).

I haven't looked at all these patches in detail, but it seems to me there should be a repo requirement in the case(s) where (all) copy metadata is not in the filelogs. Without a repo requirement, an old client may attempt to open a repo and not be able to find the copy metadata. It is OK to duplicate copy metadata in the changelog and have newer clients use copy metadata from the changelog if it is available. But if all copy metadata isn't available in the filelogs, there needs to be a requirement to lock out old clients.

I had considered that, but figured that copy information isn't essential enough to warrant a repo requirement. But I agree with your next paragraph.

That being said, we may want to be aggressive than this! If a new client is writing copy metadata to filelogs and the changelog, an old client may commit to the repo with the copy metadata just in the filelogs. I'm not sure about the code behavior, but presumably a new client configured to use changelog copy metadata would forego reading the filelog metadata since it is expecting to read it from the changelog. This could result in a new client missing copy metadata written by an old client. So we would need a repo requirement to lock out old clients from writing to the repo.

That's still not a disaster, but I agree that it still seems better to lock them out.

Then there's the wire protocol aspect. How does the copy metadata writing setting get propagated to the client? If it fails to get propagated, it is a similar situation to the local repo situation. Again, there needs to be some kind of requirement/capability detection here and the server setting needs to find its way to the client or else bad things can happen.

Good point!

Anyway, this is exciting work! It is still an experimental feature, so the implementation doesn't have to be perfect. But we will need to cross the repo requirements/capabilities bridge at some point. Can't wait to see the benefits of this work!

In D6183#91062, @indygreg wrote:

An idea to consider (which may have been proposed already) is to write a *no copy metadata* entry into extras when writing copy metadata to the changelog. If we did things this way, a new client could know definitively that no copy metadata is available and to not fall back to reading from the filelogs. I haven't fully thought this through, but that should provide better compatibility between older and newer clients. Obviously the tradeoff is you could have a mixed repo (some changesets wouldn't have copy metadata in changelog) and you would need to duplicate copy metadata across changelog and filelogs to maintain compatibility. Something to contemplate...

That's actually what I did initially, and context._copies() is still written to work that way (not consult filelogs if an empty p1copies entry was recorded in the changeset), but I haven't added a mode where we write empty entries. I should do that. Thanks for pointing that out.

Revision Contents
Changeset List

		Path
M		mercurial/changelog.py (16 lines)
M		mercurial/configitems.py (3 lines)
M		mercurial/localrepo.py (20 lines)
M		tests/test-annotate.t (2 lines)
A	M	tests/test-copies-in-changeset.t (105 lines)
M		tests/test-fastannotate-hg.t (2 lines)

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	14624		Apr 2 2019, 3:30 PM	★	★
Diff 2	14657		Apr 4 2019, 7:58 PM	★	★
Diff 3	14663		Apr 5 2019, 1:48 AM	★	★
Diff 4	14734		Apr 13 2019, 6:16 PM	★	★
Diff 5	14738		Apr 14 2019, 1:29 AM	★	★
Diff 6	14766		Apr 16 2019, 12:19 PM	★	★
Diff 7	14791	rHG0fdf45cb3dc6c4fd3c99e5f7818a18c3cf5eea18	Dec 27 2017, 10:49 PM	★	★

Commit	Parents	Author	Summary	Date
		Martin von Zweigbergk		Dec 27 2017, 10:49 PM

Status	Author	Revision
Abandoned	martinvonz	D6185 changelog: pass default extras into decodeextra()
Closed	martinvonz	D6184 changelog: extract a _string_unescape() to mirror _string_escape()
Closed	martinvonz	D6181 localrepo: rename crev in _filecommit() to cnode, since it's a node
Closed	martinvonz	D6186 changelog: parse copy metadata if available in extras
Closed	martinvonz	D6183 copies: add config option for writing copy metadata to file and/or changset
Closed	martinvonz	D6163 copies: extract function for deciding whether to use changeset-centric algos
Closed	martinvonz	D6162 getrenamedfn: get copy data from context object if configured

Diff 14663

mercurial/changelog.py

	def encodeextra(d):			def encodeextra(d):
	# keys must be sorted to produce a deterministic changelog entry			# keys must be sorted to produce a deterministic changelog entry
	items = [			items = [
	_string_escape('%s:%s' % (k, pycompat.bytestr(d[k])))			_string_escape('%s:%s' % (k, pycompat.bytestr(d[k])))
	for k in sorted(d)			for k in sorted(d)
	]			]
	return "\0".join(items)			return "\0".join(items)

				def encodecopies(copies):
				items = [
				'%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
				for k in sorted(copies)
				]
				return "\n".join(items)

	def stripdesc(desc):			def stripdesc(desc):
	"""strip trailing whitespace and leading and trailing empty lines"""			"""strip trailing whitespace and leading and trailing empty lines"""
	return '\n'.join([l.rstrip() for l in desc.splitlines()]).strip('\n')			return '\n'.join([l.rstrip() for l in desc.splitlines()]).strip('\n')

	class appender(object):			class appender(object):
	'''the changelog index must be updated last on disk, so we use this class			'''the changelog index must be updated last on disk, so we use this class
	to delay writes to it'''			to delay writes to it'''
	def __init__(self, vfs, name, mode, buf):			def __init__(self, vfs, name, mode, buf):
	text = self.revision(node)			text = self.revision(node)
	if not text:			if not text:
	return []			return []
	last = text.index("\n\n")			last = text.index("\n\n")
	l = text[:last].split('\n')			l = text[:last].split('\n')
	return l[3:]			return l[3:]

	def add(self, manifest, files, desc, transaction, p1, p2,			def add(self, manifest, files, desc, transaction, p1, p2,
	user, date=None, extra=None):			user, date=None, extra=None, p1copies=None, p2copies=None):
	# Convert to UTF-8 encoded bytestrings as the very first			# Convert to UTF-8 encoded bytestrings as the very first
	# thing: calling any method on a localstr object will turn it			# thing: calling any method on a localstr object will turn it
	# into a str object and the cached UTF-8 string is thus lost.			# into a str object and the cached UTF-8 string is thus lost.
	user, desc = encoding.fromlocal(user), encoding.fromlocal(desc)			user, desc = encoding.fromlocal(user), encoding.fromlocal(desc)

	user = user.strip()			user = user.strip()
	# An empty username or a username with a "\n" will make the			# An empty username or a username with a "\n" will make the
	# revision text contain two "\n\n" sequences -> corrupt			# revision text contain two "\n\n" sequences -> corrupt
	parseddate = "%d %d" % dateutil.makedate()			parseddate = "%d %d" % dateutil.makedate()
	if extra:			if extra:
	branch = extra.get("branch")			branch = extra.get("branch")
	if branch in ("default", ""):			if branch in ("default", ""):
	del extra["branch"]			del extra["branch"]
	elif branch in (".", "null", "tip"):			elif branch in (".", "null", "tip"):
	raise error.StorageError(_('the name \'%s\' is reserved')			raise error.StorageError(_('the name \'%s\' is reserved')
	% branch)			% branch)
				if (p1copies or p2copies) and extra is None:
				extra = {}
				if p1copies:
				extra['p1copies'] = encodecopies(p1copies)
				if p2copies:
				extra['p2copies'] = encodecopies(p2copies)

	if extra:			if extra:
	extra = encodeextra(extra)			extra = encodeextra(extra)
	parseddate = "%s %s" % (parseddate, extra)			parseddate = "%s %s" % (parseddate, extra)
	l = [hex(manifest), user, parseddate] + sorted(files) + ["", desc]			l = [hex(manifest), user, parseddate] + sorted(files) + ["", desc]
	text = "\n".join(l)			text = "\n".join(l)
	return self.addrevision(text, transaction, len(self), p1, p2)			return self.addrevision(text, transaction, len(self), p1, p2)

	def branchinfo(self, rev):			def branchinfo(self, rev):

mercurial/configitems.py

	default=100,			default=100,
	)			)
	coreconfigitem('experimental', 'copytrace.sourcecommitlimit',			coreconfigitem('experimental', 'copytrace.sourcecommitlimit',
	default=100,			default=100,
	)			)
	coreconfigitem('experimental', 'copies.read-from',			coreconfigitem('experimental', 'copies.read-from',
	default="filelog-only",			default="filelog-only",
	)			)
				coreconfigitem('experimental', 'copies.write-to',
				default='filelog-only',
				)
	coreconfigitem('experimental', 'crecordtest',			coreconfigitem('experimental', 'crecordtest',
	default=None,			default=None,
	)			)
	coreconfigitem('experimental', 'directaccess',			coreconfigitem('experimental', 'directaccess',
	default=False,			default=False,
	)			)
	coreconfigitem('experimental', 'directaccess.revnums',			coreconfigitem('experimental', 'directaccess.revnums',
	default=False,			default=False,

mercurial/localrepo.py

	if l is None or not l.held:			if l is None or not l.held:
	return None			return None
	return l			return l

	def currentwlock(self):			def currentwlock(self):
	"""Returns the wlock if it's held, or None if it's not."""			"""Returns the wlock if it's held, or None if it's not."""
	return self._currentlock(self._wlockref)			return self._currentlock(self._wlockref)

	def _filecommit(self, fctx, manifest1, manifest2, linkrev, tr, changelist):			def _filecommit(self, fctx, manifest1, manifest2, linkrev, tr, changelist,
				includecopymeta):
	"""			"""
	commit an individual file as part of a larger transaction			commit an individual file as part of a larger transaction
	"""			"""

	fname = fctx.path()			fname = fctx.path()
	fparent1 = manifest1.get(fname, nullid)			fparent1 = manifest1.get(fname, nullid)
	fparent2 = manifest2.get(fname, nullid)			fparent2 = manifest2.get(fname, nullid)
	if isinstance(fctx, context.filectx):			if isinstance(fctx, context.filectx):
	# do (what does a copy from something not in your working copy even			# do (what does a copy from something not in your working copy even
	# mean?) and it causes bugs (eg, issue4476). Instead, we will warn			# mean?) and it causes bugs (eg, issue4476). Instead, we will warn
	# the user that copy information was dropped, so if they didn't			# the user that copy information was dropped, so if they didn't
	# expect this outcome it can be fixed, but this is the correct			# expect this outcome it can be fixed, but this is the correct
	# behavior in this circumstance.			# behavior in this circumstance.

	if cnode:			if cnode:
	self.ui.debug(" %s: copy %s:%s\n" % (fname, cfname, hex(cnode)))			self.ui.debug(" %s: copy %s:%s\n" % (fname, cfname, hex(cnode)))
				if includecopymeta:
	meta["copy"] = cfname			meta["copy"] = cfname
	meta["copyrev"] = hex(cnode)			meta["copyrev"] = hex(cnode)
	fparent1, fparent2 = nullid, newfparent			fparent1, fparent2 = nullid, newfparent
	else:			else:
	self.ui.warn(_("warning: can't find ancestor for '%s' "			self.ui.warn(_("warning: can't find ancestor for '%s' "
	"copied from '%s'!\n") % (fname, cfname))			"copied from '%s'!\n") % (fname, cfname))

	elif fparent1 == nullid:			elif fparent1 == nullid:
	fparent1, fparent2 = fparent2, nullid			fparent1, fparent2 = fparent2, nullid
	elif fparent2 != nullid:			elif fparent2 != nullid:
	modified/added/removed files. On merge, it may be wider than the			modified/added/removed files. On merge, it may be wider than the
	ctx.files() to be committed, since any file nodes derived directly			ctx.files() to be committed, since any file nodes derived directly
	from p1 or p2 are excluded from the committed ctx.files().			from p1 or p2 are excluded from the committed ctx.files().
	"""			"""

	p1, p2 = ctx.p1(), ctx.p2()			p1, p2 = ctx.p1(), ctx.p2()
	user = ctx.user()			user = ctx.user()

				writecopiesto = self.ui.config('experimental', 'copies.write-to')
				writefilecopymeta = writecopiesto != 'changeset-only'
				p1copies, p2copies = None, None
				if writecopiesto in ('changeset-only', 'compatibility'):
				p1copies = ctx.p1copies()
				p2copies = ctx.p2copies()
	with self.lock(), self.transaction("commit") as tr:			with self.lock(), self.transaction("commit") as tr:
	trp = weakref.proxy(tr)			trp = weakref.proxy(tr)

	if ctx.manifestnode():			if ctx.manifestnode():
	# reuse an existing manifest revision			# reuse an existing manifest revision
	self.ui.debug('reusing known manifest\n')			self.ui.debug('reusing known manifest\n')
	mn = ctx.manifestnode()			mn = ctx.manifestnode()
	files = ctx.files()			files = ctx.files()
	self.ui.note(uipathfn(f) + "\n")			self.ui.note(uipathfn(f) + "\n")
	try:			try:
	fctx = ctx[f]			fctx = ctx[f]
	if fctx is None:			if fctx is None:
	removed.append(f)			removed.append(f)
	else:			else:
	added.append(f)			added.append(f)
	m[f] = self._filecommit(fctx, m1, m2, linkrev,			m[f] = self._filecommit(fctx, m1, m2, linkrev,
	trp, changed)			trp, changed,
				writefilecopymeta)
	m.setflag(f, fctx.flags())			m.setflag(f, fctx.flags())
	except OSError:			except OSError:
	self.ui.warn(_("trouble committing %s!\n") %			self.ui.warn(_("trouble committing %s!\n") %
	uipathfn(f))			uipathfn(f))
	raise			raise
	except IOError as inst:			except IOError as inst:
	errcode = getattr(inst, 'errno', errno.ENOENT)			errcode = getattr(inst, 'errno', errno.ENOENT)
	if error or errcode and errcode != errno.ENOENT:			if error or errcode and errcode != errno.ENOENT:
	mn = p1.manifestnode()			mn = p1.manifestnode()
	files = []			files = []

	# update changelog			# update changelog
	self.ui.note(_("committing changelog\n"))			self.ui.note(_("committing changelog\n"))
	self.changelog.delayupdate(tr)			self.changelog.delayupdate(tr)
	n = self.changelog.add(mn, files, ctx.description(),			n = self.changelog.add(mn, files, ctx.description(),
	trp, p1.node(), p2.node(),			trp, p1.node(), p2.node(),
	user, ctx.date(), ctx.extra().copy())			user, ctx.date(), ctx.extra().copy(),
				p1copies, p2copies)
	xp1, xp2 = p1.hex(), p2 and p2.hex() or ''			xp1, xp2 = p1.hex(), p2 and p2.hex() or ''
	self.hook('pretxncommit', throw=True, node=hex(n), parent1=xp1,			self.hook('pretxncommit', throw=True, node=hex(n), parent1=xp1,
	parent2=xp2)			parent2=xp2)
	# set the new commit is proper phase			# set the new commit is proper phase
	targetphase = subrepoutil.newcommitphase(self.ui, ctx)			targetphase = subrepoutil.newcommitphase(self.ui, ctx)
	if targetphase:			if targetphase:
	# retract boundary do not alter parent changeset.			# retract boundary do not alter parent changeset.
	# if a parent have higher the resulting phase will			# if a parent have higher the resulting phase will

tests/test-annotate.t

	and its ancestor by overriding "repo._filecommit".			and its ancestor by overriding "repo._filecommit".

	$ cat > ../legacyrepo.py <<EOF			$ cat > ../legacyrepo.py <<EOF
	> from __future__ import absolute_import			> from __future__ import absolute_import
	> from mercurial import error, node			> from mercurial import error, node
	> def reposetup(ui, repo):			> def reposetup(ui, repo):
	> class legacyrepo(repo.__class__):			> class legacyrepo(repo.__class__):
	> def _filecommit(self, fctx, manifest1, manifest2,			> def _filecommit(self, fctx, manifest1, manifest2,
	> linkrev, tr, changelist):			> linkrev, tr, changelist, includecopymeta):
	> fname = fctx.path()			> fname = fctx.path()
	> text = fctx.data()			> text = fctx.data()
	> flog = self.file(fname)			> flog = self.file(fname)
	> fparent1 = manifest1.get(fname, node.nullid)			> fparent1 = manifest1.get(fname, node.nullid)
	> fparent2 = manifest2.get(fname, node.nullid)			> fparent2 = manifest2.get(fname, node.nullid)
	> meta = {}			> meta = {}
	> copy = fctx.copysource()			> copy = fctx.copysource()
	> if copy and copy != fname:			> if copy and copy != fname:

tests/test-copies-in-changeset.t

This file was added.


				$ cat >> $HGRCPATH << EOF
				> [experimental]
				> copies.write-to=changeset-only
				> [alias]
				> changesetcopies = log -r . -T 'files: {files}
				> {extras % "{ifcontains("copies", key, "{key}: {value}\n")}"}'
				> EOF

				Check that copies are recorded correctly

				$ hg init repo
				$ cd repo
				$ echo a > a
				$ hg add a
				$ hg ci -m initial
				$ hg cp a b
				$ hg cp a c
				$ hg cp a d
				$ hg ci -m 'copy a to b, c, and d'
				$ hg changesetcopies
				files: b c d
				p1copies: b\x00a (esc)
				c\x00a (esc)
				d\x00a (esc)

				Check that renames are recorded correctly

				$ hg mv b b2
				$ hg ci -m 'rename b to b2'
				$ hg changesetcopies
				files: b b2
				p1copies: b2\x00b (esc)

				Rename onto existing file. This should get recorded in the changeset files list and in the extras,
				even though there is no filelog entry.

				$ hg cp b2 c --force
				$ hg st --copies
				M c
				b2
				$ hg debugindex c
				rev linkrev nodeid p1 p2
				0 1 b789fdd96dc2 000000000000 000000000000
				$ hg ci -m 'move b onto d'
				$ hg changesetcopies
				files: c
				p1copies: c\x00b2 (esc)
				$ hg debugindex c
				rev linkrev nodeid p1 p2
				0 1 b789fdd96dc2 000000000000 000000000000

				Create a merge commit with copying done during merge.

				$ hg co 0
				0 files updated, 0 files merged, 3 files removed, 0 files unresolved
				$ hg cp a e
				$ hg cp a f
				$ hg ci -m 'copy a to e and f'
				created new head
				$ hg merge 3
				3 files updated, 0 files merged, 0 files removed, 0 files unresolved
				(branch merge, don't forget to commit)
				File 'a' exists on both sides, so 'g' could be recorded as being from p1 or p2, but we currently
				always record it as being from p1
				$ hg cp a g
				File 'd' exists only in p2, so 'h' should be from p2
				$ hg cp d h
				File 'f' exists only in p1, so 'i' should be from p1
				$ hg cp f i
				$ hg ci -m 'merge'
				$ hg changesetcopies
				files: g h i
				p1copies: g\x00a (esc)
				i\x00f (esc)
				p2copies: h\x00d (esc)

				Test writing to both changeset and filelog

				$ hg cp a j
				$ hg ci -m 'copy a to j' --config experimental.copies.write-to=compatibility
				$ hg changesetcopies
				files: j
				p1copies: j\x00a (esc)
				$ hg debugdata j 0
				\x01 (esc)
				copy: a
				copyrev: b789fdd96dc2f3bd229c1dd8eedf0fc60e2b68e3
				\x01 (esc)
				a

				Test writing only to filelog

				$ hg cp a k
				$ hg ci -m 'copy a to k' --config experimental.copies.write-to=filelog-only
				$ hg changesetcopies
				files: k
				$ hg debugdata k 0
				\x01 (esc)
				copy: a
				copyrev: b789fdd96dc2f3bd229c1dd8eedf0fc60e2b68e3
				\x01 (esc)
				a

				$ cd ..

tests/test-fastannotate-hg.t

	and (2) the extension to allow filelog merging between the revision			and (2) the extension to allow filelog merging between the revision
	and its ancestor by overriding "repo._filecommit".			and its ancestor by overriding "repo._filecommit".

	$ cat > ../legacyrepo.py <<EOF			$ cat > ../legacyrepo.py <<EOF
	> from mercurial import error, node			> from mercurial import error, node
	> def reposetup(ui, repo):			> def reposetup(ui, repo):
	> class legacyrepo(repo.__class__):			> class legacyrepo(repo.__class__):
	> def _filecommit(self, fctx, manifest1, manifest2,			> def _filecommit(self, fctx, manifest1, manifest2,
	> linkrev, tr, changelist):			> linkrev, tr, changelist, includecopymeta):
	> fname = fctx.path()			> fname = fctx.path()
	> text = fctx.data()			> text = fctx.data()
	> flog = self.file(fname)			> flog = self.file(fname)
	> fparent1 = manifest1.get(fname, node.nullid)			> fparent1 = manifest1.get(fname, node.nullid)
	> fparent2 = manifest2.get(fname, node.nullid)			> fparent2 = manifest2.get(fname, node.nullid)
	> meta = {}			> meta = {}
	> copy = fctx.renamed()			> copy = fctx.renamed()
	> if copy and copy[0] != fname:			> if copy and copy[0] != fname: