This is an archive of the discontinued Mercurial Phabricator instance.

copies: add config option for writing copy metadata to file and/or changset
ClosedPublic

Authored by martinvonz on Apr 2 2019, 3:30 PM.

Details

Summary

This introduces a config option that lets you choose to write copy
metadata to the changeset extras instead of to filelog. There's also
an option to write it to both places. I imagine that may possibly be
useful when transitioning an existing repo.

The copy metadata is stored as two fields in extras: one for copies
since p1 and one for copies since p2.

I may need to add more information later in order to make copy tracing
faster. Specifically, I'm thinking out recording which files were
added or removed so that copies._chaincopies() doesn't have to look at
the manifest for that. But that would just be an optimization and that
can be added once we know if it's necessary.

I have also considered saving space by using replacing the destination
file path by an index into the "files" list, but that can also be
changed later (but before the feature is ready to release).

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

martinvonz created this revision.Apr 2 2019, 3:30 PM
martinvonz updated this revision to Diff 14657.Apr 4 2019, 7:58 PM
martinvonz updated this revision to Diff 14663.Apr 5 2019, 1:48 AM

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes). In addition, stoing copy in the changeset is kind of a "schema breakage" making its adoption slower.

Instead I would advertise for keeping the copy data inside the filelog, using a changeset centric cache summing up the information. The entries from this changeset centric cache can be exchanged over the wire alongside their associated changesets, solving your remote-filelog usecase.

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

In addition, stoing copy in the changeset is kind of a "schema breakage" making its adoption slower.
Instead I would advertise for keeping the copy data inside the filelog, using a changeset centric cache summing up the information. The entries from this changeset centric cache can be exchanged over the wire alongside their associated changesets, solving your remote-filelog usecase.

That sounds more complicated for unclear benefit.

Hi Martin,

Thanks for taking on copy tracing, it's been on our mind for a while, too.

Some of our users would be very interested in the expected speedups of the copy tracing system, however the impact of putting that data in the changeset itself would not be acceptable to them in practice. For instance, if I understand correctly, it would affect all existing hashes and could lead to subtle problems while exchanging data. It's possible that some of that could be worked around over time, but from our perspective, it looks as if a cache-based system would avoid them entirely. This would make the benefits of your work available to all users in the short term.

We understand it looks to be more complex to you, but we're willing to help. I'm pretty confident that, working together, we can nail this before the freeze in a way that would lift all concerns. This matter is important enough to us that we're ready to make working with you on that a priority in our schedule.

If that suits you, we could have a video chat this week to get over the details - of which we could post a summary here to keep the whole community in the loop.

What do you think?

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:

          size           |   in filelog | in changeset | increase |
.hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
.hg/                     |   2866813806 |   2804688010 |   -2.17% |

The performance impact is terrible, however. hg st --rev last-mozilla-central --rev GECKO_2_1_BASE (~30k commits apart) went from about 5 seconds to about 6 minutes. That's because the current code reads manifests. We should be able to remove that.

Hi Martin,
Thanks for taking on copy tracing, it's been on our mind for a while, too.
Some of our users would be very interested in the expected speedups of the copy tracing system, however the impact of putting that data in the changeset itself would not be acceptable to them in practice. For instance, if I understand correctly, it would affect all existing hashes and could lead to subtle problems while exchanging data. It's possible that some of that could be worked around over time, but from our perspective, it looks as if a cache-based system would avoid them entirely. This would make the benefits of your work available to all users in the short term.
We understand it looks to be more complex to you, but we're willing to help. I'm pretty confident that, working together, we can nail this before the freeze in a way that would lift all concerns. This matter is important enough to us that we're ready to make working with you on that a priority in our schedule.
If that suits you, we could have a video chat this week to get over the details - of which we could post a summary here to keep the whole community in the loop.
What do you think?

I see benefits of both solutions. As you said, the cache solution's primary benefit is that it works on existing repos. For new repos (in environments where everyone using the repo has a new Mercurial version), it seems simpler to store and exchange the copy information in the changeset instead of writing it to the filelog and also to a cache.

I think the way I've written these patches, it shouldn't be too hard for you guys to extend it (in core) by adding another option for experimental.copies.read-from to make it read from the cache. I'm happy to talk about that. I'm honestly not happy to talk about doing only the cache solution. Okay with you?

pulkit added a subscriber: pulkit.Apr 8 2019, 12:06 PM

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:

          size           |   in filelog | in changeset | increase |
.hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
.hg/                     |   2866813806 |   2804688010 |   -2.17% |

I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?

In D6183#90486, @pulkit wrote:

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:

          size           |   in filelog | in changeset | increase |
.hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
.hg/                     |   2866813806 |   2804688010 |   -2.17% |

I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?

You'll need to patch in this series and also D6219. Then run hg convert --config experimental.copies.write-to=changeset-only. However, note that the performance is most likely going to be a lot *worse* for now, so there's not much reason to try it IMO (except to verify that my numbers above are valid, or to see that it won't be much worse in your repo). It took over 30 hours to convert the mozilla-unified repo for me.

In D6183#90486, @pulkit wrote:

I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes).

It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)

It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:

          size           |   in filelog | in changeset | increase |
.hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
.hg/                     |   2866813806 |   2804688010 |   -2.17% |

I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?

You'll need to patch in this series and also D6219. Then run hg convert --config experimental.copies.write-to=changeset-only. However, note that the performance is most likely going to be a lot *worse* for now, so there's not much reason to try it IMO (except to verify that my numbers above are valid, or to see that it won't be much worse in your repo). It took over 30 hours to convert the mozilla-unified repo for me.

Thanks, I will check the change in size of .hg/ and changelog size.

We had half an hour of direct chat about this yesterday, with @martinvonz and @marmoute. Here's a summary

  • The architecture of the proposed change makes the storage and conveying of the information orthogonal to its usage. Therefore, it can be built upon to introduce other storage strategies, such as the caching system suggested by @marmoute.
  • The documentation for the new config option should clearly state that changeset-only is applicable only in cases where all the peers enable it and prior changesets hashes don't matter. I'm not 100% sure of the exact technical condition at this point, but I suppose someone can find a simple sentence that would be correct enough.
  • Octobus hopes that the caching strategy could eventually become useful for the general public.
  • @martinvonz helped us understand his implementation by answering further technical questions by @marmoute.

Overall, my feeling is that it's been a very productive talk and that we have a clear path forward.

We had half an hour of direct chat about this yesterday, with @martinvonz and @marmoute. Here's a summary

  • The architecture of the proposed change makes the storage and conveying of the information orthogonal to its usage. Therefore, it can be built upon to introduce other storage strategies, such as the caching system suggested by @marmoute.
  • The documentation for the new config option should clearly state that changeset-only is applicable only in cases where all the peers enable it and prior changesets hashes don't matter. I'm not 100% sure of the exact technical condition at this point, but I suppose someone can find a simple sentence that would be correct enough.
  • Octobus hopes that the caching strategy could eventually become useful for the general public.
  • @martinvonz helped us understand his implementation by answering further technical questions by @marmoute.

Overall, my feeling is that it's been a very productive talk and that we have a clear path forward.

Thanks for the summary, Georges! Reviewers, feel free to queue this if you think it looks good.

yuja added a subscriber: yuja.Apr 13 2019, 7:45 AM

+def encodecopies(copies):
+ items = [
+ '%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
+ for k in sorted(copies)
+ ]
+ return "\n".join(items)

It might be nitpicky, but I think it's better to not embed \0 into the
extras field. Almost all extras data are texts, and IIRC we regret that
transplant sources are stored in binary form.

In D6183#90698, @yuja wrote:

+def encodecopies(copies):
+ items = [
+ '%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
+ for k in sorted(copies)
+ ]
+ return "\n".join(items)

It might be nitpicky, but I think it's better to not embed \0 into the
extras field. Almost all extras data are texts, and IIRC we regret that
transplant sources are stored in binary form.

Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.

martinvonz updated this revision to Diff 14734.Apr 13 2019, 6:16 PM
yuja added a comment.Apr 13 2019, 8:26 PM
> It might be nitpicky, but I think it's better to not embed `\0` into the
>  extras field. Almost all extras data are texts, and IIRC we regret that
>  transplant sources are stored in binary form.
Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.

I don't remember, but we do store even boolean value as text, not in binary
\0/\1 form. transplant_source is the solo exception.

https://www.mercurial-scm.org/wiki/ChangesetExtra

And if we pick \0/\n separators, _string_escape() wouldn't be needed
at the encodecopies() layer.

In D6183#90722, @yuja wrote:
> It might be nitpicky, but I think it's better to not embed `\0` into the
>  extras field. Almost all extras data are texts, and IIRC we regret that
>  transplant sources are stored in binary form.
Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.

I don't remember, but we do store even boolean value as text, not in binary
\0/\1 form. transplant_source is the solo exception.

Perhaps it's just so {extras} doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)

https://www.mercurial-scm.org/wiki/ChangesetExtra
And if we pick \0/\n separators, _string_escape() wouldn't be needed
at the encodecopies() layer.

Oh, now I see what you're saying! That's embarrassing. So maybe we should _string_escape() the whole thing? I'll do that.

martinvonz updated this revision to Diff 14738.Apr 14 2019, 1:29 AM
yuja added a comment.Apr 14 2019, 7:02 PM
> >   >  extras field. Almost all extras data are texts, and IIRC we regret that
> >   >  transplant sources are stored in binary form.
> >   
> >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
>
> I don't remember, but we do store even boolean value as text, not in binary
>  `\0`/`\1` form. `transplant_source` is the solo exception.
Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)

Ok. That might be the reason, and I'm fine with the \0 separator.

> https://www.mercurial-scm.org/wiki/ChangesetExtra
> 
> And if we pick \0/\n separators, _string_escape() wouldn't be needed
>  at the encodecopies() layer.
Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that.

Not really. I meant _string_escape() could be removed entirely if we store
copies in binary (valid_filename + invalid_filename_separator) form. The extra
dict will be encoded later.

In D6183#90738, @yuja wrote:
> >   >  extras field. Almost all extras data are texts, and IIRC we regret that
> >   >  transplant sources are stored in binary form.
> >   
> >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
>
> I don't remember, but we do store even boolean value as text, not in binary
>  `\0`/`\1` form. `transplant_source` is the solo exception.
Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)

Ok. That might be the reason, and I'm fine with the \0 separator.

> https://www.mercurial-scm.org/wiki/ChangesetExtra
> 
> And if we pick \0/\n separators, _string_escape() wouldn't be needed
>  at the encodecopies() layer.
Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that.

Not really. I meant _string_escape() could be removed entirely if we store
copies in binary (valid_filename + invalid_filename_separator) form. The extra
dict will be encoded later.

Sure, if we're okay with the \0 and \n separators being printed to the terminal when the user uses the {extras} template, then we can just drop the encoding. Sounds like you're okay with that, and I also don't care too much, so I'll drop the encoding.

This revision was automatically updated to reflect the committed changes.

I support experimenting with putting copy metadata in the changelog. And the patches before this one did a lot of work to allow copy metadata to be read from alternate sources, which is great, since it can allow flexibility in the future (think copy caches, copy modifications outside of a commit, etc).

I haven't looked at all these patches in detail, but it seems to me there should be a repo requirement in the case(s) where (all) copy metadata is not in the filelogs. Without a repo requirement, an old client may attempt to open a repo and not be able to find the copy metadata. It is OK to duplicate copy metadata in the changelog and have newer clients use copy metadata from the changelog if it is available. But if all copy metadata isn't available in the filelogs, there needs to be a requirement to lock out old clients.

That being said, we may want to be aggressive than this! If a new client is writing copy metadata to filelogs and the changelog, an old client may commit to the repo with the copy metadata just in the filelogs. I'm not sure about the code behavior, but presumably a new client configured to use changelog copy metadata would forego reading the filelog metadata since it is expecting to read it from the changelog. This could result in a new client missing copy metadata written by an old client. So we would need a repo requirement to lock out old clients from writing to the repo.

Then there's the wire protocol aspect. How does the copy metadata writing setting get propagated to the client? If it fails to get propagated, it is a similar situation to the local repo situation. Again, there needs to be some kind of requirement/capability detection here and the server setting needs to find its way to the client or else bad things can happen.

Anyway, this is exciting work! It is still an experimental feature, so the implementation doesn't have to be perfect. But we will need to cross the repo requirements/capabilities bridge at some point. Can't wait to see the benefits of this work!

An idea to consider (which may have been proposed already) is to write a *no copy metadata* entry into extras when writing copy metadata to the changelog. If we did things this way, a new client could know definitively that no copy metadata is available and to not fall back to reading from the filelogs. I haven't fully thought this through, but that should provide better compatibility between older and newer clients. Obviously the tradeoff is you could have a mixed repo (some changesets wouldn't have copy metadata in changelog) and you would need to duplicate copy metadata across changelog and filelogs to maintain compatibility. Something to contemplate...

I support experimenting with putting copy metadata in the changelog. And the patches before this one did a lot of work to allow copy metadata to be read from alternate sources, which is great, since it can allow flexibility in the future (think copy caches, copy modifications outside of a commit, etc).

Yes, Pierre-Yves and Georges wanted to work on adding them to a cache. As you hint at, that would also allow copy detection to be run at a later point to update the cache. And yes, the previous patches should have decoupled the algorithms from the storage well (the new assumption is that we can cheaply get copy metadata for a whole changeset).

I haven't looked at all these patches in detail, but it seems to me there should be a repo requirement in the case(s) where (all) copy metadata is not in the filelogs. Without a repo requirement, an old client may attempt to open a repo and not be able to find the copy metadata. It is OK to duplicate copy metadata in the changelog and have newer clients use copy metadata from the changelog if it is available. But if all copy metadata isn't available in the filelogs, there needs to be a requirement to lock out old clients.

I had considered that, but figured that copy information isn't essential enough to warrant a repo requirement. But I agree with your next paragraph.

That being said, we may want to be aggressive than this! If a new client is writing copy metadata to filelogs and the changelog, an old client may commit to the repo with the copy metadata just in the filelogs. I'm not sure about the code behavior, but presumably a new client configured to use changelog copy metadata would forego reading the filelog metadata since it is expecting to read it from the changelog. This could result in a new client missing copy metadata written by an old client. So we would need a repo requirement to lock out old clients from writing to the repo.

That's still not a disaster, but I agree that it still seems better to lock them out.

Then there's the wire protocol aspect. How does the copy metadata writing setting get propagated to the client? If it fails to get propagated, it is a similar situation to the local repo situation. Again, there needs to be some kind of requirement/capability detection here and the server setting needs to find its way to the client or else bad things can happen.

Good point!

Anyway, this is exciting work! It is still an experimental feature, so the implementation doesn't have to be perfect. But we will need to cross the repo requirements/capabilities bridge at some point. Can't wait to see the benefits of this work!

An idea to consider (which may have been proposed already) is to write a *no copy metadata* entry into extras when writing copy metadata to the changelog. If we did things this way, a new client could know definitively that no copy metadata is available and to not fall back to reading from the filelogs. I haven't fully thought this through, but that should provide better compatibility between older and newer clients. Obviously the tradeoff is you could have a mixed repo (some changesets wouldn't have copy metadata in changelog) and you would need to duplicate copy metadata across changelog and filelogs to maintain compatibility. Something to contemplate...

That's actually what I did initially, and context._copies() is still written to work that way (not consult filelogs if an empty p1copies entry was recorded in the changeset), but I haven't added a mode where we write empty entries. I should do that. Thanks for pointing that out.