This is an archive of the discontinued Mercurial Phabricator instance.

wireprotov2: define and implement "rawstorefile" command
AbandonedPublic

Authored by indygreg on May 11 2018, 6:35 PM.

Details

Reviewers
None
Group Reviewers
hg-reviewers
Summary

stream_out - the previous command for sending raw revlog files -
was not carried forward to protocol version 2.

This commit introduces a minimal viable replacement for stream_out
in wire protocol version 2.

The new command allows obtaining "raw store files" - essentially
files from Mercurial's store as they exist on disk.

The command currently only allows obtaining the changelog or the
changelog plus root manifestlog. This is the feature set required
to support partial clones where only files data is partial.

We'll probably want to implement support for retrieving changelog
and manifest data via dedicated commands in order to facilitate
partial clone. And if we do decide to keep a command for streaming
"raw" files, we'll want to support tree manifests. This command
is very much a minimum viable implementation. I foresee things
changing substantially. None of wire protocol version 2 is
covered by BC yet. So hopefully the barrier to entry is low.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

indygreg created this revision.May 11 2018, 6:35 PM
martinvonz added inline comments.
mercurial/wireprotov2server.py
548–549

I understand that you don't want to lock the repo for the entire operation, but I assume that also means that the result may fail hg verify? If changelog is always sent before the manifest (is it?), then you might have some orphan entries in the manifest if it had been written while the changelog was read.

554

tree manifest support might be as easy as also iterating over data files, stripping leading 'meta/' and trailing '00manifest.[id]' and testing against repo.narrowmatch().visitdir()

Similar to Martin's question, I would like to allow streaming clones without any locks. For that to work, one party needs to know how to truncate additional undesired data. That can be fully done by the client or the client could send a size or revision hint to the server.

In order to support streaming clone without any locks *and* for the result of that clone to pass hg verify with no warnings about unreferenced revisions (assuming the server was clean to begin with), I believe we would need to scan the changelog for all referenced manifest nodes and then find the end offset of the last node in the manifest. We would then send the manifest up to that offset.

Then for filelogs, we would need to do something similar for every file.

This doesn't scale.

I see the following solutions to this problem:

  1. Obtaining a lock, determining file sizes, and only streaming files up to their known sizes. (This is the current solution.)
  2. Do not obtain a lock, send all manifest and filelog data. This potentially results the client receiving extra manifest and file revisions if the server was in the middle of a transaction when obtaining data. Also, there are race conditions involving a rollback that we'd need to worry about.
  3. Tracking the offsets of all files up to the last transaction so they can be obtained with a lock.

Anyway, I'm not a super big fan of *stream clones*. Their existence is a glorified hack to make clones faster. Their existence is a massive layering violation because it makes the server's storage implementation the client's. This *can* be useful. But I'd rather focus on making normal [partial] clones fast.

I'm considering dropping this command from the series and implementing commands to obtain changeset and manifest data. Then we could implement partial clones without *stream clones* and avoid this debate :)

First, thank you for your work on the new wire protocol.

We used to send cache file also in streaming clone with the V1 wire protocol. Do you think we would use the new rawstorefile command for this purpose? If so, the command name might be confusing as cache files reside outside the store directory.

You were suggesting dropping support for stream clones. No matter its design, do you have ideas how to be as fast as stream clones? Stream clones make cloning huge repositories hours faster than traditional clones.

I'm not yet sure what will be done with stream clones. There's a good chance the existing approach more or less gets carried forward. I do concede that it is pretty optimal and we'll have a hard time reproducing its performance.

I'm...not thrilled by the abstraction leak in this, but as long as it's strictly temporary on the path to saner partial clones I can live with it.

mercurial/help/internals/wireprotocol.txt
1935

Having this here makes it infeasible to generate the revlog files on the fly, which means non-traditional storage backends won't ever be able to implement this endpoint efficiently.

Do we care? Should we note that this method is going to go away in all likelihood?

mercurial/wireprotov2server.py
548–549

By grabbing the size along with the list of files inside the lock, that won't happen. :)

indygreg abandoned this revision.Aug 21 2018, 4:23 PM

I'll be taking a different approach to partial clone and the wire protocol in an upcoming series. These patches may get revived someday. But not now.