wireproto: experimental command to emit file data
Changes PlannedPublic

Authored by indygreg on Mar 16 2018, 7:07 PM.

Details

Reviewers
None
Group Reviewers
hg-reviewers
Summary

Partial clones will require new wire protocol functionality to
retrieve repository data. The remotefilelog extensions - which
implements various aspects of partial clone - adds a handful of
wire protocol commands:

getflogheads

Obtain heads of a filelog

getfile

Obtain data for an individual file revision

getfiles

Batch version of getfile

getpackv1

Obtain a "pack file" containing index and data on multiple
files

(among others)

Recently, the wire protocol has gained support for "obtain repository
data" in the form of overloading the "getbundle" wire protocol
command. This is arguaby OK in the context of "all data is attached
to bundles" and "bundles are a self-contained representation of
complete repository data." But partial clone invalidates these
assumptions because in a partial clone world, we no longer can assume
things like "the client has all the base revisions."

In a partial clone world, we'll need wire protocol commands that allow
clients to obtain specific pieces of data with vastly different
access patterns. For example, a client may want to obtain "index"
data but keep the fulltext data on the server. Or vice-versa. Or a
client may wish to fetch all revisions of a specific file but only
the latest revision of another. These access patterns will be
difficult to shoehorn into single, powerful commands (like
"getbundle"). Even if we could, doing that isn't wise from a server
implementation perspective because it makes implementing scalable
servers hard. We want server-side commands to be small and simple
so alternate server implementations can come into existence more
easily.

This is one reason why the frame-based wire protocol I'm implementing
supports command pipelining and out-of-order responses. This
property will enable clients performing complex operations to send
command streams containing dozens or even hundreds of small command
requests to servers.

Anyway, this commit implements an experimental wire protocol command
for "get files data." Essentially, you give it a changeset revision
you are interested in and it spits back all the files and their data
in that revision, as fulltexts.

This command is just one way a server could emit data for files.
A variation of this command that accepts specific file paths and nodes
whose data is to be retrieved would also be useful. And I imagine we'll
eventually implement that. It would also be useful to emit index
data. Or have each file blob be individually compressed. (Right now
compression is performed on the whole stream because that's how the
wire protocol currently works - but I have plans to evolve the frame
based protocol to do new and novel things here.)

I'm not even sure this variation of the wire protocol command is a
good one to have! One reason I want to start with this command is
that it seems like a useful primitive. For example, with this
command, one could build a client that is able to realize a working
directory from a single wire protocol request: you can literally
stream the response to this command and turn the data into files on
the filesystem with minimal stream processing!

As implemented, this command is effectively a benchmark of revlog
reading and/or compression. On the mozilla-unified repository when
operating on revision c488b8d0e074efb490ebca32db68eb77871bfd2f (a
recent revision of mozilla-central, the head of Firefox development),
my i7-6700K yields the following:

  • no compression: 1478MB; ~94s wall; ~56s CPU
  • zstd level 3: 343MB; ~97s wall; ~57s CPU
  • zlib level 6: 367MB; ~116s wall; ~74s CPU

For comparison, hg bundle --base null -r c488b8d0e0 -t zstd-v2
(which approximates what hg clone -r would be doing on the server)
yields:

1397MB; ~624s wall; ~225s CPU

Of course, these are vastly different operations. But this does
demonstrate that if your use case of version control is "check out
revision X" and you were previously relying on hg clone [without
stream clone bundles] to do that, this wire protocol command
is overall much more efficient on servers. It's worth noting that
the use case of version control for many automated systems *is*
"check out revision X." So I think providing a clone mode that can
realize a working copy as fast as possible is a worthwhile feature
to have!

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped