This is an archive of the discontinued Mercurial Phabricator instance.

cborutil: implement support for streaming encoding, bytestring decoding
ClosedPublic

Authored by indygreg on Apr 13 2018, 2:26 AM.

Details

Summary

The vendored cbor2 package is... a bit disappointing.

On the encoding side, it insists that you pass it something with
a write() to send data to. That means if you want to emit data to
a generator, you have to construct an e.g. io.BytesIO(), write()
to it, then get the data back out. There can be non-trivial overhead
involved.

The encoder also doesn't support indefinite types - bytestrings, arrays,
and maps that don't have a known length. Again, this is really
unfortunate because it requires you to buffer the entire source and
destination in memory to encode large things.

On the decoding side, it supports reading indefinite length types.
But it buffers them completely before returning. More sadness.

This commit implements "streaming" encoders for various CBOR types.
Encoding emits a generator of hunks. So you can efficiently stream
encoded data elsewhere.

It also implements support for emitting indefinite length bytestrings,
arrays, and maps.

On the decoding side, we only implement support for decoding an
indefinite length bytestring from a file object. It will emit a
generator of raw chunks from the source.

I didn't want to reinvent so many wheels. But profiling the wire
protocol revealed that the overhead of constructing io.BytesIO()
instances to temporarily hold results has a non-trivial overhead.
We're talking >15% of execution time for operations like
"transfer the fulltexts of all files in a revision." So I can
justify this effort.

Fortunately, CBOR is a relatively straightforward format. And we have
a reference implementation in the repo we can test against.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

indygreg created this revision.Apr 13 2018, 2:26 AM
yuja added a subscriber: yuja.Apr 13 2018, 8:34 AM
yuja added inline comments.
mercurial/utils/cborutil.py
74

I don't think yielding encoder.encode would make much sense
because an array item can also be a nested indefinite array, in
which case, we can't use writeitem().

indygreg added inline comments.Apr 13 2018, 4:00 PM
mercurial/utils/cborutil.py
74

Indeed.

Proper support for nesting will likely require a whole new high-level encoder API. Because state of the nesting needs to be tracked somewhere.

FWIW, the more I'm looking at the CBOR code, the more I'm thinking we will end up having to reinvent the full wheel. Not-yet-submitted commits to add wire protocol commands to do CBOR things are spending a *ton* of time in cbor2. The reason appears to be primarily driven by cbor2's insistence on using write(). There are a few places where we need to emit a generator of chunks. And the overhead from instantiating io.BytesIO instances to handle the write() from cbor2 only to call getvalue() to retrieve the data is non-trivial.

The next version of this may just invent a whole new CBOR encoder with only limited support for types. Or at least I'll change the API so a streaming array doesn't require an encoder be passed in.

indygreg edited the summary of this revision. (Show Details)Apr 13 2018, 5:31 PM
indygreg retitled this revision from cborutil: implement support for indefinite length CBOR types to cborutil: implement support for streaming encoding, bytestring decoding.
indygreg updated this revision to Diff 8152.

I ended up implementing my own CBOR encoder for a starting subset of types. Profiling the wire protocol server inspired me to do this.

I have a future wire protocol command that emits the fulltext data of every file in a revision. It was taking ~45s CPU to run. After plugging in the new generator based CBOR encoder, that drops to ~31s. All pure Python still too.

indygreg planned changes to this revision.Apr 13 2018, 8:13 PM

I'll probably give this one a little bit more love before I ask for review. We need to at least implement support encoding floats in order to support -Tcbor.

yuja added a comment.Apr 13 2018, 11:18 PM

FWIW, the more I'm looking at the CBOR code, the more I'm thinking
we will end up having to reinvent the full wheel.

Sounds reasonable to me.

We need to at least implement support encoding floats in order to
support -Tcbor.

If it's just for date tuple, I have a patch to get rid of floating-point timestamps
as our timestamps should be effectively ints.

In D3303#53422, @yuja wrote:

If it's just for date tuple, I have a patch to get rid of floating-point timestamps
as our timestamps should be effectively ints.

I believe that's the only place we use floats. Or at least the only place I saw it in test failures.

indygreg updated this revision to Diff 8301.Apr 14 2018, 8:21 PM
yuja added a comment.Apr 16 2018, 9:06 AM

Queued, thanks.

mercurial/utils/cborutil.py
207

Nit: We might have to support subtypes such as util.sortdict.

This revision was automatically updated to reflect the committed changes.