This is an archive of the discontinued Mercurial Phabricator instance.

vlqencoding: encodes integers to variable-length byte arrays
ClosedPublic

Authored by quark on Oct 3 2017, 9:41 PM.

Details

Summary

This is a common technique to store variable-length integers efficiently.
It's compatible with both Thrift and Protobuf [1].

It's intended to be used in:

  • On-disk file format to make the file compact and avoid issues like https://bz.mercurial-scm.org/5681 (Obsolete markers code crashes with metadata keys/values longer than 255 bytes).
  • Thrift layer.

[1]: https://developers.google.com/protocol-buffers/docs/encoding#varints

Test Plan
cargo test
cargo clippy

Also ran a kcov coverage check and it says 100%.

cargo rustc --lib --profile test -- -Ccodegen-units=1 -Clink-dead-code -Zno-landing-pads
kcov --include-path $PWD/src --verify target/kcov ./target/debug/*-????????????????

Diff Detail

Repository
rFBHGX Facebook Mercurial Extensions
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

quark created this revision.Oct 3 2017, 9:41 PM
Herald added a reviewer: Restricted Project. · View Herald TranscriptOct 3 2017, 9:41 PM
quark updated this revision to Diff 2396.Oct 3 2017, 9:49 PM
ryanmce added a subscriber: ryanmce.Oct 4 2017, 6:34 AM
ryanmce added inline comments.
rust/vlq/src/lib.rs
16 ↗(On Diff #2396)

I don't know too much about rust, but it seems like it would be great to have a type associated with the Vec<u8> to make it safer and more self-explanatory to use.

34–51 ↗(On Diff #2396)

nit: why /// rather than /** style comments?

quark added inline comments.Oct 4 2017, 12:05 PM
rust/vlq/src/lib.rs
34–51 ↗(On Diff #2396)

Rust documentation style is different. /// is what Rust stdlib uses.

By having an example section, the code could be doctest-ed by running cargo test.

quark added inline comments.Oct 4 2017, 12:23 PM
rust/vlq/src/lib.rs
16 ↗(On Diff #2396)

I imagined the caller end up with x::ByteArray, y::ByteArray, vlq::ByteArray. It seems redundant to have that many types.

I'm not sure about the best practice here. Question for more experienced Rust reviewers.

52 ↗(On Diff #2396)

Maybe it's better to use Result<u64> here to detect the error case. But I'm not sure how much runtime overhead that will introduce.

I don't mean vlq::byteareay, but vlq::isn't or something, which happens to be a byteareay under the covers. But I also don't know what the "oxidized" way to do it is.

mbthomas requested changes to this revision.Oct 5 2017, 11:20 AM
mbthomas added a subscriber: mbthomas.

You should consider adding method overrides for Write and Read in the same way as the byteorder crate does. See https://docs.rs/byteorder/1.1.0/byteorder/trait.WriteBytesExt.html for documentation and example. What I'd like to be able to do is:

filehandle.write_vlq(12345)?;

and

let n = filehandle.read_vlq()?;

This is actually a really useful concept. Does something like this already exist as a crate? If not, can we publish this as a crate to use in other Rust code?

rust/vlq/src/lib.rs
16 ↗(On Diff #2396)

Vec<u8> is kind of special in that it's considered the de-facto Rust variable-length buffer type. For example, std::io::Read contains a method to read directly into a Vec<u8>: std::io::Read::read_to_end.

52 ↗(On Diff #2396)

Result type is cheap, so you should use it if you can (from what I understand, the overhead is about the same as the cost of passing back an additional boolean return value, and having the caller check it). If the caller is reading from a file, they will be using io::Result all over the place anyway.

Use the error-chain crate to define errors easily.

52 ↗(On Diff #2396)

decode should take a slice (&[u8]) so that it can decode from within any buffer. Vec<u8> can decay automatically to &[u8] depending on context anyway.

This revision now requires changes to proceed.Oct 5 2017, 11:20 AM
mbthomas added inline comments.Oct 5 2017, 1:38 PM
rust/vlq/src/lib.rs
52 ↗(On Diff #2396)

Actually, I don't like the signature of this function at all. You should prefer tuple returns to mutable in/out parameters. The latter aren't very Rustish. Something like:

fn decode(buf: &[u8]) -> (u64, usize);
quark added a comment.EditedOct 10 2017, 4:37 PM

You should consider adding method overrides for Write and Read in the same way as the byteorder crate does. See https://docs.rs/byteorder/1.1.0/byteorder/trait.WriteBytesExt.html for documentation and example. What I'd like to be able to do is:

Sorry for the late response. I was distracted by oncall stuff.

My original idea is to read (or mmap) a flat file to Vec<u8> and vlq operates on that memory buffer, similar to what Mercurial does with revlog.i and linelog does for linelog.l. So there is no file handler involved. I think it's generally more flexible to decouple computation logic from I/O.

I noticed that Vec<u8> does not implement WriteBytesExt. We can implement a wrapper around Vec<u8> that implements WriteBytesExt, but that might make things unnecessarily complex.

@jsgf: How do you think?

fn decode(buf: &[u8]) -> (u64, usize);

This is a good point. I forgot that Rust could return a tuple and was a bit concerned about the slicing overhead. I'll make the change.

jsgf requested changes to this revision.Oct 10 2017, 5:48 PM
jsgf added inline comments.
rust/vlq/src/lib.rs
16 ↗(On Diff #2396)

The trouble with returning Vec<u8> is that requires a new Vec to be allocated for every u64, which is pretty expensive - and the chances are the caller won't be able to use it directly in that form.

It would be better to use the Write trait which generalizes over a stream of output - as @mbthomas mentioned, Vec<u8> implements this (along with Read) so it's still easy for callers. The only downside is that it can fail with io::Result so you need to propagate that up, even if a write to - say - Vec<u8> can never actually fail.

For example:

use std::io::{self, Write};

pub fn encode<W: Write>(value: u64, out: &mut W) -> io::Result<()> {
...
}
52 ↗(On Diff #2396)

A more common signature for this would be:

pub fn decode(buf: &[u8]) -> Result<(u64, &[u8]), E> {
...
}

ie, successful result returns the value and a slice representing the remaining input.

It would also be possible to use std::io::Cursor, but that's probably overkill here.

The error type E would need to represent at least too-short input (we ran out of input before seeing the high bit set) and overflow (the number overflowed before we saw the high bit).

55–62 ↗(On Diff #2396)

This is fairly inefficient because buf[*pos] will do an array bounds check for each byte.

I'd do something like:

const MAX: usize = 10;
struct Bad;
pub fn decode(input: &[u8]) -> Result<(u64, &[u8]), Bad> {
    let mut res = 0;
    let buf = &input[..cmp::min(MAX, input.len())];
    for (idx, x) in buf.iter().enumerate() {
        res |= ((x & 127) as u64) << (idx * 7);
        if x & 128 == 0 {
            return Ok((res, &input[idx+1..]));
        }
    }
    Err(Bad)
}

where MAX is the max possible encoding length of u64. This doesn't distinguish the short input from overflow.

(There's probably something fancier using more combinators, but this will be fine.)

72–75 ↗(On Diff #2396)

I would use quickcheck to generate these kinds of round-trip test cases (perhaps in addition to the hand-picked ones).

jsgf added a comment.Oct 10 2017, 5:55 PM

Getting more fancy, you could define VLQEncode/VLQDecode traits, then implement them for u64 and i64 (with zigzag), and perhaps other integer types if it makes sense/is useful.

quark added a comment.Oct 10 2017, 6:13 PM

Thanks for the very detailed explanation!

I made a mistake thinking WriteBytesExt was in stdlib and only checked Vec from stdlib. I'll revise the interface.

quark updated this revision to Diff 2574.Oct 11 2017, 2:31 AM
quark updated this revision to Diff 2575.Oct 11 2017, 2:36 AM
quark updated this revision to Diff 2576.Oct 11 2017, 3:22 AM
quark added a comment.EditedOct 11 2017, 3:37 AM

I did some benchmarks encoding 0..100000 to a Vec::with_capacity(283488) - Vec does be very slow. So the increased complexity is worthwhile!

return_vec3,252,425 ns/iter (+/- 518,303)
return_slice1,414,582 ns/iter (+/- 377,286)
write_u64_buffered946,138 ns/iter (+/- 213,011)
write_u64_unbuffered902,002 ns/iter (+/- 92,514)
write_num_traits_buffered1,009,901 ns/iter (+/- 183,366)
write_num_traits_unbuffered (current version)907,168 ns/iter (+/- 58,427)

I like the Cursor idea since that makes the return value simpler and more intuitive.

num-traits seems to provide some flexibility (ex. might be easier to support BigInt in the future) at the cost of a slight overhead. Or maybe we can just use u64, which also sounds reasonable since our main target platform is amd64.

I didn't put zigzag support in this version. If we continue using num-traits, zigzag may be implemented as a separate trait:

trait ZigZagInt {
   type Output;
   fn to_zigzag(&self) -> Output;
}

And x.to_zigzag().write_vlq(...) may be neat enough. Or we can do something like impl<T: PrimInt + ZigZagInt> VLQEncode for T to merge the two features.

If we do not use num-traits, we can just implement VLQ encoder/decoder for u64 and i64 and cast everything to one of them.

rust/vlq/src/lib.rs
63 ↗(On Diff #2576)

I'm not sure if this is a good idea or not. It seems with impl<T: TraitX>, impl<T: TraitY> becomes impossible, because they may overlap. But maybe that can be solved in a newer rust.

74 ↗(On Diff #2576)

I benched multiple writes vs a single write with a pre-allocated buffer in stack. To my surprise, the former is a bit faster:

test vlq::tests::bench_write_vlq_num_traits              ... bench:   1,085,880 ns/iter (+/- 92,873)
test vlq::tests::bench_write_vlq_num_traits_multi_writes ... bench:     929,348 ns/iter (+/- 106,070)

So I'll leave the multi-write version. The pre-allocate version has difficulty figuring out the buffer size. Ideally it's (8 * mem::size_of::<T>() + 6) / 7, but Rust disallows T in a constant expression and associated constant is still experimental.

This looks mostly ok to me, but I think @jsgf should weigh in again.

I'd also like to reiterate that I think this has more wide appeal than just in mercurial and we should consider sharing more widely. Unfortunately there is already a crate on crates.io called vlq (similar thing but with a different encoding).

rust/vlq/src/lib.rs
63 ↗(On Diff #2576)

It seems a bit odd to have VLQEncode on the integer type rather than the Write type.

106–108 ↗(On Diff #2576)

This pattern can be more succinctly written as:

base = base.checked_mul(&base_multiplier).ok_or(DecodeError::Overflow)?;

Option::ok_or(e) translates Some(x) into Ok(x) and None into Err(e). The ? operator unwraps an Ok value and returns an Err.

123 ↗(On Diff #2576)

You can also use x.write_vlq(&mut v).expect(msg); with the added advantage that you can write a message for when it fails.

quark added inline comments.Oct 11 2017, 1:20 PM
rust/vlq/src/lib.rs
63 ↗(On Diff #2576)

Good point! I think it should be on the writer type for symmetry.

106–108 ↗(On Diff #2576)

Sounds good. I guess line 100 could also be simplified. I'll scan Rust doc.

123 ↗(On Diff #2576)

Nice to know!

quark added a comment.EditedOct 11 2017, 1:27 PM

I'd also like to reiterate that I think this has more wide appeal than just in mercurial and we should consider sharing more widely. Unfortunately there is already a crate on crates.io called vlq (similar thing but with a different encoding).

I agree. It could be something like vlqencoding. I think the source of truth could just be this repo so we have monorepo advantage - atomic refactoring, etc.

quark edited the summary of this revision. (Show Details)Oct 11 2017, 1:35 PM
quark updated this revision to Diff 2600.Oct 11 2017, 4:44 PM
quark updated this revision to Diff 2601.
quark edited the summary of this revision. (Show Details)Oct 11 2017, 4:48 PM
quark retitled this revision from vlq: add a library that encodes integers to variable-length byte arrays to vlqencoding: encodes integers to variable-length byte arrays.
quark updated this revision to Diff 2602.
ryanmce resigned from this revision.Oct 13 2017, 10:19 AM
jsgf requested changes to this revision.Oct 13 2017, 6:22 PM
jsgf added inline comments.
rust/vlq/src/lib.rs
63 ↗(On Diff #2576)

Agreed.

68–69 ↗(On Diff #2576)

This is looking pretty unpleasant. I was thinking more along the lines of having a helper function to implement this for u64, and then just manually implement it for each type:

pub fn encode_u64<W: Write>(v: u64, out: &mut W) -> Result<()> {
...
}

impl VLQEncode for u64 {
    fn write_vlq(&self, writer: &mut W) -> Result<()> {
        encode_u64(*self, writer)
    }
}

impl VLQEncode for u32 {
    fn write_vlq(&self, writer: &mut W) -> Result<()> {
        encode_u64(*self as u64, writer)
    }
}

//...

It's a bit cut'n'paste, but I think it's cleaner overall (and it would be a simple macro). It also makes implementing zigzag versions for i64/i32/... much more straightforward.

Also, since all the interesting types for this are Copy, you can make the trait take self (rather than &self) as it would be cleaner to just pass in values by value than reference. If you later want to implement it for a complex type, then you can implement impl<'a> VLQEncode for &'a ComplexType later on.

70 ↗(On Diff #2574)

Having an iterator of indexes like this is pretty much an antipattern in Rust. It's possible the compiler could work out that the range is bounded and eliminate the array bounds check below, but if it doesn't it makes this pretty inefficient.

This revision now requires changes to proceed.Oct 13 2017, 6:22 PM
quark updated this revision to Diff 2834.Oct 16 2017, 2:32 PM
quark updated this revision to Diff 2835.
quark updated this revision to Diff 3044.Oct 19 2017, 6:43 PM
mbthomas requested changes to this revision.Nov 2 2017, 12:24 PM
mbthomas added inline comments.
rust/vlqencoding/src/lib.rs
30

nit: I think this is called zig-zag.

156

I don't think this works for the most negative integer values. e.g. for i8, -128 is encoded as 255u8. (255u8 >> 1) + 1 is 128, which will cause overflow (and panic in debug) if you try to convert it to i8 and then negate it.

You can use:

((n >> 1) as $T) ^ -((n & 1) as $T)

which also has the advantage of working for both positive and negative numbers, so it doesn't need the if block.

171

I'd like to see tests for limits (in particular i{8,16,32,64}::min_value(), after my comments above).

This revision now requires changes to proceed.Nov 2 2017, 12:24 PM
quark added inline comments.Nov 2 2017, 5:37 PM
rust/vlqencoding/src/lib.rs
30

Thanks!

156

Good catch and nice bit expression! Although tests are passing even for the -128 case, it seems an undefined behavior.

171

I will add tests around every bit which will cover {integer}::{MIN,MAX}.

quark edited the test plan for this revision. (Show Details)Nov 2 2017, 5:37 PM
quark updated this revision to Diff 3241.
quark updated this revision to Diff 3242.Nov 2 2017, 5:46 PM
mbthomas accepted this revision.Nov 8 2017, 6:26 AM

LGTM. I plan to use this, too.

This revision was automatically updated to reflect the committed changes.