This is an archive of the discontinued Mercurial Phabricator instance.

hg-core: avoid memory allocation (D8958#inline-14990 followup)
AbandonedPublic

Authored by acezar on Sep 28 2020, 9:48 AM.

Details

Reviewers
Alphare
Group Reviewers
hg-reviewers

Diff Detail

Repository
rHG Mercurial
Branch
default
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

acezar created this revision.Sep 28 2020, 9:48 AM
Alphare accepted this revision.

I personally find this harder to read, and the use of usize before bit shifting making this technically platform-dependent makes me a bit uneasy. I can't easily compare the compiler output since godbolt does not support crates and I'd have to find out how to do it by hand, but maybe you or @martinvonz did?

I personally find this harder to read

I find it easier to read :)

, and the use of usize before bit shifting making this technically platform-dependent makes me a bit uneasy.

It probably doesn't matter if we're truncating (on small platforms) the result to usize anyway? I suppose we can use usize::try_from() to check (whether or not we make the change in this patch).

I can't easily compare the compiler output since godbolt does not support crates and I'd have to find out how to do it by hand, but maybe you or @martinvonz did?

I hadn't, but I just tried it using the criterion benchmarking crate. This is the code I used:

let input: Vec<u8> = (0..60000).map(|_| { rand::random::<u8>() }).collect();
let routine = || {
    for i in 0..input.len()/6 {
        let slice = &input[6*i..6*i+6];
        let val = ((BigEndian::read_u16(&slice[0..2]) as usize) << 32)
            + (BigEndian::read_u32(&slice[2..6]) as usize);
    }
};

Criterion says that it takes around 5.43 us both before and after this patch. So I guess the compiler is smart enough to eliminate the allocation, or maybe I messed something up.

I personally find this harder to read

I find it easier to read :)

heh

, and the use of usize before bit shifting making this technically platform-dependent makes me a bit uneasy.

It probably doesn't matter if we're truncating (on small platforms) the result to usize anyway? I suppose we can use usize::try_from() to check (whether or not we make the change in this patch).

I can't easily compare the compiler output since godbolt does not support crates and I'd have to find out how to do it by hand, but maybe you or @martinvonz did?

I hadn't, but I just tried it using the criterion benchmarking crate. This is the code I used:

let input: Vec<u8> = (0..60000).map(|_| { rand::random::<u8>() }).collect();
let routine = || {
    for i in 0..input.len()/6 {
        let slice = &input[6*i..6*i+6];
        let val = ((BigEndian::read_u16(&slice[0..2]) as usize) << 32)
            + (BigEndian::read_u32(&slice[2..6]) as usize);
    }
};

Criterion says that it takes around 5.43 us both before and after this patch. So I guess the compiler is smart enough to eliminate the allocation, or maybe I messed something up.

Seems fine indeed.

I personally find this harder to read

I find it easier to read :)

heh

, and the use of usize before bit shifting making this technically platform-dependent makes me a bit uneasy.

It probably doesn't matter if we're truncating (on small platforms) the result to usize anyway? I suppose we can use usize::try_from() to check (whether or not we make the change in this patch).

I can't easily compare the compiler output since godbolt does not support crates and I'd have to find out how to do it by hand, but maybe you or @martinvonz did?

I hadn't, but I just tried it using the criterion benchmarking crate. This is the code I used:

let input: Vec<u8> = (0..60000).map(|_| { rand::random::<u8>() }).collect();
let routine = || {
    for i in 0..input.len()/6 {
        let slice = &input[6*i..6*i+6];
        let val = ((BigEndian::read_u16(&slice[0..2]) as usize) << 32)
            + (BigEndian::read_u32(&slice[2..6]) as usize);
    }
};

Criterion says that it takes around 5.43 us both before and after this patch. So I guess the compiler is smart enough to eliminate the allocation, or maybe I messed something up.

Seems fine indeed.

To check that I hadn't messed something up, I ran the benchmark on a debug build and there this patch speeds it up from 1.51ms to 1.38ms (which is still ~240x as slow as the release build! :) ).

To check that I hadn't messed something up, I ran the benchmark on a debug build and there this patch speeds it up from 1.51ms to 1.38ms (which is still ~240x as slow as the release build! :) ).

Yeah debug builds have notoriously little to no optimizations, so that makes sense.

To check that I hadn't messed something up, I ran the benchmark on a debug build and there this patch speeds it up from 1.51ms to 1.38ms (which is still ~240x as slow as the release build! :) ).

Yeah debug builds have notoriously little to no optimizations, so that makes sense.

@acezar said on chat that they find the old version clearer, and the compiler is apparently smart enough to optimize out the allocation, so I'll skip this patch.

acezar updated this revision to Diff 22909.Sep 29 2020, 4:19 AM
acezar abandoned this revision.Sep 29 2020, 11:59 AM