This is an archive of the discontinued Mercurial Phabricator instance.

bundle: optional multithreaded compression, ATM zstd-only
ClosedPublic

Authored by joerg.sonnenberger on Nov 8 2020, 2:49 PM.

Details

Summary

Compression type can be a huge chunk of "hg bundle", especially when
using the higher compression levels. With level=22 and threads=7, the
NetBSD test repository took 28:39 wall time and 157:47 user time.
Before, level=22 would take 129:20 wall time and 129:07 user time.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

indygreg added inline comments.
mercurial/configitems.py
865–873

None of these support compression. So why define the config options and give false promises?

mercurial/utils/compression.py
692

Multithreaded compression is a compile-time feature. python-zstandard builds the zstd library with this feature enabled. But I'm unsure what downstream packagers who unbundle libzstd are doing. For all I know they have removed the feature.

I'm fine with landing this and letting packagers who insist on making their lives difficult deal with it.

mercurial/configitems.py
865–873

Primarily because otherwise the code has to check a list before querying the config option to avoid warnings. In principle, there are concurrent implementations of both gzip and bzip2, just not available as nicely packaged library. Emulating the bzip2 behavior in a small work queue would actually be moderately easy, I think. But I'm not sure how much interest there is still in bzip2 compression.

mercurial/utils/compression.py
692

I thought we don't support generic python-zstandard at this point anyway? That said, if a user has a zstd build without, it would result in an error when asking for non-default options, which sounds perfectly acceptable to me.

I was looking specifically at bzip2 for a bit. There are essentially two kinds of threaded compressors for it. pbzip2 is the more common and creates effectively multiple independent streams. That's not handled transparently by Python's bz2, so it would break all existing clients, making this a big no-go. Sourceforge has a more proper implementation for POSIX platforms (http://bzip2smp.sourceforge.net/) which doesn't have that problem and it would be nice if someone re-implemented the idea for modern libbz2. It can be done cleaner too. While this doesn't allow multi-threaded decompression for multi-stream-aware clients, it does work with all bzip2 decoders. Sadly the way it is done can't be from Python without re-implementing a good chunk of bz2 as it hooks deeply into the implementation. So in short, it would be possible to provide it as C extension and possibly even vendored, but it is more work than I currently want to do. I haven't looked into the state of pigz.

pulkit requested changes to this revision.Mar 10 2021, 4:14 AM
pulkit added a subscriber: pulkit.

I like the idea and since it's an experimental feature, I am happy to push it. @joerg.sonnenberger kindly rebase over current default tip and resend.

mercurial/utils/compression.py
692

Can we add this to documentation somewhere and probably in release notes also?

This revision now requires changes to proceed.Mar 10 2021, 4:14 AM
pulkit added inline comments.Mar 10 2021, 1:38 PM
mercurial/utils/compression.py
692

I will still like to have an explicit comment stating that multithreaded is a compile-time feature and .... somewhere. Above this line of code will work too. I personally know few companies which build hg in their own way and use zstandard compression.

D10226 adds a separate packaging note about zstd.

pulkit accepted this revision.Mar 17 2021, 9:50 AM
This revision is now accepted and ready to land.Mar 17 2021, 9:50 AM