Page MenuHomePhabricator

bundle: optional multithreaded compression, ATM zstd-only
Needs ReviewPublic

Authored by joerg.sonnenberger on Nov 8 2020, 2:49 PM.

Details

Reviewers
None
Group Reviewers
hg-reviewers
Summary

Compression type can be a huge chunk of "hg bundle", especially when
using the higher compression levels. With level=22 and threads=7, the
NetBSD test repository took 28:39 wall time and 157:47 user time.
Before, level=22 would take 129:20 wall time and 129:07 user time.

Diff Detail

Repository
rHG Mercurial
Branch
default
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

indygreg added inline comments.
mercurial/configitems.py
551–559

None of these support compression. So why define the config options and give false promises?

mercurial/utils/compression.py
689

Multithreaded compression is a compile-time feature. python-zstandard builds the zstd library with this feature enabled. But I'm unsure what downstream packagers who unbundle libzstd are doing. For all I know they have removed the feature.

I'm fine with landing this and letting packagers who insist on making their lives difficult deal with it.

mercurial/configitems.py
551–559

Primarily because otherwise the code has to check a list before querying the config option to avoid warnings. In principle, there are concurrent implementations of both gzip and bzip2, just not available as nicely packaged library. Emulating the bzip2 behavior in a small work queue would actually be moderately easy, I think. But I'm not sure how much interest there is still in bzip2 compression.

mercurial/utils/compression.py
689

I thought we don't support generic python-zstandard at this point anyway? That said, if a user has a zstd build without, it would result in an error when asking for non-default options, which sounds perfectly acceptable to me.

I was looking specifically at bzip2 for a bit. There are essentially two kinds of threaded compressors for it. pbzip2 is the more common and creates effectively multiple independent streams. That's not handled transparently by Python's bz2, so it would break all existing clients, making this a big no-go. Sourceforge has a more proper implementation for POSIX platforms (http://bzip2smp.sourceforge.net/) which doesn't have that problem and it would be nice if someone re-implemented the idea for modern libbz2. It can be done cleaner too. While this doesn't allow multi-threaded decompression for multi-stream-aware clients, it does work with all bzip2 decoders. Sadly the way it is done can't be from Python without re-implementing a good chunk of bz2 as it hooks deeply into the implementation. So in short, it would be possible to provide it as C extension and possibly even vendored, but it is more work than I currently want to do. I haven't looked into the state of pigz.