( )โš™ D11204 hgwebdir: avoid systematic full garbage collection

This is an archive of the discontinued Mercurial Phabricator instance.

hgwebdir: avoid systematic full garbage collection
ClosedPublic

Authored by gracinet on Jul 20 2021, 12:19 PM.

Details

Summary

Forcing a systematic full garbage collection upon each request
can serioulsy harm performance. This is reported as
https://bz.mercurial-scm.org/show_bug.cgi?id=6075

With this change we're performing the full collection according
to a new setting, experimental.web.full-garbage-collection-rate.
The default value is 1, which doesn't change the behavior and will
allow us to test on real use cases. If the value is 0, no full garbage
collection occurs.

Regardless of the value of the setting, a partial garbage collection
still occurs upon each request (not attempting to collect objects from
the oldest generation). This should be enough to take care of
reference cycles that have been created by the last request
(assessment of this requires changing the setting, not to be 1).

In my experience chasing memory leaks in Mercurial servers,
the full collection never reclaimed any memory, but this is with
Python 3 and biased towards small repositories.

On the other hand, as explained in the Python developer docs [1],
frequent full collections are very harmful in terms of performance if
lots of objects survive the collection, and hence stay in the
oldest generation. Note that gc.collect() is indeed trying to
collect the oldest generation [2]. This happens usually in two cases:

  • unwanted lingering objects (i.e., an actual memory leak that the GC cannot do anything about). Sadly, we have lots of those these days.
  • desireable long-term objects, typically in caches (not inner caches carried by repositories, which should be collected with them). This is a subject of interest for the Heptapod project.

In short, the flat rate that this change still permits is
probably a bad idea in most cases, and the default value can
be tweaked later on (or even be set to 0) according to experiments
in the wild.

The test is inspired from test-hgwebdir-paths.py

[1] https://devguide.python.org/garbage_collector/#collecting-the-oldest-generation
[2] https://docs.python.org/3/library/gc.html#gc.collect

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

gracinet created this revision.Jul 20 2021, 12:19 PM
pulkit accepted this revision.Jul 20 2021, 3:19 PM
This revision is now accepted and ready to land.Jul 20 2021, 3:19 PM
av6 added a subscriber: av6.Jul 20 2021, 3:47 PM

Thank you for caring about hgweb, it doesn't get this treatment often.

It is possible, though, that a value like 100 or 1000 could be a good trade-off if someone has a repository for which the GC can actually mitigate excessive memory usage.

I feel that you're downplaying the problem. The original ff2370a70fe8 states that every raw-file request to e.g. firefox repo leaks ~100 MB per request, and I don't think people would like to have *each* hgwebdir process get to 10-100 GB before it gets gc'd.

Here's how to check if the issue is still present in the current code on python3:

hg serve --web-conf=foo.conf

[paths]
/ = /path/to/repos/*

I'm going to use hg-committed as an example repo because it's reasonably sized and readily available. Just browsing around in an hg-committed clone locally makes the hg serve process to quickly grow to 1 G rss over practically nothing (first page of log, directory listing, tags, branches, etc). Grows by 100 to 300 MB per request. Now, I know hgweb is supposed to serve only actual generated content from the repo and we're here making it serve static files as well, but this memory leaking behavior depends on the way hgweb is deployed, and even in perfect setups this problem can manifest itself if e.g. the WSGI runner decides to use multiple threads or adjust gc frequency (or if a random spider starts hitting all the URLs on the server). I haven't actually figured out what makes hgweb in gunicorn leak, even though it shouldn't be multithreaded, I think? It was a long time ago, but I remember that gc.collect() at least made hgweb processes manageable for a small vps.

desireable long-term objects, typically in caches. This is an area of interest of mine.

When I looked at why hg-committed repo takes so much memory, the biggest consumer by far was obsstore and its cached attributes. obsstore is not only fitting the entirety of .hg/store/obsstore (hundreds of thousands of obsmarkers) into memory in a not very memory-friendly format, it's doing it multiple times. obsstore.successors and obsstore.predecessors basically contain the same obsmarkers, just reorganized for different uses. They all use basic python structures. This takes about 300 MB of memory for every instance of hg-committed repo. And if you create an instance of that for every request, you get crazy memory consumption way before python figures out that maybe it should collect some unused objects.

In fact, python does eventually collect garbage on its own, but it takes like 15 requests. So full-garbage-collection-rate of 100 (let alone 1000) doesn't change anything, since the process will either collect on its own, or it'll grow in size so much that it gets a visit from OOM killer.

@av6 thanks for the detailed perspective. I will make testing along the lines you suggest tomorrow (it's late here, now), but this here is worrying:

I'm going to use hg-committed as an example repo because it's reasonably sized and readily available. Just browsing around in an hg-committed clone locally makes the hg serve process to quickly grow to 1 G rss over practically nothing (first page of log, directory listing, tags, branches, etc). Grows by 100 to 300 MB per request.

The thing is, unless I'm badly mistaken, hgwebdir creates a new localrepo object for each request (that's why I'd like it eventually to use a repo cache). So if those figures you're quoting are without this patch it means that

  1. the GC call per request is useless
  2. the leak is very worrying

So, was it with or without the patch applied? Was it with hgwebdir or just hg serve, by the way ?

In fact, python does eventually collect garbage on its own, but it takes like 15 requests. So full-garbage-collection-rate of 100 (let alone 1000) doesn't change anything, since the process will either collect on its own, or it'll grow in size so much that it gets a visit from OOM killer.

Perhaps a`gc.collect(generation=1)` would be in order. That one would be more acceptable on each request.

baymax edited the summary of this revision. (Show Details)Jul 21 2021, 8:07 AM
baymax updated this revision to Diff 29703.

โœ… refresh by Heptapod after a successful CI run (๐Ÿ™ ๐Ÿ’š)

av6 added a comment.Jul 21 2021, 8:26 AM

So, was it with or without the patch applied?

These were the figures to show what happens without ff2370a70fe8. More specifically, it's current hg built from public default of hg-committed, but with the gc.collect() line commented out. The leak is indeed very worrying, and currently this gc.collect() on every request is the only solution that we have.

Was it with hgwebdir or just hg serve, by the way ?

hgwebdir, which is used if you provide --web-conf to hg serve.

Perhaps a`gc.collect(generation=1)` would be in order.

I tried it and it was kinda middle of the road. Here's what hgwebdir's behavior with the previously described setup looks like (memory sizes approximate):
no gc: grows in size rapidly, shrinks rarely, growth happens on every request, easily gets to 2-2.5 G, then drops to 390 MB
gc.collect(generation=1): grows to 390 MB, stays there half of the time, but can easily grow to 1 G, drops to 70 MB every so often
gc.collect(): grows to 390 MB and usually stays there, occasionally spikes to 750 MB or drops to 70 MB

One thing to keep in mind is that this is not a production setup: a simple hg serve will serve static files as well, and this, I believe, gives python more chances to start a gc after a request. For production the memory leak problem is worse, because there are no static file requests (less possibility of a "natural" gc), but every request creates a repo (guaranteed memory consumption).

Bottom line, it looks to be more risky than I thought, and we don't have time to investigate before 5.9rc, so I've changed it to

  • not change the default behavior (new setting default is 1)
  • have gc.collect(generation=1) on each request.

Therefore, this can be used easily to experiment with various repos, and it still solves bug 6075, whose reporter can now use the
setting instead of patching out the gc.collect().

Also the setting is now in the experimental namespace.

av6 accepted this revision.Jul 21 2021, 9:15 AM

Okay, let's compromise.

pulkit accepted this revision.Jul 21 2021, 4:55 PM

Should this go to stable or should it wait the end of the freeze? I'm not too sure.

This revision was automatically updated to reflect the committed changes.