This is an archive of the discontinued Mercurial Phabricator instance.

lfs: add a switch to `hg verify` to ignore the content of blobs
ClosedPublic

Authored by mharbison72 on Dec 23 2019, 2:47 AM.

Details

Summary

Trying to validate the fulltext of an external revision causes missing blobs to
be downloaded and cached. Since the downloads aren't batch prefetched[1] and
aren't compressed, this can be expensive both in terms of time and space.

I made this a tri-state instead of a simple bool because there's an existing
(undocumented) config to handle this, and it would be weird if hg verify were
to suddenly start ignoring that config but an hg recover initiated verify
honors it. Since this uses the same config setting, it too will skip
rename verification (which requires fulltext, but not for LFS).

[1] https://www.mercurial-scm.org/pipermail/mercurial-devel/2018-April/116118.html

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mharbison72 created this revision.Dec 23 2019, 2:47 AM
pulkit added a subscriber: pulkit.Dec 23 2019, 12:06 PM
pulkit added inline comments.
hgext/lfs/__init__.py
408

IIUC, we can have a --lfs flag and --no-lfs will work.

Also by default, we may not want to verify large files content?

indygreg requested changes to this revision.Dec 23 2019, 12:32 PM
indygreg added a subscriber: indygreg.

Per Pulkit's review, please change the argument to --lfs, as --no-lfs should automatically work.

Also, I didn't realize verify.skipflags existed and exposes low-level bit flags of revlog storage. That's a massive abstraction leak and I'd be in favor of reinventing that feature so it worked based on string feature names.

This revision now requires changes to proceed.Dec 23 2019, 12:32 PM

Per Pulkit's review, please change the argument to --lfs, as --no-lfs should automatically work.

Does my response to @pulkit change your mind any? (Not sure how familiar you are with the lfs internals- I apparently forgot a lot myself).

Also, I didn't realize verify.skipflags existed and exposes low-level bit flags of revlog storage. That's a massive abstraction leak and I'd be in favor of reinventing that feature so it worked based on string feature names.

Any thumbnail sketch of what this should look like? Having to know that things are externally stored or that rename checking has special requirements for example, also seems leaky to a lesser degree. I was surprised that revlog bits were being passed around like this too. My plan is/was to add an lfs configbool that's documented, which could toggle this on or off, so users don't need to know bit patterns. (I was also surprised that hex numbers in the config don't work, and it pukes when you fetch the value.) Obviously I wouldn't want the lack of --lfs or --no-lfs to override this. IDK if it's better to have an LFS config for it's bit, and then another for that other bit... or some setting that is a list of skip names. I assume whatever pattern is designed here would influence the internal replacement.

hgext/lfs/__init__.py
408

The way it is now, both --lfs and --no-lfs are accepted. I actually started with --lfs, defaulting to on, but then changed it and forget why. I'll try to figure that out today, but it may have been the code was simpler when defaulting to verifying the blobs and honoring the config setting. It may have also had to do with shifting to verifying the locally accessible blobs later in the series. I didn't want to over complicate things with --none, --local-only, and --all. The options with largefiles are kind of annoying. I guess the question is, "What would an average user who doesn't know hg internals want/understand?"

As far as defaulting to *not* verifying blobs, I'm uncertain. I'd think that verifying local blobs wouldn't add much overhead. But the deciding factor for me was that there's no way to verify the content in the revlog without verifying the blob itself- the stored hash in the revlog is for the blob, not the stored JSON data. That's why I asked in IRC the other day about how strong the guarantees are about repo integrity after running a verify. IIRC, largefiles doesn't verify content by default, but it will at least make sure the revlog for the standin file is OK.

mharbison72 requested review of this revision.Jan 8 2020, 11:37 PM

Setting this back to wanting a review because I suspect my response questions were overlooked with vacations, and it won't show up in anybody's dashboard otherwise.

pulkit accepted this revision.Jan 14 2020, 10:01 AM
This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.