This is an archive of the discontinued Mercurial Phabricator instance.

Differential D2883

revlogstore: create and implement an interface for repo files storage
AbandonedPublic

Authored by indygreg on Mar 16 2018, 7:07 PM.

Download Raw Diff

Details

Reviewers

None

Group Reviewers

hg-reviewers

Summary

In order to better support partial clones, we will need to overhaul
local repository storage. This will be a major effort, as many parts
of the code assume things like the existence of revlogs for storing
data.

To help support alternate storage implementations, we will create
interfaces for accessing storage. The idea is that consumers will
all code to an interface and any new interface-conforming
implementation can come along and be swapped in to provide new and
novel storage mechanisms.

This commit starts the process of defining those interfaces.

We define an interface for accessing files data. It has a single
method for resolving the fulltext of an iterable of inputs.

The interface is specifically defined to allow out-of-order responses.
It also provides a mechanism for declaring that files data is censored.
We *may* also want a mechanism to declare LFS or largefiles data.
But I'm not sure how that mechanism works or what the best way to
handle that would be, if any.

We introduce a new "revlogstore" module to hold the definitions of
these interfaces that use our existing revlog-based storage
mechanism.

An attribute pointing to the "files store" has been added to
localrepository.

No consumers of the new interface have been added. The interface
should still be considered highly experimental and details are
expected to change.

It was tempting to define the interface as one level higher than
file storage - in such a way to facilitate accessing changeset
and manifest data as well. However, I believe these 3 primitives -
changesets, manifests, and files - each have unique requirements
that will dictate special, one-off methods on their storage
interfaces. I'd rather we define our interfaces so they are
tailored to each type initially. If an implementation wants to
shoehorn all data into generic key-value blog store, they can
still do that. And we also reserve the right to combine interfaces
in the future. I just think that attempting to have the initial
versions of the interfaces deviate too far from current reality will
make it very challenging to define and implement them.

The reason I'm defining and implementing this interface now is to
support new (experimental) wire protocol commands to be used to
support partial clone. Some of these commands will benefit from
aggressive caching. I want to prove out the efficacy of the interfaces
approach by implementing cache-based speedups in the interface layer.

Diff Detail

Repository

rHG Mercurial

Lint

Lint Skipped

Unit

Unit Tests Skipped

Event Timeline

indygreg created this revision.Mar 16 2018, 7:07 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptMar 16 2018, 7:07 PM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

indygreg added a child revision: D2884: wireproto: experimental command to emit file data.Mar 16 2018, 7:07 PM

indygreg added a parent revision: D2901: wireproto: explicitly track which requests are active.Mar 19 2018, 5:44 PM

indygreg updated this revision to Diff 7149.Mar 19 2018, 7:59 PM

It's probably too early to worry about for the experimenting that you're doing, but at some point, maybe this should also allow yielding the full text in chunks? As it stands now, there are a couple places where LFS has to read in the full file, and one of those places is the filelog/revlog. IIRC, largefiles manages to avoid that completely.

This dated paged is the only thing that I could find talking about the issues with that approach:

https://www.mercurial-scm.org/wiki/HandlingLargeFiles

I'm not sure what this should look like either, but it seemed worthwhile to point out that page, with the accompanying discussion of revlog limitations.

I'll rebase this on top of zope.interface (D2928 and friends). Please defer reviewing for now.

In D2883#46712, @mharbison72 wrote:

It's probably too early to worry about for the experimenting that you're doing, but at some point, maybe this should also allow yielding the full text in chunks? As it stands now, there are a couple places where LFS has to read in the full file, and one of those places is the filelog/revlog. IIRC, largefiles manages to avoid that completely.

Yes, we should definitely design the interface such that file fulltexts can be expressed as chunks. That doesn't mean we have to implement things to actually use chunks. But it will at least give us an escape hatch so we can do more reasonable things for large files in the future.

I'm not actively working on this.

Revision Contents
Changeset List

		Path
M		mercurial/localrepo.py (3 lines)
M		mercurial/repository.py (30 lines)
A	M	mercurial/revlogstore.py (37 lines)

Status	Author	Revision
Closed	indygreg	D2987 stringutil: add function to pretty print an object
Closed	indygreg	D2986 wireproto: add frame flag to denote payloads as CBOR
Closed	indygreg	D2985 wireproto: implement custom __repr__ for frame
Closed	indygreg	D2984 keepalive: implement readinto()
Closed	indygreg	D2983 wireproto: port protocol handler to zope.interface
Closed	indygreg	D2982 wireproto: separate commands tables for version 1 and 2 commands
Closed	indygreg	D2981 wireproto: mark SSHv2 as a version 1 transport
Closed	indygreg	D2979 wireproto: stop aliasing wire protocol types (API)
Closed	indygreg	D2951 wireproto: use CBOR for command requests
Closed	indygreg	D2902 wireproto: define frame to represent progress updates
Closed	indygreg	D2948 wireproto: syntax for encoding CBOR into frames
Closed	indygreg	D2947 wireproto: explicit API to create outgoing streams
Closed	indygreg	D2907 wireproto: add streams to frame-based protocol
Closed	indygreg	D2906 wireproto: start to associate frame generation with a stream
Closed	indygreg	D2950 tests: fix duplicate and failing test
Closed	indygreg	D2978 cbor: import CBORDecoder and CBOREncoder
Closed	indygreg	D2949 setup: install cbor packages
Abandoned	indygreg	D2885 RFC: use Redis to cache file data
Changes Planned	indygreg	D2884 wireproto: experimental command to emit file data
Abandoned	indygreg	D2883 revlogstore: create and implement an interface for repo files storage
Closed	indygreg	D2901 wireproto: explicitly track which requests are active
Closed	indygreg	D2900 wireproto: use named arguments when passing around frame data
Closed	indygreg	D2899 wireproto: define attr-based classes for representing frames
Closed	indygreg	D2872 wireproto: define human output side channel frame
Closed	indygreg	D2871 wireproto: service multiple command requests per HTTP request
Closed	indygreg	D2870 wireproto: support for receiving multiple requests
Closed	indygreg	D2869 wireproto: add request IDs to frames
Closed	indygreg	D2860 wireproto: buffer output frames when in half duplex mode
Closed	indygreg	D2858 wireproto: define and implement responses in framing protocol
Closed	indygreg	D2857 wireproto: implement basic command dispatching for HTTPv2
Closed	indygreg	D2856 wireproto: nominally don't expose "batch" to version 2 wire transports
Closed	indygreg	D2852 wireproto: implement basic frame reading and processing
Closed	indygreg	D2851 wireproto: define and implement protocol for issuing requests
Closed	indygreg	D2868 util: prefer "bytesio" to "stringio"
Closed	indygreg	D2850 wireproto: define content negotiation for HTTPv2
Closed	indygreg	D2849 hgweb: also set Content-Type header
Closed	indygreg	D2837 wireproto: require POST for all HTTPv2 requests
Closed	indygreg	D2836 wireproto: define permissions-based routing of HTTPv2 wire protocol
Closed	indygreg	D2834 wireproto: support /api/* URL space for exposing APIs
Closed	indygreg	D2843 url: support suppressing Accept header
Closed	indygreg	D2842 util: don't log low-level I/O calls for HTTP peer
Closed	indygreg	D2841 debugcommands: support sending HTTP requests with debugwireproto
Closed	indygreg	D2726 debugcommands: support connecting to HTTP peers
Closed	indygreg	D2722 url: add HTTP handler that uses a proxied socket
Closed	indygreg	D2721 util: observable proxy objects for sockets
Closed	indygreg	D2840 hgweb: allow defining Server response header for HTTP server
Closed	indygreg	D2839 tests: use $HTTP_DATE$ for Date header
Closed	indygreg	D2720 debugcommands: introduce actions to perform deterministic reads
Closed	indygreg	D2725 httppeer: refactor how httppeer is created (API)
Closed	indygreg	D2724 httppeer: alias url as urlmod
Closed	indygreg	D2723 httppeer: consolidate _requestbuilder assignments and document
Closed	indygreg	D2832 hgweb: remove wsgirequest (API)
Closed	indygreg	D2831 hgweb: store the raw WSGI environment dict
Closed	indygreg	D2830 hgweb: remove dead wsgirequest code
Closed	indygreg	D2829 hgweb: port to new response API
Closed	indygreg	D2828 hgweb: pass modern request type into templater()
Closed	indygreg	D2827 hgweb: use modern response type for index generation
Closed	indygreg	D2826 hgweb: don't pass wsgireq to makeindex and other functions
Closed	indygreg	D2825 hgweb: replace PATH_INFO with dispatchpath
Closed	indygreg	D2824 hgweb: rewrite path generation for index entries
Closed	indygreg	D2823 hgweb: construct {url} with req.apppath
Closed	indygreg	D2822 hgweb: support constructing URLs from an alternate base URL
Closed	indygreg	D2821 hgweb: clarify that apppath begins with a forward slash
Closed	indygreg	D2820 hgweb: change how dispatch path is reported
Closed	indygreg	D2819 hgweb: refactor repository name URL parsing
Closed	indygreg	D2818 tests: add test coverage for parsing WSGI requests
Closed	indygreg	D2817 hgweb: construct static URL like hgweb does
Closed	indygreg	D2816 hgweb: remove unused **map argument
Closed	indygreg	D2815 hgweb: extract entries() to standalone function
Closed	indygreg	D2814 hgweb: move rawentries() to a standalone function
Closed	indygreg	D2813 hgweb: move archivelist to standalone function
Closed	indygreg	D2812 hgweb: move readallowed to a standalone function
Closed	indygreg	D2805 hgweb: remove some use of wsgireq in hgwebdir
Closed	indygreg	D2804 hgweb: fix a bug due to variable name typo
Closed	indygreg	D2803 hgweb: stop passing req and tmpl into @webcommand functions (API)
Closed	indygreg	D2802 hgweb: pass modern request type into various webutil functions (API)
Closed	indygreg	D2801 hgweb: don't redundantly pass templater with requestcontext (API)
Closed	indygreg	D2800 hgweb: use templater on requestcontext instance
Closed	indygreg	D2799 hgweb: add a sendtemplate() helper function
Closed	indygreg	D2798 hgweb: use web.req instead of req.req
Closed	indygreg	D2797 hgweb: stop setting headers on wsgirequest
Closed	indygreg	D2796 hgweb: always return iterable from @webcommand functions (API)
Closed	indygreg	D2795 hgweb: send errors using new response API
Closed	indygreg	D2794 hgweb: refactor 304 handling code
Closed	indygreg	D2793 hgweb: transition permissions hooks to modern request type (API)
Closed	indygreg	D2792 hgweb: port archive command to modern response API
Closed	indygreg	D2791 hgweb: refactor fake file object proxy for archiving
Closed	indygreg	D2790 tests: additional test coverage of archive web command
Closed	indygreg	D2789 hgweb: port static file handling to new response API
Closed	indygreg	D2788 hgweb: remove one-off routing for file?style=raw
Closed	indygreg	D2787 hgweb: port most @webcommand to use modern response type
Closed	indygreg	D2786 hgweb: support using new response object for web commands
Closed	indygreg	D2785 hgweb: inline caching() and port to modern mechanisms
Closed	indygreg	D2784 hgweb: expose repo name on parsedrequest
Closed	indygreg	D2783 hgweb: expose URL scheme and REMOTE_* attributes
Closed	indygreg	D2782 hgweb: remove wsgirequest.form (API)
Closed	indygreg	D2781 hgweb: perform all parameter lookup via qsparams
Closed	indygreg	D2780 hgweb: set variables in qsparams
Closed	indygreg	D2779 hgweb: use our new request object for "style" parameter
Closed	indygreg	D2776 hgweb: use a multidict for holding query string parameters
Closed	indygreg	D2775 hgweb: create dedicated type for WSGI responses
Closed	indygreg	D2778 tests: add test for a wire protocol request to wrong base URL
Closed	indygreg	D2773 hgweb: remove support for short query string based aliases (BC)
Closed	indygreg	D2774 hgweb: remove support for POST form data (BC)
Closed	indygreg	D2771 hgweb: expose input stream on parsed WSGI request object
Closed	indygreg	D2770 hgweb: make parsedrequest part of wsgirequest
Closed	indygreg	D2769 hgweb: refactor the request draining code
Closed	indygreg	D2768 hgweb: use a capped reader for WSGI input stream
Closed	indygreg	D2767 hgweb: document continuereader
Closed	indygreg	D2749 hgweb: remove wsgirequest.__iter__
Closed	indygreg	D2748 hgweb: remove wsgirequest.read()
Closed	indygreg	D2747 hgweb: remove unused methods on wsgirequest
Closed	indygreg	D2746 wireprotoserver: remove unused argument from _handlehttperror()
Closed	indygreg	D2745 hgweb: store and use request method on parsed request
Closed	indygreg	D2744 hgweb: handle CONTENT_LENGTH
Closed	indygreg	D2743 wireprotoserver: access headers through parsed request
Closed	indygreg	D2742 hgweb: parse and store HTTP request headers
Closed	indygreg	D2741 wireprotoserver: remove broken optimization for non-httplib client
Closed	indygreg	D2740 wireprotoserver: move all wire protocol handling logic out of hgweb
Closed	indygreg	D2739 hgweb: use parsed request to construct query parameters
Closed	indygreg	D2738 hgweb: only recognize wire protocol commands from query string (BC)
Closed	indygreg	D2737 hgweb: teach WSGI parser about query strings
Closed	indygreg	D2736 hgweb: use the parsed application path directly
Closed	indygreg	D2735 hgweb: use computed base URL from parsed request
Closed	indygreg	D2734 hgweb: parse WSGI request into a data structure
Closed	indygreg	D2733 hgweb: always use "?" when writing session vars
Closed	indygreg	D2732 hgweb: rename req to wsgireq
Closed	indygreg	D2731 hgweb: validate WSGI environment dict
Closed	indygreg	D2730 hgweb: ensure all wsgi environment values are str

Diff 7149

mercurial/localrepo.py

	obsolete,			obsolete,
	pathutil,			pathutil,
	peer,			peer,
	phases,			phases,
	pushkey,			pushkey,
	pycompat,			pycompat,
	repository,			repository,
	repoview,			repoview,
				revlogstore,
	revset,			revset,
	revsetlang,			revsetlang,
	scmutil,			scmutil,
	sparse,			sparse,
	store,			store,
	subrepoutil,			subrepoutil,
	tags as tagsmod,			tags as tagsmod,
	transaction,			transaction,
	self.cachevfs.createmode = self.store.createmode			self.cachevfs.createmode = self.store.createmode
	if (self.ui.configbool('devel', 'all-warnings') or			if (self.ui.configbool('devel', 'all-warnings') or
	self.ui.configbool('devel', 'check-locks')):			self.ui.configbool('devel', 'check-locks')):
	if util.safehasattr(self.svfs, 'vfs'): # this is filtervfs			if util.safehasattr(self.svfs, 'vfs'): # this is filtervfs
	self.svfs.vfs.audit = self._getsvfsward(self.svfs.vfs.audit)			self.svfs.vfs.audit = self._getsvfsward(self.svfs.vfs.audit)
	else: # standard vfs			else: # standard vfs
	self.svfs.audit = self._getsvfsward(self.svfs.audit)			self.svfs.audit = self._getsvfsward(self.svfs.audit)
	self._applyopenerreqs()			self._applyopenerreqs()
				self.filesstore = revlogstore.revlogfilesstore(self.svfs)

	if create:			if create:
	self._writerequirements()			self._writerequirements()

	self._dirstatevalidatewarned = False			self._dirstatevalidatewarned = False

	self._branchcaches = {}			self._branchcaches = {}
	self._revbranchcache = None			self._revbranchcache = None
	self.filterpats = {}			self.filterpats = {}

mercurial/repository.py

	return			return

	raise error.CapabilityError(			raise error.CapabilityError(
	_('cannot %s; remote repository does not support the %r '			_('cannot %s; remote repository does not support the %r '
	'capability') % (purpose, name))			'capability') % (purpose, name))

	class legacypeer(peer, _baselegacywirecommands):			class legacypeer(peer, _baselegacywirecommands):
	"""peer but with support for legacy wire protocol commands."""			"""peer but with support for legacy wire protocol commands."""

				class basefilesstore(object):
				"""Storage interface for repository files data.

				This interface defines mechanisms to access repository files data in a
				storage agnostic manner. The goal of this interface is to abstract storage
				implementations so implementation details of storage don't leak into
				higher-level repository consumers.
				"""

				__metaclass__ = abc.ABCMeta

				def resolvefilesdata(self, entries):
				"""Resolve the fulltext data for an iterable of files.

				Each entry is defined by a 2-tuple of (path, node).

				The method is a generator that emits results as they become available.
				Each emitted item is a 4-tuple of (result, path, node, data), where
				the first element can be one of the following to represent the operation
				result for this request:

				ok
				Successfully resolved fulltext data. Data field is a bytes-like
				object.
				missing
				Data for this item not found. Data field is ``None``.
				censored
				Data for this revision is censored. Data field is ``None``.
				"""

mercurial/revlogstore.py

This file was added.

				# revlogstore.py - storage interface for repositories using revlog storage
				#
				# Copyright 2018 Gregory Szorc <gregory.szorc@gmail.com>
				#
				# This software may be used and distributed according to the terms of the
				# GNU General Public License version 2 or any later version.

				from __future__ import absolute_import

				from . import (
				error,
				filelog,
				repository,
				)

				class revlogfilesstore(repository.basefilesstore):
				"""Files storage layer using revlogs for files storage."""

				def __init__(self, svfs):
				self._svfs = svfs

				def resolvefilesdata(self, entries):
				for path, node in entries:
				fl = filelog.filelog(self._svfs, path)

				try:
				rev = fl.rev(node)
				except error.LookupError:
				yield 'missing', path, node, None
				continue

				if fl.iscensored(rev):
				yield 'censored', path, node, None
				continue

				data = fl.read(node)
				yield 'ok', path, node, data

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	7080		Mar 16 2018, 7:07 PM	★	★
Diff 2	7149		Mar 19 2018, 7:59 PM	★	★