diff --git a/mercurial/help.py b/mercurial/help.py --- a/mercurial/help.py +++ b/mercurial/help.py @@ -369,6 +369,7 @@ _(b'Extension API'), loaddoc(b'extensions', subdir=b'internals'), ), + ([b"locks"], _(b"Lock Files"), loaddoc(b"locks", subdir=b"internals")), ( [b'mergestate'], _(b'Mergestate'), @@ -380,11 +381,21 @@ loaddoc(b'requirements', subdir=b'internals'), ), ( + [b"repolayout"], + _(b"Repository Layout"), + loaddoc(b"repolayout", subdir=b"internals"), + ), + ( [b'revlogs'], _(b'Revision Logs'), loaddoc(b'revlogs', subdir=b'internals'), ), ( + [b"store-revlog"], + _(b"Revlog Store"), + loaddoc(b"store-revlog", subdir=b"internals"), + ), + ( [b'wireprotocol'], _(b'Wire Protocol'), loaddoc(b'wireprotocol', subdir=b'internals'), diff --git a/mercurial/helptext/internals/locks.txt b/mercurial/helptext/internals/locks.txt new file mode 100644 --- /dev/null +++ b/mercurial/helptext/internals/locks.txt @@ -0,0 +1,73 @@ +This document describes the file-based locking mechanism that Mercurial +uses. + +Lock Holders +============ + +Each lock has the concept of a *holder* or *owner*. The holder is identified +by the content of the lock consisting of a string of the form +``:``. + +```` is the value of the ``gethostname()`` POSIX function and +```` is the process ID. + +The ```` component can be supplemented with an additional identifier +if the PID may be ambiguous. For example, different containers may have +identical hostnames and PIDs due to the existence of PID namespaces. On Linux, +the hostname component is supplemented with the value of ``/proc/self/ns/pid`` +(if available) to help disambiguate PID namespaces. + +The *holder* of the lock might be displayed in user-facing output when informing +about the existence of an already held lock. + +Lock Files +========== + +Locks are represented by files on the filesystem. + +On platforms where symlinks are available, a lock file is a symlink where +the target of the symlink is the ``:`` value denoting the +lock holder. + +If symlinks are not available, locks are normal files and the content of +the file is the ``:`` value denoting the lock holder. + +Acquiring A Lock +================ + +If symlinks are available, an attempt is made to create a symlink. + +If symlinks are not available, an attempt is made to ``open()`` a file +with ``O_CREAT | O_WRONLY | O_EXCL`` flags, then write the lock content, +and finally ``close()`` the file. + +If the symlink creation or file open fails due to an existing lock file, +software attempts to deal with the failure. + +An attempt is made to read the content of the lock file. If the current +process or parent process holds the lock, the lock may be considered +acquired if the software architecture allows this. For example, a process +may assume that there aren't multiple threads operating independently and +any lock held by that process is valid. + +Software may determine that a lock should be forcefully *broken*. For +example, it may determine that the process holding the lock is not running +or it may have been told to forcefully acquire the lock by the user. + +A lock is broken by creating a sibling lock file named ``.break``. +If this lock can be acquired, the original lock file should be read again. +If it hasn't changed, the original lock file can be removed, effectively +releasing the lock. This should all be done with the ``.break`` file +held. + +Software may attempt to sleep and retry lock acquisition if an existing +lock is held. + +Releasing a Lock +================ + +A lock is released by deleting the corresponding lock file. + +In cases where the lock acquisition is nested, care needs to be taken to not +remove the lock file before all nested locks are also released. There is no +file-based mechanism to track nesting of locks: this must be done in software. diff --git a/mercurial/helptext/internals/repolayout.txt b/mercurial/helptext/internals/repolayout.txt new file mode 100644 --- /dev/null +++ b/mercurial/helptext/internals/repolayout.txt @@ -0,0 +1,69 @@ +A Mercurial repository is divided into user-facing and non user-facing +components. The "working directory" (where files are checked out) is +the user-facing component. The ``.hg`` directory (where Mercurial +maintains all its internal data and state) is the non user-facing +component. + +This document describes the layout and operation of the ``.hg`` +directory. In doing so, it effectively describes the file-based +API and access patterns that repository readers must conform to. + +Requirements File +================= + +When opening repositories, an attempt is made to read a ``requires`` file. +This file contains a newline (``\n``) delimited list of strings denoting +**required** capabilities of readers. If a client sees any unrecognized +entry in this file, it should fail fast and refuse to continue, as any +unknown value represents undefined behavior to that client. + +Absence of the ``requires`` file is interpreted as an empty set of +requirements. This translates to the legacy layout and semantics from +before Mercurial 0.9.2 (released 2006). This scenario should be rare +in modern times. + +See :hg:`help internals.requirements` for the full list of requirements +that are defined. Keep in mind that extensions may define their own +requirements. + +Configuration File +================== + +The ``.hg`` directory can optionally contain an ``hgrc`` file. If present, +this Mercurial configuration file will be loaded and processed before any +other file is accessed. This enables per-repository settings - including +extensions - to have near full control over repository opening behavior. + +Legacy Layout +============= + +Versions of Mercurial before 0.9.2 (released 2006) use a legacy layout +that lacks some substantial features from newer versions. Specifically, +it lacks a ``requires`` file (explained above) and therefore assumes +locations of certain files. + +These legacy versions of Mercurial always attempt to load a +``.hg/00changelog.i`` revlog when opening a repo. Thus, modern +repositories contain a placeholder ``.hg/00changelog.i`` file with a revlog +version that these legacy versions of Mercurial will refuse to load. This +effectively locks out legacy clients from accessing modern repositories. + +All content below documents the modern repository layout and behavior. + +The Store +========= + +The *store* is an abstract concept referring to a storage backend holding +repository data (such as changelog, manifest, and file revisions) and +history metadata, such as phases and obsolescence markers. + +Modifications to the store are made through a locking mechanism to ensure +an exclusive writer and are transactional in nature: changes either all +apply or all rolled back to the previous transaction. + +There exist multiple implementations of a Mercurial store: as long as the +implementation conforms to the software-defined interfaces in Mercurial, +Mercurial can interact with the store. + +See :hg:`help internals.store-revlog` for details about the revlog-based +store, which is the default store used by Mercurial. diff --git a/mercurial/helptext/internals/store-revlog.txt b/mercurial/helptext/internals/store-revlog.txt new file mode 100644 --- /dev/null +++ b/mercurial/helptext/internals/store-revlog.txt @@ -0,0 +1,382 @@ +This document describes the file layout and semantics of the +*revlog store*, a concrete repository storage implementation/backend +which uses revlogs for revision storage. + +The ``store`` Requirement +========================= + +Use of the revlog store is denoted by the presence of the ``store`` +requirement in the ``.hg/requires`` file. This requirement was introduced +in Mercurial 0.9.2 and has existed since the ``requires`` file was +introduced. + +If this requirement is present, the store physically resides in a +``.hg/store`` directory, which we'll call the *store root*. + +If the ``shared`` or ``relshared`` repository requirements are present, +the path to the *store* resides not in ``.hg/store`` from the ``.hg`` +directory in question but in a path defined by the ``.hg/sharedpath`` file. + +If the ``shared`` requirement is present, the path as defined in +``.hg/sharedpath`` is used verbatim. If ``relshared`` is present, the +path in this file is interpreted as relative to the ``.hg`` directory. + +If the ``store`` requirement is not present, the files belonging to the +store reside in the ``.hg`` directory itself, not in an isolated directory. +This scenario should be rare. + +Content +======= + +The revlog store contains revlogs, other data files such as phases and +obsolescence markers data, and files related to locking and transaction +management. + +The *store root* contains some well-known revlog files: + +``00changelog.i`` + The changelog revlog (stores changesets describing commits). +``00manifest.i`` + The root manifest revlog (stores lists of file revisions at various + points in time). + +Both ``00changelog.i`` and ``00manifest.d`` can have a corresponding +``.d`` file if the revlog is not *inline*. See :hg:`help internals.revlogs` +for more. + +Additional files in the *store root* directory include: + +``bookmarks`` + Defines repository bookmarks. +``fncache`` + Holds a list of known data files in the store. +``journal`` + Defines an in-progress transaction. +``narrowspec`` + Defines how storage is *narrowed* and not complete (a subset of + repository history). +``obsstore`` + Defines obsolescence markers. +``phaseroots`` + Defines the phase state of changesets. +``undo`` and ``undo.*`` + Defines how a transaction can be rolled back. + +Not all files are present in all stores. + +The *store root* may also contain the sub-directories ``data`` and +``meta``. + +Conceptually, the store is divided into top-level files and *data* +files. Top-level files are specialized and well-defined files (see +above lists) that define primitives common to Mercurial repositories. +Data files provide path-based storage for repository content, notably +the specific file paths under version control. + +Path-Based Revlog Storage +========================= + +The ``data`` and ``meta`` directories in the store effectively provide +a key-value store where the keys are a relative filesystem path and values +are revlogs, which themselves are a mapping of a node to revision data. + +The ``data`` directory holds history of individual files under version +control. The ``meta`` directory holds everything else, notably *tree +manifest* revlogs. + +The store's logical keys are filesystem paths and to Mercurial a path +is any sequence of bytes ending at the first ``\0`` byte and ``/`` is +the directory separator. Since various operating systems and filesystems +are incapable of storing all path byte sequences supported by Mercurial, +there is an encoding mechanism to convert requested store keys into +concrete filesystem paths. This allows bypassing limits around lengths of +paths, prohibited character limitations or filename sequences, lack of +case preservation, and lack of case sensitivity. + +The path encoding strategy uses a combination of discrete encoding steps. +The set of encoders to use is defined by repository requirements. + +The following sections describe the individual path encoders. + +Directory Encoder +----------------- + +The *directory encoder* performs the following substitutions in the +following order: + +* ``.hg/`` -> ``.hg.hg/`` +* ``.i/`` -> ``.i.hg/`` +* ``.d/`` -> ``.d.hg/`` + +Decoding applies those substitutions in reverse order. On the decoder +side, if the input string does not contain ``.hg/``, no transformation +is necessary. + +Encoding examples: + +* ``foo.i`` -> ``foo.i`` +* ``foo.i/bar.i`` -> ``foo.i.hg/bar.i`` +* ``foo.hg/bar`` -> ``foo.hg.hg/bar`` + +The goal of the directory encoder is to escape directory path components +resembling ``.hg`` directories and revlog files. + +Filename Encoder +---------------- + +The *filename encoder* performs normalization of paths to help ensure +filesystem portability. This encoder effectively maps input bytes to output +byte sequences. The byte transformations are as follows: + +* Byte values 0-31, 34, 42, 58, 60, 62, 63, 92, 124, and 126-255 are encoded + to ``~xx``, where ``xx`` represents the lowercase hexidecimal encoding of + that value. e.g. integer ``24`` would be encoded as ``~18``. +* Byte values 65 through 90 (inclusive) (representing ASCII A-Z) and 95 (ASCII + ``_``) are encoded to ``_x``, where ``x`` is the lowercase version of the + input. e.g. ``A`` is encoded as ``_a`` and ``_`` as ``__``. +* All other input byte sequences are preserved as-is. + +This encoding escapes characters outside the standard printable ASCII +range, special/reserved characters on common filesystems (notably Windows), +and uppercase letters (to account for case sensitivity/preservation +limitations). + +Decoders apply the reverse transformation. + +Encoding examples: + +* ``FOO`` -> ``_f_o_o`` +* ``foo:bar?`` -> ``foo~3abar~3f`` +* ``foo\x07bar\xadbaz`` -> ``foo~07bar~adbaz`` + +Auxillary Encoder +----------------- + +The *auxillary encoder* normalizes some special string sequences which can't +be represented on certain operating systems or filesystems, notably Windows. + +This encoder escapes the byte sequences ``aux``, ``con``, ``prn``, ``nul``, +``com``, and ``lpt``, where ```` can be the digits 1-9. It also +handles trailing spaces and ``.``. + +The input path is first split on ``/`` to produce its discrete components. + +For each path component, we search the bytes up to the first ``.`` or end +end of string for an exact match against one of the special byte sequences +above (e.g. ``con`` or ``lpt6``). If there is a match, the third letter/byte +is transformed to the byte sequence ``~xx``, where ``xx`` is the lowercase +hexidecimal value of that byte value. + +In addition, if the last byte of the path component is a space or ``.``, +that byte is encoded as ``~xx``, where ``xx`` is the lowercase hexidecimal +value of that byte value. + +Encoding examples: + +* ``.com1com2`` -> ``.com1com2`` +* ``aux.txt`` -> ``au~78.txt`` +* ``foo/lpt6.bar`` -> ``foo/lp~746.bar`` +* ``foo.`` -> ``foo~2e`` + +Lower Encoder +------------- + +The *lower encoder* normalizes inputs to lowercase and escapes special +byte sequences. It is similar to the *filename encoder* in that the +transformation consists of mapping input bytes to output byte sequences. +The *lower encoder* is not reversible. + +The byte transformations are as follows: + +* Byte values 0-31, 34, 42, 58, 60, 62, 63, 92, 124, and 126-255 are encoded + to ``~xx``, where ``xx`` represents the lowercase hexidecimal encoding of + that value. e.g. ``24`` would be encoded as ``~18``. +* Byte values 65 through 90 (inclusive) (representing ASCII A-Z) are encoded + to the lowercase version of the input. e.g. ``A`` is encoded as ``a``. +* All other byte sequences are preserved as-is. + +Encoding examples: + +* ``foo`` -> ``foo`` +* ``FOO`` -> ``foo`` +* ``hello:world`` -> ``hello~3aworld`` + +Path Hash Encoder +----------------- + +The *path hash encoder* encodes long input paths using a hashing mechanism. +This encoder is used to shorten the lengths of paths such that excessively +long input paths won't exceed path length limits. The hash encoder is not +reversible. + +The hash encoder is not yet documented here. Please see the source code +for details. + +Dot Encoder +----------- + +The *dot encoder* normalizes leading and trailing spaces or ``.`` in path +components. + +If a path component begins or ends with a space or ``.``, that byte is +transformed to ``~xx``, where ``xx`` is the lowercase hexidecimal value of +that byte. + +In addition, the *auxillary encoder* is used to transform reserved file +basename sequences. + +Encoding examples: + +* ``.foo`` -> ``~2efoo`` +* ``.foo/aux.txt`` -> ``~2efoo/au~78.txt`` +* ``foo. `` -> ``foo.~20`` + +Requirements and Encoders +------------------------- + +Repository requirements dictate which path encoders are used. + +If no ``store`` requirement is present, paths are encoded with just the +``directory encoder`` and data paths are stored in the store root directory. + +If just the ``store`` requirement is present, paths are encoded with the +``directory encoder`` then that output is fed into the ``filename encoder``. +All paths are placed in the ``data`` or ``meta`` directory of the store. + +If the ``store`` and ``fncache`` requirements are present, paths are +encoded with the ``directory encoder``, then ``filename encoder``, then +the ``auxillary encoder``. If the resulting path is longer than 120 +bytes, it is deemed too large to use and is thrown away. Instead, the original +path is encoded with the ``directory encoder`` and then fed into the ``path +hash encoder``. The hash encoder is a non-reversible transform and there are +additional mechanisms to facilitate walking all tracked paths in the store +(see the section on the ``fncache`` below). + +If the ``store``, ``fncache``, and ``dotencode`` requirements are present, +encoding is similar to that of ``store`` + ``fncache`` with the difference +being that the *dot encoder* is used to transform path components beginning +with a leading space or ``.``. + +The ``fncache`` File +-------------------- + +Presence of the ``fncache`` repository requirement engages the *hash encoder*, +which performs non-reversible transformations on paths. Various repository +operations require *walking* or discovering store-tracked filenames. To +facilitate these operations in the presence of lossy filename encoding, +the ``fncache`` file exists to store a list of all logical filenames tracked +by the store. + +The ``fncache`` file consists of a ``\n`` delimited list of paths. Paths are +encoded with the *directory encoder*. There should always be a ``\n`` at the +end of the file. Paths are prefixed with ``data/`` or ``meta/``. The order of +entries is not relevant (but implementations are encouraged to sort entries +for determinism). + +The ``fncache`` file is updated when the set of paths tracked by the store +changes. + +Locking and Transactions +======================== + +This section documents the mechanisms for locking and mutating the *revlog +store*. + +The *revlog store* does not have the concept of a *reader lock*. Instead, +mutations to the files within are expected to be performed in such a way +that readers are able to read a consistent snapshot of repository data. +(In reality there are some corner cases that can arise.) + +A *writer lock* **must** be held before mutating any files in the store. +This writer lock is defined by the ``.hg/store/lock`` file. See +:hg:`help internals.locks` for details on how lock files work. + +Mutations to the *revlog store* are transactional in nature: a change either +completes in its entirety or is rolled back to the previous state. Mutations +are aimed to be atomic, meaning that readers are capable of reading all of the +old state or all of the new state, never a hybrid in-between state. In reality, +an atomic switchover of all file state is not performed: instead, writers and +readers write and read files in such an order and with such semantics that +newer changes are not exposed, even if current file state exposes info from +a partially committed transaction. There are likely corner cases where this +doesn't hold. + +An in-progress transaction is indicated by the presence of a ``journal`` +file in the root directory of the store. Nested or multiple transactions +are not allowed: it should not be possible to create a new transaction +if the ``journal`` file exists. + +Store files changed via transactions are classified as append-only (e.g. +revlogs) or random-access. In order to facilitate rolling back a transaction +and not exposing mutated state to a reader before a transaction is committed, +transactions track changes to each file type separately. + +Append-only files are tracked in the ``journal`` file. This file consists +of entries of the form ``\0\n``. That is, the store relative +filesystem path of the file being append, followed by a NULL byte, followed +by an integer offset (normalized to a base 10 string), followed by a newline. +The integer offset here is the original size of the file before the +transaction. + +Random access files are tracked in the ``journal.backupfiles`` file. This file +starts with a header of the form ``\n``, where ```` is +currently the ascii character ``2``. Following the header are entries of the +form ``\0\0\0\n``. ``location`` is the +string symbolic name of the virtual filesystem handle used by the writer. The +value is dependent on the software performing the transaction. ``path`` is the +store relative filesystem path of the file being modified. ``backup_path`` +is the store relative filesystem path of the file containing a backup of the +original content. ``is_cache`` is a ``0`` or ``1`` boolean flag indicating if +the changed file is a special *cache* file. + +When a transaction is created, the ``journal`` and ``journal.backupfiles`` +are created. The former is initially empty and the latter just contains its +header. + +When writing to an append-only file, the writer first persists that file's +path and current size to the ``journal`` file. The original file is then +appended to. If no original file exists, its ```` is recorded as +``0``. + +When changing a random access file, the writer copies the original file +to a backup path (Mercurial has the convention of naming these files +``journal.backup.``) and writes an entry to ``journal.backupfiles`` +to record a planned mutation to this file. This copy may be a hardlink on +filesystems that support it. For a file that doesn't exist, the +```` recorded in the ``journal.backupfiles`` entry is the +empty string. + +Mercurial also makes full file copies of certain files to files named +``journal.*``. For example, the ``bookmarks`` file is read and written +to ``journal.bookmarks``. These copies are always performed at transaction +open time and are not initially tracked in the ``journal`` or +``journal.backupfiles`` logs. + +When the transaction is closed/committed, various activities occur. +The handles on the ``journal`` and ``journal.backupfiles`` are closed. +A set of undo files are written to facilitate rolling back this transaction +(see next paragraph). The ``journal.backupfiles`` and ``journal`` files +are deleted. The ```` values from the ``journal.backupfiles`` +file entries are deleted. + +Before a transaction is committed, various files are written to facilitate +an undo of that commit. This essentially entails copying files so they are +named ``undo.*``. + +When a transaction rollback occurs, entries from the ``journal`` and +``journal.backupfiles`` files are essentially reversed. For entries in +``journal``, the file is truncated to the original offset reported in +``journal``. For entries in ``journal.backupfiles``, the ```` +is copied to ```` using the named VFS in the entry. Finally, +``journal.backupfiles`` and ``journal`` files are removed. + +There is special handling of the ``00changelog.i`` file (the changelog revlog) +during transactions. This is to facilitate new repository readers from reading +new changelog revisions before they have been committed from an active +transaction. A copy of ``00changelog.i`` is made to ``00changelog.i.a`` and +writes to the changelog are made to ``00changelog.i.a``. When the +transaction is committed, ``00changelog.i.a`` is automatically renamed to +``00changelog.i``, replacing the canonical changelog revlog index. Under special +circumstances, a new reader may opt to read the ``00changelog.i.a`` file instead +of ``00changelog.i``. For example, a hook process may want to operate on the +proposed new state of the repository.