Page MenuHomePhabricator

remotenames: store journal entry for bookmarks if journal is loaded
AbandonedPublic

Authored by pulkit on Nov 10 2017, 2:01 PM.

Details

Reviewers
None
Group Reviewers
hg-reviewers
Summary

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

pulkit created this revision.Nov 10 2017, 2:01 PM
dlax added a subscriber: dlax.Nov 13 2017, 3:53 AM

The state of this stack is not quite clear: there are abandoned revisions and the first changeset (introducing "mercurial/remotenames.py" file) seems to be missing.

In D1358#22836, @dlax wrote:

The state of this stack is not quite clear: there are abandoned revisions and the first changeset (introducing "mercurial/remotenames.py" file) seems to be missing.

Ah, I will resend the series maybe with new differentials. Thanks!

smf added a subscriber: smf.Nov 13 2017, 2:18 PM

pulkit (Pulkit Goyal) <phabricator@mercurial-scm.org> writes:

pulkit created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.
REPOSITORY

rHG Mercurial

REVISION DETAIL

https://phab.mercurial-scm.org/D1358

AFFECTED FILES

mercurial/remotenames.py

CHANGE DETAILS
diff --git a/mercurial/remotenames.py b/mercurial/remotenames.py

  • a/mercurial/remotenames.py

+++ b/mercurial/remotenames.py

A bit jumping the gun, no? I don't think we should be skipping putting
this initial stuff into hgext/remotenames.py.

In D1358#22899, @smf wrote:

pulkit (Pulkit Goyal) <phabricator@mercurial-scm.org> writes:

pulkit created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.
REPOSITORY

rHG Mercurial

REVISION DETAIL

https://phab.mercurial-scm.org/D1358

AFFECTED FILES

mercurial/remotenames.py

CHANGE DETAILS
diff --git a/mercurial/remotenames.py b/mercurial/remotenames.py

  • a/mercurial/remotenames.py

+++ b/mercurial/remotenames.py

A bit jumping the gun, no? I don't think we should be skipping putting
this initial stuff into hgext/remotenames.py.

I don't think so. IMO the storage-only parts can and should go straight in core and be on by default. Exposing the data (by default, at least) we're collecting should probably go through an extension rodeo first.

smf added a comment.Nov 14 2017, 6:20 PM

durin42 (Augie Fackler) <phabricator@mercurial-scm.org> writes:

durin42 added a comment.

In https://phab.mercurial-scm.org/D1358#22899, @smf wrote:
> pulkit (Pulkit Goyal) <phabricator@mercurial-scm.org> writes:
>
> > pulkit created this revision.
> >  Herald added a subscriber: mercurial-devel.
> >  Herald added a reviewer: hg-reviewers.
> >
> > REPOSITORY
> >
> >   rHG Mercurial
> >
> > REVISION DETAIL
> >
> >   https://phab.mercurial-scm.org/D1358
> >
> > AFFECTED FILES
> >
> >   mercurial/remotenames.py
> >
> > CHANGE DETAILS
> >
> > diff --git a/mercurial/remotenames.py b/mercurial/remotenames.py
> >
> > - a/mercurial/remotenames.py +++ b/mercurial/remotenames.py
>
> A bit jumping the gun, no? I don't think we should be skipping putting
>  this initial stuff into hgext/remotenames.py.
I don't think so. IMO the storage-only parts can and should go straight in core and be on by default. Exposing the data (by default, at least) we're collecting should probably go through an extension rodeo first.

I agree with putting data generating functions and storage into core.
Though, I disagree with the approach shown so far.

We currently have at least two extensions in core that do something
similar: blackbox and journal. Including remotenames, all three have
overlapping data storage needs. The storage design of each is roughly:

  • blackbox logs each command
  • journal logs update / move commands
  • remotenames logs exchange commands

For something to make it to core, I would expect a unified way to deal
with this duplication. For example, one solution could be:

  1. Create a v1 storage format api to hold this information (maybe start with sqlite or something very simple at first for the backend)
  1. Create a helper utility (decorator, or just wrap ui, etc)
  1. Use helper utility on all these commands to keep the data extension-agnostic

This would be a great way to improve all our data logging extensions as
well as keep things tidy in core. If it’s needed, I’d be more than happy
to guide Pulkit and do the review for him if that is a bottleneck.

In D1358#23388, @smf wrote:

For something to make it to core, I would expect a unified way to deal
with this duplication. For example, one solution could be:

  1. Create a v1 storage format api to hold this information (maybe start with sqlite or something very simple at first for the backend)
  2. Create a helper utility (decorator, or just wrap ui, etc)
  3. Use helper utility on all these commands to keep the data extension-agnostic

This would be a great way to improve all our data logging extensions as
well as keep things tidy in core. If it’s needed, I’d be more than happy
to guide Pulkit and do the review for him if that is a bottleneck.

I like where your head is at, but I'm unclear if Pulkit et al will be willing to do this extra work. Pulkit, is this something that we could get you to put on the path of "remotenames in core"?

In D1358#23388, @smf wrote:

durin42 added a comment.

I don't think so. IMO the storage-only parts can and should go straight in core and be on by default. Exposing the data (by default, at least) we're collecting should probably go through an extension rodeo first.

I agree with putting data generating functions and storage into core.
Though, I disagree with the approach shown so far.
We currently have at least two extensions in core that do something
similar: blackbox and journal. Including remotenames, all three have
overlapping data storage needs. The storage design of each is roughly:

  • blackbox logs each command
  • journal logs update / move commands
  • remotenames logs exchange commands

For something to make it to core, I would expect a unified way to deal
with this duplication.

While I agree with you that we should have a unified way to deal with any kind of duplication, I am unable to understand the duplication between data stored by remotanemes and one stored by journal and blackbox.

Data stored for journal, blackbox and remotenames are as follows:

Journal: time, user, namespace, name of changed item, command, nodes(old and new)
Blackbox: date, user, node, process id, source and some kind of formatter message
Remotenames: node, remotepath, name of bookmark or branch

I strongly feel that data storage for journal and blackbox should be unified, I am not sure how these overlaps with remotenames one. I want to keep remotenames stored separately as information about a remoterepo can be used in operations like push and pull. @durin42 and others had some ideas around it. Moreover journal and blackbox data serves different purpose than that of remotenames.

Can you please help me understanding how you think these data can be merged in one format.

For example, one solution could be:

  1. Create a v1 storage format api to hold this information (maybe start with sqlite or something very simple at first for the backend)
  2. Create a helper utility (decorator, or just wrap ui, etc)
  3. Use helper utility on all these commands to keep the data extension-agnostic

This would be a great way to improve all our data logging extensions as
well as keep things tidy in core. If it’s needed, I’d be more than happy
to guide Pulkit and do the review for him if that is a bottleneck.

I will also like to work on improving storage things in core but I don't think that intersects very much with remotenames. This thread will also interests you: https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-November/107411.html

quark added a subscriber: quark.Nov 15 2017, 1:57 PM
In D1358#23388, @smf wrote:
  1. Create a v1 storage format api to hold this information (maybe start with sqlite or something very simple at first for the backend)

On sqlite:

  • Some people use traditional tools like grep, less on blackbox.log. Using sqlite would make blackbox harder to use.
  • sqlite calls fsync by default. It's a performance issue under heavy I/O.
  1. Create a helper utility (decorator, or just wrap ui, etc)
  2. Use helper utility on all these commands to keep the data extension-agnostic

The current APIs (ui.log and repo.journal.record) do not look
terrible. Could you give an example of how this diff could be
improved with your proposed new API? i.e. is it becoming shorter,
less cryptic?

If the code under the new API look similar to the existing one,
I wonder whether they could be implemented independently - i.e.
this series continue using the existing API and changing the
format is independent from this series.

durham added a subscriber: durham.Nov 15 2017, 2:00 PM

I'm a little confused here, in the same way Pulkit has expressed. remotenames has a couple parts related to storage: one is the storing of the remotenames themselves, and the other is tracking how they change over time. The remotename storage itself doesn't seem like it overlaps with journal or blackbox, as it's not a log. The tracking of how they change over time seems to fit the journal exactly, which is a store for tracking how various references to commits (working copy parent, bookmarks, and now remotenames) change over time.

So I guess I'm confused about what part of remotenames needs unification? They seem to either be completely orthogonal, or a complete match for an existing system.

In D1358#23388, @smf wrote:

For something to make it to core, I would expect a unified way to deal
with this duplication. For example, one solution could be:

  1. Create a v1 storage format api to hold this information (maybe start with sqlite or something very simple at first for the backend)
  2. Create a helper utility (decorator, or just wrap ui, etc)
  3. Use helper utility on all these commands to keep the data extension-agnostic

This would be a great way to improve all our data logging extensions as
well as keep things tidy in core. If it’s needed, I’d be more than happy
to guide Pulkit and do the review for him if that is a bottleneck.

I like where your head is at, but I'm unclear if Pulkit et al will be willing to do this extra work. Pulkit, is this something that we could get you to put on the path of "remotenames in core"?

I will be happy to do that work but there are two things involved here:

  1. Improving end-user experience
  2. Improving our backend storage

If we do 1) after the completion of 2), then I think we won't have enough time to test the new feature in this cycle or maybe we need to wait for the next cycle. I don't see a very hard dependency here between both and they can go in parallel which I am again happy to do. @smf how does this sounds to you? If this sounds good to you, I will send a new version of the series and start a thread on mailing related to improving the storage API's.

smf added a comment.Nov 16 2017, 2:30 PM

pulkit (Pulkit Goyal) <phabricator@mercurial-scm.org> writes:

pulkit added a comment.

In https://phab.mercurial-scm.org/D1358#23388, @smf wrote:
> > durin42 added a comment.
> > 
> >   I don't think so. IMO the storage-only parts can and should go straight in core and be on by default. Exposing the data (by default, at least) we're collecting should probably go through an extension rodeo first.
>
> I agree with putting data generating functions and storage into core.
>  Though, I disagree with the approach shown so far.
>
> We currently have at least two extensions in core that do something
>  similar: blackbox and journal. Including remotenames, all three have
>  overlapping data storage needs. The storage design of each is roughly:
>
> - blackbox logs each command
> - journal logs update / move commands
> - remotenames logs exchange commands
>
>   For something to make it to core, I would expect a unified way to deal with this duplication.
While I agree with you that we should have a unified way to deal with any kind of duplication, I am unable to understand the duplication between data stored by remotanemes and one stored by journal and blackbox.
Data stored for journal, blackbox and remotenames are as follows:
Journal: time, user, namespace, name of changed item, command, nodes(old and new)
Blackbox: date, user, node, process id, source and some kind of formatter message
Remotenames: node, remotepath, name of bookmark or branch

Why not union them as such:

datetime, user, namespace, node, process id, args (name of changed item,
remotepath, etc), input (old nodes), result of command (new nodes, etc)

I strongly feel that data storage for journal and blackbox should be unified, I am not sure how these overlaps with remotenames one. I want to keep remotenames stored separately as information about a remoterepo can be used in operations like push and pull. @durin42 and others had some ideas around it. Moreover journal and blackbox data serves different purpose than that of remotenames.

I don't really see them as different purposes. Questions like, "What was
the state of my repo yesterday (journal)?" or "What was the state of
remote yesterday (remotenames, though future features that is append
only)?" or "What were the last five commands and their input
(blackbox)?"

We're talking about putting something in core and I believe we should
really think this stuff out.

In D1358#23866, @smf wrote:

pulkit (Pulkit Goyal) <phabricator@mercurial-scm.org> writes:

pulkit added a comment.

In https://phab.mercurial-scm.org/D1358#23388, @smf wrote:
> > durin42 added a comment.
> > 
> >   I don't think so. IMO the storage-only parts can and should go straight in core and be on by default. Exposing the data (by default, at least) we're collecting should probably go through an extension rodeo first.
>
> I agree with putting data generating functions and storage into core.
>  Though, I disagree with the approach shown so far.
>
> We currently have at least two extensions in core that do something
>  similar: blackbox and journal. Including remotenames, all three have
>  overlapping data storage needs. The storage design of each is roughly:
>
> - blackbox logs each command
> - journal logs update / move commands
> - remotenames logs exchange commands
>
>   For something to make it to core, I would expect a unified way to deal with this duplication.
While I agree with you that we should have a unified way to deal with any kind of duplication, I am unable to understand the duplication between data stored by remotanemes and one stored by journal and blackbox.
Data stored for journal, blackbox and remotenames are as follows:
Journal: time, user, namespace, name of changed item, command, nodes(old and new)
Blackbox: date, user, node, process id, source and some kind of formatter message
Remotenames: node, remotepath, name of bookmark or branch

Why not union them as such:
datetime, user, namespace, node, process id, args (name of changed item,
remotepath, etc), input (old nodes), result of command (new nodes, etc)

  1. remotenames don't need datetime, user, process id, result of commands and the etc's.
  2. if we store all the journal data, blackbox data and remotenames data all at one place, we will certainly need one more field which will store whether it's remotenames data, journal or blackbox as otherwise it will be slow to read all the data and find the data we need.
I strongly feel that data storage for journal and blackbox should be unified, I am not sure how these overlaps with remotenames one. I want to keep remotenames stored separately as information about a remoterepo can be used in operations like push and pull. @durin42 and others had some ideas around it. Moreover journal and blackbox data serves different purpose than that of remotenames.

I don't really see them as different purposes. Questions like, "What was
the state of my repo yesterday (journal)?" or "What was the state of
remote yesterday (remotenames, though future features that is append
only)?" or "What were the last five commands and their input
(blackbox)?"

Sorry Sean, but I do see them as different purposes because of the following points:

  1. This information is related to a remote repo. It will be very good if we have storage of remote repo information and local repo information separately.
  2. remotenames will be a part of the UI (commands like log etc.)
  3. There are revsets related to remotenames
  4. remotenames will have new namespaces
  5. There is lot of data stored by journal and blackbox which is not required for remotenames.

Even we get to a stage where we pulls the journal or blackbox information from a remote repo and store them locally, we should store them atleast in different files as that of local journal and blackbox information.

If we want to unify the storage for remote and local things, then we should think about having a unified storage for bookmarks(local+remote) and branches(local+remote).

We're talking about putting something in core and I believe we should
really think this stuff out.

smf added a comment.Nov 16 2017, 7:57 PM

pulkit (Pulkit Goyal) <phabricator@mercurial-scm.org> writes:

pulkit added a comment.

In https://phab.mercurial-scm.org/D1358#23866, @smf wrote:
> pulkit (Pulkit Goyal) <phabricator@mercurial-scm.org> writes:
>
> > pulkit added a comment.
> >
> >   In https://phab.mercurial-scm.org/D1358#23388, @smf wrote:
> >
> >   > > durin42 added a comment.
> >   > >
> >   > >   I don't think so. IMO the storage-only parts can and should go straight in core and be on by default. Exposing the data (by default, at least) we're collecting should probably go through an extension rodeo first.
> >   >
> >   > I agree with putting data generating functions and storage into core.
> >   >  Though, I disagree with the approach shown so far.
> >   >
> >   > We currently have at least two extensions in core that do something
> >   >  similar: blackbox and journal. Including remotenames, all three have
> >   >  overlapping data storage needs. The storage design of each is roughly:
> >   >
> >   > - blackbox logs each command
> >   > - journal logs update / move commands
> >   > - remotenames logs exchange commands
> >   >
> >   >   For something to make it to core, I would expect a unified way to deal with this duplication.
> >
> >
> >   While I agree with you that we should have a unified way to deal with any kind of duplication, I am unable to understand the duplication between data stored by remotanemes and one stored by journal and blackbox.
> >
> >   Data stored for journal, blackbox and remotenames are as follows:
> >
> >   Journal: time, user, namespace, name of changed item, command, nodes(old and new)
> >   Blackbox: date, user, node, process id, source and some kind of formatter message
> >   Remotenames: node, remotepath, name of bookmark or branch
>
> Why not union them as such:
>
> datetime, user, namespace, node, process id, args (name of changed item,
>  remotepath, etc), input (old nodes), result of command (new nodes, etc)
1. remotenames don't need datetime, user, process id, result of commands and the etc's.

Just because we don't use them now doesn't mean we won't in future (in
fact, I could easily see a use-case for seeing the previous locations of
a remote repo by date or even user).

  1. if we store all the journal data, blackbox data and remotenames data all at one place, we will certainly need one more field which will store whether it's remotenames data, journal or blackbox as otherwise it will be slow to read all the data and find the data we need.

Perhaps but not necessarily. We could just query, for instance, all the
times we've pulled or pushed. The resulting query set would have the
same-ish data as remotenames has now.

>
>
>>   I strongly feel that data storage for journal and blackbox should be unified, I am not sure how these overlaps with remotenames one. I want to keep remotenames stored separately as information about a remoterepo can be used in operations like push and pull. @durin42 and others had some ideas around it. Moreover journal and blackbox data serves different purpose than that of remotenames.
>
> I don't really see them as different purposes. Questions like, "What was
>  the state of my repo yesterday (journal)?" or "What was the state of
>  remote yesterday (remotenames, though future features that is append
>  only)?" or "What were the last five commands and their input
>  (blackbox)?"
Sorry Sean, but I do see them as different purposes because of the following points:
1. This information is related to a remote repo. It will be very good if we have storage of remote repo information and local repo information separately.

Perhaps. But that sounds more like an implementation detail than a
logging data design.

  1. remotenames will be a part of the UI (commands like log etc.)

Sure, but so could journal and others?

  1. There are revsets related to remotenames

Are you asking points (2) and (3) due to performance reasons?

  1. remotenames will have new namespaces

Only when reading the data which will still live in an extension for now.

  1. There is lot of data stored by journal and blackbox which is not required for remotenames.

I mean, sort of. After using remotenames for so long, it would have been
nice if I stored the datetime, proc, command, etc because I found myself
needing that information sometimes.

Even we get to a stage where we pulls the journal or blackbox information from a remote repo and store them locally, we should store them atleast in different files as that of local journal and blackbox information.

My main point in this thread is that this information (journal,
blackbox, remotenames, etc) should be, at least from core's perspective,
agnostic of which extension wants that data. By writing code with only
one extension in mind, we end up in a place where extensions may have
(unwritten) dependencies. I ran into this often with hg-git,
hgsubversion, and remotenames.

Keeping the data in a centralized place through a common api (e.g.
something akin to namespace) we avoid having extensions step on each
other's toes.

If we want to unify the storage for remote and local things, then we should think about having a unified storage for bookmarks(local+remote) and branches(local+remote).

That's not a bad idea, though a bit more specialized than a data store.
I have long wanted a generic cache object so that extensions didn't have
to implement their own.

Journal seems to have a lot of similar code to remotenames. In fact,
probably even better code because I see it has an actual storage class.
We could start by abstracting out journalstorage and friends to another
module and then importing that into remotenames for use as a storage
layer.

Restricting to just journal and remotenames seems easier for now.
Something like this is what I have in mind:

  1. abstract out storage of journal (hgext/extstorage.py or something)
  2. use new module in remotenames
  3. iterate on usage
  4. graduate extstorage to core

I might be able to take a crack at (1) this weekend.

There seems like two use cases for remotenames: knowledge about the most recently seen location of each remote bookmark, and knowledge about where the remote bookmarks have been over time. I think 99% of the benefit of remotenames comes from the first part. Building a storage layer which suits both the journal and remotenames seems like massive scope creep to address the 1% use case. Even if we did unify the storage models, I still don't believe remotenames and the journal could share a common storage format because they don't have similar access patterns. Remotenames is much more of a random access dict-like object, while the journal is much more of a sequential log. If I want to lookup the value of the latest remote master I don't want to deal with scanning some log of recent activity, and if I want to maintain a log of recent activity I don't want to build an index that allows for random access.

pulkit abandoned this revision.Nov 29 2017, 5:05 PM

I have left out this patch and updated the series with new differentials D1547 to D1551.