This is an archive of the discontinued Mercurial Phabricator instance.

cstore: uniondatapackstore support for python stores
ClosedPublic

Authored by ms2316 on Sep 5 2017, 10:49 AM.
Tags
None
Subscribers
None

Details

Summary

Now uniondatapackstore can also hold python data stores. PythonDataStore
wrapper simply passes function calls to underlying python objects and marshals
the output.

Test Plan
  • Added test case
  • Tested on fbsource with treemanifest.usecunionstore=True

Diff Detail

Repository
rFBHGX Facebook Mercurial Extensions
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ms2316 created this revision.Sep 5 2017, 10:49 AM
Herald added a reviewer: Restricted Project. · View Herald TranscriptSep 5 2017, 10:49 AM
ms2316 added inline comments.Sep 5 2017, 10:51 AM
cstore/py-structs.h
21

By the way, I read here that semicolon is not needed as PyObject_HEAD is a macro that already contains semicolon. And that the behavior is implementation dependent if the semicolon is added.

durham requested changes to this revision.Sep 5 2017, 12:18 PM

I don't think any of the current tests add a python store to the cunionstore. Could we get some tests to ensure this is working? cstore-uniondatapackstore.py is a unit test that makes it easy to write a new test for this.

Reviewed half of this. Will look at the rest later.

cstore/py-datapackstore.h
31

Not quite alphabetical

225

This requires that the caller create a pythondatastore type instead of passing in the actual python store (like a DatapackStore) they made right? Why not allow them to pass in an arbitrary store and wrap it in a PythonDataStore in here?

This revision now requires changes to proceed.Sep 5 2017, 12:18 PM
durham added inline comments.Sep 5 2017, 1:07 PM
cstore/py-pythondatastore.h
24 ↗(On Diff #1612)

Why void*? Couldn't it be PyObject* since this is always going to be called from python code and therefore all objects are PyObject's?

26 ↗(On Diff #1612)

When you do this, you are telling the PythonObj that it now own's this PyObject, and that it should call decrement when it goes out of scope. In this case, it would mean it decrements the store ref count when this function is over, which isn't what we want. I think if you change the _store to be a PythonObj from the beginning, then you never have to cast it to anything, and the ref count will just be handled for you.

40 ↗(On Diff #1612)

PyIter_Next returns a new reference, which means we need to decrement the ref count when we're done with it. So I'd change tuples type to be PythonObj so the ref count is handled for you. You'll then have to convert it to PyObject* to pass to PyArg_ParseTuple below, but that's ok.

46 ↗(On Diff #1612)

When using s or z without # it means it can't contain a null character. It looks like you are using:

s# for filename, filesize
z for node <--- node can contain null characters so this needs to be z#
s for deltabasefilename <--- the deltabasefilename can be a different length to the normal filename, so we need to store it's size as well
z for deltabasenode <--- can also contain null characters, so needs to be z#
z# for delta, delta size

52 ↗(On Diff #1612)

If PyArg_ParseTuple fails, we need to raise the exception. PyArg_ParseTuple will set the python exception flag when it fails, so you just need to throw a pyexception (defined in our pythonutil.h), and then you can have a catch in the top level python-to-c interface function that catches pyexception and returns the appropriate error code to python. Search for pyexception in py-datapackstore.h to see how it's done elsewhere.

55 ↗(On Diff #1612)

DeltaChainLink doesn't take ownership of the strings you pass in right? If these strings are temporarily created for the duration of the iterator, then when tuple goes out of scope and is freed, I think these pointers will become invalid? So we may need to either A) copy the strings when they are passed in to DeltaChainLink(), or B) attach all the returned tuple objects to PyDeltaChain as well, so they have the same lifetime as their associated DeltaChainLink objects. Option B would be more efficient

62 ↗(On Diff #1612)

Same void* issue as above

cstore/py-structs.h
30

The existence of this means we'll have python code constructing these. I think we want the PythonDataStore concept to be internal to the C++ implementation, and actual python code should never construct it.

cstore/pythondatastore.cpp
22

Why isn't the implementation of getdeltachani, getdeltachainraw, and getmessing just put inside this class? Instead of calling out to a global function?

45

Why copy this?

cstore/pythondatastore.h
21

If this is a python object, this should be of type PythonObj, so all the ref counting is handled for you.

ms2316 added inline comments.Sep 5 2017, 1:20 PM
cstore/pythondatastore.cpp
22

I wanted to separate python specific from non python specific code because of 'Eden related stuff' as you told me

45

Because key is a reference to const Key. In which case Single has to accept pointer to const Key in cosntructor and also return pointer to const Key. And we cannot simply pass a const pointer to Single class because it extends KeyIterator. So the simples solution was to make a copy.

cstore/pythondatastore.h
21

Yes, it should be PythonObj. However, I tried to separate python specific code from non python specific code and decided that PythonDataStore could be a wrapper which is not python specific (ironically) so that we dont iclude Python.h (bc it is needed only for PythonObj here)

ms2316 added inline comments.Sep 5 2017, 1:25 PM
cstore/py-pythondatastore.h
24 ↗(On Diff #1612)

Of course It could and should be but I tried not to include Python.h in PythonDataStore.

26 ↗(On Diff #1612)

Ok

46 ↗(On Diff #1612)

Ok

55 ↗(On Diff #1612)

Yes, you are right. I was not sure what ParseTuple does and how it allocates memory

durham added inline comments.Sep 5 2017, 2:24 PM
cstore/py-pythondatastore.h
24 ↗(On Diff #1612)

I think including python.h in PythonDataStore will be ok since PythonDataStore will only ever be included by a top level python-c file.

cstore/pythondatastore.cpp
22

Since this is specific to the PythonDataStore, which will only ever be linked into Python specific implementations, I think it's ok to include python stuff here.

45

You could probably use const_cast or static_cast or something to cast it to non-const at compile time, since we know the Key isn't being modified. Alternatively, it might make sense to change KeyIterator to return (const key)*, but that's probably more code churn that necessary right now. I'd just do the cast, since copying Key's requires allocations that we'd rather avoid.

cstore/pythondatastore.h
21

I think since this whole class is specific to python usage of the data store, it's ok for it to #include python.h. That won't prevent us from linking the other code into non-python implementations (like Eden) since they wouldn't be using pythondatastore.h anyway.

ms2316 marked 20 inline comments as done.Sep 5 2017, 4:41 PM
ms2316 added inline comments.
cstore/py-pythondatastore.h
55 ↗(On Diff #1612)

To be honest, I am not sure about option B, even though I like it more. As I am not completely sure what PyArg_ParseTuple does, but my guess is that it simply allocates memory for filename, node ... arrays and puts the contents of the tuple in there. Which means that we will need to copy these arrays because on the next iteration of the cycle PyArg_ParseTuple can (potentially) override existing arrays, because there is no memory management for the filename, node.. allocated variables.

ms2316 marked an inline comment as done.Sep 5 2017, 6:30 PM
ms2316 added inline comments.
cstore/py-datapackstore.h
225

Tbh, now I am a little bit confused about how we deal with memory management of PythonDataStores.

225

I essentially followed the pattern of py_datapackstore, because my previous implementation was ugly(in uniondatapackstore destructor I iterated over the vector of stores and explicitly freed the memory of pythondatastores only). Again, If I push_back pointers to PythonDataStore into stores vector then memory management has to be done by uniondatapackstore. Or if we make stores the vector of shared_ptr then the memory management of cdatapacks will be performed by uniondatapackstores which is not what we want. So I am a little bit confused about this part.

ms2316 marked 6 inline comments as done.Sep 7 2017, 8:36 AM
ms2316 edited edge metadata.Sep 7 2017, 8:39 AM
ms2316 updated this revision to Diff 1659.

Moved py-pythondatastore functions to pythondatastore.cpp where they belong.
Addressed memory management of PythonDataStores and memory management of
DeltaChainLink's when passed from python.
This diff does not include unit tests for PythonDataStore's yet.

durham requested changes to this revision.Sep 7 2017, 12:16 PM

Looking good! I think one more round will do it. The only real blocker now is getting the python.h references out of the core deltachain files

cstore/deltachain.cpp
10

Why does this import python.h and pythonutil.h? I don't see it used in the file, and this would prevent Eden from linking in non-python deltachain code.

cstore/deltachain.h
163

Hmm, since PyDeltaChain is specific to python implementations, it should probably be moved out to another file. Then you could import the python util headers and avoid the forward declaration here.

163

I'd name this _pythonChainLinks, so it refers to the semantic use instead of the object type.

cstore/py-datapackstore.h
227

I'd use make_shared here. It saves you one allocation (the shared_pointer ref count and the PythonDataStore are allocated together), and avoiding 'new' helps ensure maintainability of memory safety (i.e. if someone later added lines after this new and before the shared_ptr constructor, an exception could cause a memory leak).

Then for the stores.push_back below you can ask the shared_ptr for the raw ptr

cstore/pythondatastore.cpp
73

We should probably throw a real exception. Probably pyexception if PyList_Append sets the exception bit for us.

This revision now requires changes to proceed.Sep 7 2017, 12:16 PM
ms2316 marked 5 inline comments as done.Sep 7 2017, 12:52 PM
ms2316 edited edge metadata.Sep 7 2017, 2:53 PM
ms2316 updated this revision to Diff 1675.

Addressed issues mentioned above and added test cases. One of the tests is
failing.

ms2316 added a comment.Sep 7 2017, 2:57 PM

There is a test failure. I am not sure that fastdatapacks have to be in the uniondatapackstore.

tests/cstore-uniondatapackstore.py
160

+ ERROR: testGetDeltaChainMultiPackPyStore (__main__.uniondatapackstoretests)
+
+ Traceback (most recent call last):
+ File "/data/users/ms2316/facebook-hg-rpms/fb-hgext/tests/cstore-uniondatapackstore.py", line 220, in testGetDeltaChainMultiPackPyStore
+ chain = unionstore.getdeltachain(revisions2[0][0], revisions2[0][1])
+ File "/data/users/ms2316/facebook-hg-rpms/fb-hgext/remotefilelog/datapack.py", line 319, in getdeltachain
+ raise KeyError((name, hex(node)))
+ KeyError: ('foo', '4fbae6af78196f9ce9f77add42ce8c03b866d3ab')
+ [1]

durham requested changes to this revision.Sep 7 2017, 4:03 PM

Code looks good. Need to fix up some test stuff then we're good to go.

tests/cstore-uniondatapackstore.py
31

Instead of copying the tests like this, I think you can define a new class and just override createpack:

class uniondatastorepythonpacktests(uniondatapackstoretests):
    def createPack(self, *args, **kwargs):
        fastpack = super(uniondatapackstorepythonpacktests).createPack(*args, **kwargs)
        return datapack(fastpack.path)
# you will need to import datapack from remotefilelog.datapack

Then it will run all the same tests, but with python packs instead of fastdatapacks. Technically a fastdatapack is python store, and it just happens to wrap a cdatapack

160

I think this is because UnionDeltaChainIterator::getNextChain tries to combine chains from multiple stores at once, and it's not handling the case where one store doesn't have the Key at all. It looks like the expected error path is for getDeltaChainRaw to return a DeltaChain with a code() value that is not GET_DELTA_CHAIN_OK. I think we need to change PythonDataStore::getDeltaChainRaw to catch KeyError exceptions when it calls PythonObj list = _store.callmethod("getdeltachain", pyKey); and turn those into return values.

This revision now requires changes to proceed.Sep 7 2017, 4:03 PM
ms2316 marked 2 inline comments as done.Sep 8 2017, 1:34 PM
ms2316 added inline comments.
tests/cstore-uniondatapackstore.py
31

I dont think we can do that. Because in the first case uniondatapackstore accepts datapackstore(packdir), but when testing datapacks we want it to accepts multiple datapack(packpath) i.e. not packdir

ms2316 edited edge metadata.Sep 8 2017, 6:27 PM
ms2316 updated this revision to Diff 1684.

Updated tests and got rid of a bug. However, been stuck with anothoer for quite a while now.

ms2316 added inline comments.Sep 8 2017, 6:33 PM
cstore/pythondatastore.cpp
40

Took me quite a while to figure out that we need to clear python exception, otherwise after entering once again we will exit immediately

112–114

I have been stuck with this thing for quite a while now. it->next() is giving me segmentation fault. Is this because variable 'it' has a pointer to KeyIterator but KeyIterator is an abstract class and doesnt have implementation of next? I.e. at runtime it is not known that it is actually a pointer to PythonKeyIterator, right?

ms2316 added inline comments.Sep 8 2017, 6:40 PM
cstore/py-datapackstore.h
346–347

for debugging purposes only

durham added inline comments.Sep 8 2017, 6:53 PM
cstore/py-datapackstore.h
348

I'd expand this to the full 3 lines, with braces. Generally we've avoided single line if's like this because they are error prone during later refactors.

Also, why change away from the normal while loop pattern?

cstore/pythondatastore.cpp
38

It'd be nice if we could verify that the current exception is a KeyError before we eat it.

42

If the store doesn't contain the key, we need to return an error code. UnionDeltaChainIterator.getNextChain expects it to return a chain->status() != GET_DELTA_CHAIN_OK if there's an error. Returning an empty chain just causes the union getNextChain to return the empty chain and stop searching other stores.

113

Was this a debug line?

tests/cstore-uniondatapackstore.py
60–61

We should probably rename this function to createPackStore instead of createPack, since it doesn't actually always return a pack.

ms2316 added inline comments.Sep 8 2017, 6:58 PM
cstore/pythondatastore.cpp
42

chain->status() of the empty delta chain returns GET_DELTA_CHAIN_NOT_FOUND.

ms2316 added inline comments.Sep 8 2017, 6:59 PM
cstore/pythondatastore.cpp
113

Yes. It is a debug line.

durham added inline comments.Sep 8 2017, 7:01 PM
cstore/pythondatastore.cpp
42

Eh, that seems pretty subtle. But ok for now I guess.

ms2316 marked 7 inline comments as done.Sep 8 2017, 8:35 PM
ms2316 updated this revision to Diff 1686.

Fixed bugs. Added exception type checking

ms2316 marked an inline comment as done.Sep 8 2017, 8:40 PM
simonfar added inline comments.Sep 11 2017, 8:39 AM
cstore/py-datapackstore.h
220–234

I'd find this easier to read as a switch block:

switch(is_cdatapack) {
    case 1:
        // Do things
        break;
    case 0:
        // Do things
        break;
    default:
        // Error
       return -1;
}
ms2316 marked an inline comment as done.Sep 11 2017, 10:27 AM
ms2316 edited the test plan for this revision. (Show Details)
ms2316 updated this revision to Diff 1713.

Replaced if else block with switch block for readability

simonfar accepted this revision.Sep 11 2017, 11:25 AM
This revision is now accepted and ready to land.Sep 11 2017, 11:58 AM
This revision was automatically updated to reflect the committed changes.