Ticket #135 (new defect)

Opened 1 year ago

Last modified 1 year ago

Inability to serialize parser, due to minor things

Reported by: nneonneo <nneonneo@gmail.com> Assigned to: haypo
Priority: normal Milestone:
Component: core Keywords:
Cc:

Description

There are several minor issues that prevent parser serialization on Hachoir. 1) Generators cannot be pickled; this can be worked around by exhausting all parsers recursively (i.e. parsing the whole file); the resulting pickle (if it could be generated) would be at least as large as the original file, but would carry data about the parser and hence would still load quickly. 2) Files cannot be pickled; workaround is to avoid a file input stream (e.g. by using StringInputStream? on the whole file) for parsers intended to be pickled 3) Some classes have the name attribute changed; this breaks pickle! Example: String is given a name of FixedString?; pickle complains that it cannot find FixedString? as hachoir_core.field.string_field.FixedString?; fix is either a) to not mess up name, or b) to create an alias of String in that module (e.g. FixedString? = String) 4) A few of the properties defined in Field use lambda when they could simply use the function directly. Example:

value = property(lambda self: self._getValue(), "Value of field")

in hachoir_core.field.field could be expressed without the lambda (yes, it may ruin caching, but read #5)

value = property(_getValue, "Value of field")

just like _getParent is. 5) Yes, caching the value is a good idea, I suppose, in Field._getValue(), but setting the function to a lambda is not a good way to do it; lambdas are not pickleable. The better way, in my opinion, would be to do this:

    def _getValue(self):
        if hasattr(self,'_value_cached'): return self._value_cached
        try:
            value = self.createValue()
        except HACHOIR_ERRORS, err:
            self.error(_("Unable to create value: %s") % unicode(err))
            value = None
        self._cached_value = value
        return value
    value = property(_getValue, "Value of field")

similar to what _getDisplay does.

So, this will eliminate all lambda from Field, which makes it pickleable, except for... 6) Enum returns a dynamically created createDisplay function which, as pickle claims, is not found in the field.enum module (rightly so; it is dynamically generated by Enum()). This is a bit more difficult to fix, but it does not affect *all* parsers (after 1-5 are fixed, many parsers will become perfectly serializable). Perhaps the best policy would be to use a wrapper class, i.e. make Enum into a class instead of a function which wraps the Field:

class Enum:
    def __init__(self, field, enum, key_func=None):
        self.__field=field
        self.__enum=enum
        self.__key_func=key_func
        self.__class__=field.__class__
    def __getattr__(self, key):
        try: return self.__dict__[key]
        except LookupError: return getattr(self.__field,key)
    def createDisplay(self):
        if key_func:
            try:
                key = key_func(self.__field.value)
                return self.__enum[key]
            except LookupError:
                return self.__field.createDisplay()
        else:
            try:
                return self.__enum[self.__field.value]
            except LookupError:
                return self.__field.createDisplay()
    def getEnum(self):
        return self.__enum

Here's the problem with this (admittedly) quick hack: the class will be pickled fine, but will unpickle without the Enum wrapper; removing the self.class=field.class line prevents isinstance from working and other nasties.

Anyway, there are my ideas on that.

Attachments

Change History

06/28/07 17:35:09 changed by haypo

Hachoir use most Python dynamic features like redefinition of method or lambda functions. We may avoid some of them but not all. « value = property(lambda self: self._getValue()) » is used because _getValue() can be changed dynamically.

See also ticket:26

06/28/07 20:21:12 changed by nneonneo <nneonneo@gmail.com>

Yes, I have seen that ticket, but no activity on that ticket has been seen.

I have a new suggestion: given that it would be foolhardy to cripple Hachoir just so it can be pickled, how about we add getstate and setstate to the methods?

e.g. FileInputStream?'s getstate and setstate could produce the filename, instead of the file object, thus allowing persistence of the file object (essentially): the file object points to an existing, open file anyway so if it doesn't exist, then the file object wouldn't have worked anyway ;) This avoids the need to build a StringInputStream? on a 10GB file and attempting to pickle it (point #2)

More importantly, this works around limitations #1 and #4. By adding a getstate/setstate mechanism, we can define our own way to serialize the parser internal state, e.g. pickle the function code of the few functions which aren't equal to their original forms, then for the rest pickle the reference to the original function. For generators, though, I'm less sure about what can be done.


Add/Change #135 (Inability to serialize parser, due to minor things)