There are several minor issues that prevent parser serialization on Hachoir.
1) Generators cannot be pickled; this can be worked around by exhausting all parsers recursively (i.e. parsing the whole file); the resulting pickle (if it could be generated) would be at least as large as the original file, but would carry data about the parser and hence would still load quickly.
2) Files cannot be pickled; workaround is to avoid a file input stream (e.g. by using StringInputStream? on the whole file) for parsers intended to be pickled
3) Some classes have the name attribute changed; this breaks pickle! Example: String is given a name of FixedString?; pickle complains that it cannot find FixedString? as hachoir_core.field.string_field.FixedString?; fix is either a) to not mess up name, or b) to create an alias of String in that module (e.g. FixedString? = String)
4) A few of the properties defined in Field use lambda when they could simply use the function directly.
Example:
value = property(lambda self: self._getValue(), "Value of field")
in hachoir_core.field.field could be expressed without the lambda (yes, it may ruin caching, but read #5)
value = property(_getValue, "Value of field")
just like _getParent is.
5) Yes, caching the value is a good idea, I suppose, in Field._getValue(), but setting the function to a lambda is not a good way to do it; lambdas are not pickleable. The better way, in my opinion, would be to do this:
def _getValue(self):
if hasattr(self,'_value_cached'): return self._value_cached
try:
value = self.createValue()
except HACHOIR_ERRORS, err:
self.error(_("Unable to create value: %s") % unicode(err))
value = None
self._cached_value = value
return value
value = property(_getValue, "Value of field")
similar to what _getDisplay does.
So, this will eliminate all lambda from Field, which makes it pickleable, except for...
6) Enum returns a dynamically created createDisplay function which, as pickle claims, is not found in the field.enum module (rightly so; it is dynamically generated by Enum()). This is a bit more difficult to fix, but it does not affect *all* parsers (after 1-5 are fixed, many parsers will become perfectly serializable). Perhaps the best policy would be to use a wrapper class, i.e. make Enum into a class instead of a function which wraps the Field:
class Enum:
def __init__(self, field, enum, key_func=None):
self.__field=field
self.__enum=enum
self.__key_func=key_func
self.__class__=field.__class__
def __getattr__(self, key):
try: return self.__dict__[key]
except LookupError: return getattr(self.__field,key)
def createDisplay(self):
if key_func:
try:
key = key_func(self.__field.value)
return self.__enum[key]
except LookupError:
return self.__field.createDisplay()
else:
try:
return self.__enum[self.__field.value]
except LookupError:
return self.__field.createDisplay()
def getEnum(self):
return self.__enum
Here's the problem with this (admittedly) quick hack: the class will be pickled fine, but will unpickle without the Enum wrapper; removing the self.class=field.class line prevents isinstance from working and other nasties.
Anyway, there are my ideas on that.