While editing the OLE2 [MSOffice] parser, I noticed that it was possible, through rather arcane hacks, to sew together unattached fragments (by way of the fragment_group field which is created as the properties are explored). The fragments are sewn together by FragmentGroup?, presented as a substream and subsequently left unused.
At the end of createFields in OLE2, I added this code:
if "root[0]" in self:
self.seekBit(0)
stream=self["root[0]"].group.createInputStream()
psfield=OfficeRootEntry(stream)
RootSeekableFieldSet.__init__(psfield,self,"root",stream,"Document Fragment Group: root",stream.size)
psfield.ole2=self
yield psfield
Yes, there is voodoo. Basically, if any root entries were found (summary and doc_summary entries follow), it seeks to the beginning (to allow for enough perceived space to store the data) and then creates a stream from the fragment_group (which happens to be a StringInputStream?), initializes a parser for it, and then attaches the parser to the main document.
Now, with an unmodified base library, this fails with an AssertionError?: the parent stream must match the child stream. If this line is commented out, the stream functions perfectly, allowing inspection of the contained elements.
The OfficeRootEntry? parser isn't registered with the main parser list, as it doesn't function on its own (it requires variables and methods from the parent OLE2 parser, which is linked by the above code). Thus, the substream "technique" appears to be the only truly viable way of reading the data (the parser can't be attached to the stream directly because it is heavily fragmented in my test file, with over 12 separate fragments; and the function seekSBlock assumes a contiguous stream).
I therefore suggest removing the assertion for
assert id(self.stream) == id(parent.stream)
in hachoir_core.field.basic_field_set.py.