Serialization for Grown-Ups

I don’t want to discredit XML again, you know, the bloated yet another machine readable text format which saves the world.

There are alternatives, and people not smart enough to chose wisely may use XML as a general purpose scripting language (like a good friend was forced to implement for a company that needed an intelligent way to do voice scripting).

One single rule for not losing friends: If there are humans involved in your data processing, don’t use XML.

XML is unreadable, it contains redundancy and the things it solves are not understandable at all. And moreover, every generic XML editing tool seems to suck… These poor developers surely blown their minds by reading the XML specification over and over again.

XML is the poor mind’s type system. Developers specify DTDs all the time, but in a concise, human readable format built by classes, references and properties.

So what else, there is YAML and there is yes, ... JSON. Simple to understand, simple to generate and to parse, good platform support and … well, underused. It combines both, human and machine readability and so it is well suited for serialization.

Usually, an object model in memory is implemented by building up and maintaining a graph of objects. Then the requirement appears to serialize this object model to disc and back so that an application can continue where it was at a previous point in time. Besides, all applications should be designed to do so, but we live in Microsoft world, so they don’t.

It is always a struggle for a developer to serialize the live set of an object graph, be it configuration data or a complete document.

Several problems appear:

1. Meta data is not available (like in ancient languages .. for example C++).

2. Serialization is not capable to use the meta data wisely: Arbitrary graphs can not be serialized and deserialized properly, references to base classes are not possible, custom collections can not be used.

3. Schema evolution renders serialized content unusable. Change your classes, and all the serialized information is lost.

I stumbled over the above problems several times in the past years, and tried to find a solution.

One idea was to create an entire type system and use tools to generate source code to overcome missing meta data. I called it metatype, it implemented hierarchies using C++ discriminated union and so solved the first problem for C++. It was capable to create binary data or XML.

Now, in C#, we get the meta data by using Reflections. Serialization using Reflections is common, so we go for number 2. Here is my recommendation: Use JsonExSerializer from Ted Elliot. The defaults have to be tweaked, but it holds up to what it promises. It extends Json slightly, and is capable to serialize and deserialize object graphs connected by references.

Two of four problems solved, we got lucky. If you counted, there is one last? step missing most developers don’t even see. We come to that later.

Schema evolution, the third step in sane serialization, can probably be solved by strong discipline. Storing a version number of the schema in the serialized data and attributing new or changed properties together with conversion methods that are called dynamically should do it. Ideas are starting to grow, it’s just a matter of time, hopefully a solution can be integrated in JsonExSerializer soon.

And now, the fourth serialization feature, the dot on the i, icing on the cake or What is truly wanted? is not to serialize an object graph, it is to serialize all incremental information that built the object graph (not even databases do that), so we finally serialize undo information as long the user wants to, and if done carefully, build a file format that stores changes live.

History serialization.

To lay out this path we have first to solve automatic property based change tracking, notification and transaction management. In-memory solutions are implemented by any serious application that processes user editable documents, but stable serializing seems to be an issue in 2008.

The main problem I see, is that objects models must be completely isolated from presentation code. Such object models can now be built (and probably have been, but did not survived) using modern type systems and VMs that use garbage collection.

One idea is to automatically generate the implementation of data objects based on interface descriptions by using dynamic code generation. This enables property change notification, which is the initial step to automatically track changes to the model. By bundling changes, these sets can be stored to disc so that they can be applied incrementally … like in a revision control system.

Building application should be easy, and I think we will soon come to that point when it actually is. I imagine a world where everyone can create applications in front of a zooming, multi-touch paper sized web-tablet by moving graphical symbols around.

yours
armin

Stay safe, use revision control systems.

Intelligent people try to understand stupidity. All others struggle for identification.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

CSV

I never understood if XML is so good, how to explain the undying popularity of the CSV format?

Then Google go with a binary format – Protocol Buffers.

I think the choice forces people to think about appropiateness. No longer One Buzzword to rule them all.