Message Mircoformat and Serialisation

In my last post I talked a little bit about my ideas for encoding my messages, so I thought I’d elaborate a little bit.

Basic Premise

To encode an object I’m going to start with converting the object into a proxy object which is simply a map of string -> string. So for example an action would be encoded

actionProxy = [
  "type" => "GATHER",
  "resource" => "HERB",
  "timeLeft" => "10"
]

This premise can now be serialised into a string by splitting each of the strings into the pairs and encoding them as a list, so for example: "type,GATHER,resource,HERB,timeLeft,10"

There’s a few problems here:

  • What do I do if I wanted to encode a list of items or another map?
  • How do I cope with an object that has other objects inside it?
  • What do I do if I need to encode a string with a comma in it?

Complex objects

Let’s plough ahead and consider what happens if I encode an object that has another object, or a list or map, inside it. We can make the proxy map as before:

actorProxy = [
  "action" => "type,GATHER,resource,HERB,timeLeft,10"
  "name" => "Tim, the enchanter!"
  "stats" => "5,2,4"
]

Here we can see the problems. If I just encoded this by mapping the directly we’d get:

"action,type,GATHER,resource,HERB,timeLeft,10,name,Tim, the enchanter!,stats,5,2,4"

Here, the commas are everywhere and we can’t work out where the boundaries of the objects are. So the next step is to consider boundary markers.

Boundary Markers

It would be a lot easier if I knew where in the above example where the action encoding starts and ends, and where the stats encoding starts and ends. To do this we can use a boundary marker. For the example we’ll use [] around the objects, lists and maps. I could put it around everything, but that shouldn’t be necessary.

"[action,[type,GATHER,resource,HERB,timeLeft,10],name,Tim, the enchanter!,stats,[5,2,4]]"

Now, when I deserialise this, I can detect which object I’m in – if I strip off the outer brackets (which I know to expect) then I can keep track of the brackets and if there’s been as many close as open brackets, then I know I’m in the main part of the object. Otherwise, I can ignore all the characters inside. This makes my deserialisation a lot easier, as I don’t need to know what I expect in the child object to decode the parent.

I still have the problem of having a comma in a string, which is now also a problem if it has a bracket in it too.

Choosing separators

Although this isn’t vital, it would be a bit simpler and more obvious if normally we don’t have to treat any characters specially. Because of this, rather than using , and [], I’ll use something that is rarely present: ¬ as a separator, and |¦ as the brackets. I’ll also need an escape character (for the next paragraph) and for that I’m using £.

Escaping strings

To avoid clashes in strings, especially for user input, I need to escape special characters so I can parse them. This means I need to translate, for example, ¬ as something else.

I can make my life a lot easier if I translate any special character into a completely different character Then when parsing I can trust the special characters mean what I think they do, regardless of the characters around them. For this, then I am going to encode ‘¬’ as ‘£1’, ‘|’ as ‘£2’ etc. There’s one more problem here, though: what if my string has a £ in it already? This one I can encode with a £ too, so £ becomes £0.

This has to be done in the right order. First all the £ characters have to be converted – then I can safely swap the rest of the special characters. To decode, firstly I swap all £x sequences other than £0 with their relevant characters, then finally swap all £0 with £. So we have:

"1¬2¬3" => "1£12£13"
"£1" => "£01"
"||" => "£2£2"

This is roughly analogous to % encoding in urls but a little bit simpler.

Serialisation

To serialise an object, we now have an algorithm:

  • Create a proxy map by converting fields to strings
    • Enums, numbers, booleans etc can be converted directly
    • Strings have special characters escaped
    • Objects have their own serialisation using the same algorithm
  • The proxy map is then converted to a list [key1, value1, key1, value1]
  • The list is converted into a string by joining it together with the separator character
  • Add boundary characters to the list and return it

Deserialisation

In order to use these I need to be able to deserialise. For this, I can run a simple split routine for pretty much everything. After removing the boundary markers from my input string, I processing my characters one by one. For each character, I do the following:

  • If this is not a special character I add it to the buffer
  • If it is an open boundary marker, add one to my boundary counter
  • If it is a close boundary marker, subtract one to my boundary counter
  • If it is my separator character, then,
    • if we have a boundary counter greater than 0, add the character to the buffer
    • If the boundary counter is 0, save the buffer to the list and start a new item

I can use this to deserialise maps and lists, and this means that I can deserialise my proxy objects. For objects with child objects, I can use the same deserialise on those child objects.

Effectively then to deserialise my input string into an object:

  • Remove the boundary markers
  • Split the string using the above algorithm into a list of pairs of strings
  • Recreate the proxy mapping by converting the pairs into a map such that [a, b, c, d…] = [a => b, c => d…]
  • For each of the map entries, convert it into a property of the object
    • For integers, booleans, enums, etc, convert the string value as is
    • For string values, unescape the string
    • For object values, pass the string to that object’s deserialise function in turn.

Critique

This is a fairly heavy-handed approach to what is normally available easily. It wouldn’t really be necessary to do this in most circumstances – we’re not doing much here that you can’t do with, say, json encoding.

It also creates quite a lot of extra development work as this will be a custom serialise and deserialise per object.

That being said, in my Kotlin javascript implementation, I don’t have easy access to reflection, and it would be very difficult to implement a generalised serialisation without it.

The other advantage of this is that I can choose which fields I serialise and how I do it – this could have some advantages when it comes to keeping the data size smaller which has some relevancy for this. I can also make this relatively quick as it’s a simple encoding – this would be much more expensive if I had to do reflection on every object.

Regardless, this is a project for fun, and hopefully this will provide some insight into the thought process behind serialisation!


Leave a Reply

Your email address will not be published. Required fields are marked *