“Programming isn’t about what you know; it’s about what you can figure out.”
Chris Pine
The third article on Rocinante finally deals with the interesting part of the protocol. How the parts of a message are encoded into the binary Protocol Buffer representation and decoded again. Rocinante uses a two-stage approach. The basis is formed by a ProtoInputStream
and a ProtoOutputStream
, in which the basic mechanisms are implemented. For each message a generated Proto
class with individual read
and write
methods is based on this.
The following Proto
class ExampleProto
was generated for our example message from the first article about Rocinante. The class also contains a builder, which has been omitted for the sake of clarity.
public final class ExampleProto implements Proto<Example> { @Override public Example read(InputStream inputStream) throws IOException { ProtoInputStream protoInputStream = new ProtoInputStream(inputStream); String text = null; boolean flag = false; int number = 0; BitSet initialized = new BitSet(); WireType wireType; while ((wireType = protoInputStream.readType()) != null) { int lastFieldNumber = protoInputStream.getLastFieldNumber(); switch (lastFieldNumber) { case 1: text = protoInputStream.readString(wireType); break; case 2: flag = protoInputStream.readBoolean(wireType); initialized.set(2); break; case 3: number = protoInputStream.readInteger(wireType); initialized.set(3); break; default: throw new IOException("invalid field number: " + lastFieldNumber); } } if (initialized.cardinality() != 2) { throw new IOException("some fields are not initialized"); } return new Example(text, flag, number); } @Override public void write(OutputStream outputStream, Example value) throws IOException { ProtoOutputStream protoOutputStream = new ProtoOutputStream(outputStream); if (value.getText().isPresent()) { protoOutputStream.writeString(1, value.getText().get()); } protoOutputStream.writeBoolean(2, value.flag()); protoOutputStream.writeInteger(3, value.number()); } }
In the read
method, a type field is read and then, depending on the field number it contains, the corresponding data type is read. In fact, the link between the fields of the record and the field number is only realized in the read
and write
methods.
The read-in values are collected and if everything has been processed successfully, the Example
record is generated. The Protocol Buffer requires that all mandatory fields have actually been read in. The check in Rocinante is handled by setting a correnspondent bit for the field number in a BitSet
. If at the end as many bits are set as there are mandatory fields in the message, then all mandatory fields have been read in.
An important feature of the Protocol Buffer is the reduction of the payload of a message. Compared to JSON, a lot is achieved by replacing the attribute names with field numbers. But the values are also saved a little.
In Java, integers are stored as Integer
with 4 bytes or as Long
with 8 bytes. However, as most of the numbers used can be coded in fewer bytes, Protocol Buffer uses the VarInt format.
VarInt stands for “Variable Integer” and is a format for storing integers in a compact form. The VarInt format adapts the size of the memory used to the value of the number. With the VarInt format, the number is stored in grouped 7-bit blocks. Each block contains a continuity bit that indicates whether further bytes are to follow. If the bit is set, then a further byte must be read in, otherwise the coding ends with the current 7 bits.
Numbers up to 127 therefore only require one byte and numbers up to 16383 only two bytes. The advantages of the VarInt format are obvious. Less memory is required if the numbers are small and the messages can be transmitted faster because they are smaller, too. Of course, the advantages of the reduction in memory space are offset by the disadvantage of increased computing time. In computer science, you don’t get anything for free.
The VarInt format is not only used for the obvious int32
, int64
, uint32
and uint64
, but also for bool
, enum
and the lengths in string
and bytes
. For now, Rocinante does not yet support uint32
and uint64
. Probably because I don’t like the distinction between signed and unsigned number types. The type uint32
can be mapped to Long
. Large values of uint64
are either dropped or BigInteger
has to enter the stage. Let’s just see what the future brings.
The type bool
is represented as VarInt with the values 0
or 1
and therefore only requires one byte for its value. Whether it is a Boolean
, an Enum
, an Integer
or a Long
cannot be recognized by the binary data, the correct interpretation is made by the code in the read
method.
The same applies to the types string
and bytes
. both encode their content as a VarInt for the length of the following byte sequence. The read
method decides whether this sequence is interpreted as a byte[]
or as an utf-8 encoded String
.
The following code shows a small excerpt from the ProtoInputStream
class.
public class ProtoInputStream extends FilterInputStream { private int lastFieldNumber; public ProtoInputStream(InputStream in) { super(in); } public WireType readType() throws IOException { int value = read(); if (value == -1) { return null; } lastFieldNumber = value >> 3; return WireType.values()[value & 0b00000111]; } protected Integer readVariableByteInt() throws IOException { byte digit; int msgLength = 0; int multiplier = 1; do { digit = (byte) read(); msgLength += ((digit & 0x7F) * multiplier); multiplier *= 128; } while ((digit & 0x80) != 0); return msgLength; } // ... }
The readType
method returns the type of the current field in the InputStream
and also saves the field number, which is encoded in the same byte. The readVariableByteInt
reads a VarInt as an int
. At the moment it is neither particularly optimized nor is error handling carried out here in the event of an integer overflow. However, the latter is already on the TODO list.
The types int32
and int64
have an unpleasant disadvantage when using negative numbers. These must be represented as 2’s complement. For some migration reason it is always 64bit long. This means that the number -2
is stored as feffffffffffffffffffff01
in 10 bytes. To avoid this waste of space, two further data types sint32
and sint64
were defined in Protocol Buffer.
Unfortunately, Rocinante is still incompatible with Protocol Buffer and only uses 5 bytes for negative values in int32
. When I first read the documentation, I thought this was a badly formulated passage. I would never have dreamed that such a waste would be allowed at this point.
In the sint32
and sint64
types, the numbers are coded according to the ZigZag coding scheme. The easiest way to explain this scheme is with the following table.
Negative Values | Positive Values | Encoding |
---|---|---|
0 | 0 | |
-1 | 1 | |
1 | 2 | |
-2 | 3 | |
2 | 4 |
The coding switches back and forth between the negative and positive values, so every positive value p is coded as p\cdot2 and every negative value n as \lvert n\rvert\cdot2-1. This means that numbers above 63 already need two bytes, but all numbers between -1 and -64 only need one byte instead of 10 bytes.
Since the ZigZag format only requires the transformation of the input values, it was also implemented in Rocinante.
public int readZigZagInteger(WireType wireType) throws IOException { validate(wireType, WireType.VARINT); Integer value = readVariableByteInt(); return (value >>> 1) ^ -(value & 1); }
Once the basic functionalities have been described and implemented, the binary representation of our example can be processed.
0a 09 52 6f 63 69 6e 61 6e 74 65 10 01 18 2a
The first byte 0a
contains the value 2 for LEN
encoded types such as string
, bytes
and message
in the lower three bits. The value 1 is encoded in the upper 5 bits for the first field text
from the Example
. This is followed by 09
, a VarInt that encodes the length 9 of the fields content. This is followed by 9 bytes as an utf-8 coded String
.
The next byte 10
contains the value 0 in the lower three bits. the value 0 encodes the type VARINT
. The upper 5 bits encode the value 2 for the flag
field from the Example
. The next value 01
encodes the boolean
value true
.
The second last byte 18
again encodes the type VARINT
in the lower three bits. The upper five bits encode the value 3 for the number
field. The last byte 2a
as VarInt encodes the number 42.
We have done this manually here, but of course it also works with our ExampleProto
class.
byte[] bytes = HexFormat.of().parseHex("0a09526f63696e616e74651001182a"); Example example = ExampleProto.read(new BytesArrayInputStream(bytes));
In the next article we will implement a Maven plugin to automate the generation of our Protocol Buffer classes.
1 thought on “Rocinante – encoding and decoding”