Rocinante – encoding and decoding

“Programming isn’t about what you know; it’s about what you can figure out.”

Chris Pine

The third article on Rocinante finally deals with the interesting part of the protocol. How the parts of a message are encoded into the binary Protocol Buffer representation and decoded again. Rocinante uses a two-stage approach. The basis is formed by a ProtoInputStream and a ProtoOutputStream, in which the basic mechanisms are implemented. For each message a generated Proto class with individual read and write methods is based on this.

The following Proto class ExampleProto was generated for our example message from the first article about Rocinante. The class also contains a builder, which has been omitted for the sake of clarity.

public final class ExampleProto implements Proto<Example> {

  @Override
  public Example read(InputStream inputStream) throws IOException {
    ProtoInputStream protoInputStream = new ProtoInputStream(inputStream);
    String text = null;
    boolean flag = false;
    int number = 0;
    BitSet initialized = new BitSet();

    WireType wireType;
    while ((wireType = protoInputStream.readType()) != null) {
      int lastFieldNumber = protoInputStream.getLastFieldNumber();
      switch (lastFieldNumber) {
        case 1:
          text = protoInputStream.readString(wireType);
          break;
        case 2:
          flag = protoInputStream.readBoolean(wireType);
          initialized.set(2);
          break;
        case 3:
          number = protoInputStream.readInteger(wireType);
          initialized.set(3);
          break;
        default:
          throw new IOException("invalid field number: " + lastFieldNumber);
      }
    }
    if (initialized.cardinality() != 2) {
      throw new IOException("some fields are not initialized");
    }
    return new Example(text, flag, number);
  }

  @Override
  public void write(OutputStream outputStream, Example value) throws IOException {
    ProtoOutputStream protoOutputStream = new ProtoOutputStream(outputStream);
    if (value.getText().isPresent()) {
      protoOutputStream.writeString(1, value.getText().get());
    }
    protoOutputStream.writeBoolean(2, value.flag());
    protoOutputStream.writeInteger(3, value.number());
  }
}

In the read method, a type field is read and then, depending on the field number it contains, the corresponding data type is read. In fact, the link between the fields of the record and the field number is only realized in the read and write methods.

The read-in values are collected and if everything has been processed successfully, the Example record is generated. The Protocol Buffer requires that all mandatory fields have actually been read in. The check in Rocinante is handled by setting a correnspondent bit for the field number in a BitSet. If at the end as many bits are set as there are mandatory fields in the message, then all mandatory fields have been read in.

An important feature of the Protocol Buffer is the reduction of the payload of a message. Compared to JSON, a lot is achieved by replacing the attribute names with field numbers. But the values are also saved a little.

In Java, integers are stored as Integer with 4 bytes or as Long with 8 bytes. However, as most of the numbers used can be coded in fewer bytes, Protocol Buffer uses the VarInt format.

VarInt stands for “Variable Integer” and is a format for storing integers in a compact form. The VarInt format adapts the size of the memory used to the value of the number. With the VarInt format, the number is stored in grouped 7-bit blocks. Each block contains a continuity bit that indicates whether further bytes are to follow. If the bit is set, then a further byte must be read in, otherwise the coding ends with the current 7 bits.

Numbers up to 127 therefore only require one byte and numbers up to 16383 only two bytes. The advantages of the VarInt format are obvious. Less memory is required if the numbers are small and the messages can be transmitted faster because they are smaller, too. Of course, the advantages of the reduction in memory space are offset by the disadvantage of increased computing time. In computer science, you don’t get anything for free.

The VarInt format is not only used for the obvious int32, int64, uint32 and uint64, but also for bool, enum and the lengths in string and bytes. For now, Rocinante does not yet support uint32 and uint64. Probably because I don’t like the distinction between signed and unsigned number types. The type uint32 can be mapped to Long. Large values of uint64 are either dropped or BigInteger has to enter the stage. Let’s just see what the future brings.

The type bool is represented as VarInt with the values 0 or 1 and therefore only requires one byte for its value. Whether it is a Boolean, an Enum, an Integer or a Long cannot be recognized by the binary data, the correct interpretation is made by the code in the read method.

The same applies to the types string and bytes. both encode their content as a VarInt for the length of the following byte sequence. The read method decides whether this sequence is interpreted as a byte[] or as an utf-8 encoded String.

The following code shows a small excerpt from the ProtoInputStream class.

public class ProtoInputStream extends FilterInputStream {

  private int lastFieldNumber;

  public ProtoInputStream(InputStream in) {
    super(in);
  }

  public WireType readType() throws IOException {
    int value = read();
    if (value == -1) {
      return null;
    }
    lastFieldNumber = value >> 3;
    return WireType.values()[value & 0b00000111];
  }

  protected Integer readVariableByteInt() throws IOException {
    byte digit;
    int msgLength = 0;
    int multiplier = 1;

    do {
      digit = (byte) read();
      msgLength += ((digit & 0x7F) * multiplier);
      multiplier *= 128;
    } while ((digit & 0x80) != 0);
    return msgLength;
  }

  // ...
}

The readType method returns the type of the current field in the InputStream and also saves the field number, which is encoded in the same byte. The readVariableByteInt reads a VarInt as an int. At the moment it is neither particularly optimized nor is error handling carried out here in the event of an integer overflow. However, the latter is already on the TODO list.

The types int32 and int64 have an unpleasant disadvantage when using negative numbers. These must be represented as 2’s complement. For some migration reason it is always 64bit long. This means that the number -2 is stored as feffffffffffffffffffff01 in 10 bytes. To avoid this waste of space, two further data types sint32 and sint64 were defined in Protocol Buffer.

Unfortunately, Rocinante is still incompatible with Protocol Buffer and only uses 5 bytes for negative values in int32. When I first read the documentation, I thought this was a badly formulated passage. I would never have dreamed that such a waste would be allowed at this point.

In the sint32 and sint64 types, the numbers are coded according to the ZigZag coding scheme. The easiest way to explain this scheme is with the following table.

Negative ValuesPositive ValuesEncoding
00
-11
12
-23
24

The coding switches back and forth between the negative and positive values, so every positive value p is coded as p\cdot2 and every negative value n as \lvert n\rvert\cdot2-1. This means that numbers above 63 already need two bytes, but all numbers between -1 and -64 only need one byte instead of 10 bytes.

Since the ZigZag format only requires the transformation of the input values, it was also implemented in Rocinante.

public int readZigZagInteger(WireType wireType) throws IOException {
  validate(wireType, WireType.VARINT);
  Integer value = readVariableByteInt();
  return (value >>> 1) ^ -(value & 1);
}

Once the basic functionalities have been described and implemented, the binary representation of our example can be processed.

0a 09 52 6f 63 69 6e 61 6e 74 65 10 01 18 2a

The first byte 0a contains the value 2 for LEN encoded types such as string, bytes and message in the lower three bits. The value 1 is encoded in the upper 5 bits for the first field text from the Example. This is followed by 09, a VarInt that encodes the length 9 of the fields content. This is followed by 9 bytes as an utf-8 coded String.

The next byte 10 contains the value 0 in the lower three bits. the value 0 encodes the type VARINT. The upper 5 bits encode the value 2 for the flag field from the Example. The next value 01 encodes the boolean value true.

The second last byte 18 again encodes the type VARINT in the lower three bits. The upper five bits encode the value 3 for the number field. The last byte 2a as VarInt encodes the number 42.

We have done this manually here, but of course it also works with our ExampleProto class.

byte[] bytes = HexFormat.of().parseHex("0a09526f63696e616e74651001182a");
Example example = ExampleProto.read(new BytesArrayInputStream(bytes));

In the next article we will implement a Maven plugin to automate the generation of our Protocol Buffer classes.

1 Gedanke zu „Rocinante – encoding and decoding“

Schreibe einen Kommentar