Monday, March 30, 2020

Parsing unknown protobufs with python

Protocol Buffers are quite popular, more and more apps and system files are storing data in this format in both iOS and Android operating systems. If you aren't familiar with Protocol Buffers, read this post. There I use the protoc.exe utility (by google), as does everyone else who needs to view this data, when you do not have the corresponding .proto file.

This is great! But the raw view/output has one big disadvantage. While this approach (--decode_raw) works fine if you just want to see the text strings stored in your data, it does not always provide the correct conversions for all the raw data types! 

According to google, when the message (data) is encoded, there are only 6 different types of data types allowed. These are known as wire types. Here are the allowed types (below).

Figure - Allowed wire types from https://developers.google.com/protocol-buffers/docs/encoding#structure

Unless you have the .proto file, you really don't know what the original data type may be. Even protoc.exe just makes a best guess. For instance, all binary blobs are also converted to strings with protoc as both the string and bytes type use the Length-delimited wiretype. There is also no way to tell if a number is to be interpreted as signed or unsigned, because they both use the same underlying type (varint)!

Now to raw-decode a protobuf in python, there are a couple of libraries I have seen so far that do a decent job. I will list out the libraries, then demonstrate parsing with them, and compare.

This seems to be more than 4 years old and not maintained any more. It is also in python2. There is a python3 port somewhere. It makes several assumptions regarding data types and attempts to produce output similar to protoc.

This is a more mature library that provides much more in functionality. It makes relatively few assumptions about data types. In addition to parsing the protobuf and returning a dictionary object, it also provides a type definition dictionary for the parsed data. 

To demonstrate what I am talking about, I created a demo protocol buffer file called addressbook.proto and defined a protobuf message as shown below.

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
  required int64 id64 = 4;
  required uint64 uid64 = 5;
  optional double double = 6;
  optional bytes bytes = 7;
}

Then compiled it using protoc.exe.

protoc --python_out=. addressbook.proto

Now used a python script to include the compiled python protobuf header and generated a binary protobuf file called tester_pb. The data contained in it is shown below.

Actual data
[
  name: "John Doe"
  id: 1234
  email: "[email protected]"
  id64: -22
  uid64: 13360317589766481554
  double: 4.5566
  bytes: b'\x00\x124V'
]

Protoc outputprotoc --decode_raw < ..\tester_pb  )
1 {
  1: "John Doe"
  2: 1234
  3: "[email protected]"
  4: 18446744073709551594
  5: 13360317589766481554
  6: 0x401239f559b3d07d
  7: "\000\0224V"
}

protobuf-decoder output
{
  '01:00:string': 'John Doe',
  '02:01:Varint': 1234,
  '03:02:string': '[email protected]',
  '04:03:Varint': 18446744073709551594,
  '05:04:Varint': 13360317589766481554,
  '06:05:64-bit': 4.5566
  '07:06:string': '\x00\x124V'
}

blackboxprotobuf output (includes data dictionary and types dictionary)
{'1':
 {
  '1': b'John Doe',
  '2': 1234,
  '3': b'[email protected]',
  '4': -22,
  '5': -5086426483943070062,
  '6': 4616816293942907005,
  '7': b'\x00\x124V'
 }
}
{'1': {'type': 'message', 'message_typedef':
 {
  '1': {'type': 'bytes', 'name': ''},
  '2': {'type': 'int', 'name': ''},
  '3': {'type': 'bytes', 'name': ''},
  '4': {'type': 'int', 'name': ''},
  '5': {'type': 'int', 'name': ''},
  '6': {'type': 'fixed64', 'name': ''},
  '7': {'type': 'bytes', 'name': ''}
 }, 'name': ''}
}

As seen in the outputs above, each decoder makes some default assumptions about the data types encountered. The items highlighted in red are the ones that are interpreted using incorrect types. I like blackboxprotobuf because it lets you specify the real type via a types dictionary similar to the one it outputs. So once we have figured out the correct types, we can pass this into the decode_message() function to get the correct output. See code snippet below.


import blackboxprotobuf

with open('tester_pb', 'rb') as f:
    pb = f.read()
    types = {'1': {'type': 'message', 'message_typedef':
              {
               '1': {'type': 'str', 'name': 'name'},

               '2': {'type': 'int', 'name': 'id'},
               '3': {'type': 'str', 'name': 'email'},
               '4': {'type': 'int', 'name': 'id64'},
               '5': {'type': 'uint', 'name': 'uid64'},
               '6': {'type': 'double', 'name': 'double'},
               '7': {'type': 'bytes', 'name': 'bytes'}
              }, 'name': ''}
         }
    values, _ = blackboxprotobuf.decode_message(pb, types)
    print(values)



That produces the desired output -


{'1':
 {
  'name': 'John Doe',
  'id': 1234,
  'email': '[email protected]',
  'id64': -22,
  'uid64': 13360317589766481554,
  'double': 4.5566,
  'bytes': b'\x00\x124V'
 }
}


In summary, I recommend using the blackboxprotobuf library, however note that it is not exactly plug and play. Since it is not on pypi, you have to use it from code. Also, to use it with python3, I had to make one small tweak. I also added the 'str' type decode as that was not available. Since then I have tested this with numerous protobuf streams and it has not failed me! For my updated version of this library, get it here.

No comments:

Post a Comment