Monday, March 30, 2020

Parsing unknown protobufs with python

Protocol Buffers are quite popular, more and more apps and system files are storing data in this format in both iOS and Android operating systems. If you aren't familiar with Protocol Buffers, read this post. There I use the protoc.exe utility (by google), as does everyone else who needs to view this data, when you do not have the corresponding .proto file.

This is great! But the raw view/output has one big disadvantage. While this approach (--decode_raw) works fine if you just want to see the text strings stored in your data, it does not always provide the correct conversions for all the raw data types! 

According to google, when the message (data) is encoded, there are only 6 different types of data types allowed. These are known as wire types. Here are the allowed types (below).

Figure - Allowed wire types from

Unless you have the .proto file, you really don't know what the original data type may be. Even protoc.exe just makes a best guess. For instance, all binary blobs are also converted to strings with protoc as both the string and bytes type use the Length-delimited wiretype. There is also no way to tell if a number is to be interpreted as signed or unsigned, because they both use the same underlying type (varint)!

Now to raw-decode a protobuf in python, there are a couple of libraries I have seen so far that do a decent job. I will list out the libraries, then demonstrate parsing with them, and compare.

This seems to be more than 4 years old and not maintained any more. It is also in python2. There is a python3 port somewhere. It makes several assumptions regarding data types and attempts to produce output similar to protoc.

This is a more mature library that provides much more in functionality. It makes relatively few assumptions about data types. In addition to parsing the protobuf and returning a dictionary object, it also provides a type definition dictionary for the parsed data. 

To demonstrate what I am talking about, I created a demo protocol buffer file called addressbook.proto and defined a protobuf message as shown below.

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
  required int64 id64 = 4;
  required uint64 uid64 = 5;
  optional double double = 6;
  optional bytes bytes = 7;

Then compiled it using protoc.exe.

protoc --python_out=. addressbook.proto

Now used a python script to include the compiled python protobuf header and generated a binary protobuf file called tester_pb. The data contained in it is shown below.

Actual data
  name: "John Doe"
  id: 1234
  email: "[email protected]"
  id64: -22
  uid64: 13360317589766481554
  double: 4.5566
  bytes: b'\x00\x124V'

Protoc outputprotoc --decode_raw < ..\tester_pb  )
1 {
  1: "John Doe"
  2: 1234
  3: "[email protected]"
  4: 18446744073709551594
  5: 13360317589766481554
  6: 0x401239f559b3d07d
  7: "\000\0224V"

protobuf-decoder output
  '01:00:string': 'John Doe',
  '02:01:Varint': 1234,
  '03:02:string': '[email protected]',
  '04:03:Varint': 18446744073709551594,
  '05:04:Varint': 13360317589766481554,
  '06:05:64-bit': 4.5566
  '07:06:string': '\x00\x124V'

blackboxprotobuf output (includes data dictionary and types dictionary)
  '1': b'John Doe',
  '2': 1234,
  '3': b'[email protected]',
  '4': -22,
  '5': -5086426483943070062,
  '6': 4616816293942907005,
  '7': b'\x00\x124V'
{'1': {'type': 'message', 'message_typedef':
  '1': {'type': 'bytes', 'name': ''},
  '2': {'type': 'int', 'name': ''},
  '3': {'type': 'bytes', 'name': ''},
  '4': {'type': 'int', 'name': ''},
  '5': {'type': 'int', 'name': ''},
  '6': {'type': 'fixed64', 'name': ''},
  '7': {'type': 'bytes', 'name': ''}
 }, 'name': ''}

As seen in the outputs above, each decoder makes some default assumptions about the data types encountered. The items highlighted in red are the ones that are interpreted using incorrect types. I like blackboxprotobuf because it lets you specify the real type via a types dictionary similar to the one it outputs. So once we have figured out the correct types, we can pass this into the decode_message() function to get the correct output. See code snippet below.

import blackboxprotobuf

with open('tester_pb', 'rb') as f:
    pb =
    types = {'1': {'type': 'message', 'message_typedef':
               '1': {'type': 'str', 'name': 'name'},

               '2': {'type': 'int', 'name': 'id'},
               '3': {'type': 'str', 'name': 'email'},
               '4': {'type': 'int', 'name': 'id64'},
               '5': {'type': 'uint', 'name': 'uid64'},
               '6': {'type': 'double', 'name': 'double'},
               '7': {'type': 'bytes', 'name': 'bytes'}
              }, 'name': ''}
    values, _ = blackboxprotobuf.decode_message(pb, types)

That produces the desired output -

  'name': 'John Doe',
  'id': 1234,
  'email': '[email protected]',
  'id64': -22,
  'uid64': 13360317589766481554,
  'double': 4.5566,
  'bytes': b'\x00\x124V'

In summary, I recommend using the blackboxprotobuf library, however note that it is not exactly plug and play. Since it is not on pypi, you have to use it from code. Also, to use it with python3, I had to make one small tweak. I also added the 'str' type decode as that was not available. Since then I have tested this with numerous protobuf streams and it has not failed me! For my updated version of this library, get it here.

Update (8/2020): I made more bug fixes and published it to pypi, so you can install it via pip now. 
  pip install blackboxprotobuf

Saturday, March 28, 2020

Google Search & Personal Assistant data on android

The Google app, previously known as Google Now, is installed by default on most phones. From the app's description -

The Google app keeps you in the know about things that matter to you. Find quick answers, explore your interests, and stay up to date with Discover. The more you use the Google app, the better it gets.

Search and browse:
- Nearby shops and restaurants
- Live sports scores and schedules
- Movies times, casts, and reviews
- Videos and images
- News, stock information, and more
- Anything you’d find on the web

It is that ubiquitous bar/widget sometimes called the Google Assistant Search Bar or just google Search widget found on the phone's home screen.

Figure 1 - Google Search / Personal Assistant Bar 

The internal package goes by the name It's artifacts are found at /data/data/

There are many files and folders here, but the most interesting data is the sub-folder files/recently

Your recent searches along with some full screen screenshots of search results are stored here. Screenshots (saved as jpg) are in .webp format. The unique number in the name is referenced by the data in the protobuf file (file name is the email address of the logged in user account). If you are not logged in, nothing is populated in this folder. See screenshots below.

Figure 2 - Folder 'recently' has no entries when no account was logged on.

Figure 3 - Folder 'recently' has files when searches were performed after logging in

The protobuf file ([email protected] in this case) when decoded has entries that look like this (see below) for a typical search. If you aren't familiar with protobuf decoding, read this.

1 {
  1: 15485946382791341007
  3: 0
  4: 1585414188066
  5: "dolphin"
  8 {
    1: "web"
    2: ""
  9: 10449902870035666886
  17: 1585413397978

In the protobuf data (decoded using protoc.exe), as seen above, we can easily distinguish the relevant fields:

Item Description
1 session id
4 timestamp1 (unix epoch)
5 search query
8 dictionary
1 = type of search (web, video, ..)
2 = search engine
9 screenshot-id (needs conversion to int from uint)
17 timestamp2 (unix epoch)

Here is the corresponding screenshot saved in the same folder -
Figure 4 - Screenshot of search for"dolphin"

If you clicked on a recent news story in the app, the protobuf entry looks like this (below):

1 {
  1: 9016892896339717414
  3: 1
  4: 1572444614834
  5: ""
  7 {
    1: ""
    2: ""
    3: "Photos of AirPods Pro arriving in stores around the world - 9to5Mac"
  9: 9016892896339717414
  10: 9
  17: 1572444614834
Figure 5 - Screenshot for news article clicked from link in google app

Last week, I added a plugin for ALEAPP to read these recent search artifacts. This isn't all, there is actually more data to be read here.

The search widget can be used to make any kind of query, which may then be forwarded to the web browser or Android Auto or the Email or Messaging apps depending on what was queried for. This makes for an interesting artifact.

From my test data, all searches are stored in the app_session folder as protobuf files having the extension .binarypb. See screenshot below.

Figure 6 - .binarypb files
Each of these files is a protobuf that stores a lot of data about the searches. This includes searches from Android Auto too. Josh Hickman did some excellent research on Android Auto and addressed some of this briefly in his talk here. A parser is not available to read this as the format of the data contained in the protobufs is unknown. I've attempted to reverse-engineer parts of it enough to get the useful bits of information out, such as the search queries. There are also mp3 recordings of the replies from google assistant stored in some of them. These are being added to ALEAPP to parse.

The format here is a bit too much to write about. Below is the raw protobuf structure (sans the binary blobs, replaced by ...). The search term here was "tom and jerry".

  1: 0x00000053b0c63c1b
  2: 0x11f0299e
  3: "search"
  132242267: ""
  132264001 {
    1: "..."
    2: 0x00000000
    3: 0
    4: 0x00000000000e75fe
  132269388 {
    2: 0x0000000000000040
    3 {
      1: "..."
      2: ""
      3: "and.gsa.launcher.allapps.appssearch"
  132269847 {
    1 {
      1: "..."
      2: ""
      3: "and.gsa.launcher.allapps.appssearch"
    2 [
      0: "...",
      1: "... tom and jerry ..."
      2: "..."
      3: 1
  146514374 {
    1: "and.gsa.launcher.allapps.appssearch"
  206022552 {
    1: 0

After studying this and several other samples, here are the important pieces in the parsed protobuf dictionary:

1session id (same number as in filename)
3type of query (search, car_assistant, opa)
car_assistant = Android Auto
opa = personal assistant
1 = mp3 recording of response

1 = dictionary
2last query

2 = List of session queries (in blobs)

For more details, refer the module in ALEAPP. Below is a screenshot of the parsed out data.
Figure 7 - ALEAPP output showing Google App / Personal assistant queries