Monkey Unpacks Some Python

UNPACK-ing Python .. Now with added monkey!

Some forensic folks have suggested that a Python tutorial on how to read/print binary data types might be helpful to budding Python programmers in the community.
So in this post, we will simulate reverse engineering a fictional contact file format and then write a Python script to extract/print out the values.
For brevity, this post ass-umes the reader has a basic knowledge of Python (i.e. they can launch a script and know about functions/assigning variables etc.). There are plenty of introductory tutorials online - if you are a beginner, you might want to check out the Google Developer Python course before proceeding.

The script (unpack-tute.py) has been tested with both Python v2.7.12 and Python 3.4.1 on a Win7x64 PC.
Historically, Python 2 had more supported 3rd party libraries. Consequently, it was the first version of Python that this monkey learned and we are actually more familiar with Python 2. Python 2's End Of Life is currently scheduled for April 2020 so there's a few years left. However, as this script does not rely on 3rd party libraries, we have adapted it to run on both Python 2 and 3.
The main difference affecting this script was that Python 3 treats strings as Unicode by default so we had to add an encode('utf-8') call when searching through our data file.

There is more than one way to code a solution. We have tried to make this code easy to follow instead of making it "Pythonic" (whatever that even means) or by adding lots of error checking code (if you write a script, you should know how it to use it!).
The Python script (unpack-tute.py) and sample binary file (testctx.bin) will be posted to my brand-monkey-spanking-new GitHub Python Tutorials folder.

So, here's a screenshot of the "testctx.bin" file we want to read:

Screenshot of "testctx.bin" (brought to us courtesy of WinHex!)

Note: The first contact record is highlighted and offsets are listed in decimal. Curious George is ... curious?

Using some reverse engineering strategies that we previously wrote about here we can make a few observations regarding the structure of each Contact record ...

We can see there's a repeated "ctx!" string before each Contact record.
After each record's "ctx!" field, there is a Little Endian 2 byte field that seems to increase with each subsequent record (eg 0x0100 at decimal offset 68, 0x02FF at decimal offset 516, 0xFFFF at decimal offset 804). For initial classification, we will say its an index record number.
Each record has a UTF16LE (ie 2 bytes per character) string that contains a name (eg George).
Each record has a UTF8/ASCII (ie 1 byte per character) string that contains a phone number (eg 5551234).
Before each of the strings, there is a one byte integer corresponding to the string size in bytes.
The last field seems to be a 4 byte field. By observing which bytes vary and which bytes remain constant(ie the left most bytes change more rapidly than the rightmost bytes), we suspect the last field is a Little Endian timestamp field. Feeding in the first record's last 4 bytes (ie 0x26CDDB56) into DCode results in a valid date/time for a Unix 32 bit Little Endian timestamp

DCoding the Contact timestamp

So here's our contact record format:

Contact record data structure

And here's a summary of what we want the script to do:

1. Open "testctx.bin" file (read only)
2. Store file contents
3. Search file contents for ctx! markers
4. For each hit:
    4a. Print hit offset
    4b. Extract Index Number field and print
    4c. Extract Name Length field and print
    4d. Extract Name String (UTF16LE) field and print
    4e. Extract Phone Length field and print
    4f. Extract Phone String (UTF8) field and print
    4g. Extract Unix Timestamp field and print (in ISO format)

5. Close file

Simples!

The Script

OK now that we know what we want to do, here's how we implement each step in code ...

Steps 1 & 2 Open file and store file contents (See "unpack-tute.py" lines 25-33):
1. Open "testctx.bin" file (read only)
2. Store file contents

For step 1, we open the "textctx.bin" file in read-only binary mode (what the "rb" stands for):
    fb = open(filename, "rb")

We chose read-only mode because we don't want to change the file contents and we chose binary mode because we are interpreting the file as raw bytes (not text).
Then to read/store the file contents, we call:
    filecontent = fb.read()

So the "filecontent" variable will now contain every byte from the "testctx.bin" file and individual bytes can be accessed directly using the "slice" notation.
For example, filecontent[0:3] is 3 bytes long and includes the bytes at offsets 0, 1 and 2. It does NOT include the byte at offset 3.
If we replace the start/end locations of our slice example with a variable called startoffset, we get:
    filecontent[startoffset:(startoffset+3)]
This will include the 3 bytes at start, start+1, start+2 only.
The reader might want to remember that little notation nugget as monkey has the feeling it will be popping up again later ... (Hehe, Poo jokes are still floating around in 2017!)

Step 3: Search file contents for "ctx!" markers (See "unpack-tute.py" lines 35-49):
Knowing that "ctx!" encoded in ASCII/UTF8 is x63 x74 x78 x21, we can use a variable "searchstring" to represent our search term in hex:
    searchstring = "\x63\x74\x78\x21"

We now consider the "filecontent" variable as one big string of bytes ...
Python string types have a find() method which searches the parent string for a substring. The find() method returns -1 if the substring is not found otherwise, it returns the first offset where it found the substring. The find() method can also take an starting offset argument so we can use a while loop to repeatedly call find() with an incrementing starting offset until we get no more hits. Thus we can find an offset for each substring hit in the parent string, which we can then store in a Python list called "hitlist".
Here's the code:
    nexthit = filecontent.find(searchstring.encode('utf-8'), 0)
    hitlist = []
    while nexthit >= 0:
        hitlist.append(nexthit)
        nexthit = filecontent.find(searchstring.encode(), nexthit + 1)

We use searchstring.encode('utf-8') because of Python 3 compatibility issues. Python 3 treats all strings as Unicode by default, where as we need to search in UTF8 (ie byte by byte). So we have to encode the searchstring as UTF8 before running the search.
Default Python 2 strings are represented as sequences of raw bytes so calling searchstring.encode('utf-8') in Python 2 has no real effect - we could have used Python 2 lines such as:
    nexthit = filecontent.find(searchstring, 0)
and
    nexthit = filecontent.find(searchstring, nexthit + 1)
This was the only major script change required for Python2 and Python 3 compatibility.

Step 4: Looping through each hit (See "unpack-tute.py" lines 50-88):
Now we have our hitlist of offsets to "ctx!" markers and we know how each contact record is structured, so we can iterate through the filecontent variable using a for loop and extract/print the data we need using the slice notation we discussed previously.

4a. We print out each hit offset in both decimal and hexadecimal.
    print("\nHit found at offset: " + str(hit) + " decimal = " + hex(hit) + " hex")

We use the str() function to convert the "hit" offset variable into a decimal string for printing and the hex() function to convert the hit offset variable into a hexadecimal string.

4b. The first field ("Index Number") after the "ctx!" marker will start 4 bytes after the hit offset. To calculate the offset, we can use code like:
    indexnum_offset = hit + 4
As we have already read the entire file into filecontent, we can access the 2 byte "Index Number" field and interpret it as a Little Endian 2 byte integer as follows:
    indexnum = struct.unpack("<H", filecontent[indexnum_offset:(indexnum_offset+2)])[0]

We are using the struct module's "unpack" function on the given filecontent slice to interpret the slice as a LE 2 byte integer and store it in the "indexnum" variable.
The "<H" argument tells unpack how to interpret the raw bytes i.e. "<" for Little Endian, "H" for unsigned 2 byte integer.
The unpack function returns a tuple (kinda like a sequence of variables) so we specify the "[0]" at the end to retrieve the first converted value. It seems a bit weird until you find out that you can chain types together in the same unpack call. For example, "<HH" specifies 2 consecutive LE unsigned 2 byte integers. Unfortunately, we cannot use chaining here due to the variable length of Name/Phone strings in the contact record.
There's a bunch of other unpack types defined in the Python help documentation (search for "pack unpack").

We can now print out our interpreted "indexnum" value but we need to use the str() function to convert our Index Number integer into a printable string. We can use code such as:
    print("indexnum = " + str(indexnum))

We can re-use a similar pattern of code for the remaining fields in the record.
That is, we calculate the offset of field X, interpret those slice bytes and then print.
Because we know the record field sizes (or can read them e.g. via "Name Length" size byte), calculating the offsets becomes an exercise in adding field sizes to previous field offsets to get to the next offset address.

4c. So for the second field ("Name Length") we can use:
    namelength_offset = indexnum_offset + 2
    print("namelength_offset = " + str(namelength_offset))
    namelength = struct.unpack("B", filecontent[namelength_offset:(namelength_offset+1)])[0]
    print("namelength = " + str(namelength))

For the "Name Length" field (one byte long), we use a starting offset ("namelength_offset") which is 2 bytes past the "Index Number" offset ("Index Number" field is 2 bytes long).
We use unpack with the "B" argument as we are interpreting the 1 byte at filecontent[name_length_offset:(namelength_offset+1)] as an unsigned 1 byte integer and storing it in the "namelength" variable.

4d. For the third field ("Name String") we can use:
    namestring_offset = namelength_offset + 1
    namestring = filecontent[namestring_offset:(namestring_offset+namelength)].decode('utf-16-le')
    print("namestring = " + namestring)

After calculating the "Name String" field offset (should be one byte past the "Name Length" field), we can use the string.decode('utf-16-le') method to interpret the filecontent[namestring_offset:(namestring_offset+namelength)] slice as a UTF16LE string and store it in the "namestring" variable.

4e. For the fourth field ("Phone Length") we can use:
    phonelength_offset = namestring_offset + namelength
    phonelength = struct.unpack("B", filecontent[phonelength_offset:(phonelength_offset+1)])[0]
    print("phonelength = " + str(phonelength))

After calculating the "Phone Length" field offset (should be "Name Length" bytes past the "Name String" offset), we use unpack with the "B" argument as we are interpreting the 1 byte at filecontent[phonelength_offset:(phonelength_offset+1)] as an unsigned 1 byte integer and storing it in the "phonelength" variable.

4f. For the fifth field ("Phone String") we can use:
    phonestring_offset = phonelength_offset + 1
    print("phonestring_offset = " + str(phonestring_offset))
    phonestring = filecontent[phonestring_offset:(phonestring_offset+phonelength)].decode('utf-8')
    print("phonestring = " + phonestring)

After calculating the "Phone String" field offset (should be one byte past the "Phone Length" field), we can use the string.decode('utf-8') method to interpret the filecontent[phonestring_offset:(phonestring_offset+phonelength)] slice as a UTF8 string and store it in the "phonestring" variable.

4g. For the sixth and last field ("Unix Timestamp") we can use:
    timestamp_offset = phonestring_offset + phonelength
    print("timestamp_offset = " + str(timestamp_offset))
    timestamp = struct.unpack("<I", filecontent[timestamp_offset:(timestamp_offset+4)])[0]
    print("raw timestamp decimal value = " + str(timestamp))
    timestring = datetime.datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%dT%H:%M:%S")
    print("timestring = " + timestring)

We calculate the timestamp offset as being "Phone Length" bytes past the "Phone String" field and print the timestamp offset to help with debugging.
We use unpack with the "<I" argument to interpret the 4 byte filecontent[timestamp_offset:(timestamp_offset+4)] slice as a LE unsigned 4 byte integer and then store the integer value in the "timestamp" variable.
eg interprets 0x26CDDB56 LE as 0x56DBCD26 BE = 1457245478 decimal = number of seconds since 1JAN1970.
We then call the datetime.datetime.utcfromtimestamp() method to create a Python "datetime" object using the number of seconds since 1JAN1970. The returned datetime object has a "strftime" method we can call to obtain a human readable ISO format string. The "%Y-%m-%dT%H:%M:%S" argument to strftime() specifies that we want a datetime string formatted as Year-Month-DayTHour:Minute:Second.

Step 5: After we process all of the "ctx!" hits, we close the file (See "unpack-tute.py" line 89):
    fb.close()

For shiggles, we also print out the number of hits in the hitlist on line 91 before the script finishes.
    print("\nProcessed " + str(len(hitlist)) + " ctx! hits. Exiting ...\n")

Running the script

For Python v2.7.12:
In a Win7x64 command terminal window with "unpack-tute.py" and "testctx.bin" copied to "c:\":

C:\>c:\Python27\python.exe unpack-tute.py
Running unpack-tute.py v2017-08-19

Hit found at offset: 64 decimal = 0x40 hex
indexnum = 1
namelength_offset = 70
namelength = 12
namestring = George
phonelength = 7
phonestring_offset = 84
phonestring = 5551234
timestamp_offset = 91
raw timestamp decimal value = 1457245478
timestring = 2016-03-06T06:24:38

Hit found at offset: 512 decimal = 0x200 hex
indexnum = 65282
namelength_offset = 518
namelength = 18
namestring = King Kong
phonelength = 9
phonestring_offset = 538
phonestring = +15554321
timestamp_offset = 547
raw timestamp decimal value = 1457245695
timestring = 2016-03-06T06:28:15

Hit found at offset: 800 decimal = 0x320 hex
indexnum = 65535
namelength_offset = 806
namelength = 30
namestring = Magilla Gorilla
phonelength = 10
phonestring_offset = 838
phonestring = +445552468
timestamp_offset = 848
raw timestamp decimal value = 1457258495
timestring = 2016-03-06T10:01:35

Processed 3 ctx! hits. Exiting ...

C:\>

For Python 3.4.1:
In a Win7x64 command terminal window with "unpack-tute.py" and "testctx.bin" copied to "c:\":

C:\>c:\Python34\python.exe unpack-tute.py
Running unpack-tute.py v2017-08-19

Hit found at offset: 64 decimal = 0x40 hex
indexnum = 1
namelength_offset = 70
namelength = 12
namestring = George
phonelength = 7
phonestring_offset = 84
phonestring = 5551234
timestamp_offset = 91
raw timestamp decimal value = 1457245478
timestring = 2016-03-06T06:24:38

Hit found at offset: 512 decimal = 0x200 hex
indexnum = 65282
namelength_offset = 518
namelength = 18
namestring = King Kong
phonelength = 9
phonestring_offset = 538
phonestring = +15554321
timestamp_offset = 547
raw timestamp decimal value = 1457245695
timestring = 2016-03-06T06:28:15

Hit found at offset: 800 decimal = 0x320 hex
indexnum = 65535
namelength_offset = 806
namelength = 30
namestring = Magilla Gorilla
phonelength = 10
phonestring_offset = 838
phonestring = +445552468
timestamp_offset = 848
raw timestamp decimal value = 1457258495
timestring = 2016-03-06T10:01:35

Processed 3 ctx! hits. Exiting ...

C:\>

We can see that all of the name and phone strings are complete / as shown in the Hex view picture.
We also verified that each "timestring" value corresponded to it's raw LE hex value using Dcode.

Final Thoughts

After you know the basics of a language, programming is a skill best sharpened by working on actual projects (not reading books or blog posts).
Google and StackOverflow are your friends when researching how to code common tasks in Python.
Which makes print statements your No-BS-tell-it-like-it-is best friend when debugging (e.g. print offset addresses and/or values to debug). A well placed print statement can be the easiest way of finding out that your fifth cola/coffee didn't do you any favours.

The code in this script is intended for use with files that can fit into memory (ie 0 MB to *maybe* hundreds of MB).
Larger files may require breaking up the file into chunks before reading/processing.

In writing this script, we used Notepad++ (v6.7.9.2) with the Language set to Python to get the funky syntax highlighting (eg comments in green, auto-indenting). The TAB size was set to 4 spaces via the Settings, Preferences, Tab Settings menu. We disabled "Word Wrap" (under View menu) and enabled line numbers (under Settings, Preferences, Editing menu) so if/when you get a runtime error, you can find the relevant line more readily.

If you are in the forensic community and found this post helpful or you're in the forensic community and had some questions/thoughts about the code, please leave a comment or send me an email (No, I will not do your homework/assignment! But if its for a new artifact for a case, monkey might be convinced ;).

Monday, 21 August 2017