Wednesday 18 December 2013

Monkey Vs Python = Template Based Data Extraction Python Script



There seems to be 2 steps to forensically reverse engineering a file format:
- Figuring out how the data is structured
- Extracting that data for subsequent presentation

The dextract.py script is supposed to help out between the two stages. Obviously, I was battling to come up with a catchy script name ("dextract" = data extract). Meh ...

The motivation for this script came when I was trying to reverse engineer an SMS Inbox binary file format and really didn't want to write a separate data extraction script for every subsequent file format. I also wanted to have a wrestle with Python so this seemed like as good an opportunity as any.

Anyhoo, while 9 out of 10 masochists agree that reverse engineering file formats can be fun, I thought why not save some coding time and have one configurable extraction script that can handle a bunch of different file formats.
This lead me to the concept of a "template definition" file. This means one script (with different templates) could extract/print data from several different file types.
Some quick Googling showed that the templating concept has already been widely used in various commercial hex editors
http://sandersonforensics.com/forum/content.php?119-RevEnge
http://www.x-ways.net/winhex/index-m.html
http://www.sweetscape.com/010editor/
http://www.hexworkshop.com/


Nevertheless, I figured an open source template based script that extracts/prints data might still prove useful to my fellow frugal forensicators - especially if it could extract/interpret/output data to a Tabbed Separated (TSV) file for subsequent presentation.
It is hoped that dextract.py will save analysts from writing customized code and also allow them to share their template files so that others don't have to re-do the reverse engineering. It has been developed and tested (somewhat) on SIFT v2.14 running Python 2.6.4. There may still be some bugs in it so please let me know if you're lucky/unlucky enough to find some.

You can get a copy of the dextract.py script and an example dextract.def template definition file from my Google code page here.
But before we begin, Special Thanks to Mari DeGrazia (@maridegrazia) and Eric Zimmerman (@EricRZimmerman) for their feedback/encouragement during this project. When Monkey starts flinging crap ideas around, he surely tests the patience of all those unlucky enough to be in the vicinity.

So here's how it works:

Everyone LOVES a good data extraction!


Given a list of field types in a template definition file, dextract.py will extract/interpret/print the respective field values starting from a given file offset (defaults to beginning of the file).
After it has iterated through each field in the template definition file once, it assumes the data structure/template repeats until the end offset (defaults to end of file) and the script iterates repeatedly until then.
Additionally, by declaring selected timestamp fields in the template definition, the script will interpret the hex values and print them in a human readable ISO format (YYYY-MM-DDThh:mm:ss).

To make things clearer, here's a fictional SMS Inbox file example ... Apparently, Muppets love drunk SMS-ing their ex-partners. Who knew?
So here's the raw data file ("test-sms-inbox.bin") as seen by WinHex:

"test-sms-inbox.bin"


OK, now say that we have determined that the SMS Inbox file is comprised of distinct data records with each record looking like:

Suspected "test-sms-inbox.bin" record structure

Observant monkeys will notice the field marked with the red X. For the purposes of this demo, the X indicates that we suspect that field is the "message read" flag but we're not 100% sure. Consequently, we don't want to clutter our output with the data from this field and need a way of suppressing this output. More on this later ...

And now we're ready to start messing with the script ...

Defining the template file

The template file lists each of the data fields on a seperate line.
There are 3 column attributes for each line.
  • The "field_name" is a unique placeholder for whatever the analyst wants to call the field. It must be unique or you will get funky results.
  • The "num_types" field is used to specify the number of "types". This should usually be set to 1 except for strings. For strings, the "num_types" field corresponds to the number of bytes in the string. You can set it to 0 if unknown and the script will extract from the given offset until it reaches a NULL character. Or you can also set it to a previously declared "field_name" (eg "msgsize") and the script will use the value it extracted for that previous "field_name" as the size of the string.
  • The "type" field defines how the script interprets the data. It can also indicate endianness for certain types via the "<" (LE) or ">" (BE) characters at the start of the type.

Here's the contents of our SMS Inbox definition file (called "dextract.def").
Note: comment lines begin with "#" and are ignored by the script.

# Note: Columns MUST be seperated by " | " (spaces included)
# field_name | num_types | type
contactname | 0 | s
phone | 7 | s
msgsize | 1 | B
msg | msgsize | s
readflag | 1 | x
timestamp | 1 | >unix32

So we can see that a record consists of a "contactname" (null terminated string), "phone" (7 byte string), "msgsize" (1 byte integer), "msg" (string of "msgsize" bytes), "readflag" (1 byte whose output will be ignored/skipped) and "timestamp" (Big Endian 4 byte No. of seconds since Unix epoch).

Remember that "readflag" field we weren't sure about extracting earlier?
By defining it as a type "x" we can tell the script to skip processing those "Don't care" bytes.
So if you haven't reverse engineered every field (sleep is for chumps!), you can still extract the fields that you have figured out without any unnecessary clutter.

Running the script

Typing the scriptname without arguments will print the usage help.  
Note: I used the Linux command "chmod a+x" to make my detract.py executable.

sansforensics@SIFT-Workstation:~$ ./dextract.py
Running dextract v2013-12-11 Initial Version

Usage:
Usage#1: dextract.py -d defnfile -f inputfile
Usage#2: dextract.py -d defnfile -f inputfile -a 350 -z 428 -o outputfile

Options:
  -h, --help      show this help message and exit
  -d DEFN         Template Definition File
  -f FILENAME     Input File To Be Searched
  -o TSVFILE      (Optional) Tab Seperated Output Filename
  -a STARTOFFSET  (Optional) Starting File Offset (decimal). Default is 0.
  -z ENDOFFSET    (Optional) End File Offset (decimal). Default is the end of
                  file.
sansforensics@SIFT-Workstation:~$

The following values are output by the script:
  • Filename
  • File_Offset (offset in decimal for the extracted field value)
  • Raw_Value (uninterpreted value from extracted field)
  • Interpreted_Value (currently used only for dates, it uses the Raw_Value field and interprets it into something meaningful)

The default outputting to the command line can be a little messy so the script can also output to a tab separated file (eg smstext.txt).

So getting back to our SMS example ...
We can run the script like this:

sansforensics@SIFT-Workstation:~$ ./dextract.py -d dextract.def -f /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin -o smstest.txt
Running dextract v2013-12-11 Initial Version

Input file /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin is 164 bytes

/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:0, nullterm str field = contactname, value = fozzie bear
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:12, defined str field = phone, value = 5551234
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:19, field = msgsize, value = 18
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:20, deferred str field = msg, value = Wokka Wokka Wokka!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:39, field = timestamp, value = 1387069205, interpreted date value = 2013-12-15T01:00:05
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:43, nullterm str field = contactname, value = kermit
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:50, defined str field = phone, value = 5551235
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:57, field = msgsize, value = 6
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:58, deferred str field = msg, value = Hi Ho!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:65, field = timestamp, value = 1387069427, interpreted date value = 2013-12-15T01:03:47
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:69, nullterm str field = contactname, value = Swedish Chef
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:82, defined str field = phone, value = 5554000
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:89, field = msgsize, value = 31
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:90, deferred str field = msg, value = Noooooooony Nooooooony Nooooooo
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:122, field = timestamp, value = 1387080005, interpreted date value = 2013-12-15T04:00:05
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:126, nullterm str field = contactname, value = Beaker
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:133, defined str field = phone, value = 5550240
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:140, field = msgsize, value = 18
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:141, deferred str field = msg, value = Mewww Mewww Mewww!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:160, field = timestamp, value = 1387082773, interpreted date value = 2013-12-15T04:46:13

Exiting ...
sansforensics@SIFT-Workstation:~$

And if we import our "smstest.txt" output TSV into a spreadsheet application for easier reading, we can see:

Tab Separated Output File for all records in "test-sms-inbox.bin"


Note: The "readflag" field has not been printed and also note the Unix timestamps have been interpreted into a human readable format.

Now, say we're only interested in one record - the potentially insulting one from "kermit" that starts at (decimal) offset 42 and ends at offset 68.
We can run something like:

sansforensics@SIFT-Workstation:~$ ./dextract.py -d dextract.def -f /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin -o smstest.txt -a 43 -z 68
Running dextract v2013-12-11 Initial Version

Input file /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin is 164 bytes

/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:43, nullterm str field = contactname, value = kermit
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:50, defined str field = phone, value = 5551235
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:57, field = msgsize, value = 6
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:58, deferred str field = msg, value = Hi Ho!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:65, field = timestamp, value = 1387069427, interpreted date value = 2013-12-15T01:03:47

Exiting ...
sansforensics@SIFT-Workstation:~$

and the resultant output file looks like:


Tab Separated Output File for the "kermit" record only

Lets see our amphibious amore wriggle out of that one eh?

Limitations

The main limitation is that dextract.py relies on files having their data in distinctly ordered blocks (eg same ordered fields for each record type). Normally, this isn't a problem with most flat files containing one type of record.
If you have a file with more than one type of record (eg randomly combined SMS Inbox/Outbox with 2 types of record) then this script can still be useful but the process will be a bit longer/cumbersome.
You can use the start/end offset arguments to tell the script to extract a specific record from the file using a particular template definition (as shown previously).
For extracting another type of record, re-adjust the start/end offsets and point the script to the other template file.
Unfortunately, I couldn't think of a solution to extracting multiple record types randomly ordered in the same file (eg mixed Inbox/Outbox messages). Usually, there would be a record header/number preceeding the record data but we can't be sure that would always be the case. So for randomly mixed records, we're kinda stuck with the one record at a time method.
However, if the records were written in a repeated fixed pattern eg recordtypeA, recordtypeB (then back to recordtypeA), the script should be able to deal with that. You could set up a single template file with the definition of recordtypeA then recordtypeB and then the script will repeatedly try to extract records in that order until the end offset/end of file.

FYI As SQLite databases do NOT write NULL valued column values to file, they can have varying number of fields for each row record depending on the data values. Consequently, dextract.py and SQLite probably won't play well together (unless possibly utilized on a per record basis).

Obviously there are too many types of data fields to cater for them all. So for this initial version, I have limited it to the in-built Python types and some selected timestamps from Paul Sanderson's "History of Timestamps" post.

These selected timestamps also reflect the original purpose of cell phone file data extraction.

Supported extracted data types include:

# Number types:
# ==============
# x (Ignore these No. of bytes)
# b or B (signed or unsigned byte)
# h or H (BE/LE signed or unsigned 16 bit short integer)
# i or I (BE/LE signed or unsigned 32 bit integer)
# l or L (BE/LE signed or unsigned 32 bit long)
# q or Q (BE/LE signed or unsigned 64 bit long long)
# f (BE/LE 32 bit float)
# d (BE/LE 64 bit double float)
#
# String types:
# ==============
# c (ascii string of length 1)
# s (ascii string)
# Note: "s" types will have length defined in "num_types" column. This length can be:
# - a number (eg 140)
# - 0 (will extract string until first '\x00')
# - Deferred. Deferred string lengths must be set to a previously declared field_name
# See "msgsize" in following example:
# msg-null-termd | 0 | s
# msg-fixed-size | 140 | s
# msgsize | 1 | B
# msg-deferred | msgsize | s
# msg-to-ignore | msgsize | x
#
# Also supported are:
# UTF16BE (BE 2 byte string)
# UTF16LE (LE 2 byte string)
# For example:
# UTF16BE-msg-null-termd | 0 | UTF16BE
# UTF16BE-msg-fixed-size | 140 | UTF16BE
# UTF16BE-msgsize | 1 | B
# UTF16BE-msg-deferred | msgsize | UTF16BE
#
# Timestamp types:
# =================
# unix32 (BE/LE No. of secs since 1JAN1970T00:00:00 stored in 32 bits)
# unix48ms (BE/LE No. of millisecs since 1JAN1970T00:00:00 stored in 48 bits)
# hfs32 (BE/LE No. of secs since 1JAN1904T00:00:00)
# osx32 (BE/LE No. of secs since 1JAN2001T00:00:00)
# aol32 (BE/LE No. of secs since 1JAN1980T00:00:00)
# gps32 (BE/LE No. of secs since 6JAN1980T00:00:00)
# unix10digdec (BE only 10 digit (5 byte) decimal No. of secs since 1JAN1970T00:00:00)
# unix13digdec (BE only 13 digit (7 byte) decimal No. of ms since 1JAN1970T00:00:00)
# bcd12 (BE only 6 byte datetime hex string  eg 071231125423 = 31DEC2007T12:54:23)
# bcd14 (BE only 7 byte datetime hex string eg 20071231125423 = 31DEC2007T12:54:23)
# dosdate_default (BE/LE 4 byte int eg BE 0x3561A436 = LE 0x36A46135 = 04MAY2007T12:09:42)
# dosdate_wordswapped (BE/LE 4 byte int eg BE 0xA4363561 = LE 0x613536A4 = 04MAY2007T12:09:42)
#


How the code works? A brief summary ...

The code reads each line of the specified template definition file and creates a list of field names. It also creates a dictionary (keyed by field name) for sizes and another dictionary for types.
Starting at the given file offset, the script now iterates through the list of fieldnames and extracts/interprets/prints the data via the "parse_record" method. It repeats this until the end offset (or end of file) is reached.
The main function doesn't involve many lines of code at all. The "parse_record" function and other subfunctions is where things start to get more involved and they make up the bulk of the code. I think I'll leave things there - no one in their right mind wants to read a blow by blow description about the code.

Thoughts on Python

I can see why it has such a growing following. It's similar enough to C and Perl that you can figure out what a piece of code does fairly quickly.
The indents can be a bit annoying but it also means you don't have to spend extra lines for the enclosing {}s. So code seems shorter/purdier.

The online documentation and Stackoverflow website contained pretty much everything I needed - from syntax/recipe examples to figuring out which library functions to call.
It's still early days - I haven't written any classes or tried any inheritance. For short scripts this might be overkill anyway *shrug*.
As others have mentioned previously, it probably comes down to which scripting language has the most appropriate libraries for the task.
SIFT v2.14 uses Python 2.6.4 so whilst it isn't the latest development environment, I figured having a script that works with a widely known/used forensic VM is preferable to having the latest/greatest environment running Python 3.3.
I used jedit  for my Python editor but could have also used the gedit text editor already available on SIFT. You can install jedit easily enough via the SIFT's Synaptic Package Manager. Let me know if you think there's a better Python editor in the comments?

So ... that's all I got for now. If you find it useful or have some suggestions (besides "Get into another line of work Monkey!"), please leave a comment. Hopefully, it will prove useful to others ... At the very least, I got to play around with Python. Meh, I'm gonna claim that Round 1 was a draw :)