Thursday, 23 January 2014

Android SMS script update and a bit of light housekeeping

Knock, Knock ...

During recent research into Android SQLite databases (eg sms), Mari DeGrazia discovered a bug in the script.
Mari's test data was from a Samsung Galaxy S II. It turns out the script wasn't handling Cell Header "Serial Type" values of 8 or 9.
These Cell Header values are respectively used to represent "0" and "1" integer constants and eliminate the need for a corresponding 0x0/0x1 byte value in the Cell Data field section.
So this meant that some fields were being interpreted as "0" when they were actually set to "1". DOH!

The previous Android test data I used did not utilize these particular cell header values which is why it escaped my monkey like attention to detail. Banana? Where?!

Anyway, there's an updated version of the script available from GitHub here.

Pictures speak louder than words so lets look at a simplified example of an SQLite cell record:

SQLite Cell Record Structure

From the diagram above, we can see the usual SQLite record format. A Cell Size, Rowid and Cell Header Size followed by the rest of the Cell Header and the Cell Data sections.
Notice how HeaderField-B = 0x8? This means there will be no corresponding value written in the Cell Data section (ie there is no DataField-B).
When read, the extracted value of DataField-B will be set (to 0) based on the HeaderField-B type (0x8).
Alternatively, if the HeaderField-B type value was 0x9, the extracted value of DataField-B would be set to 1.

Additionally, since the previous post here - both Mari and I have used to carve sms messages from a cellphone's free space.
Here's how it played out:
- Cellebrite UFED was used to generate the .bin physical image file(s) from an Android phone.
- Then the .bin file(s) were added to a new X-Ways Forensics case.
- A keyword search for various phone numbers turned up multiple hits in the 1 GB+ "Free Space" file (ie unallocated space) which was then exported/copied to SIFT v2.14.
- The script's schema config file was adjusted to match the sms table schema.
- After trying the script with a 1GB+ file, we were consistently getting out of memory errors (even after increasing the SIFT VM RAM to 3 GB).
So the Linux "split" command was used to split the 1GB+ file into 3 smaller 500 MB files.
This ran error free although it meant running the script a few more times. Meh, still better than doing it by hand!
As mentioned in a previous post, this script can potentially be used with non-sms SQLite databases especially if the search term field appears near the start of the cell data section.

From now on, all of my scripts will be hosted at GitHub. I'm not sure how much longer GoogleCode will keep my existing scripts so I have also transferred most of those to GitHub.
Because I can no longer update on GoogleCode, I have removed the previous version to minimize further confusion.

Apologies for any inconvenience caused by this script oversight and Special Thanks to Mari for both spotting and letting me know about the error!

Monday, 13 January 2014

Facebook / Facebook Messenger Android App Parser Script

Poorly drawn parody of the Faceoff movie poster

Not satisfied with how your forensic tools are currently presenting Facebook (v3.3 for Android) / Facebook Messenger (v2.5.3 for Android) messages and contacts?
Would you also like a GoogleMaps URL that plots each message using available geographic metadata?
Or maybe you're just curious about how Facebook / Facebook Messenger stores contacts/messages on Android devices?
If so, read on! If not, then there's nothing to see here ... move along.

This Python script is the brainchild of Shafik Punja (@qubytelogic).  I recently contacted him regarding mobile device script ideas and he quickly got back to me with sample Android test databases and cell phone screenshots (for validation). Without his assistance and feedback, this script would not be here today. So, Thankyou Shafik! The Commonwealth of Forensic Monkeys salutes you!

It's also pretty fortunate timing because Heather Mahalik (@heathermahalik) recently gave an awesomely detailed SANS webcast about Data Retention on Android/iOS devices. In her talk she covered where to look for various application data artefacts and also mentioned a few fun Facebook facts.

BTW I don't use Facebook / Facebook Messenger (monkey has no social life) and no one in their right mind would volunteer their personal data for this blog. So just for you ingrates, I made up a test scenario involving 3 muppets and several Facebook messages.
Due to time contraints, I have only fabricated the script relevant data fields just so we had something to print out.
Any id's I use (hopefully) do not correspond to valid Facebook accounts.
Your own data will probably have more populated fields/longer field lengths. Meh.

The script has been developed/tested on SANS SIFT v2.14 running Python 2.6.4. It has also been run successfully on Win7x64 running Python 2.7.6. You can download it from my Google Code page here.

UPDATE 2014-02-05 Two potential issues have arisen with this code.
1. emoji/non-printable characters in the message text may cause the script to crash. Currently, I do not have the Android test data to verify this.
2. If the keywords "to" or "from" are in the message text and it is subsequently used in a GoogleMaps URL, the URL will not plot properly. This is because GoogleMaps interprets these keywords as routing instructions. I can't think of a way around this without changing the text for the GoogleMaps plot.

Data, Data ... where's the data?

For sanity's sake, I am limiting the scope of this post to actual message content and contacts information. There's a crapload of databases/tables that Facebook uses so I had to draw the line somewhere (even if it's in crayon!). From Shafik's test data, there are 3 tables ("contacts", "threads" and "messages") that we are going to extract data from. These tables are stored in 2 separate SQLite database files ("contacts_db2" and "threads_db2").

The "contacts" table
The Facebook app for Android (katana) stores it's "contacts" table data in:

Notice there's no file extension but it's actually SQLite.

Similarly, the Facebook Messenger app for Android (orca) stores it's "contacts" table data in:

Notice how the filenames are the same?
If you compare their table schemas, you will find that they are identical.
Using the SQLite Manager plugin for Firefox on SIFT v2.14, I opened both "contacts_db2" files and checked the "contacts" table schema (found under the Structure tab).

Facebook App (katana) / Facebook Messenger App (orca) "contacts" table schema:

The "data" column is actually a JSON encoded series of key/value pairs. JSON (JavaScript Object Notation) is just another way of exchanging information. See here for further details.

Using the SQLite Manager Firefox plugin, our fictional muppet test scenario "contacts" table looks like:

Muppet test data "contacts" table

Note: If you hover your cursor over the data cell you're interested in, it brings up the neat yellow box displaying the whole string. So you don't have to waste time re-sizing/scrolling.
If it makes it easier, you can also copy the data cell string and use the JSON validator here to pretty print/validate the string.

In addition to the "contact_id" column, the script extracts/outputs the following JSON fields from the "data" column:

The "PictureUrl" values were usually observed to be based at the "" domain.
The "timelineCoverPhoto" values were usually observed to be based at the "" domain.
For the muppet test scenario data, I've just used picture URLs from wikipedia.

The "threads" table

The Facebook app for Android (katana) stores it's "messages" and "threads" table data in:

Similarly, the Facebook Messenger app for Android (orca) stores it's "messages" and "threads" table data in:

For the "threads" table, the Facebook / Facebook Messenger schemas are identical.

Facebook App (katana) / Facebook Messenger App (orca) "threads" table schema :
CREATE TABLE threads (thread_id TEXT PRIMARY KEY, thread_fbid TEXT, action_id INTEGER, refetch_action_id INTEGER, last_visible_action_id INTEGER, name TEXT, participants TEXT, former_participants TEXT, object_participants TEXT, senders TEXT, single_recipient_thread INTEGER, single_recipient_user_key TEXT, snippet TEXT, snippet_sender TEXT, admin_snippet TEXT, timestamp_ms INTEGER, last_fetch_time_ms INTEGER, unread INTEGER, pic_hash TEXT, pic TEXT, can_reply_to INTEGER, mute_until INTEGER, is_subscribed INTEGER, folder TEXT, draft TEXT )

For the "threads" table, we are only interested in the "thread_id" and "participants" columns.
The "thread_id" can be used to group all the messages from a particular conversation.
Later, we will use the "thread_id" to link "messages" table entries with the "participants" of that thread.
The "participants" column is formatted in JSON and looks something like:
[{"email":"","user_key":"FACEBOOK:100000987654321","name":"Kermit The Frog","mute":0,"lastReadReceiptTimestampMs":0},{"email":"","user_key":"FACEBOOK:1087654322","name":"Rowlf","mute":0,"lastReadReceiptTimestampMs":0}]

The script currently only extracts/prints the "name" data field. It is currently left to the analyst to match these "name" values with the "displayName" fields extracted from the "contacts" table mentioned previously.

Here's a screenshot of our fictional muppet "threads" table:

Muppet test data"threads" table

The "messages" table
The Facebook / Facebook Messenger apps' "messages" table schemas differ by one column - Facebook Messenger's "messages" table has an extra column called "auto_retry_count". We're not going to extract this field anyway so our idea of using one extraction script for both apps is still viable. Phew!

Facebook App (katana) "messages" table schema:
CREATE TABLE messages (msg_id TEXT PRIMARY KEY, thread_id TEXT, action_id INTEGER, subject TEXT, text TEXT, sender TEXT, timestamp_ms INTEGER, timestamp_sent_ms INTEGER, mms_attachments TEXT, attachments TEXT, shares TEXT, msg_type INTEGER, affected_users TEXT, coordinates TEXT, offline_threading_id TEXT, source TEXT, is_non_authoritative INTEGER, pending_send_media_attachment STRING, handled_internally_time INTEGER, pending_shares STRING, pending_attachment_fbid STRING, client_tags TEXT, send_error STRING )

Facebook Messenger App (orca) "messages" table schema:
CREATE TABLE messages (msg_id TEXT PRIMARY KEY, thread_id TEXT, action_id INTEGER, subject TEXT, text TEXT, sender TEXT, timestamp_ms INTEGER, timestamp_sent_ms INTEGER, mms_attachments TEXT, attachments TEXT, shares TEXT, msg_type INTEGER, affected_users TEXT, coordinates TEXT, offline_threading_id TEXT, source TEXT, is_non_authoritative INTEGER, pending_send_media_attachment STRING, handled_internally_time INTEGER, pending_shares STRING, pending_attachment_fbid STRING, client_tags TEXT, send_error STRING, auto_retry_count INTEGER )

For our test scenario, we will be using the Facebook Messenger App schema (orca) for the "messages" table. It should not matter either way.
Our fictional muppet test scenario "messages" table looks like this:

Muppet test data "messages" table

Note: This screenshot does not show all of the column values, just the script relevant ones (ie "msg_id", "thread_id", "text", "sender" (JSON formatted), "timestamp_ms", "coordinates" (JSON formatted), "source").

The "msg_id" is a unique identifier for each message stored in the table.
The "thread_id" is used to group messages from the same conversation thread.
The "text" string stores the message's text. Note: For formatting reasons, the script converts any "/r/n" and "/n" to spaces.
The "sender" column JSON looks like:
{"email":"","user_key":"FACEBOOK:100000987654321","name":"Kermit The Frog"}
From testing observations, this "name" field should correspond to a "displayName" JSON field from the "contacts" table.

The "timestamp_ms" column seems to be the ms since 1JAN1970 in UTC/GMT. It was verified by comparing the message timestamps with screenshots taken from the test Android phone. The test phone displayed the local time of this timestamp.

The "coordinates" column JSON looks like:
Sometimes this column was blank, while other times there were only values defined for latitude/longitude/accuracy.

The "source" column values have been observed to be "messenger", "chat", "web", "mobile".
At this time, I don't know what all of the values indicate. Further testing is required as the Messaging Help from Facebook does not mention this "source" field. Boo!
It's probably safe to say that "messenger" indicates the Facebook Messenger app (not sure if this includes the mobile versions).
The "chat" probably indicates the source being the chat sidebar from within a browser.
The "mobile" possibly indicates the source is a Facebook app running on mobile device (eg Android/iPhone).
The "web" could mean a "Facebook message" was sent from a browser logged into Facebook?
There is also a Firefox add-on for sending Facebook messages but it's unknown which category this would fall under.
BTW if you know what all these values stand for please let us know via the comments section!

So putting it all together ... here's our script relevant data all nicely mapped out for you :)

Facebook messaging schema

Note: The JSON encoded data fields are highlighted in blue ("Columbia Blue" according to Wikipedia for all you interior decorator types :). The remaining uncoloured fields are either text or numeric in nature.

From the diagram above, we can use the "thread_id" to match "participants" (senders/receivers) to a particular thread (collection of messages). See the red link in the diagram.
As mentioned earlier, we can also link the "messages" table's "sender" column ("name") back to an entry in the "contacts" table's "data" column ("displayName"). See the yellowy-orange link in the diagram above.

Due to "sender" columns sometimes being blank, the script does not currently do this automagically (Drats!). Instead, it is suggested that the analyst manually matches each participant "name" from the extracted contacts output using the contacts "displayName" field.
From the test data supplied, these two fields seem to correspond. Future versions of the script could also print the "user_key" field in case there are multiple contacts with the same "displayName"s.

How the script works ...

OK, enough about the data. Let's see what the script does eh?

The script connects to the given "threads_db2" and "contacts_db2" SQLite files and runs queries to extract the stored contacts and messages.
It then sorts/outputs these values to the command line and optionally to the nominated Tab Separated Variable files.
The script converts the "timestamp_ms" column values into the form YYYY-MM-DDThh:mm:ss.
If a message has latitude/longitude data, it will also provide a plot of the position via a GoogleMaps URL. The message text and timestamp are also included on the plot.

In case you were wondering about the SQLite query the script uses to extract the messages ...
select messages.msg_id, messages.thread_id, messages.text, messages.sender, threads.participants, messages.timestamp_ms, messages.source, messages.coordinates from messages, threads where messages.thread_id=threads.thread_id order by messages.thread_id, messages.timestamp_ms;

And for the contacts ...
select contact_id, data from contacts;

To make things easier, Python has some existing libraries we can use:
sqlite3 (for querying the SQLite files)
json (for converting the JSON strings to a Python object we can parse)
datetime (for converting "the timestamp_ms" field into a readable date/time string)
urllib (used to ensure our GoogleMaps URL doesn't contain any illegal characters ... it makes 'em an offer they can't refuse!)

On SIFT v2.14, I've used the following command to make the script directly executable (ie no need to type "python" before the script name):
sudo chmod a+x

Here's the help text ...

sansforensics@SIFT-Workstation:~$ ./
Running fbmsg-extractor v2014-01-08 Initial Version
Usage: -t threads_db -c contacts_db -x contacts.tsv -z messages.tsv

  -h, --help      show this help message and exit
  -t THREADSDB    threads_db2 input file
  -c CONTACTSDB   contacts_db2 input file
  -x CONTACTSTSV  (Optional) Contacts Tab Separated Output Filename
  -z MESSAGESTSV  (Optional) Messages Tab Separated Output Filename

And here's what happens when we run it with our muppet test data ...

sansforensics@SIFT-Workstation:~$ ./ -t facebook/test/threads_db2 -c facebook/test/contacts_db2 -x muppet-contacts.txt -z muppet-messages.txt
Running fbmsg-extractor v2014-01-08 Initial Version

Extracted CONTACTS Data

contact_id    profileFbid    displayName    displayNumber    universalNumber    smallPictureUrl    bigPictureUrl    hugePictureUrl    timelineCoverPhoto

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMwo=    1087654323    Fozzie Bear    (555) 555-0003    +15555550003

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTAwMDAwOTg3NjU0MzIxCg==    100000987654321    Kermit The Frog    NA    NA

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=    1087654321    Miss Piggy    (555) 555-0001    +15555550001

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMgo=    1087654322    Rowlf    (555) 555-0002    +15555550002

Extracted MESSAGES Data

msg_id    thread_id    text    sender    participants    timestamp_ms    source    latitude    longitude    accuracy    heading    speed    altitude    googlemaps
m_id.123456789012345678    t_1234567890abcdefghijk1    Hi-ho! You coming to the show?    Kermit The Frog    Kermit The Frog, Miss Piggy    2014-01-03T23:45:03    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0,+-117.918157+%28Hi-ho%21+You+coming+to+the+show%3F+%402014-01-03T23%3A45%3A03%29&iwloc=A&hl=en

m_id.123456789012345679    t_1234567890abcdefghijk1    Yes Kermie! Just powdering my nose ...    Miss Piggy    Kermit The Frog, Miss Piggy    2014-01-03T23:49:05    mobile    33.802399    -117.914954    1500.0    NA    NA    NA,+-117.914954+%28Yes+Kermie%21+Just+powdering+my+nose+...+%402014-01-03T23%3A49%3A05%29&iwloc=A&hl=en

m_id.123456789012345680    t_1234567890abcdefghijk1    So ... At IHOP again huh?    Kermit The Frog    Kermit The Frog, Miss Piggy    2014-01-03T23:50:05    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0,+-117.918157+%28So+...+At+IHOP+again+huh%3F+%402014-01-03T23%3A50%3A05%29&iwloc=A&hl=en

m_id.123456789012345683    t_1234567890abcdefghijk1    More Pork Rolls for you to love!    Miss Piggy    Kermit The Frog, Miss Piggy    2014-01-03T23:50:45    mobile    33.802399    -117.914954    1500.0    NA    NA    NA,+-117.914954+%28More+Pork+Rolls+for+you+to+love%21+%402014-01-03T23%3A50%3A45%29&iwloc=A&hl=en

m_id.123456789012345689    t_1234567890abcdefghijk2    Yo Fozzie! Where u at?    Kermit The Frog    Kermit The Frog, Fozzie Bear    2014-01-03T23:47:13    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0,+-117.918157+%28Yo+Fozzie%21+Where+u+at%3F+%402014-01-03T23%3A47%3A13%29&iwloc=A&hl=en

m_id.123456789012345690    t_1234567890abcdefghijk2    Hey Kermie! I'm almost BEAR ! Wokka!Wokka!Wokka!    Fozzie Bear    Kermit The Frog, Fozzie Bear    2014-01-03T23:47:43    mobile    33.808227    -117.918948    12.0    90.0    1.0    0.0,+-117.918948+%28Hey+Kermie%21+I%27m+almost+BEAR+%21+Wokka%21Wokka%21Wokka%21+%402014-01-03T23%3A47%3A43%29&iwloc=A&hl=en

4 contacts were processed

6 messages were processed


As you can see, the script prints out the contacts information first followed by the message data. Columns are tab separated but when dealing with large numbers of contacts/messages, the command line quickly becomes unreadable. It is HIGHLY recommended that analysts utilize the output to TSV functionality.

Here's what the outputted TSV data looks like after being imported into a spreadsheet program:

Script's Output TSV for Muppet test data contacts

Script's Output TSV for Muppet test data messages

Contacts are sorted alphabetically by "displayName".
Messages are sorted first by thread, then in chronological order (using the "timestamp_ms" value).

Not all messages will have defined geodata. Some may be blank or only have lat/long/accuracy with no speed/heading/altitude.
CAUTION: Not sure what the units are for the accuracy/speed/heading/altitude
In general, the script outputs the string "NA" if there is no defined value.

Just for shiggles, lets plot Miss Piggy's position for her first reply back to Kermit (ie "Yes Kermie! Just powdering my nose ...") using the GoogleMaps URL from the messages TSV.

Where's Piggy?

From the example screenshot, we can see the message text and timestamp plotted along with her position in GoogleMaps. Somebody's telling porkies eh?

Some Other Trivia ...

Format of "contact_id"
The funky looking "contact_id" (eg "Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=") from the "contacts" table is actually base64 encoded. Looking closer at the letters and numbers comprising the "contact_id", we can see an "=" character.
I remembered seeing similar strings in base64 encoded emails ... so just for shiggles, I tried decoding it via the "base64" command.
Here's a fictional demo example:

sansforensics@SIFT-Workstation:~$ echo 'Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=' | base64 --decode

The decoded format appears to be "contact:XXX:YYY"
XXX = remains constant for each "contact_id" and corresponds to /data/data/com.facebook.orca/shared_prefs/com.facebook.orca_preferences.xml's "/auth/user_data/fb_uid" value. It's believed to be used as a unique user id for the Facebook account user.
YYY = Seems to be a user id field for the stored contact (ie equals their "profileFbid" value).

Don't believe everything in the "contacts" database however. Heather Mahalik mentioned in her SANS webcast that Facebook can add contact entries when the app suggests potential friends. Consequently, stored messages should be used to indicate whether a contact entry is someone the account holder has communicated with.

XML files
Be sure to also check out the various XML files containing potentially relevant info (such as username, times, account info). You can find these under:

Even though they share the same filename, /data/data/com.facebook.orca/shared_prefs/com.facebook.orca_preferences.xml differs from the /data/data/com.facebook.katana/shared_prefs/com.facebook.orca_preferences.xml.
In addition to having a different order of declarations, the katana version mentions what appears to be the user's email address.

Other databases
Also check out the "prefs" database tables for username, times, account info. This can be found under /data/data/com.facebook.katana/databases/prefs_db and /data/data/com.facebook.orca/databases/prefs_db.

Facebook Messaging Documentation
Just in case you're as semi-oblivious to Facebook messaging as I am, here's some messaging help I found from the Facebook website.

When you send someone a message, it gets delivered to the person’s Facebook Messages.
If the person you messaged has turned chat on, your message will appear as a chat. If they have chat off, the message will appear in their message inbox and they will receive a notification

Chat and message histories are threaded together — you can think of them as one and the same. When you open a conversation, you’ll see a conversation that includes all your messages along with your entire chat history. If you send a chat message to a friend who has turned chat off, the chat message will be sent directly to their message inbox.

Can I message my mobile contacts if we’re not Facebook friends?
Yes. Confirming your phone number when you first sign in helps ensure that your contacts will be able to find you. Messenger works similar to texting or other mobile messaging apps, and you can add people to your Messenger contacts by entering their phone number.
To allow people who have your phone number to reach you in Messenger, the app will ask you to set the control called "Who can look you up by the phone number you provided?" to Public.

Can I message friends who aren’t using the Facebook Messenger mobile app?
Yes. People who don't have the Facebook Messenger app on their phone will receive chats and messages you send whenever they log into Facebook.

How does location work with the Messenger mobile app?
When you send a message from the Messenger app, your location is included by default. You can turn this feature off by tapping before you send a message, which turns the arrow from blue (on) to gray (off). Location remains off for that conversation until you tap the arrow again.
In order to share your location in messages, you'll need to turn on location services in the main settings of your smartphone.

Who can see my location when I share it in a conversation in Messenger?
Your location is only visible to the people in that conversation.

Does Facebook store my location when I include it in a message?
When you add your location to a message, the location becomes a permanent part of the message history.
When you send a message to a friend with your location, that friend can see it as a pin on a map when they tap on your message. Your location won't appear anywhere outside of the message.

Final Thoughts

This script has been tested with a limited amount of test data. It is possible that some non-defined/blank fields might cause the script to fall over in a screaming heap. If so, let me know and I will try to fix it although some test data may be required to locate the bug(s) quicker. Knowing this monkey's luck, Facebook will probably change their database schema next week anyway LOL.

In the end, it took just as long to write this blog article as it did to write the script. Once you familarize yourself with the relevant libraries / Google for what you want the code to do (Thankyou Stack Overflow !) - it's pretty straight forward. Even a monkey can do it! And while this functionality is possibly already implemented in a number of forensic tools, writing this script provided me with a deeper understanding of the data involved whilst also allowing me to improve my Python programming skills. So I'd say it was well worth the effort.

It seems GoogleCode is stopping uploads from mid-January so this will probably be the last script I put on there. I'll have to find a new (free) home for future scripts. Anyone have suggestions? I am thinking of sharing a GoogleDrive / DropBox folder but that might not allow for easily viewed release notes. Not that anyone reads the release notes anyway LOL.

As usual, please feel free to leave comments/suggestions in the comments section below.

Wednesday, 18 December 2013

Monkey Vs Python = Template Based Data Extraction Python Script

There seems to be 2 steps to forensically reverse engineering a file format:
- Figuring out how the data is structured
- Extracting that data for subsequent presentation

The script is supposed to help out between the two stages. Obviously, I was battling to come up with a catchy script name ("dextract" = data extract). Meh ...

The motivation for this script came when I was trying to reverse engineer an SMS Inbox binary file format and really didn't want to write a separate data extraction script for every subsequent file format. I also wanted to have a wrestle with Python so this seemed like as good an opportunity as any.

Anyhoo, while 9 out of 10 masochists agree that reverse engineering file formats can be fun, I thought why not save some coding time and have one configurable extraction script that can handle a bunch of different file formats.
This lead me to the concept of a "template definition" file. This means one script (with different templates) could extract/print data from several different file types.
Some quick Googling showed that the templating concept has already been widely used in various commercial hex editors

Nevertheless, I figured an open source template based script that extracts/prints data might still prove useful to my fellow frugal forensicators - especially if it could extract/interpret/output data to a Tabbed Separated (TSV) file for subsequent presentation.
It is hoped that will save analysts from writing customized code and also allow them to share their template files so that others don't have to re-do the reverse engineering. It has been developed and tested (somewhat) on SIFT v2.14 running Python 2.6.4. There may still be some bugs in it so please let me know if you're lucky/unlucky enough to find some.

You can get a copy of the script and an example dextract.def template definition file from my Google code page here.
But before we begin, Special Thanks to Mari DeGrazia (@maridegrazia) and Eric Zimmerman (@EricRZimmerman) for their feedback/encouragement during this project. When Monkey starts flinging crap ideas around, he surely tests the patience of all those unlucky enough to be in the vicinity.

So here's how it works:

Everyone LOVES a good data extraction!

Given a list of field types in a template definition file, will extract/interpret/print the respective field values starting from a given file offset (defaults to beginning of the file).
After it has iterated through each field in the template definition file once, it assumes the data structure/template repeats until the end offset (defaults to end of file) and the script iterates repeatedly until then.
Additionally, by declaring selected timestamp fields in the template definition, the script will interpret the hex values and print them in a human readable ISO format (YYYY-MM-DDThh:mm:ss).

To make things clearer, here's a fictional SMS Inbox file example ... Apparently, Muppets love drunk SMS-ing their ex-partners. Who knew?
So here's the raw data file ("test-sms-inbox.bin") as seen by WinHex:


OK, now say that we have determined that the SMS Inbox file is comprised of distinct data records with each record looking like:

Suspected "test-sms-inbox.bin" record structure

Observant monkeys will notice the field marked with the red X. For the purposes of this demo, the X indicates that we suspect that field is the "message read" flag but we're not 100% sure. Consequently, we don't want to clutter our output with the data from this field and need a way of suppressing this output. More on this later ...

And now we're ready to start messing with the script ...

Defining the template file

The template file lists each of the data fields on a seperate line.
There are 3 column attributes for each line.
  • The "field_name" is a unique placeholder for whatever the analyst wants to call the field. It must be unique or you will get funky results.
  • The "num_types" field is used to specify the number of "types". This should usually be set to 1 except for strings. For strings, the "num_types" field corresponds to the number of bytes in the string. You can set it to 0 if unknown and the script will extract from the given offset until it reaches a NULL character. Or you can also set it to a previously declared "field_name" (eg "msgsize") and the script will use the value it extracted for that previous "field_name" as the size of the string.
  • The "type" field defines how the script interprets the data. It can also indicate endianness for certain types via the "<" (LE) or ">" (BE) characters at the start of the type.

Here's the contents of our SMS Inbox definition file (called "dextract.def").
Note: comment lines begin with "#" and are ignored by the script.

# Note: Columns MUST be seperated by " | " (spaces included)
# field_name | num_types | type
contactname | 0 | s
phone | 7 | s
msgsize | 1 | B
msg | msgsize | s
readflag | 1 | x
timestamp | 1 | >unix32

So we can see that a record consists of a "contactname" (null terminated string), "phone" (7 byte string), "msgsize" (1 byte integer), "msg" (string of "msgsize" bytes), "readflag" (1 byte whose output will be ignored/skipped) and "timestamp" (Big Endian 4 byte No. of seconds since Unix epoch).

Remember that "readflag" field we weren't sure about extracting earlier?
By defining it as a type "x" we can tell the script to skip processing those "Don't care" bytes.
So if you haven't reverse engineered every field (sleep is for chumps!), you can still extract the fields that you have figured out without any unnecessary clutter.

Running the script

Typing the scriptname without arguments will print the usage help.  
Note: I used the Linux command "chmod a+x" to make my executable.

sansforensics@SIFT-Workstation:~$ ./
Running dextract v2013-12-11 Initial Version

Usage#1: -d defnfile -f inputfile
Usage#2: -d defnfile -f inputfile -a 350 -z 428 -o outputfile

  -h, --help      show this help message and exit
  -d DEFN         Template Definition File
  -f FILENAME     Input File To Be Searched
  -o TSVFILE      (Optional) Tab Seperated Output Filename
  -a STARTOFFSET  (Optional) Starting File Offset (decimal). Default is 0.
  -z ENDOFFSET    (Optional) End File Offset (decimal). Default is the end of

The following values are output by the script:
  • Filename
  • File_Offset (offset in decimal for the extracted field value)
  • Raw_Value (uninterpreted value from extracted field)
  • Interpreted_Value (currently used only for dates, it uses the Raw_Value field and interprets it into something meaningful)

The default outputting to the command line can be a little messy so the script can also output to a tab separated file (eg smstext.txt).

So getting back to our SMS example ...
We can run the script like this:

sansforensics@SIFT-Workstation:~$ ./ -d dextract.def -f /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin -o smstest.txt
Running dextract v2013-12-11 Initial Version

Input file /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin is 164 bytes

/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:0, nullterm str field = contactname, value = fozzie bear
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:12, defined str field = phone, value = 5551234
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:19, field = msgsize, value = 18
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:20, deferred str field = msg, value = Wokka Wokka Wokka!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:39, field = timestamp, value = 1387069205, interpreted date value = 2013-12-15T01:00:05
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:43, nullterm str field = contactname, value = kermit
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:50, defined str field = phone, value = 5551235
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:57, field = msgsize, value = 6
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:58, deferred str field = msg, value = Hi Ho!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:65, field = timestamp, value = 1387069427, interpreted date value = 2013-12-15T01:03:47
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:69, nullterm str field = contactname, value = Swedish Chef
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:82, defined str field = phone, value = 5554000
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:89, field = msgsize, value = 31
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:90, deferred str field = msg, value = Noooooooony Nooooooony Nooooooo
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:122, field = timestamp, value = 1387080005, interpreted date value = 2013-12-15T04:00:05
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:126, nullterm str field = contactname, value = Beaker
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:133, defined str field = phone, value = 5550240
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:140, field = msgsize, value = 18
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:141, deferred str field = msg, value = Mewww Mewww Mewww!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:160, field = timestamp, value = 1387082773, interpreted date value = 2013-12-15T04:46:13

Exiting ...

And if we import our "smstest.txt" output TSV into a spreadsheet application for easier reading, we can see:

Tab Separated Output File for all records in "test-sms-inbox.bin"

Note: The "readflag" field has not been printed and also note the Unix timestamps have been interpreted into a human readable format.

Now, say we're only interested in one record - the potentially insulting one from "kermit" that starts at (decimal) offset 42 and ends at offset 68.
We can run something like:

sansforensics@SIFT-Workstation:~$ ./ -d dextract.def -f /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin -o smstest.txt -a 43 -z 68
Running dextract v2013-12-11 Initial Version

Input file /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin is 164 bytes

/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:43, nullterm str field = contactname, value = kermit
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:50, defined str field = phone, value = 5551235
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:57, field = msgsize, value = 6
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:58, deferred str field = msg, value = Hi Ho!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:65, field = timestamp, value = 1387069427, interpreted date value = 2013-12-15T01:03:47

Exiting ...

and the resultant output file looks like:

Tab Separated Output File for the "kermit" record only

Lets see our amphibious amore wriggle out of that one eh?


The main limitation is that relies on files having their data in distinctly ordered blocks (eg same ordered fields for each record type). Normally, this isn't a problem with most flat files containing one type of record.
If you have a file with more than one type of record (eg randomly combined SMS Inbox/Outbox with 2 types of record) then this script can still be useful but the process will be a bit longer/cumbersome.
You can use the start/end offset arguments to tell the script to extract a specific record from the file using a particular template definition (as shown previously).
For extracting another type of record, re-adjust the start/end offsets and point the script to the other template file.
Unfortunately, I couldn't think of a solution to extracting multiple record types randomly ordered in the same file (eg mixed Inbox/Outbox messages). Usually, there would be a record header/number preceeding the record data but we can't be sure that would always be the case. So for randomly mixed records, we're kinda stuck with the one record at a time method.
However, if the records were written in a repeated fixed pattern eg recordtypeA, recordtypeB (then back to recordtypeA), the script should be able to deal with that. You could set up a single template file with the definition of recordtypeA then recordtypeB and then the script will repeatedly try to extract records in that order until the end offset/end of file.

FYI As SQLite databases do NOT write NULL valued column values to file, they can have varying number of fields for each row record depending on the data values. Consequently, and SQLite probably won't play well together (unless possibly utilized on a per record basis).

Obviously there are too many types of data fields to cater for them all. So for this initial version, I have limited it to the in-built Python types and some selected timestamps from Paul Sanderson's "History of Timestamps" post.

These selected timestamps also reflect the original purpose of cell phone file data extraction.

Supported extracted data types include:

# Number types:
# ==============
# x (Ignore these No. of bytes)
# b or B (signed or unsigned byte)
# h or H (BE/LE signed or unsigned 16 bit short integer)
# i or I (BE/LE signed or unsigned 32 bit integer)
# l or L (BE/LE signed or unsigned 32 bit long)
# q or Q (BE/LE signed or unsigned 64 bit long long)
# f (BE/LE 32 bit float)
# d (BE/LE 64 bit double float)
# String types:
# ==============
# c (ascii string of length 1)
# s (ascii string)
# Note: "s" types will have length defined in "num_types" column. This length can be:
# - a number (eg 140)
# - 0 (will extract string until first '\x00')
# - Deferred. Deferred string lengths must be set to a previously declared field_name
# See "msgsize" in following example:
# msg-null-termd | 0 | s
# msg-fixed-size | 140 | s
# msgsize | 1 | B
# msg-deferred | msgsize | s
# msg-to-ignore | msgsize | x
# Also supported are:
# UTF16BE (BE 2 byte string)
# UTF16LE (LE 2 byte string)
# For example:
# UTF16BE-msg-null-termd | 0 | UTF16BE
# UTF16BE-msg-fixed-size | 140 | UTF16BE
# UTF16BE-msgsize | 1 | B
# UTF16BE-msg-deferred | msgsize | UTF16BE
# Timestamp types:
# =================
# unix32 (BE/LE No. of secs since 1JAN1970T00:00:00 stored in 32 bits)
# unix48ms (BE/LE No. of millisecs since 1JAN1970T00:00:00 stored in 48 bits)
# hfs32 (BE/LE No. of secs since 1JAN1904T00:00:00)
# osx32 (BE/LE No. of secs since 1JAN2001T00:00:00)
# aol32 (BE/LE No. of secs since 1JAN1980T00:00:00)
# gps32 (BE/LE No. of secs since 6JAN1980T00:00:00)
# unix10digdec (BE only 10 digit (5 byte) decimal No. of secs since 1JAN1970T00:00:00)
# unix13digdec (BE only 13 digit (7 byte) decimal No. of ms since 1JAN1970T00:00:00)
# bcd12 (BE only 6 byte datetime hex string  eg 071231125423 = 31DEC2007T12:54:23)
# bcd14 (BE only 7 byte datetime hex string eg 20071231125423 = 31DEC2007T12:54:23)
# dosdate_default (BE/LE 4 byte int eg BE 0x3561A436 = LE 0x36A46135 = 04MAY2007T12:09:42)
# dosdate_wordswapped (BE/LE 4 byte int eg BE 0xA4363561 = LE 0x613536A4 = 04MAY2007T12:09:42)

How the code works? A brief summary ...

The code reads each line of the specified template definition file and creates a list of field names. It also creates a dictionary (keyed by field name) for sizes and another dictionary for types.
Starting at the given file offset, the script now iterates through the list of fieldnames and extracts/interprets/prints the data via the "parse_record" method. It repeats this until the end offset (or end of file) is reached.
The main function doesn't involve many lines of code at all. The "parse_record" function and other subfunctions is where things start to get more involved and they make up the bulk of the code. I think I'll leave things there - no one in their right mind wants to read a blow by blow description about the code.

Thoughts on Python

I can see why it has such a growing following. It's similar enough to C and Perl that you can figure out what a piece of code does fairly quickly.
The indents can be a bit annoying but it also means you don't have to spend extra lines for the enclosing {}s. So code seems shorter/purdier.

The online documentation and Stackoverflow website contained pretty much everything I needed - from syntax/recipe examples to figuring out which library functions to call.
It's still early days - I haven't written any classes or tried any inheritance. For short scripts this might be overkill anyway *shrug*.
As others have mentioned previously, it probably comes down to which scripting language has the most appropriate libraries for the task.
SIFT v2.14 uses Python 2.6.4 so whilst it isn't the latest development environment, I figured having a script that works with a widely known/used forensic VM is preferable to having the latest/greatest environment running Python 3.3.
I used jedit  for my Python editor but could have also used the gedit text editor already available on SIFT. You can install jedit easily enough via the SIFT's Synaptic Package Manager. Let me know if you think there's a better Python editor in the comments?

So ... that's all I got for now. If you find it useful or have some suggestions (besides "Get into another line of work Monkey!"), please leave a comment. Hopefully, it will prove useful to others ... At the very least, I got to play around with Python. Meh, I'm gonna claim that Round 1 was a draw :)

Wednesday, 18 September 2013

Reflections of a Monkey Intern and some HTCIA observations

Inspired by the approaching 12 month point of my internship and this Lifehacker article, I thought I'd share some of my recent thoughts/experiences. Hopefully, writing this drivel will force me to better structure/record my thoughts. It's kinda like a memo to myself but feel free to share your thoughts in the comments section.


This is vital to any healthy internship. Ensuring that both intern/employer have the same/realistic expectations will help in all other areas.
Initially, I found it beneficial to over-communicate if I was unsure (eg explain what I did and then ask about any uncertainties). Interns asking questions are also a good way for supervisors to gauge understanding. Perhaps the intern's line of questioning might uncover additional subjects which the supervisor can help with.

Take detailed notes of any tasks you have performed. This includes the time spent (be honest!) and any notable achievements/findings. These notes will make it easier for you to communicate to your supervisor exactly what has been done.
Later, you can also use these notes to:
- help you pimp your CV (eg "I peer-reviewed X client deliverable reports") and
- see how far you've progressed (eg now it only takes me Y minutes to do that task).

Goal Setting & Feedback

Having an initial goal of "getting more experience" is OK but when the work load surges/subsides, it's easy to lose track of where your training was up to before the interruption. Regular feedback sessions where both parties can communicate short term goals (eg get more experience with X-Ways) can help keep you on track. They don't have to be long, formal discussions - if things are going well, it might only be a 5 minute conversation.
It's also easy to fall into a comfort zone and say "everythings peachy". Don't leave it all to your supervisor - think about other new skills/tools you might like to learn/apply.
Regular communication with your supervisor about the internship will also encourage/help them think about your progress.

The internship should be geared more for the intern's benefit rather than the employer but it is still a two way street. If you feel like your needs are not being met, speak up but also realise that there's mundane tasks in every job and that you can usually learn something from almost any task. The internship is all about experiencing the good, the not so good and the "I never want to do that ever again!".

Rules & Guidelines

Follow your supervisor's instructions but you don't have to be a mindless robot about it. Whatever the task, try to think of ways to improve/streamline a process/description. eg Would a diagram help with this explanation? Can I write a script to automate this task? Could we word this description better - if so, be prepared to provide alternatives. However, before you implement your game changing improvements, be sure to discuss them with your supervisor first!

Pace Yourself

As an intern, you are not expected to know everything. However, you can't sit on your paws and expect to be taught everything either. I guess it's like learning to ride a bike - your supervisor has done it before but there's only so much they can tell you before it's time for you to do it yourself. Along the way, you might fall/stuff up but that's all part of learning.
Everyone learns at different rates. Try not to get too high/too low about your progress. At the start, it's tempting to "go hard" but interns should also make the time to ensure that they are on-track. In this regard, knowing when to ask for help or for extra info can make an internship so much easier. If something feels like its taking too long, it's probably time to ask your supervisor for help.
Also, allow yourself time to decompress/be simian. This will require you to ask/know what work is coming up. Remember, they wouldn't be taking on an intern if business was slow but interns are (supposedly!) human too. We all need a break now and then. If you have a prior commitment, let your supervisor know as soon as possible.
I have noticed that I tend to get absorbed in a problem and can work long hours on it until it's resolved. However, when that's over, I like to slow things down to recharge the batteries. During this slower period (when the case load wanes), I might be doing research or writing scripts or just relaxing and doing non-forensic stuff. Knowing and being honest about your preferred working style can also help you choose the most appropriate forensics job (eg a small private company vs a large law enforcement agency).

Confidence & Mistakes

Despite my awesome cartooning ability, I would not say that I am a naturally confident and sociable person. New unknowns (eg social situations) can be a little daunting for me. However, I am learning that confidence is an extension of experience. The more experience you get, the more situational awareness you develop. I think this "awareness" can then appear to others as confidence (eg "Oh I've seen this before ... if we do ABC we need to think about XYZ").
I still cringe every time I realise that I've made a mistake but I also realise that mistakes are part of the learning process/experience. The main thing is to get back on the bike and not to repeat the mistake.
I also like to use my mistakes as motivation to achieve something extra positive. For example, if I make a mistake in one section of a report, I use it to as motivation to look for areas of improvement for the other sections. It's kinda corny but this pseudo self-competitiveness keeps things interesting (especially when writing reports).

Use Your Breaks/Free Time Wisely

Like most monkeys, I have found it easier to retain information by doing rather than reading (ie monkey-see, monkey-do). That said, there's no way I'm gonna be able to do everything.
One thing I like to do with my spare time is to try to keep current with DFIR news (eg new tools/technology, popular consumer applications). The trends of today will probably contain the evidence we'll need tomorrow. My approach is to read as many relevant blogs/forums as possible and understand that whilst I may not remember every detail, I understand enough so if/when I need this information, my monkey-brain goes "Yup so and so did a post on this last year" and I can re-familarize myself with the specific details.

Certification ... blech! I have mixed feelings about this. I am sure many recruiters just skim resumes looking for key words such as EnCe or ACE. Knowing a tool doesn't necessarily make you a better investigator. Knowing what artifacts the tools are processing and how they work, does. Writing your own tools to process artifacts? Even better!
However, as an intern looking for a full time job we also have to think of how to sell ourselves to an employer (no, not like that...). ie What skills/experience are employers looking for?
Obviously your chances of landing a full time job improve if you have some (certified) experience with the forensic tools that they use. While I have used various commercial tools for casework, I've also been fortunate that my supervisor has also let me use them to do additional practice cases. This has given me enough experience to get a vendor based cell phone certification that I can now add to my CV.
Regardless of whether your shop uses commercial or open source tools, getting some extra "seat time" working on previous/practice cases is a great way to improve the confidence/speed at which you work. And being an intern, your supervisor can also act as a trainer/coach.

Meeting New People

It's becoming apparent to me that in DFIR, who you know plays just as an important role as what you know. For example, your business might get a referral from someone you meet at a conference or maybe that someone can help you with some forensic analysis or land a new job.  Being a non-drinking, naturally shy intern monkey, meeting new people can intimidate the crap outta me. However, I also realise that it's a small DFIR world and that we really should make the time to connect with other DFIRers. Even if it's as simple as reading someone's blog post and sending them an email to say thank you. Or perhaps suggesting some improvements for their process/program. FYI Bloggers REALLY appreciate hearing about how their post helped someone.
Your supervisor is also probably friendly with a bunch of other DFIRers. Use the opportunity to make some new acquaintances.

HTCIA Thoughts

I recently spent 2 weeks with my supervisor before heading out to the HTCIA conference. It was the first time we had met in person since I started the internship but because we had already worked together for a while, it felt more like catching up with a friend.
During the first week, I got some hands-on experience imaging hard drives and cell phones (both iPhone/Android) for some practice cases. Having a remote internship meant that this was the first time I got to use this equipment which was kinda cool. I also practiced filling out Chain of Custodys and following various company examination procedures.
During the second week, I got to observe the business side of a private forensics company as we visited some new clients on site. I noticed that private forensics involves more than just technical skills and the ability to explain your analysis. A private forensics company also has to convince prospective clients that they can help and then regularly address any of the client's concerns. This increased level of social interaction was something that I hadn't really thought about previously. The concept of landing/keeping clients is probably the main difference between Law Enforcement and private practice.
As part of my supervisor's plan to improve their public speaking skills, they gave a presentation on Digital Forensics to a local computer user's group. After the main presentation, I talked for 10 minutes on cell phone forensics. Whilst it had been a while since I last talked in public, I was not as nervous as I'd thought I'd be. I think I found it easier because my supervisor gave great presentation and I could kinda base my delivery on theirs. I noticed that an effective presentation involves engaging the audience with questions (ie making them think), keeping a brisk pace and keeping the technical material at an audience appropriate level. The use of humour (eg anecdotes, pictures) can also help with pacing. Later, I would see these same characteristics during the better HTCIA labs.

HTCIA was held this year at the JW Marriott Hotel in Summerlin, Nevada. About a 20 min drive from the Las Vegas strip, you really needed a car otherwise you were kinda stuck at the hotel.
The labs/lectures started on Monday afternoon and ended on Wednesday afternoon.
The first couple of days allowed for plenty of face time with the vendors. Each vendor usually had something to give away. At the start, I *almost* felt guilty about taking the free stuff but towards the end it was more like "what else is up for grabs?" LOL. I probably did not maximise my swag but how many free pens/usb sticks/drink bottles can you really use?

Socially, I probably didn't mix as much as I could have. My supervisor and I spent a fair amount of time working on the new cases whenever we weren't attending labs/lectures. I still managed to meet a few people though and when I was feeling tired/shy I could just hang around my supervisor and just listen in/learn more about the industry. The good thing about forensic conferences is that most of the attendees are fellow geeks and so when there's a lull in the conversation, we can default to shop talk (eg What tools do you use? How did you get started in forensics?).

There were several labs that stood out to me. Listed in chronological order, they were:
Monday PM: Sumuri's "Mac Magic - Solving Cases with Apple Metadata" presented by Steve Whalen. This lab mentioned that Macs have extended metadata attributes which get lost when analysing from non HFS+ platforms. Hence, it's better to use a Mac to investigate another Mac. The lab also covered Spotlight indexing, importers and exiftool. As a novice Mac user, this was all good stuff to know. Steve has a witty and quick delivery but he also took the time and ensured that everyone could follow along with any demos.

Tuesday PM: SANS "Memory Forensics For The Win" presented by Alissa Torres ( @sibertor ). Alissa demonstrated Volatility 2.2 on SIFT using a known malware infected memory dump. She also gave out a DVD with SIFT and various malware infected memory captures. Alissa mentioned that the material was taken from a week long course so even with her energetic GO-GO-GO delivery, it was a lot to cover in 1.5 hours. The exercises got students to use Volatility to identify malicious DLLs/processes from a memory image, extract malicious DLLs for further analysis and also inspect an infected registry key. The handout also included the answers which made it easier to follow along/catch up if you fell behind. I had seen Alissa's SANS 360 presentation on Shellbags and Jesse Kornblum's SANS Webcast on Memory Forensics so I kinda had an inkling of what to expect. But there is just so much to know about how Windows works (eg which processes do what, how process data is stored in memory) that this HTCIA session could be compared to drinking from a fire hose. It would be interesting to see if the pace is a bit more easy going when Alissa teaches "SANS FOR526: Windows Memory Forensics In-Depth". However, I definitely think this session was worth attending - especially as I got a hug after introducing myself :) Or maybe I just need to get out of the basement more often LOL.

Wednesday AM: SANS "Mac Intrusion Lab" presented by Sarah Edwards ( @iamevltwin ). Sarah's talk was enthusiastic, well paced and well thought out - she would discuss the theory and then show corresponding example Macintosh malware artefacts. Sarah covered quite a bit in the 1.5 hours - how to check for badness in installed applications/extensions (drivers), autoruns, Internet history, Java, email, USB and log analysis. Interestingly, she also mentioned that Macs usually get hacked via a Java vulnerability/social engineering. It was good to meet Sarah in person and it also let me figure out the significance of her email address. It looks like her SANS 518 course on Mac and iOS forensics will be a real winner.

Overall, it was an awesome trip visiting my supervisor and a good first conference experience.  Hopefully, I can do it again real soon.
Please feel free to leave a comment about internships and/or the HTCIA conference below.

Friday, 23 August 2013

HTCIA Monkey

Just a quick post to let you know that this monkey (and friends) will be attending HTCIA 2013 from 8-11 Sept in Summerlin, Nevada.
 So if you're in the neighbourhood, please feel free to play spot the monkey and say hello. I promise I won't bite ... unless you try to touch my bananas (heh-heh).

Friday, 26 July 2013

Determining (phone) offset time fields

Let me preface this by saying this post is not exhaustive - it only details what I have been able to learn so far. There's bound to be other strategies/tips but a quick Google didn't return much (hence this post). Hopefully, both the post and accompanying script ( available from here) will help my fellow furry/not-so-furry forensicators determine possible timestamp fields.

In a recent case, we had a discovery document listing various SMS/MMS messages and their time as determined by both manual phone inspection and telecommunications provider logs.
However, whilst our  phone extraction tool was able to image the phone and display all of the files, it wasn't able to automagically parse the SMS/MMS databases. Consequently, we weren't immediately able to correlate what was in the discovery document with our phone image. Uh-oh ...

So what happens when you have an image of a phone but your super-duper phone tool can't automagically parse the SMS/MMS entries?
Sure, you might be able to run "strings" or "grep" and retrieve some message text but without the time information, it's probably of limited value.
Methinks it's time to strap on the caffeine helmet and fire up the Hex editor!


Time information is typically stored as an offset number of units (eg seconds) since a particular reference date/time. Knowing the reference date is half the battle. The other half is knowing how the date is stored. For example, does it use bit fields for day/month/year etc. or just a single Big or Little Endian integer representing the number of seconds since date X? Sanderson Forensics has an excellent summary of possible date/time formats here.

Because we have to start somewhere, we are going to assume that the date/time fields are represented by a 32 bit integer representing the number of seconds since date X. This is how the very popular Unix epoch format is stored. One would hope that the simplest methods would also be the most popular or that there would be some universal standard/consistency for phone times right? Right?!  
In the case mentioned previously, the reference dates actually varied depending on what database you were looking at. For example, timestamps in the MMS database file used Unix timestamps (offset from 1JAN1970) where as the SMS Inbox/Outbox databases used GPS time (offset from 6JAN1980). Nice huh?
Anyhow, what these dates had in common was that they both used a 4 byte integer to store the amount of seconds since their respective reference dates. If only we had a script that could take a target time and reference date and print out the (Big Endian/Little Endian) hex representations of the target time. Then we could look for these hex representations in our messages in order to figure out which data corresponds to our target time.

Where to begin?

Ideally, there will be a file devoted to each type of message (eg SMS inbox, SMS outbox, MMS). However, some phones use a single database file with multiple tables (eg SQLite) to store messages.
Either way, we should be able to use a Hex editor (eg WinHex) to view/search the data file(s) and try to determine the record structure.

Having a known date/time for a specific message will make things a LOT easier. For example, if someone allegedly sent a threatening SMS at a particular time and you have some keywords from that message, then using a hex editor you can search your file(s) for those keywords to find the corresponding SMS record(s). Even a rough timeframe (eg month/year) will help narrow the possible date/time values.
For illustrative purposes, let's say the following fictional SMS was allegedly sent on 27 April 2012 at 23:46:12 "Bananas NOW or prepare to duck and cover! Sh*ts about to get real!".

OK assuming that we have searched and found a relevant text string and we know its purported target date, we now take a look at the byte values occurring before/after the message text.
Here's a fictional (and over simplified) example record ...

<44 F2 C5 3C> <ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"> <12 34 54 67> <ascii text="555-1234"> <89 12 67 89>

Note: I'm using the "< >" to group likely fields together and make things easier to read.

This is where our simple script ( comes in to play. Knowing the target date/date range, we can try our script with various reference dates and see if the script output matches/approximates a particular group of 4 bytes around our text string.
Here's an example of using the script:

sansforensics@SIFT-Workstation:~$ ./ -ref 1970-01-01T00:00:00 -target 2012-04-27T23:46:12

Running v2013.07.23

2012-04-27T23:46:12 is 1335570372 (decimal)
2012-04-27T23:46:12 is 0x4F9B2FC4 (BE hex)
2012-04-27T23:46:12 is 0xC42F9B4F (LE hex)


We're using the Unix epoch (1JAN1970 00:00:00) as reference date and our alleged target date of 27APR2012 23:46:12.
Our script tells us the number of seconds between the 2 dates is 1335570372 (decimal). Converted to a Big Endian hexadecimal value this is 0x4F9B2FC4. The corresponding Little Endian hexadecimal value is 0xC42F9B4F.
So now we scan the bytes around the message string for these hex values ...
Checking our search hit, we don't see any likely date/time field candidates.

<44 F2 C5 3C><ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"><12 34 54 67><ascii text="555-1234"><89 12 67 89>

OK now lets try our script with the GPS epoch (6JAN1980 00:00:00) as our reference date ...

sansforensics@SIFT-Workstation:~$ ./ -ref 1980-01-06T00:00:00 -target 2012-04-27T23:46:12

Running v2013.07.23

2012-04-27T23:46:12 is 1019605572 (decimal)
2012-04-27T23:46:12 is 0x3CC5F244 (BE hex)
2012-04-27T23:46:12 is 0x44F2C53C (LE hex)


Now our script tells us the number of seconds between the 2 dates is 1019605572 (decimal). Converted to a Big Endian hexadecimal value this is 0x3CC5F244 . The corresponding Little Endian hexadecimal value is 0x44F2C53C .
Returning to our message string hit, we scan for any of these hex values and ...

<44 F2 C5 3C><ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"><12 34 54 67><ascii text="555-1234"><89 12 67 89>

Aha! The 4 byte field immediately before the text string "Bananas NOW or prepare to duck and cover! Sh*ts about to get real!" appears to match our script output for a LE GPS epoch! Fancy that! Almost like it was planned eh? ;)

So now we have a suspected date/time field location, we should look at other messages to confirm there's a date/time field occurring just before the message text. Pretty much rinse/repeat what we just did. I'll leave that to your twisted imagination.

If we didn't find that hex hit, we could keep trying different reference dates. There's a shedload of potential reference dates listed here but there's also the possibility that the source phone is not using a 4 byte integer to store the number of seconds since a reference date.
If you suspect the latter, you should probably check this out for other timestamp format possibilities.

OK so we've tried out our script on other messages and have confirmed that the date/time field immediately precedes the message text. What's next? Well my script monkey instincts tells me to create a script that can search a file for a text message, parse any metadata fields (eg date, read flag) and then print the results to a file for presentation/further processing (eg print to TSV and view in Excel). This would require a bit more hex diving to determine the metadata fields and message structure but the overall process would be the same ie start with known messages and try to determine which bytes correspond to what metadata. I'm not gonna hold your paw for that one - just trying to show you some further possibilities. In case you're still interested, Mari DeGrazia has written an excellent post on reverse engineering sms entries here.

Further notes on determining date/time fields

It is likely that there will be a several groups of bytes that consistently change between message entries. Sometimes (if you're lucky) these fields will consistently increase as you advance through the file (ie newer messages occur later in the file). So if you consistently see X bytes in front of/behind a text message and the value of those X bytes changes incrementally - it's possibly a date field or maybe its just an index.

An interesting observation for date/field offset integers is that as time increases, the least significant byte will change more rapidly than the most significant byte. So 0x3CC5F244 (BE hex) might be followed by 0x3CC5F288 (BE hex). Or 0x44F2C53C (LE hex) might be followed by 0x88F2C53C (LE hex). This can help us decide whether a date field is Big Endian or Little Endian and/or it might be used to determine suspected date/time fields.

Be aware not all time fields use the same epoch/are stored the same (even on the same phone).

I found that writing down the suspected schema helped me to later interpret any subsequent messages (in hex). For example:
<suspected 4 byte date field><SMS message text><unknown 4 bytes><ASCII string of phone number><unknown 4 bytes>
So when I starting looking at multiple messages, I didn't need to be Rain-man and remember all the offsets (eg "Oh that's right, there's 4 bytes between the phone number and last byte of the SMS message text"). In my experience, there are usually a lot more fields (10+) than shown in the simplified example above.

How the Script works

The script takes a reference date/time and a target date/time and then calculates the number of days/hours/minutes/seconds between the two (via the Date::Calc::Delta_DHMS function).
It then converts this result into seconds and prints the decimal/Big Endian hexadecimal/Little Endian hexadecimal values.
The Big Endian hexadecimal value can be printed via the printf "%x" argument (~line 90).
To calculate the Little Endian hexadecimal value we have to use the pack / unpack Perl functions. Basically we convert ("pack") our decimal number into a Big-endian unsigned 32 bit integer binary representation and then unconvert ("unpack") that binary representation as a Little-endian unsigned 32 bit integer (~line 92). This effectively byte swaps a Big-endian number into a Little endian number. It shouldn't make a difference if we pack BE and unpack LE or if we pack LE and then unpack BE. The important thing is the pack/unpack combination uses different "endian-ness" so the bytes get swapped/reversed.


This script has been tested on both the SANS SIFT VM (v2.14) and ActiveState Perl (v5.16.1) on Win7.

Selected epochs have been validated either using the DCode test data listed on or via known case data. I used selected dates given in the DCode table as my "target" arguments and then verified that my script output raw hex/decimal values that matched the table's example values.
The tested epochs were:
Unix (little/big endian ref 1JAN1970), HFS/HFS+ (little/big endian ref 1JAN1904), Apple Mac Absolute Time/OS X epoch (ref 1JAN2001) and GPS time (tested using case data ref 6JAN1980).

Note: Due to lack of test data, I have not been able to test the script with target dates which occur BEFORE the reference date. This is probably not an issue for most people but I thought I should mention it in case your subject succeeded in travelling back in time/reset their phone time.

Final Words

We've taken a brief look at how we can use a new script ( to determine one particular type of timestamp (integer seconds since a reference date).
While there are excellent free tools such as DCode by Digital Detective and various other websites that can take a raw integer/hex value and calculate a corresponding date, if your reference date is not catered for, you have to do it yourself. Additionally, what happens when you have a known date but no raw integer/hex values? How can we get a feel for what values could be timestamps?
With this script it is possible to enter in a target date and get a feel for what the corresponding integer/hex values should look like under many different reference dates (assuming they are stored in integer seconds).

If you have any other hints/suggestions for determining timestamp fields please leave a comment.

Wednesday, 17 July 2013

G is 4 cookie! (nomnomnom)

What is it?

A Linux/Unix based Perl script for parsing cached Google Analytic requests. Coded/tested on SANS SIFT Virtual Machine v2.14 (Perl v5.10). The script ( can be downloaded from:

The script name is pronounced "G is for cookie". The name was inspired by this ...

Basically, Google Analytics (GA) tracks website statistics. When you browse a site that utilizes GA, your browser somewhat sneakily makes a request for a small invisible .GIF file. Also passed with that request is a bunch of arguments which tell the folks at Google various cookie type information such as the visiting times, page title, website hostname, referral page, any search terms used to find website, Flash version and whether Java is enabled. These requests are consequently stored in browser cache files. The neat part is that even if a user clears their browser cache or deletes their user profile, we may still be able to gauge browsing behaviour by looking for these GA requests in unallocated space.

Because there is potentially a LOT of data that can be stored, we felt that creating a script to extract this information would help us (and the forensics community!) save both time and our ageing eyeballs.

For more information (eg common browser cache locations) please refer to Mari Degrazia's blog post here.
Other references include Jon Nelson's excellent DFINews article on Google Analytic Cookies
and the Google Analytics documentation.

How It Works

1. Given a filename or a directory containing files, the script will search for the "" string and store any hit file offsets.
2. For each hit file offset, the script will try to extract the URL string and store it for later parsing.
3. Each extracted URL hit string is then parsed for selected Google Analytic arguments which are printed either to the command line or to a user specified Tab Separated Variable file.

The following Google Analytic arguments are currently parsed/printed:
utmdt (page title)
utmhn (hostname)
utmp (page request)
utmr (referring URL)
utmz_sources (organic/referral/direct)
utmz_utmcsr (source site)
utmz_utmcmd (type of access)
utmz_utmctr (search keywords)
utmz_utmcct (path to website resource)
utmfl (Flash version)
utmje (Java enabled).
You probably won't see all of these parameters in a given GA URL. The script will print "NA" for any missing arguments. More information on each argument is available from the references listed previously.

To Use It

You can type something like:
./gis4cookie -f inputfile -o output.tsv -d

This will parse "inputfile" for GA requests and output to a tab separated file ("output.tsv"). You can then import the tsv file into your favourite spreadsheet application.
To improve readability, this example command also decodes URI encoded strings via the -d argument (eg convert %25 into a "%" character). For more info on URI/URL/percent encoding see here.

Note: The -f inputfile name cannot contain spaces.

Alternatively, you can point the script at a directory of files:
./gis4cookie -p /home/sansforensics/inputdir

In this example, the script prints its output to the command line (not recommended due to the number of parameters parsed). This example also does not decode URI/URL/percent encoding (no -d argument).

Note: The -p directory name MUST use an absolute path (eg "/home/sansforensics/inputdir" and not just "inputdir").

Other Items of Note

  • The script is Linux/Unix only (it relies on the Linux/Unix "grep" executable).
  • There is a 2000 character limit on the URL string extraction. This was put in so the URL string extraction didn't loop forever. So if you see the message "UH-OH! The URL string at offset 0x____ appears to be too large! (>2000 chars). Ignoring ..." you should be able to get rid of it by increasing the "$MAX_SZ_STRING" value. Our test data didn't have a problem with 2000 characters but your freaky data might. The 2000 character count starts at the "g" in "".
  • Some URI encodings (eg %2520) will only have the first term translated (eg "%2520" converts to "%20"). This is apparently how GA encodes some URL information. So you will probably still see "%20"s in some fields (eg utmr_referral, utmz_utmctr). But at least it's a bit more readable.
  • The script does not find/parse UTF-16/Unicode GA URL strings. This is because grep doesn't handle Unicode. I also tried calling "strings" instead of "grep" but it had issues with the "--encoding={b,l}" argument not finding every hit.
  • The utmz's utmr variable may have issues extracting the whole referring URL. From the test data we had, sometimes there would be "utmr=0&" and other (rarer) times utmr would equal a URI encoded http address. I'm not 100% sure what marks the end of the URI encoded http address because there can also be embedded &'s and additional embedded URLs. Currently, the script is looking for either an "&" or a null char ("x00") as the utmr termination flag. I think this is correct but I can't say for sure ...
  • The displayed file offsets point to the beginning of the search string (ie the "g" in ""). This is not really a limitation so don't freak out if you go to the source file and see other URL request characters (eg "http://www.") occurring before the listed file offset.
  • Output is sorted first by filename, then by file offset address. There are a bunch of different time fields so it was easier to sort by file offset rather than time.

Special Thanks

To Mari DeGrazia for both sharing her findings and helping test the script.
To Jon Nelson for writing the very helpful article on which this script is based.
To Cookie Monster for entertaining millions ... and for understanding that humour helps us learn.
"G is 4 Cookie and that's good enough for me!" (you probably need to watch the video to understand the attempted humour)