Thursday, 21 February 2013

Creating a Perl script to retrieve Android SMS


This script/post was inspired by Mari DeGrazia after she had to manually parse hundreds of Android SMS messages. Without her prior research and the principles she discusses in her post, there's little chance I would have attempted this script. Thanks for sharing Mari!
This post continues on from where Mari's post ended. We'll look further at an example Android SMS SQLite schema and then use it to explain how our SMS extraction script (sms-grep.pl) works. We will also walk you through how to use our script and what kind of output you can expect.

UPDATE 2014-01-23:
The code for sms-grep.pl has been revised/relocated to GitHub
It now handles Cell Header "Serial Type" values of 8 and 9 (ie Header values associated with 0 and 1).




UPDATE 2013-04-25:
Like a punch in the guts, some additional issues have arisen with the previous sms-grep.pl script (v2013-02-16).
Changes were made. Some bananas may have been thrown.


Issue #1
During further testing, we noticed that the initial version of sms-grep.pl was not reading some INTEGER fields correctly.
This has been corrected in the latest version 2013-04-14 with the addition of a new function "read_payload_int".
The previous script version tried to read payload INTEGER fields as VARINTs. This seems to work with positive integers less than 128 and so went unnoticed - as integer payload data in the "sms" table is typically limited to 1's or 0's. However, the "status" fields can read -1 and "thread_id" can be greater than 127 so the correction was made.

Issue #2
The script searches for a known phone number ("address" field) and then tries to go backwards a set number of fields until it hits the cell header size. Previously, it treated the payload fields prior to the "address" field (ie "thread_id") as VARINTs (like the cell header fields). As mentioned previously, this should not prove troublesome if the "thread_id" field is a single byte between 0 and 127. However, if the "thread_id" is greater than 127 or uses multiple bytes, there may be issues with ascertaining the cell header size and hence parsing the sms cell. See also the sms-cell-example-revised.png pic shown below in the original post.

The new version of the script requires the "-p" argument which represents the number of bytes between the last cell header field (VARINT) and the phone number "address" field. For our example schema, using "-p 2" means there's 2 bytes being used for the "thread_id" which sits in between the last cell header field and the "address" field.
This also means that to be thorough, the script will have to be run twice - once with "-p 1" and again with "-p 2" to cater for the possible 1 and 2 byte "thread_id" sizes. I decided to make it an argument rather than hard code it because the schema may change in the future. In practice, the "thread_id" will probably not exceed 2 bytes as the maximum "thread_id" value of 0xFFFF should be sufficiently large. If there's no bytes between the last cell header field and the phone number/search term field, you can use "-p 0".



Issue #3
As the rowid value is stored outside of the Cell Header and Cell Data sections, the script is currently unable to report the rowid value accurately. Typically, the Cell Header section will store a 0x0 value for the field that contains the rowid. Consequently, the script interprets the field value as 0.

Changes to Configuration File FormatAs a result of the changes made for Issue #2, the configuration file no longer requires the PHONE type marker for the "address" field.
Instead, the "address" field can be declared as TEXT and the "-p" argument is used to define the relationship between the "address" field and the last cell header field. The example pic of the sample configuration file format has been edited accordingly.

To run it you now type something like:

sms-grep.pl -c config.txt -f mmssms.db -p 1 -s "5555551234" -o output.tsv
and 

sms-grep.pl -c config.txt -f mmssms.db -p 2 -s "5555551234" -o output.tsv

Testing
Limited testing with Android sms test data was again performed.
The script now seems to handle multiple byte payload integers correctly with the new configuration file format.
As always, users should validate this script for themselves before relying upon the returned results (this is my CYA = Cover Your Ass section). What worked for our test data may not work for your's ...

END UPDATE 
2013-04-25 (Original post below also edited/revised)

Introduction

Android stores SMS records in the "sms" table of /data/data/com.android.providers.telephony/databases/mmssms.db. SQLite can also store backups of "sms" table data in the /data/data/com.android.providers.telephony/databases/mmssms.db-journal file (in case it needs to undo a transaction). Journal files are a potential forensic gold mine because they may contain previously deleted data which is no longer visible in the current database.
As far as I'm aware, there is currently no freely available way to easily view/print the sms contents of mmssms.db-journal files.
And while you can query the mmssms.db database directly via SQLite, this will not return any older (deleted) sms entries from database pages which have been since been re-purposed.
Our sms-grep.pl script seems to work well with mmssms.db and mmssms.db-journal files and also with unallocated space (although file size limited/hardware dependent).
Additionally, our script will interpret date fields and print them in a human readable format so no more hours spent manually checking/converting timestamps!
Our script is also configurable - so you should be able to use it to look at multiple Android SMS SQLite schemas without having to modify the underlying code.

But before we dive into the script - it's probably a good idea to learn about how SQLite stores data ...

The SQLite Basics

The SQLite database file format is described in detail in Richard Drinkwater's blog posts here  and here.
There's also some extra information at the official SQLite webpage.

OK, now for the lazy monkeys who couldn't be bothered reading those links ...
The basic summary is that all SQLite databases have a main header section, followed by a bunch of fixed size storage pages.
There are several different types of page but each page is the same size as declared in the header.
One type of page is the "table B-Tree" type which has a 0xD byte flag marker. This type of page is used to store field data from the whole database (ie data from all of the tables) in units called "cells". Consequently, this page type will be used to store "sms" table data and because the page format is common to both mmssms.db and mmssms.db-journal files - our carving job is potentially much simpler.
Pages can also be deleted/re-allocated for another type of page so we must also be vigilant about non-"table B-tree" pages having free space which contains old "table B-tree" cell data. Think of it like file slack except for a database.

A 0xD type (ie "table B-tree") page will look like:





Generic Layout of a 0xD page

We can see the 0xD byte is followed by:
- 2 bytes containing the 1st free cell offset (0 if full)
- 2 bytes containing the number of used data cells in page
- 2 bytes containing the 1st used cell offset
- 1 byte fragmentation marker

Then depending on the number of used data cells, there will be a series of 2 byte offsets which point to each used data cell (see the green section in the pic). The cell pointer list starts with the closest data cell first and ends with the "1st used offset" data cell. Each used data cell should correspond to a row entry in a table of the database (eg an "sms" row entry).
Following those cell pointers (green), will be any free/unallocated space (blue) followed by the actual data cells (purple). The blue area is where we might see older previously deleted "table B-tree" data.

Breaking it down further, the general form of a data cell (from the purple section) looks like:

Generic Layout of a Cell

We can see there's a:
- Cell Size (which is the size of the cell header section + cell data section)
- Rowid (ie Primary Key of the row)
- Cell Header section (compromised of a "Cell Header Size" field + a bunch of fields used to describe each type/size of field data)
- Cell Data section (compromised of a bunch of fields containing the actual data)

You might have noticed an unfamiliar term called a "varint".
Varints are type of encoded data and are used to save space. They can be 1 to 9 bytes and require a bit of decoding.
Basically, you read the most significant byte (data is stored big endian) and if it's most significant bit is set to 1, it means there's another byte to follow/include. Then there's a bunch of droppping most significant bits and concatenating the leftovers into a single binary value.
Richard Drinkwater's got a better explanation (with example) here.
Later for our script, we will need to write a function to read these varints but for now, just know that a varint can store anywhere from 1-9 bytes (usually 1-2 bytes though) and it requires some decoding to arrive at the "original value".

So for our Android SMS scenario, a typical used "sms" data cell might look like:

Android SMS Cell example

You'll notice that there's a "Cell Header" section highlighted in purple and a "Cell Data" section highlighted in pink.
Think of the Cell Header section as a template that tells us how many bytes to expect for each field in the Cell Data section. The Cell Data section does not use varints to store data.
From the sms-cell-example-revised pic, we can see that most of the Cell Header field types are 0x01 - which means those fields use one byte of data in the subsequent cell data section (pink). Also please note the potential for multi-byte varints for the "thread_id" data field circled in red.
The official SQLite documentation refers to these cell header field type values as "Serial Type Codes" and there's a comprehensive definition table about halfway down the page here.

For our sms example, we can see from the purple section that the sms "Read" and "Type" fields will use 1 byte each to store their data in the Cell Data (pink) section. Looking at the pink section confirms this - the "Read" field value is 0 (0 for unread, 1 for read) and the "Type" field is 1 (1 for received, 2 for sent).
As a rule, if the value of the cell header field type (purple section) is between 0x0 and 0x4, the corresponding data field (pink) will use that many bytes (eg 0x1 means 1 byte data field, 0x4 means 4 bytes)
If the value of a cell header field (purple section) is 0x5 (eg "Date" & "Date_sent" fields), it will take 6 bytes in the cell data (pink) section. The "Date" and "Date_sent" data fields are 6 byte Big Endian values which (for Android) contain the number of milliseconds since the Unix epoch (1 Jan 1970).
There's a special case for handling strings. Firstly, the cell header field type value must be odd and greater than or equal to 13. Then to calculate the number of bytes required in the data section we use this formula:

Number of bytes in string = (cell header field type value - 13)/2.

So in our sms-cell-example-revised pic, the corresponding string size for the "Address" field is (0x21 - 0xD) / 0x2 = (33 - 13) / 2 = 10 bytes. I haven't actually shown a value for the "Address" in the pink section so just use your imagination!
Similarly, we can see that the "Body" field will take (0x23 - D) / 0x2 = (35 - 13) / 2 = 11 bytes.
Note: For long sms, the varint containing the "body" header field type has been observed to require 2 bytes.

You might also have noticed that not all of the cell header fields declared in the cell header section (purple) have a matching entry in the cell data section (pink). This is because if a cell header field is marked as NULL (ie 0x00), it does not get recorded in the cell data section (eg the purple "Rowid" header field's 0x00 value means there won't be a corresponding data field in the pink section).
So if we want to retrieve data, we can't go strictly off the schema - we have to pay attention to the cell header section so we can interpret the cell data fields correctly.

So how do/did we know what cell data field was what?
It turns out that SQLite ensures that the order of cell header fields in the cell header (purple section) is the same order as the database schema field order. Consequently, the cell data section (pink) will also appear in schema order (notwithstanding any missing null fields).
We can get the schema of a database file using the sqlite3 command line exe like this:

sansforensics@SIFT-Workstation:~$ sqlite3 mmssms.db
SQLite version 3.7.11 2012-03-20 11:35:50
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .headers on
sqlite> pragma table_info(sms);
cid|name|type|notnull|dflt_value|pk
0|_id|INTEGER|0||1
1|thread_id|INTEGER|0||0
2|address|TEXT|0||0
3|person|INTEGER|0||0
4|date|INTEGER|0||0
5|protocol|INTEGER|0||0
6|read|INTEGER|0|0|0
7|status|INTEGER|0|-1|0
8|type|INTEGER|0||0
9|reply_path_present|INTEGER|0||0
10|subject|TEXT|0||0
11|body|TEXT|0||0
12|service_center|TEXT|0||0
13|locked|INTEGER|0|0|0
14|error_code|INTEGER|0|0|0
15|seen|INTEGER|0|0|0
16|date_sent|INTEGER|0|0|0
sqlite>


So we can see that the "sms" table consists of 17 fields with the first being the "_id" (ie rowid) primary key and the last being the "date_sent" field. In practice, the "_id" is typically unused/set to NULL as it is just duplicating the Cell's Rowid varint (from the white section). Some fields are declared as INTEGERS and others TEXT. Notice how the "date" and "date_sent" are declared as INTEGERS? These represent ms since UTC.
At this stage, I'm not 100% certain on every field's meaning. We know the "address" field is used to store phone numbers and the "body" field stores the sms text string. From Mari's research we also know that "read" is 1 for a read sms, 0 otherwise and "type" indicates sent (2) or recieved sms (1). That should suffice for now.

So that's the basics of the data structures we will be looking at. In the next section, we'll share some thoughts on the script. Oh goody!

The Script

At first I thought we could find each 0xD page and iterate through the data cells that way but this would miss any old sms messages contained in pages which have since been re-purposed by SQLite. That method would also miss any corrupted/partial pages containing sms in unallocated space.
So to find the sms messages, we are going to have to come up with a way of detecting/printing individual sms data cells.

The strategy we ended up using was based on the "address" field (ie the phone number).

1. We read in our schema and print flags from a configuration file.

2. We create one big string from the nominated input file.
Perl has a handy function called "index" that lets you find out if a given string is contained in a larger string. We use this "index" function to find all phone number matches and their respective file offsets.

3. For each match's file offset, we then work backwards and try to find the cell header size field (ie the start of cell header).
Looking at the sms-cell-example-revised pic, we can see that there are 17 (purple) varint fields plus the "thread_id" (pink) data field between the "Address" cell value (in pink section) and the cell header length/size field (in purple section). The number of varint fields should be constant for a given schema but it is possible for the number of bytes required for each varint to change (eg the "thread_id" data field to is typically 1-2 bytes)..

4. Now that we've determined the cell header size file offset, we can read in the header field type varints (ie find out how many bytes each field requires/uses in the cell data section) and also read in/store the actual data.

5. We then repeat steps 3 and 4 until we have processed all our search hits.

6. We can then sort the data in chronological order before printing to screen/file.

The main sanity check of this process is checking the cell header size value range. Remember, the cell header size value should tell us the number of bytes required for the entire cell header (including itself). So for our example schema above, this value should be:
- above the 18 byte minimum (ie number of schema fields plus the size of the cell header length = 17 + 1) and
- below a certain threshold (18+5 at this time).
Most "sms" cell header sizes should be 18 bytes (most of the fields are one byte flags) but for longer "body" fields or large "thread_id" field values, multi-byte varints have been observed which would obviously increase number of bytes required for that cell header. Allowing for an extra 5 bytes seemed like a good start for now.

For more information on how the script works (eg see how painful it is to read a multi-byte varint!) you can read the comments in the code. I dare you ;)

Making it Schema Configurable
As Mari has noted, not every Android phone will have the same schema. So instead of having a different script for each schema, we'll be utilising a configuration text file. Think of the configuration file as a kind of plugin for the script. Each phone schema will have it's own configuration file. This file will tell the script:
- what the schema fields are and more importantly, their order,
- which fields are DATES or STRINGS or INTEGERS and
- whether we want to print this field's values


For possible future needs, we have also declared a "c4n6mtype=android" field. This is in case we need to read an iPhone schema sometime in the future (iPhones use seconds since UTC for their DATE fields).

Here's an example of a configuration file (also provided from my GoogleCode Download page as "sms-grep-sample-config.txt"):

Sample Android Configuration File

Notice that it's very similar to the schema we got earlier from sqlite3?
The most significant differences are:

- "date" field (which is now listed as a DATE type)
- "date_sent" field (which is now listed as a DATE type)
- the configuration file uses ":" as a field separator (sqlite3 uses "|")
- the print flags (1 prints the field value, 0 does not print)

The script will ignore any blank lines and lines starting with "#".

Update 2013-04-25: The "address" field is now declared as TEXT (previously declared as PHONE).

Running the Script
The first step would be to determine the schema. The easiest way to do this is to use the sqlite3 client with the mmssms.db as previously shown. Admittedly, this requires access to a database file so if your don't have a sample to work with, your out of luck.
Next it's time to create the configuration file - making sure to mark the PHONE field and any DATE fields. Also remember to specify which fields you wish to print.
Once that is done, we can run the script using something like:

sms-grep.pl -c config.txt -f mmssms.db -s "5555551234" -s "(555) 555-1234" -p 1  -o output.tsv

Note: Users can specify multiple phone numbers/formats to search for using -s arguments. At least one -s argument is required.
If no -o argument is specified, the results will be printed to the screen in Tab separated columns - which can get messy with lots of messages. Alternatively, an output Tab separated file (TSV) can be generated (eg using -o output.tsv).

Update 2013-04-25: To be thorough, the script will have to be run twice - once with "-p 1" and again with "-p 2" to cater for the possible 1 and 2 byte "thread_id" sizes.

Any extracted hits will be printed in chronological order based upon the first DATE type schema field declared in the configuration file (eg "date" field for our example configuration file). You will probably see multiple entries for the same SMS which was stored at different file offsets. The date sorting makes this situation easier to detect/filter.

Here's a fictional TSV output example based on the previously shown config file:

Fictional sample TSV output from sms-grep.pl

The arrows in the pic are used by Notepad++ to indicate TABs. We can see that only the print fields marked with a 1 in the configuration file (ie address, date, read, type, subject, body, seen, date_sent) are printed along with the file offset in hex.

Note: If you run the script on a partially complete cell (eg the cell header is truncated by the end of file so there's no corresponding cell data), the script will print out "TRUNCATED" for any strings and -999 for any integer fields. If you see these values, further manual parsing/inspection is recommended.

Testing
Limited testing of the script (initial version) has been performed with:
- 2 separate Android schemas
- Unallocated space (as retrieved by Cellebrite and exported into a new 600 MB file using X-Ways/WinHex)
- A Raw Cellebrite .bin file (1 GB size)

Note: The script failed to run with a 16 GB .bin file - we suspect this is due to a RAM deficiency in our test PC.

As I don't have an Android phone, I've relied pretty heavily on Mari for testing. We've tested it using ActiveState Perl v5.16 on 64 bit Windows 7 PCs. It should also work on *nix distributions with Perl. I have run it succesfully on SIFT v2.14.

Additionally, an outputted TSV file has also been successfully imported into MS Excel for subsequent analysis.

Update 2013-04-25: Subsequent re-testing of the new version (2013-04-14) was limited to 2 different schema mmssms.db files in allocated space only. I reckon if it worked for these, it should still work for the other previous test cases.

Validation tips:
grep -oba TERM FILE
Can be used in Linux/SIFT to print the list of search hit file offsets (in decimal).
For example: grep -oba "5555551234" mmssms.db
Additionally, WinHex can be used to search for phone number strings and will also list the location of any hits in a clickable index table.  The analyst can then easily compare/check the output of sms-grep.pl.

What it will do
Find sms records to/from given phone numbers with valid formatted cell headers. This includes from both the SQLite database files (ie mmssms.db) and backup journal files (ie mmssms.db-journal). It should also find any existing sms records (with valid headers) that appear in pages which have been since re-allocated by SQLite for new data. Finally, it should also be able to find SMS from unallocated space (assuming the size isn't larger than your hardware can handle).

What it doesn't do very well
If the cell header is corrupted/missing, the script will not detect sms data.

The script does some range checking on the cell header size and prints a warning message if required. However, it is possible to get a false positive (eg phone number string is found and theres a valid cell header size value at the expected cell header size field). This error should be obvious from the output (eg "body" field has nonsensical values). The analyst is encouraged to view any such errors in a Hex viewer to confirm the misinterpreted data.

Unallocated space has proved troublesome due to size limitations. The code reads the input file into one big string for searching purposes, so running out of memory is a possibility when running on large input data such as unallocated. However, if you break up the unallocated space into smaller chunks (eg 1 GB), the script should work OK. I have also used WinHex to copy SMS data out from unallocated and paste it into a separate smaller file. This smaller file was then parsed correctly by our script.
The big string approach seemed like the quickest/easiest way at the time. I was targeting the actual database files/database journals rather than unallocated. So consider unallocated a freebie ;)

We have seen some SQLite sms records from an iPhone 4S which does NOT include the phone numbers. There may be another field we can use instead of phone numbers (perhaps we can use a phone book id?). This requires further investigation/testing.

Final words

As always, you should validate any results you get from your tools (including this one!).

This script was originally created for the purposes of saving time (ie reduce the amount time spent manual parsing sms entries). However, along the way we also learnt more about how SQLite stores data and how we can actually retrieve data that even SQLite doesn't know it still has (eg re-purposed pages).

The past few weeks have flown by. I originally thought we were creating a one-off script for Android but due to the amount of different schemas available, we ended up with something more configurable. This flexibility should also make it easier to adapt this script for future SQLite carving use (provided we know the schema). It doesn't have to be limited to phones!

However, this scripting method relies on providing a search term and knowing what schema field that term will occur in. There's no "magic number" that marks each data cell, so if you don't know/cannot provide that keyword and schema, you are out of luck.

I would be interested to hear any comments/suggestions. It can probably be improved upon but at some point you have to stop tinkering so others can use it and hopefully suggest improvements. If you do have problems with it please let me know. Just a heads up - for serious issues, I probably won't be able to help you unless you supply some test data and the schema. 
To any SANS Instructors reading, any chance we can get a shoutout for this in SANS FOR 563? Monkey needs to build some street cred ... ;)

Thursday, 3 January 2013

Dude, Where's My Banana? Retrieving data from an iPhone voicemail database


This is a complementary post to Mari DeGrazia's post here about what to do when your tools don't quite cut the mustard. In today's post, I'll show how we can write a Perl script to retrieve the contents of an iPhone's voicemail database and then display those contents in a nice HTML table.

The first thing I *should* have done was Google it and see if anyone had written a similar script ... D'Oh!
But due to my keen-ness, I dived right in and using iPhone and IOS Forensics by Hoog and Strzempka (2011) plus some previous code I'd written, it took me a couple of days (at a leisurely end of year pace) to write this script.

Soon after I wrote this script, I learned that John Lehr had already written a bunch of similar iPhone scripts in Python in 2011. So while it looks like this monkey was a little late to the party, I still had fun learning and creating something.
You can view John's iPhone Voicemail script here.

My Python skills are pretty limited but it looks like my script is very similar to John's (except for the HMTL generation part). So I guess that's comforting - ie I didn't miss out on some obsure Apple incantation to Lord Jobs (I'm joking OK? Please don't sue me LOL).

Writing the script

First we use the DBI Perl package to read "voicemail.db". Next, we use the HTML::QuickTable package to print out the HTML table.
We've used both of these packages before (see exif2map.pl and squirrelgripper.pl posts), so it should be pretty straight-forward. Not being able to think of a clever and punny name, I'm just calling this script "vmail-db-2-html.pl". Catchy huh?

You can download the script from here. I'll spare you the agony of a line-by-line commentary and just delve into the most interesting parts.

So this is what the voicemail.db schema looks like (via the sqlite command line interface):
sqlite> .schema
CREATE TABLE _SqliteDatabaseProperties (key TEXT, value TEXT, UNIQUE(key));
CREATE TABLE voicemail (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, remote_uid INTEGER, date INTEGER, token TEXT, sender TEXT, callback_num TEXT, duration INTEGER, expiration INTEGER, trashed_date INTEGER, flags INTEGER);

CREATE INDEX date_index on voicemail(date);
CREATE INDEX remote_uid_index on voicemail(remote_uid);


Using iPhone and IOS Forensics by Hoog and Strzempka (2011) pp. 193, 194 - the important bits (for us anyway) are located in the "voicemail" table. These are the:
ROWID =  Unique index number for each voicemail entry. Each entry's voicemail file uses the format "ROWID.amr" for the voicemail's filename. ROWID increments by 1 so if voicemails are deleted there will be discrepancies between the ROWID numbers and the current number of voicemail entries.
date = Date and time relative to the Unix epoch (ie seconds since 1 Jan 1970).
sender = Phone number of person who left the voicemail. Can be "null" presumably if number is witheld.
duration = Duration of voicemail in seconds.
trashed_date = Time when the user placed the voicemail in the "Deleted" folder or "0" if not deleted. This field is a Mac "CF Absolute Time" = number of seconds since 1 JAN 2001 (Thanks to Mari for pointing this out!). Consequently, we have to add 978307200 to our "trashed_date" before we can use it with any Unix epoch date functions (eg "gmtime"). Note: 978307200 is the number of seconds between 1 JAN 1970 and 1 JAN 2001.

Once we know the schema we can formulate our SQLite query (see line 74's "$db->prepare" argument):
"SELECT rowid as Rowid, sender as Sender, datetime(date, 'unixepoch') AS Date, duration as 'Duration (secs)', rowid as Filename, trashed_date as 'Deleted Date' from voicemail ORDER BY rowid ASC"

We're using the SQLite "as" functionality to create pretty alias names for the table headings. We're also using the SQLite "datetime" function to convert the Unix epoch "date" field into a YYYY-MM-DD HH:MM:SS string. The "trashed_date" will be handled later via the script's "printCFTime" function. For the moment, we will just retrieve the raw Mac "CF Absolute time" value.
The query results will be returned in order of ascending "rowid" and subsequently processed via the "PrintResults" function. 

Once we have the results from the database, we then store them in a variable (imaginatively) called "results_hash".
The "results_hash" variable is set from within the "PrintResults" function and involves some mucking around to get the required table fields (eg human readable trash date, HTML link to .amr files). Essentially, each entry of the "results_hash" has a key (the rowid) and an associated array of values (eg From, Date, Duration, Filename, Deleted Date).
Once we've got the "results_hash" all set up, we can then call HTML::QuickTable's "render" function to do the actual HTML table generation and then add in some of our own text for the number of rows returned.
The resultant HTML file will be called "vmail-db-2-html-output-X.html" where X represents a timestamp of the number of seconds since 1 Jan 1970.
Note: Due how the HTML::QuickTable renders hashes, the HTML table "rowid" entries are printed in textual rowid order (eg 1, 10, 2, 3).

Running the script

I tested the script on SIFT v2.14 with Perl v5.10 and also on Win 7 Pro-64 with ActiveState Perl v5.16.1.
Here are the Perl package dependencies:
DBI
HTML::QuickTable
Getopt::Long
File::Spec


If you run the script and it doesn't work, it's probably complaining that it can't find one of those packages.
To install a package X on SIFT you can use:
"sudo cpan X"
eg1 "sudo cpan HTML::QuickTable"
eg2 "sudo cpan DBI"
The 2 examples shown above will probably be the most likely culprits.
Also, after downloading the script on SIFT, you should ensure that it is executable by typing something like:
"sudo chmod a+x vmail-db-2-html.pl"

If you're using ActiveState Perl, just use the Perl Package Manager to install the relevant packages.

And here's the help text - if I've written it right, it should be all you need (Ha!)

sansforensics@SIFT-Workstation:~$ ./vmail-db-2-html.pl -h
vmail-db-2-html.pl v2012.12.28

Perl script to conjure up an HTML table from the contents of an iPhone's voicemail.db SQLite database.

Usage: vmail-db-2-html.pl [-h|help] [-db database] [-f folder]
-h|help ........ Help (print this information). Does not run anything else.
-db database ... SQLite database to extract voicemail data from.
-f folder ...... Optional foldername containing the .amr files for linking. If not specified,
the script assumes the .amr files are in the current directory.

Example: vmail-db-2-html.pl -f heavy-breather/vmails -db voicemail.db

The script will extract the voicemail data from voicemail.db and then
write HTML links to the relevant .amr using the nominated directory (eg "heavy-breather/vmails/1.amr")
The .amr files must be copied to the nominated directory before the link(s) will work.


Script Output

The script was tested using data from an iPhone 4S running iOS 6. Unfortunately, I cannot show you any actual case output and I also do not have any iPhone data of my own - so here's some fictional output just so you can see how purdy everything is ...

Example of command line ouput:
sansforensics@SIFT-Workstation:~$ ./vmail-db-2-html.pl -f heavy-breather/vmails -db voicemail.db

Now Retrieving Voicemail data ...

Rowid | Sender | Date | Duration (secs) | Filename | Deleted Date
1 | +12005551234 | 2013-01-01 00:00:01 | 25 | 1.amr | 2013-01-01 12:00:01
2 | +12005552468 | 2013-01-01 01:00:01 | 10 | 2.amr | 0
3 | +12005551357 | 2013-01-01 02:00:01 | 28 | 3.amr | 0
4 | +12005551123 | 2013-01-01 03:00:01 | 30 | 4.amr | 0
5 | +12005554321 | 2013-01-01 04:00:01 | 19 | 5.amr | 0
6 | +12005558642 | 2013-01-01 05:00:01 | 17 | 6.amr | 0
7 | +12005557531 | 2013-01-01 06:00:01 | 26 | 7.amr | 0
8 | +12005551234 | 2013-01-01 07:00:01 | 51 | 8.amr | 0
9 |  | 2013-01-01 08:00:01 | 41 | 9.amr | 2013-01-01 12:01:01
10 | +12005551234 | 2013-01-01 10:00:01 | 15 | 10.amr | 0

10 Rows returned

Please refer to "vmail-db-2-html-output-1357011655.html" for a clickable link output table

sansforensics@SIFT-Workstation:~$


Note1: Rows are printed in numerical rowid order for the command line output.
Note2: Null value for rowid 9 is left as a blank.

Here's the corresponding HTML generated file output example:




Note1: Rows are printed in textual rowid order for the HTML table (due to how the HTML::QuickTable renders)
Note2: Null values (eg for rowid 9) are displayed as a "-".
Note3: The HTML link to Filename will assume the user has copied the .amr files into the user specified folder (eg heavy-breather/vmails/1.amr). If no folder argument is given, the script will assume the .amr files are in the current local directory and link accordingly (eg 1.amr).

Final Thoughts

Mari's "Swiss Army Knife A $$$$$" tool did not process iPhone voicemail deleted dates or indicate if the voicemails were deleted. By writing this Perl script we were able to obtain this extra information that otherwise may have been missed.

By writing this script I also feel like I:
- Helped a friend and by sharing the solution, potentially helped other DFIRers.
- Improved my knowledge of iPhone voicemail. I had skim read iPhone and IOS Forensics by Hoog and Strzempka about 6 months ago but writing this script provided some much needed reinforcement. Additionally, I also learned how to handle yet another time format - the Mac "CF Absolute Time".
- Exercised my Perl coding skills. Like any language, skills atrophy if you don't use them regularly. This exercise also showed me the benefit of building up your own code library - I was able to "cut and paste" parts of my previous scripts into this new script thus saving time.

I'm not really bothered that I re-invented the wheel for this script. While John Lehr's script already provides the trashed date information - if I hadn't tried writing this, I would have missed out on a great learning opportunity.
I think in my case, "learning by doing" sticks in my brain better than learning exclusively via reading someone else's work. "Having a go" at something doesn't mean it has to be original or even successful so long as you are able to learn something from it. Sharing what you've learnt/helping others is just an added bonus.

Finally, one helpful tool for converting different date formats is the free "DCode" Windows exe from www.digital-detective.co.uk.
I used this tool to verify my script's arithmetic in converting "CF Absolute time" to a human readable time but it will also do a bunch of other conversions.

So thats about it for my first post of 2013. Any comments/suggestions are welcome.

Friday, 14 December 2012

Cheeky Season's Greetings


Today I thought I would do a brain-dump of some things/tips I've done/encountered since starting my internship about 6 weeks ago.
Hopefully some of it will be useful to others but at the very least it will help ensure I don't forget stuff (an insidious affliction affecting 2 out of 3 older monkeys).
But before I get to that, I'd like to thank everyone who has helped me learn more about DFIR over the past year or so. This includes the generous DFIRers that have responded to my emails/Tweets or shared their knowledge (via blog, book, presentation or those unlucky enough to have a socially awkward meeting with me). Hopefully, I have also thanked you individually in an email/Tweet at some point. Mother monkey did not raise an ungrateful bastard - so if I have forgotten to thank you I apologise.
When I started this blog,  I was hoping it might help me connect with some other newbies. But what I've since realised is that as DFIR constantly changes,  everyone is a potential newbie and having the newbie curiosity is a pre-requisite. Perhaps that is why DFIRers seem so helpful/approachable?
Originally, it was also more of an excuse to draw crude (in every sense of the word) cartoons. But somehow it ended up becoming more technical - hopefully the picture above helps re-balance the universe somewhat.
Anyway, enough of the touchy-feely stuff - on to the forensics!

Bing Bar Parser

Based heavily on the work of others (*cough rip off *cough), I wrote a "bing-bar-parser.pl" Perl script to parse Bing Bar history files.
The Bing Bar is Microsoft's attempt at combining search, Hotmail and other stuff into an IE toolbar. I can't say how popular it is in the age of Google but just in case, here's a tool. This script is intended to provide a Perl based method to view a user's Bing Bar search terms as stored in "searchhs.dat". By writing it in Perl I hoped to both improve my Perl programming and provide a cross-platform tool for fellow SIFT/Linux users.

Specifically, the script was based on the following sources:
- Mari DeGrazia's comprehensive blog post on interpreting the Bing Bar's "searchhs.dat" file.
- The "processBingToolbar" function from the "sep-history-viewer" C# Google code project by "ForensicGeek InTheCorner"
- The SystemTime 128 bit data structure as defined by Microsoft.
- Mark Stosberg's blog on Percent-encoding URIs in Perl.

Note: According to the "sep-history-viewer" page, the Bing Bar's "searchhs.dat" file is typically located for XP in:
\Documents and Settings\<user>\Local Settings\Application Data\Microsoft\Search Enhancement Pack\Search Box Extension

For Vista / Win7 it is located in:
\Users\<user>\AppData\LocalLow\Microsoft\Search Enhancement Pack\Search Box Extension

Also note that for a Win7 system, an additional searchhs.dat file (with URI encoding) has been found in:
\Users\<user>\AppData\Local\Microsoft\BingBar\Apps\Search_6f21d9007fa34bc78d94309126de58f5\VersionIndependent

You can use the "bing-bar-parser.pl" script's -d option to decode the URI encoding to something more readable (eg convert %20 to a space character). For the help hints, type "bing-bar-parser.pl -h".

Here's a fictional example script output (fields are pipe separated):

.\bing-bar-parser.pl -f /cases/searchhs.dat -d

Running bing-bar-parser.pl v2012-11-10

File Header field check (should be "FACADE0"): FACADE0
Bing Bar File Version: 1
Record ID for next new record: 341
Number of Records in this file: 335

Record id: 1 | Search Count: 2 | Last Search Time (UTC) = 2012-12-14 Fri 16:06:36 | Term: monkeys
... [ bunch of records edited out ]
Record id: 340 | Search Count: 7 | Last Search Time (UTC) = 2012-12-15 Sat 01:26:39 | Term: monkeys on typewriters

bing-bar-parser.pl v2012-11-10 Finished!

Notice that the "Record ID for next new record" does not reflect the actual number of records in the file.

Limited testing has been done on SIFT v2.14 and on 64 bit Windows 7 with ActiveState Perl v5.16. The script seems to parse data in accordance with Mari's blog post.

For more information, please read the comments in the code (you can download it from my google code page here).
Special Thank-yous to Mari, "ForensicGeek InTheCorner" and Mark Stosberg - they saved me a bunch of time!

Viewing previous versions of Websites

 Sometimes forensicators might need to look at previous versions of a website.
The Wayback Machine takes historical snapshots of selected websites and allows users to view them in their former glory.

Google also caches pages. You can view the latest cached page using the URL:
http://webcache.googleusercontent.com/search?q=cache:http://nfl.com

where http://nfl.com is the URL in question.
Images/animations may not load so you might have to click on the "Text-only version" link provided.
Incidentally, this cached page is what is loaded when you hover over a Google search result.
The Google cache might be handy if the page has been recently taken offline and has not made it into the Wayback Machine.

Steam Profiles

One of my practice cases involved investigating a PC for communications between a particular user who had "run away" and another party.
I noticed various references to a Steam profile in the Internet History.
Steam is used to download/play online games (eg Call of Duty, Left for Dead) and keeps various statistics on:
- if a user is currently online
- a user's playing time (eg 5 hours in the last 2 weeks)
- a user's game achievements
- a user's friends

So I thought it might be worth looking up this user's Steam profile. It's easy enough - go to: http://steamcommunity.com/apps
and type in the user's Steam profile name. If their profile is public, it should be returned by the search.

In my case, I was able to view the profile and saw that they had recently played and had also included their current location in their profile information. Of course, the user could have typed anything in for their current location but at least there was some indication that the user was still active.

Extracting from an SQLite Database

 Say you want to extract the information from an SQLite database in CSV format for re-importing into another program. Here's one way of doing it courteousy of  The Definitive Guide to SQLite (2 ed. 2010) by Allen and Owens ...

1. If required, install the latest "sqlite" command line shell from here.
Note: SIFT already has a version of "sqlite" installed.

2. Run the "sqlite" command with the input sqlite file as an argument. For example type:
"sqlite cookies.sqlite"

3. (Optional) Type ".schema" to list the tables, indexes etc. contained in the given sqlite file.

4. Type ".separator ," to change the default separator to a comma.

5. Type ".output ffcookies.csv" to specify an output file (eg "ffcookies.csv").

6. Run your query for the data that you're interested in. For example type:
"select * from moz_cookies;"

7. Type ".quit" to quit the "sqlite" command line shell.

The query results should now be CSV formatted in the nominated output file.
Alternatively, you can also use the SQLiteManager Firefox plugin to open an sqlite file and export it as CSV.

So thats all folks! Thanks for reading/commenting over the past year and here's wishing you a Happy Holiday Season (and Happy New Year too if I don't post anything in the near future).

Sunday, 21 October 2012

Thoughts on Intern Monkeys


I apologise for the long break between posts. I've been doing some renovation work and my well of ideas seems to have run dry. In an attempt to kickstart some creativeness, I recently contacted some people to volunteer my limited testing services. Even though I didn't end up testing much, one of the parties (lets call them an "Anonymous Benefactor") offered me an unpaid remote internship. It has the potential to help both of us - I get some actual hands-on experience and they get a (hopefully timesaving) research monkey.
So this got me to thinking that a post on internship issues could prove useful to my fellow noobs and/or a prospective employer who is considering taking an intern on.

Duties
Both parties should agree on what tasks the intern is expected to do before they commence. For example, the intern will conduct supervised forensic exams on cases, or subsets of data provided by a senior examiner. This may include ongoing cases, past cases, or simulated cases. Analysis may include, but is not limited to Windows registry files, Internet History, keyword searches, timeline construction, data carving and data recovery. Other duties may include report review and writing, research and testing, and script/programming development.

Position Details / Terms and Conditions of Internship
Some other issues which could be addressed include:
Timeframe: List the Start and End date (if applicable/known).
Working Hours: Is the internship part-time/full-time? It might be helpful to list the maximum number of hours per week expected. Time zone differences should also be taken into account for remote internships.
Location: Can the duties can be performed remotely (ie via Internet) or is the intern required on site/to travel.
Scheduling: Agree on how work is assigned, what to do if a deadline is unachievable etc.
Remuneration:
Spell out if it's an Unpaid Internship and if there is (not) a promise of future employment.
Termination: State the agreed period of notice, if both parties can terminate and what happens to any relevant data/hardware/software after termination (eg gets returned/wiped).
Liability: State who is legally responsible for the intern's work. For example, the intern's work will be verified/reviewed before being used in a report. Any liability then remains with the employer.
Travel costs: Obviously this is more of an issue with remote internships. Should the intern be required to travel / testify in court, both parties should agree beforehand on who will pay for reasonable travel costs.
Equipment: Both parties should agree on what hardware/software will be provided by the intern and what hardware/software will be supplied by the employer. Also, what happens to data/software/hardware upon the ending of the internship. One potential requirement which might surprise a new intern is that analysis computer(s) must not be connected to the Internet whilst the Intern is working with client data. Separate PCs and/or use of Virtual Machines could assist with this requirement.
Software Authorship: If the intern writes a script/program during the internship, do they own it? Or does the employer?
Blogging: If the intern wishes to blog about something they learned/observed, is it OK? Employers should be given the chance to review/approve any content which could potentially disclose confidential information.
Additional work for 3rd parties: Can the intern perform tasks for other parties (eg beta testing)? The employer might want final say just in case the 3rd party is a potential competitor.

Confidentiality
Obviously, the employer is trusting the intern not to disclose sensitive data but if Johnny/Janet Law comes knocking, the intern should be aware that they are obligated to obey any lawful orders. Some orders may even prohibit the intern from notifying their employer.
As an example of a confidentiality clause -  the intern is not to disclose any confidential information (eg client data, employers business data) unless with the employers consent or as required by the law.

Non compete
Address any restrictions on who the intern can work for after their internship ends. This could prove difficult to agree on because the intern will likely be interested in any/all potential offers. Limiting the intern's knowledge of the employer's business practices (eg client list, pricing list) could be one strategy to reduce an intern's ability to "compete" in the future. A remote internship is also less likely to result in the need for a non-compete agreement.

Applicable Labour Laws
This will vary from each state/country. I am not a lawyer so please don't rely on this monkey's ramblings - seek your own legal advice! Some things you may need to consider - term limits, start/end dates, which party is benefiting more (it should probably be the intern).
In general, I think most governments realise that unpaid internships are a good thing (especially in the current economy). As long as everyone agrees to what is expected of them, then there should be no need for lawyers. To minimise any surprises, spell out as much as you think relevant in any internship agreement. It may take a little longer to get started, but it really should be worth the effort.

Final Thoughts
As an intern, my attitude should be to learn as much as possible and to protect the employer's interests. The employer is doing me a favour, so I should treat them accordingly. By addressing the above issues before the internship starts, both parties can then focus on the forensic work at hand.
If you have any questions/thoughts, please leave a comment. Just note my "Anonymous Benefactor" is not seeking any other interns at this time. So please don't ask for their contact details!
I'd like to finish off by thanking a few forensic friends for sharing their valuable thoughts about internships - Mari DeGrazia, Syd Pleno and Carl House.
Hopefully, I will be able post something new and more forensically interesting soon ...

Wednesday, 23 May 2012

A Perl script plays Matchmaker with ExifTool and SQLite


Imagine their cute little offspring!
- Query-able metadata = The ability to sort/query by several metadata fields (not just time).
- Metadata from an well established/tested source library (ExifTool)
- SQLite cross platform compatibility
- And all available for the one low price on SIFT (free!)

Being Perl, it won't be as fast as a compiled language like C/C++.
Also, at this time only .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf and .jpg files are supported.
As a result, I'm pretty sure you wouldn't want to blindly point it at "C:\" as it could take a while/use a lot of memory.
Instead, say you found a directory of interest (eg a suspect's "naughty" directory). You could extract the metadata to the database, take a hash of the database and then send it on to another analyst for further analysis/reporting. Or you could analyse it yourself using standard SQL queries from either the "sqlite3" client or the Firefox "SQLite Manager" plugin.

As far as I know, this capability does not currently exist (at least for open source) so I think this script could prove handy for several different types of investigations. For example: fraud, e-discovery processing/culling, processing jpgs (eg exploitation).

Having a warped sense of humour, I'm calling this extraction tool "SquirrelGripper".
In Australia, to grab a bloke by his "nuts" (accidentally or on purpose or sometimes both) is also known as "the Squirrel Grip". This has been known to happen in the heat of battle/whilst tackling an opponent during a Rugby or Aussie Rules match. Thankfully, I have never had this done to me - I'd like to think I was just too quick but maybe they just couldn't find them LOL.
The idea behind the name is that "SquirrelGripper" (aka "SG") will lead analysts to the low hanging fruit right at the crux of the matter ;)
In polite company (eg client briefing), one could say that "SquirrelGripper" finds all the little nuts/nuggets of information.
By running this script an analyst will hopefully save time. Rather than having to launch the ExifTool exe multiple times for different files/metadata, they can launch this script once and then perform SQL queries for retrieving/presenting their metadata of interest.

Whilst developing this tool, I contacted some more experienced DFIR'ers for feedback. So I'd just like to say a big "ralphabetical" Thankyou to Brad Garnett, Cindy Murphy, Corey Harrell, Gerald Combs, "Girl, Unallocated" and Ken Pryor. Sorry for clogging your inboxes with my semi-organised/semi-chaotic thoughts. Having an audience for my email updates provided both structure and motivation for my thoughts.

An extra bunch of thanks to Corey Harrell (does this guy ever rest?) who was both my testing "guinea pig" and who also made some excellent suggestions. Most, if not all of his suggestions were incorporated into this latest version of SG. I can't wait to see/read how he will use it in his investigations.

Enough already! Let's get started ...

The Design

As SquirrelGripper has grown to over 1000 lines, I have decided to provide the code via my new Google Code project ("cheeky4n6monkey") and just give a general description in this blog post. Like all my scripts, it is provided "as is" and whilst I have done my best, there may still be bugs in the code. As always, perform your own validation testing before using it live.

We'll start by describing how SquirrelGripper will store metadata.
Currently, there are 9 tables in the created database - FileIndex, XLSXFiles, XLSFiles, DOCXFiles,DOCFiles, PPTXFiles, PPTFiles, PDFFiles, JPEGFiles.

FileIndex is the main index table and contains the following fields:
-"AbsFileName" (primary key) = Absolute path and filename. Used as a common key to FileIndex and other file specific tables.
-"DateTimeMetaAdded" = Date and time SquirrelGripper extracted the metadata for a file.
-"FileType" = ExifTool field indicating file type. We can use this to determine which table should contain the remaining file metadata.
-"CaseRef" = Mandatory User specified case tag string.
-"UserTag" = Optional User specified tag string. Can be used for labelling directories/files for a particular SG launch.

The 8 other tables will be used to record file type specific metadata (eg "Slides" in PPTXFiles = Number of Slides for a .pptx file). I decided to use separate tables for each file type because there are simply too many fields for a single table (over 100 in total).

Most of the metadata fields are stored as TEXT (strings) in the database. Some obvious integer fields are stored as INTs (eg Words, Slides, FileSize). GPS co-ordinates from tagged JPEGs are stored in both TEXT and REAL (decimal) format. If in doubt (eg ExifToolVersion), I made the script store the data as TEXT.
Dates are stored as TEXT in the database and I have tested that they can be processed by SQLite functions. For example, "julianday(CreateDate)" works OK. I have also been able to retrieve/sort records by date.

In the interests of everyone's sanity (including my own) I won't detail any more of the schema. You can see them yourself in the code by searching for the "CREATE TABLE" strings.

The code is structured quite similar to "exif2map.pl". We are using the File::Find package to recursively search for the handled filetypes and then we call Image::ExifTool's "GetFoundTags" and "GetInfo" functions to extract the metadata. Depending on the extracted "FileType" tag, we call the appropriate handling function (eg "ProcessPPT"). These handling functions will read selected fields and if the field is defined, the function will insert/replace the metadata into the user specified SQLite database. If the metadata does not exist, a "Not Present" string will be inserted into the table instead.

Selections of metadata fields were made by throwing darts by estimating which of the fields would be most likely to hold information of forensic interest.
For example, MS Office 2007+ files have a variable list of "Zip" fields which I thought was of limited value so the script doesn't store these. However, the main fields such as "FileName", "CreateDate", "FileModifyDate" etc. are all included.
Having said that, if you think SquirrelGripper should include a particular field that I've left out, please let me know.

Installing SquirrelGripper

The Image::ExifTool Perl package is already installed on SIFT. We have previously used it in our "exif2map.pl" script. However due to some internal label changes to ExifTool, we need to grab the latest version (v8.90) for our script. We can do this on SIFT by typing: "sudo cpan Image::ExifTool".

The last few posts have also detailed our travels with Perl and SQLite. We have been using the DBI Perl package to interface with SQLite databases. So if you haven't already done it, grab the latest DBI package from CPAN by typing: "sudo cpan DBI".

Next, you can download/unzip/copy "squirrelgripper.pl" (from here) to "/usr/local/bin" and make it executable (by typing "sudo chmod a+x /usr/local/bin/squirrelgripper.pl")

Now you should be ready to run SquirrelGripper on SIFT in all its glory.

To run on Windows, install ActiveState Perl and use the Perl Package Manager to download the ExifTool package (v8.90). DBI should already be installed. Next, copy the "squirrelgripper.pl" script to the directory of your choice.
You should now be able to run SG at the command prompt by typing something like:
"perl c:\squirrelgripper.pl -newdb -db nutz2u.sqlite -case caseABC -tag fraud-docs -dir c:\squirrel-testdata\subdir1"
See next section for what each of the arguments mean.


Running SquirrelGripper

For this test scenario, I have various .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf files in the "/home/sansforensics/squirrel-testdata/subdir1" directory.
I have also copied various .jpg files to the "/home/sansforensics/squirrel-testdata/subdir2" directory

It is assumed that a new database will be created for each case. However, the same database can be also used for multiple iterations of the script. Just FYI - you can get a help/usage message by typing "squirrelgripper.pl -h"

The script recursively searches thru sub-directories so please ensure you've pointed it at the right level before launching. It is also possible to mark different sub-directories with different case tags. eg Launch script with one directory using the case tag "2012-04-18-caseA-companyA". Then launch the script a second time pointing to another directory using the case tag "2012-04-18-caseA-companyB". SG can also handle multiple -dir arguments in case you need to extract data from more than one directory (eg "-dir naughty/pics -dir naughty/docs"). If a "-tag" argument is also specified, it will apply to files from both directories.

The first example uses "-newdb" to create the "nutz2u.sqlite" database in the current directory. It also tags all "subdir1" files with the "fraud-docs" user tag (you can see the "UserTag" value in the "FileIndex" table). Currently, the "-db", "-case" and "-dir" arguments are mandatory.
Note: the -dir directory can be an absolute path or a relative one.

squirrelgripper.pl -newdb -db nutz2u.sqlite -case caseABC -tag fraud-docs -dir /home/sansforensics/squirrel-testdata/subdir1/

The output looks like:


squirrelgripper.pl v2012.05.18
Assuming /home/sansforensics/squirrel-testdata/subdir1/ is an absolute path
Directory entry for processing = /home/sansforensics/squirrel-testdata/subdir1/

Now processing /home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx inserted into XLSXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf
/home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf inserted into PDFFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/word2k7.docx
/home/sansforensics/squirrel-testdata/subdir1/word2k7.docx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/word2k7.docx inserted into DOCXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc
/home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc inserted into DOCFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf
/home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf inserted into PDFFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt
/home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt inserted into PPTFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx inserted into XLSXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx
/home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx inserted into PPTXFiles table
sansforensics@SIFT-Workstation:~$



The second call assumes the "nutz2u.sqlite" database already exists and tags all "subdir2" files with the "fraud-pics" tag.

squirrelgripper.pl -db nutz2u.sqlite -case caseABC -tag fraud-pics -dir /home/sansforensics/squirrel-testdata/subdir2

The output looks like:


squirrelgripper.pl v2012.05.18
Assuming /home/sansforensics/squirrel-testdata/subdir2 is an absolute path
Directory entry for processing = /home/sansforensics/squirrel-testdata/subdir2

Now processing /home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg
lat = 41.888948, long = -87.624494
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg : No GPS Lat/Long data present
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg
lat = 41.888948, long = -87.624494
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg
lat = 36.1147630001389, long = -115.172811
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg inserted into JPEGFiles table
sansforensics@SIFT-Workstation:~$


SquirrelGripper has also been similarly tested using ActiveState Perl v5.14.2 on Win XP by myself and Corey Harrell has also tested it using ActiveState Perl v5.12 on Windows 7 (32/64). Please note however, that the majority of my testing was done on SIFT v2.12 and Perl 5.10.0.


Once the metadata has been extracted to the SQLite database we can use SQL queries to find specific files of interest. This can be done via the "sqlite3" client and/or more easily via the Firefox "SQLite Manager" plugin.


Extracting SquirrelGripper results

So now we have our SQLite database chock full of metadata - how can we search it?
We need some basic SQLite commands which we can learn here.
W3Schools and Wikipedia also have some useful general information on SQL queries.

For our first example query we will be finding pictures from a particular camera model and ordering the results by date/time of the original.
Building upon the previous section, we have opened the "nutz2u.sqlite" database by using the Firefox "SQLite Manager" plugin on SIFT.
To do this - open Firefox on SIFT, under the "Tools" menu, select "SQLite Manager" to launch.
Then under the "Database" menu, select "Connect Database", browse to the user specified database (eg "nutz2u.sqlite") and press the "Open" button. Click on the "FileIndex" table tree item on the left hand side and then look under the "Browse & Search" tab. You should now see something like:


nutz2u.sqlite's FileIndex Table

We can see that we've got 12 files of all the different supported file types (see "File Type" column in pic). We can also see the absolute path filename ("AbsFileName"), the Date/Time the data was extracted ("DateTimeMetaAdded"), the case tag field ("CaseRef") and the user tag field ("UserTag").
If we now click on the "JPEGFiles" table tree item and look under the "Browse & Search" tab, we can see which jpeg files had their metadata extracted:

nutz2u.sqlite's JPEGFiles Table

Note: For the file specific tables, you will probably have to use the scroll bar to see all of the fields. Not shown in the pic above are the "Model" and "DateTimeOriginal" fields. We will be using these fields in our SQL query.
To execute a query against the JPEGFiles table, we click on the "Execute SQL" tab and then enter in the following:

SELECT * FROM JPEGFiles WHERE Model='Canon PowerShot SX130 IS' ORDER BY DateTimeOriginal;


We then press the "Run SQL" button and we see that we have found two jpegs that meet our search criteria:

nutz2u.sqlite's SQL Search Results

Closer inspection of the "AbsFileName" fields shows the filenames are "/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg" and "/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg". So given a bunch of .jpgs, we have quickly found multiple pictures that were taken by the same camera model.I thought that was pretty cool eh?

Similarly, we can run other queries against the database such as:

Finding .docs by same author and sorting by page count:
SELECT * FROM DOCFiles WHERE Author='Joe Friday' ORDER BY PageCount;

Sorting Powerpoint files by Revision Number:
SELECT * FROM PPTFiles ORDER BY RevisionNumber;

Finding all "fraud-pics" user tagged .JPEGs:
SELECT * FROM JPEGFiles, FileIndex WHERE JPEGFiles.AbsFileName=FileIndex.AbsFileName AND FileIndex.UserTag='fraud-pics';

Finding the LastModified time and the user responsible (for .xlsx files) BEFORE a given date/time:
SELECT LastModifiedBy, ModifyDate from XLSXFiles where DATETIME(ModifyDate) < DATETIME('2012-05-01 06:36:54')
Note: We are using the SQLite DATETIME function to convert our string date into a format we can perform comparisons with.

Finding the Filename and Keywords (for .pptx files) where the Keywords contains "dea". This will find keywords such as "death" "deal" "idea". The % represents a wildcard:
SELECT FileName, Keywords from PPTXFiles where Keywords LIKE '%dea%'

Finding the Filename and Template names used (for .doc files) when the Template name starts with "Normal.dot":
SELECT FileName, Template FROM DOCFiles WHERE Template LIKE 'Normal.dot%'

Find Unique Company names by performing a UNION of each table's Company fields. For more information on the UNION keyword see here.
SELECT Company FROM DOCFiles UNION SELECT Company FROM DOCXFiles UNION SELECT Company FROM XLSXFiles UNION SELECT Company FROM XLSFiles UNION SELECT Company FROM PPTXFiles UNION SELECT Company FROM PPTFiles

Finding the Filename and GPS Longitude (for .jpg files) where the GPS Longitude is less than "-115.0":
SELECT FileName, GPSLongitude FROM JPEGFiles WHERE GPSLongitude < '-115.0'

There are LOTS more ways of querying the database because there's a shedload of other fields we can use. Check out the script code (search for "CREATE TABLE") or create your own example database (using SG) for more details on what metadata fields are being extracted. Where possible, I have used the same field names as the ExifTool exe prints to the command line.


Final Words

We have written a Perl script ("squirrelgripper.pl") to extract file metadata into an SQLite Database. We have also shown how to perform some basic SQL queries against that database.

In total, for the .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf, .jpeg file types there were over 100 fields to extract. Whilst I double-checked my script, it is possible I've made some errors/overlooked something. It is also possible to process other file types (eg EXE, DLL) but I thought I would wait for further feedback/specific requests before continuing.

UPDATE: Just had this thought - If/when I do add the EXE & DLL functionality, SquirrelGripper could also be used for locating known Malware/indicators of compromise as well. I've created a monster!

I suspect there will be some commonly used SQL queries across certain types of investigations. So analysts could build up a standard set of queries to run across multiple cases of the same type. That could save analysts time / help narrow down files of interest.

If you found this script useful and/or you have any questions, please drop me a Tweet / email / leave a comment. The more I know about if/how forensicators are using SquirrelGripper - the better I can make it. Now that the basic framework is in place, it should be pretty quick to add new fields/file types so long as they are supported by the Image::ExifTool package (v8.90).

Thursday, 3 May 2012

Extracting Font metadata from MS Excel 2007 and MS Word 2007 files


Ever wondered how to programmatically extract a list of fonts from MS Excel 2007 or Word 2007?
Me neither ... However, thanks to a Corey Harrell Tweet I read about a case where documents could be proved false based upon the fonts used. That is, the forgers used a font which did not yet exist at the purported creation time. Corey also helpfully provided links that described which fonts are installed with the various MS Office versions (MS Office 2010 , Office 2007 & other MS Office versions).

My first stop was Exiftool (on SIFT v2.12) which extracted plenty of metadata but unfortunately not font information.
So my next stop was a Perl script. But to do this requires some background knowledge of how Excel 2007 and Word 2007 files are structured.

Background

Office 2007 uses a zipped series of XML files to construct the various file types.
These file structures are detailed in the Office Open XML Format specification.
Some helpful summaries are available here and here.
Some Word 2007 specific information is available here.

Another resource is "Digital Forensics with Open Source Tools" by Altheide & Carvey pp.201-203.

The 1 minute briefing is:
- Office Open XML files can be unzipped into a consistent file structure.
- At the top level, there is a "_rels" folder, "docProps" folder, "[ContentTypes].xml" file and an application specific folder (eg "word" or "xl"). Most sub-folders will also contain a "_rels" folder (containing various .xml.rels files) which are then used to construct relationships between the various folders/files (via an XML schema).
- Document metadata such as Subject, Creator are located in "docProps/core.xml". Additional application specific metadata (eg word count, template used) is located in "docProps/app.xml".
- "[ContentTypes].xml" lists all files (aka "document parts") which make up the ZIP archive.
- Within the application specific folder, there is a "_rels" folder and other various folders and XML files which contain the actual user content (eg headers, footers, footnotes, graphics, comments, Word document text, Excel worksheet contents).
- Any media inserted into a document has it's original filename changed to a generic filename (eg "image1.png") but these files can still retain their internal metadata (eg EXIF data).

After unzipping some sample .xlsx and .docx files, I found a list of fonts in the "xl/styles.xml" and "word/fontTable.xml" files respectively. Powerpoint 2007 .pptx files seem to be different - I could not find a central non-theme related file that lists all fonts used :(

XML documents look similar to HTML. Basically, an "element" (or "node") is considered the smallest building block of an XML schema. An element is delineated by a beginning "tag" and an end "tag".  Elements can also contain "attributes" which are indicated by the use of "=" within a "tag".

It'll probably help if I list an (edited) "word/fontTable.xml" excerpt here:

<w:fonts xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:font w:name="Calibri"> ... </w:font></w:fonts>

So we can see that there is a "w:fonts" element which in turn contains a "w:font" element. The"w:font" element has an "attribute" called "w:name" which has the value "Calibri" (which is what we're interested in).

So now we know WHAT we want to read, we need a way of programmatically reading the XML.
I found this helpful nugget of code on the StackOverflow forum.

Enter ... CPAN's XML::XPath!

The Code

#CODE BEGINS ON NEXT LINE
#!/usr/bin/perl -w

# MS Office 2010 fonts
# http://support.microsoft.com/kb/2121313

# MS Office 2k7 fonts
# http://support.microsoft.com/kb/924623

# MS Office2k3, 2k, 97 fonts
# http://support.microsoft.com/kb/837463


use strict;

use Getopt::Long;
use XML::XPath;
#use XML::XPath::XMLParser;

my $version = "docx-font-extractor.pl v2012.04.29";
my $help = 0; # help flag
my $isdocx = 0;
my $isxlsx = 0;
my $fontfile = "";

# TODO my @OFFICE2007_FONTS = ();
# TODO my @OFFICE2010_FONTS = ();


GetOptions('help|h' => \$help,
    'd' => \$isdocx,
    'x' => \$isxlsx,
    'f=s' => \$fontfile);

if ($help || $fontfile eq "" || ($isdocx eq 0 and $isxlsx eq 0) || ($isdocx eq 1 and $isxlsx eq 1) )
{
    print("\n$version\n");
    print("Perl script to list fonts used in an MS Office .docx or .xlsx file\n");
    print("Assumes .docx or .xlsx has already been unzipped to a local directory\n\n");
    print("Example: docx-font-extractor.pl -d -f /home/sansforensics/word2k7/word/fontTable.xml\n");
    print("Example: docx-font-extractor.pl -x -f /home/sansforensics/excelbk1/xl/styles.xml\n");
    exit;
}

my $xpath = XML::XPath->new(filename => $fontfile);
my $nodeset;
my $xmlfontfield;

if ($isdocx)
{
    $nodeset = $xpath->find("/w:fonts/w:font");
    $xmlfontfield = "w:name";
}
elsif ($isxlsx)
{
    $nodeset = $xpath->find("/styleSheet/fonts/font/name");
    $xmlfontfield = "val";
}

print "Found ".$nodeset->size." results\n";

foreach my $node ($nodeset->get_nodelist)
{
    my $fontname = $node->getAttribute($xmlfontfield);
    print "Found font = $fontname\n";

    # TODO Lookup $fontname in list of stored Office fonts

    # TODO Print "The ... font is installed on MS Office ..."

}

#END CODE


Code Summary

We start off with a "GetOptions" and Help section as usual.

Then we create a new "XML::XPath" object passing in the user specified filename (ie "$fontfile") as the argument.

If the user has specified the .docx flag ("-d"), we get our XML::XPath object ($xpath) to "find" the MS Word specific font element ("/w:fonts/w:font"). Note: the "/"s are used to seperate the element names. We then set the "$xmlfontfield" variable to the attribute name we're interested in (ie "w:name"). We'll use this later.

Similarly, if the user has specified the .xlsx flag ("-x"), we get our XML::XPath object ($xpath) to "find" the MS Excel specific font element ("/styleSheet/fonts/font/name") and then set "$xmlfontfield" to the attribute name we're interested in (ie "val").

Our various calls to "$xpath->find" should return results in the form of an XML::XPath::NodeSet object. We can then call XML::XPath::NodeSet's "get_nodelist" to return a list of search results. Next we can  iterate through each "node" and print out the relevant "attribute" value by calling the XML::XPath::Node::Element's "getAttribute" function with the "$xmlfontfield" variable we set previously.

Testing

After unzipping example .xlsx and .docx files to their respective "/home/sansforensics/" sub-folders (eg "unzip /cases/excel2k7-Book1.xlsx -d /home/sansforensics/excelbk1/") we now run our Perl script.

sansforensics@SIFT-Workstation:~$ ./docx-font-extractor.pl -x -f /home/sansforensics/excelbk1/xl/styles.xml
Found 4 results
Found font = Calibri
Found font = Calibri
Found font = Calibri
Found font = Arial Black
sansforensics@SIFT-Workstation:~$


For confirmation, here's the example listing of "xl/styles.xml":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><fonts count="4"><font><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><b/><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><b/><u/><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><sz val="11"/><color theme="1"/><name val="Arial Black"/><family val="2"/></font></fonts><fills count="2"><fill><patternFill patternType="none"/></fill><fill><patternFill patternType="gray125"/></fill></fills><borders count="1"><border><left/><right/><top/><bottom/><diagonal/></border></borders><cellStyleXfs count="1"><xf numFmtId="0" fontId="0" fillId="0" borderId="0"/></cellStyleXfs><cellXfs count="5"><xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/><xf numFmtId="14" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/><xf numFmtId="0" fontId="1" fillId="0" borderId="0" xfId="0" applyFont="1"/><xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1"/><xf numFmtId="0" fontId="3" fillId="0" borderId="0" xfId="0" applyFont="1"/></cellXfs><cellStyles count="1"><cellStyle name="Normal" xfId="0" builtinId="0"/></cellStyles><dxfs count="0"/><tableStyles count="0" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"/></styleSheet>


The reason for the 3 Calibri declarations appears to be due to the use of "normal", "Bold" and "Bold Underlined" versions of the Calibri font. The bold Calibri font is declared using a "<b/>" following the second "<font>" element tag. Similarly, I suspect "<b/><u/>" represents "Bold Underlined". Unfortunately, Word 2007 does not seem to implement the same XML mechanisms for denoting "Bold" and "Underline" fonts as Excel 2007.

Here's our script's results with Word 2007:

sansforensics@SIFT-Workstation:~$ ./docx-font-extractor.pl -d -f /home/sansforensics/word2k7/word/fontTable.xml
Found 3 results
Found font = Calibri
Found font = Times New Roman
Found font = Cambria
sansforensics@SIFT-Workstation:~$


For confirmation, here's the example listing of "word/fontTable.xml":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:fonts xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:font w:name="Calibri"><w:panose1 w:val="020F0502020204030204"/><w:charset w:val="00"/><w:family w:val="swiss"/><w:pitch w:val="variable"/><w:sig w:usb0="A00002EF" w:usb1="4000207B" w:usb2="00000000" w:usb3="00000000" w:csb0="0000009F" w:csb1="00000000"/></w:font><w:font w:name="Times New Roman"><w:panose1 w:val="02020603050405020304"/><w:charset w:val="00"/><w:family w:val="roman"/><w:pitch w:val="variable"/><w:sig w:usb0="20002A87" w:usb1="00000000" w:usb2="00000000" w:usb3="00000000" w:csb0="000001FF" w:csb1="00000000"/></w:font><w:font w:name="Cambria"><w:panose1 w:val="02040503050406030204"/><w:charset w:val="00"/><w:family w:val="roman"/><w:pitch w:val="variable"/><w:sig w:usb0="A00002EF" w:usb1="4000004B" w:usb2="00000000" w:usb3="00000000" w:csb0="0000009F" w:csb1="00000000"/></w:font></w:fonts>


Note: Word's completely different XML schema compared to Excel's schema.

Conclusion

We have successfully written a "docx-font-extractor.pl" Perl script to extract the generic font names used from (unzipped) MS Office 2007 .docx and .xlsx files.

Whilst the extracted font names generally correspond to Microsoft's list of installed fonts - they do not match exactly. For example, the MS Office 2007 font table lists "Calibri (TrueType)", "Calibri Bold (TrueType)", "Calibri Italic (TrueType)", "Calibri Bold Italic (TrueType)" and our script can only report the attribute name ie "Calibri".
If we had a 1:1 mapping, then it would be simple enough to declare an array list of fonts for each MS Office release and then search them for each font found (see TODO comments in code).
Unfortunately, it looks like there's more work required to extract any further font metadata from each of the differing Excel/Word font mechanisms. If only MS had made their font attributes consistent throughout Office *sigh*. And having the full font names in the XML would have been nice too (eg "Calibri Bold (TrueType)") *double sigh*. Anyhoo, that's where I'll leave things (for now) ... I'll finish up by wishing you *Sigh O Nara* ;)