Friday 26 July 2013

Determining (phone) offset time fields



Let me preface this by saying this post is not exhaustive - it only details what I have been able to learn so far. There's bound to be other strategies/tips but a quick Google didn't return much (hence this post). Hopefully, both the post and accompanying script (timediff32.pl available from here) will help my fellow furry/not-so-furry forensicators determine possible timestamp fields.

In a recent case, we had a discovery document listing various SMS/MMS messages and their time as determined by both manual phone inspection and telecommunications provider logs.
However, whilst our  phone extraction tool was able to image the phone and display all of the files, it wasn't able to automagically parse the SMS/MMS databases. Consequently, we weren't immediately able to correlate what was in the discovery document with our phone image. Uh-oh ...

So what happens when you have an image of a phone but your super-duper phone tool can't automagically parse the SMS/MMS entries?
Sure, you might be able to run "strings" or "grep" and retrieve some message text but without the time information, it's probably of limited value.
Methinks it's time to strap on the caffeine helmet and fire up the Hex editor!

Background

Time information is typically stored as an offset number of units (eg seconds) since a particular reference date/time. Knowing the reference date is half the battle. The other half is knowing how the date is stored. For example, does it use bit fields for day/month/year etc. or just a single Big or Little Endian integer representing the number of seconds since date X? Sanderson Forensics has an excellent summary of possible date/time formats here.

Because we have to start somewhere, we are going to assume that the date/time fields are represented by a 32 bit integer representing the number of seconds since date X. This is how the very popular Unix epoch format is stored. One would hope that the simplest methods would also be the most popular or that there would be some universal standard/consistency for phone times right? Right?!  
In the case mentioned previously, the reference dates actually varied depending on what database you were looking at. For example, timestamps in the MMS database file used Unix timestamps (offset from 1JAN1970) where as the SMS Inbox/Outbox databases used GPS time (offset from 6JAN1980). Nice huh?
Anyhow, what these dates had in common was that they both used a 4 byte integer to store the amount of seconds since their respective reference dates. If only we had a script that could take a target time and reference date and print out the (Big Endian/Little Endian) hex representations of the target time. Then we could look for these hex representations in our messages in order to figure out which data corresponds to our target time.

Where to begin?

Ideally, there will be a file devoted to each type of message (eg SMS inbox, SMS outbox, MMS). However, some phones use a single database file with multiple tables (eg SQLite) to store messages.
Either way, we should be able to use a Hex editor (eg WinHex) to view/search the data file(s) and try to determine the record structure.

Having a known date/time for a specific message will make things a LOT easier. For example, if someone allegedly sent a threatening SMS at a particular time and you have some keywords from that message, then using a hex editor you can search your file(s) for those keywords to find the corresponding SMS record(s). Even a rough timeframe (eg month/year) will help narrow the possible date/time values.
For illustrative purposes, let's say the following fictional SMS was allegedly sent on 27 April 2012 at 23:46:12 "Bananas NOW or prepare to duck and cover! Sh*ts about to get real!".

OK assuming that we have searched and found a relevant text string and we know its purported target date, we now take a look at the byte values occurring before/after the message text.
Here's a fictional (and over simplified) example record ...

<44 F2 C5 3C> <ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"> <12 34 54 67> <ascii text="555-1234"> <89 12 67 89>

Note: I'm using the "< >" to group likely fields together and make things easier to read.

This is where our simple script (timediff32.pl) comes in to play. Knowing the target date/date range, we can try our script with various reference dates and see if the script output matches/approximates a particular group of 4 bytes around our text string.
Here's an example of using the script:

sansforensics@SIFT-Workstation:~$ ./timediff32.pl -ref 1970-01-01T00:00:00 -target 2012-04-27T23:46:12

Running timediff32.pl v2013.07.23

2012-04-27T23:46:12 is 1335570372 (decimal)
2012-04-27T23:46:12 is 0x4F9B2FC4 (BE hex)
2012-04-27T23:46:12 is 0xC42F9B4F (LE hex)

sansforensics@SIFT-Workstation:~$


We're using the Unix epoch (1JAN1970 00:00:00) as reference date and our alleged target date of 27APR2012 23:46:12.
Our script tells us the number of seconds between the 2 dates is 1335570372 (decimal). Converted to a Big Endian hexadecimal value this is 0x4F9B2FC4. The corresponding Little Endian hexadecimal value is 0xC42F9B4F.
So now we scan the bytes around the message string for these hex values ...
Checking our search hit, we don't see any likely date/time field candidates.

<44 F2 C5 3C><ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"><12 34 54 67><ascii text="555-1234"><89 12 67 89>

OK now lets try our script with the GPS epoch (6JAN1980 00:00:00) as our reference date ...

sansforensics@SIFT-Workstation:~$ ./timediff32.pl -ref 1980-01-06T00:00:00 -target 2012-04-27T23:46:12

Running timediff32.pl v2013.07.23

2012-04-27T23:46:12 is 1019605572 (decimal)
2012-04-27T23:46:12 is 0x3CC5F244 (BE hex)
2012-04-27T23:46:12 is 0x44F2C53C (LE hex)

sansforensics@SIFT-Workstation:~$


Now our script tells us the number of seconds between the 2 dates is 1019605572 (decimal). Converted to a Big Endian hexadecimal value this is 0x3CC5F244 . The corresponding Little Endian hexadecimal value is 0x44F2C53C .
Returning to our message string hit, we scan for any of these hex values and ...

<44 F2 C5 3C><ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"><12 34 54 67><ascii text="555-1234"><89 12 67 89>

Aha! The 4 byte field immediately before the text string "Bananas NOW or prepare to duck and cover! Sh*ts about to get real!" appears to match our script output for a LE GPS epoch! Fancy that! Almost like it was planned eh? ;)

So now we have a suspected date/time field location, we should look at other messages to confirm there's a date/time field occurring just before the message text. Pretty much rinse/repeat what we just did. I'll leave that to your twisted imagination.

If we didn't find that hex hit, we could keep trying different reference dates. There's a shedload of potential reference dates listed here but there's also the possibility that the source phone is not using a 4 byte integer to store the number of seconds since a reference date.
If you suspect the latter, you should probably check this out for other timestamp format possibilities.

OK so we've tried out our script on other messages and have confirmed that the date/time field immediately precedes the message text. What's next? Well my script monkey instincts tells me to create a script that can search a file for a text message, parse any metadata fields (eg date, read flag) and then print the results to a file for presentation/further processing (eg print to TSV and view in Excel). This would require a bit more hex diving to determine the metadata fields and message structure but the overall process would be the same ie start with known messages and try to determine which bytes correspond to what metadata. I'm not gonna hold your paw for that one - just trying to show you some further possibilities. In case you're still interested, Mari DeGrazia has written an excellent post on reverse engineering sms entries here.

Further notes on determining date/time fields

It is likely that there will be a several groups of bytes that consistently change between message entries. Sometimes (if you're lucky) these fields will consistently increase as you advance through the file (ie newer messages occur later in the file). So if you consistently see X bytes in front of/behind a text message and the value of those X bytes changes incrementally - it's possibly a date field or maybe its just an index.

An interesting observation for date/field offset integers is that as time increases, the least significant byte will change more rapidly than the most significant byte. So 0x3CC5F244 (BE hex) might be followed by 0x3CC5F288 (BE hex). Or 0x44F2C53C (LE hex) might be followed by 0x88F2C53C (LE hex). This can help us decide whether a date field is Big Endian or Little Endian and/or it might be used to determine suspected date/time fields.

Be aware not all time fields use the same epoch/are stored the same (even on the same phone).

I found that writing down the suspected schema helped me to later interpret any subsequent messages (in hex). For example:
<suspected 4 byte date field><SMS message text><unknown 4 bytes><ASCII string of phone number><unknown 4 bytes>
So when I starting looking at multiple messages, I didn't need to be Rain-man and remember all the offsets (eg "Oh that's right, there's 4 bytes between the phone number and last byte of the SMS message text"). In my experience, there are usually a lot more fields (10+) than shown in the simplified example above.

How the Script works

The script takes a reference date/time and a target date/time and then calculates the number of days/hours/minutes/seconds between the two (via the Date::Calc::Delta_DHMS function).
It then converts this result into seconds and prints the decimal/Big Endian hexadecimal/Little Endian hexadecimal values.
The Big Endian hexadecimal value can be printed via the printf "%x" argument (~line 90).
To calculate the Little Endian hexadecimal value we have to use the pack / unpack Perl functions. Basically we convert ("pack") our decimal number into a Big-endian unsigned 32 bit integer binary representation and then unconvert ("unpack") that binary representation as a Little-endian unsigned 32 bit integer (~line 92). This effectively byte swaps a Big-endian number into a Little endian number. It shouldn't make a difference if we pack BE and unpack LE or if we pack LE and then unpack BE. The important thing is the pack/unpack combination uses different "endian-ness" so the bytes get swapped/reversed.

Testing

This script has been tested on both the SANS SIFT VM (v2.14) and ActiveState Perl (v5.16.1) on Win7.

Selected epochs have been validated either using the DCode test data listed on http://www.digital-detective.co.uk/freetools/decode.asp or via known case data. I used selected dates given in the DCode table as my "target" arguments and then verified that my script output raw hex/decimal values that matched the table's example values.
The tested epochs were:
Unix (little/big endian ref 1JAN1970), HFS/HFS+ (little/big endian ref 1JAN1904), Apple Mac Absolute Time/OS X epoch (ref 1JAN2001) and GPS time (tested using case data ref 6JAN1980).

Note: Due to lack of test data, I have not been able to test the script with target dates which occur BEFORE the reference date. This is probably not an issue for most people but I thought I should mention it in case your subject succeeded in travelling back in time/reset their phone time.

Final Words

We've taken a brief look at how we can use a new script (timediff32.pl) to determine one particular type of timestamp (integer seconds since a reference date).
While there are excellent free tools such as DCode by Digital Detective and various other websites that can take a raw integer/hex value and calculate a corresponding date, if your reference date is not catered for, you have to do it yourself. Additionally, what happens when you have a known date but no raw integer/hex values? How can we get a feel for what values could be timestamps?
With this script it is possible to enter in a target date and get a feel for what the corresponding integer/hex values should look like under many different reference dates (assuming they are stored in integer seconds).

If you have any other hints/suggestions for determining timestamp fields please leave a comment.

Wednesday 17 July 2013

G is 4 cookie! (nomnomnom)

What is it?

A Linux/Unix based Perl script for parsing cached Google Analytic requests. Coded/tested on SANS SIFT Virtual Machine v2.14 (Perl v5.10). The script (gis4cookie.pl) can be downloaded from:
http://code.google.com/p/cheeky4n6monkey/downloads/list

The script name is pronounced "G is for cookie". The name was inspired by this ...




Basically, Google Analytics (GA) tracks website statistics. When you browse a site that utilizes GA, your browser somewhat sneakily makes a request for a small invisible .GIF file. Also passed with that request is a bunch of arguments which tell the folks at Google various cookie type information such as the visiting times, page title, website hostname, referral page, any search terms used to find website, Flash version and whether Java is enabled. These requests are consequently stored in browser cache files. The neat part is that even if a user clears their browser cache or deletes their user profile, we may still be able to gauge browsing behaviour by looking for these GA requests in unallocated space.

Because there is potentially a LOT of data that can be stored, we felt that creating a script to extract this information would help us (and the forensics community!) save both time and our ageing eyeballs.

For more information (eg common browser cache locations) please refer to Mari Degrazia's blog post here.
Other references include Jon Nelson's excellent DFINews article on Google Analytic Cookies
and the Google Analytics documentation.

How It Works

1. Given a filename or a directory containing files, the script will search for the "google-analytics.com/__utm.gif?" string and store any hit file offsets.
2. For each hit file offset, the script will try to extract the URL string and store it for later parsing.
3. Each extracted URL hit string is then parsed for selected Google Analytic arguments which are printed either to the command line or to a user specified Tab Separated Variable file.

The following Google Analytic arguments are currently parsed/printed:
utma_first_time
utma_prev_time
utma_last_time
utmdt (page title)
utmhn (hostname)
utmp (page request)
utmr (referring URL)
utmz_last_time
utmz_sessions
utmz_sources (organic/referral/direct)
utmz_utmcsr (source site)
utmz_utmcmd (type of access)
utmz_utmctr (search keywords)
utmz_utmcct (path to website resource)
utmfl (Flash version)
utmje (Java enabled).
You probably won't see all of these parameters in a given GA URL. The script will print "NA" for any missing arguments. More information on each argument is available from the references listed previously.

To Use It

You can type something like:
./gis4cookie -f inputfile -o output.tsv -d

This will parse "inputfile" for GA requests and output to a tab separated file ("output.tsv"). You can then import the tsv file into your favourite spreadsheet application.
To improve readability, this example command also decodes URI encoded strings via the -d argument (eg convert %25 into a "%" character). For more info on URI/URL/percent encoding see here.

Note: The -f inputfile name cannot contain spaces.

Alternatively, you can point the script at a directory of files:
./gis4cookie -p /home/sansforensics/inputdir

In this example, the script prints its output to the command line (not recommended due to the number of parameters parsed). This example also does not decode URI/URL/percent encoding (no -d argument).

Note: The -p directory name MUST use an absolute path (eg "/home/sansforensics/inputdir" and not just "inputdir").

Other Items of Note

  • The script is Linux/Unix only (it relies on the Linux/Unix "grep" executable).
  • There is a 2000 character limit on the URL string extraction. This was put in so the URL string extraction didn't loop forever. So if you see the message "UH-OH! The URL string at offset 0x____ appears to be too large! (>2000 chars). Ignoring ..." you should be able to get rid of it by increasing the "$MAX_SZ_STRING" value. Our test data didn't have a problem with 2000 characters but your freaky data might. The 2000 character count starts at the "g" in "google-analytics.com/__utm.gif?".
  • Some URI encodings (eg %2520) will only have the first term translated (eg "%2520" converts to "%20"). This is apparently how GA encodes some URL information. So you will probably still see "%20"s in some fields (eg utmr_referral, utmz_utmctr). But at least it's a bit more readable.
  • The script does not find/parse UTF-16/Unicode GA URL strings. This is because grep doesn't handle Unicode. I also tried calling "strings" instead of "grep" but it had issues with the "--encoding={b,l}" argument not finding every hit.
  • The utmz's utmr variable may have issues extracting the whole referring URL. From the test data we had, sometimes there would be "utmr=0&" and other (rarer) times utmr would equal a URI encoded http address. I'm not 100% sure what marks the end of the URI encoded http address because there can also be embedded &'s and additional embedded URLs. Currently, the script is looking for either an "&" or a null char ("x00") as the utmr termination flag. I think this is correct but I can't say for sure ...
  • The displayed file offsets point to the beginning of the search string (ie the "g" in "google-analytics.com/__utm.gif?"). This is not really a limitation so don't freak out if you go to the source file and see other URL request characters (eg "http://www.") occurring before the listed file offset.
  • Output is sorted first by filename, then by file offset address. There are a bunch of different time fields so it was easier to sort by file offset rather than time.

Special Thanks

To Mari DeGrazia for both sharing her findings and helping test the script.
To Jon Nelson for writing the very helpful article on which this script is based.
To Cookie Monster for entertaining millions ... and for understanding that humour helps us learn.
"G is 4 Cookie and that's good enough for me!" (you probably need to watch the video to understand the attempted humour)