Cheeky4n6Monkey - Learning About Digital Forensics: SQLite

Showing posts with label SQLite. Show all posts

Friday, 3 July 2015

Extracting Ones BLOBs From The Clutches Of SQLite

SQLite BLOB work used to be an adventure ... Not anymore!

Did you know that SQLite databases can also hold binary data? BLOB fields can contain pictures, audio, base64 encoded data and any other binary data you care to wobble your gelatinous finger at.
Monkey's recent foray into SQLite led him to the magic of the "pragma table_info" SQLite query (which returns a table's column configuration). This means that we don't have to hard code our queries and/or know the database schema before we query a table. Consequently, given a table name, we can now query that table for any BLOBs and dump them to file for further analysis. This could come in handy when analysing a browser cache history database (for images) or a voicemail database (for recorded messages) or a contact database (for images) or any other stored binary BLOB (eg binary .plists). So it also sounds like a good fit for mobile device forensic analysis ... awww, my favorite - how did you know? :)

Special Thanks to Alex Caithness of CCLForensics for inspiring/laying the groundwork for this script idea via his 2013 post. Monkey salutes you Alex (in the nice way)!

You can download the Python script (sqlite-blob-dumper.py) from my GitHub page.

The Script

Here's the script's help text:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py -h
Running sqlite-blob-dumper.py v2015-07-03

usage: sqlite-blob-dumper.py [-h] db table outputdir

Extracts BLOB fields from a given SQLite Database

positional arguments:
db          SQLite DB filename
table       SQLite DB table name containing BLOB(s)
outputdir   Output directory for storing extracted BLOBs

optional arguments:
-h, --help show this help message and exit
cheeky@ubuntu:~$

Given an SQLite database filename, the table name containing BLOBs and an output directory, this script will query the table for any BLOB columns, process the BLOB content for file type and then dump the contents to a file in the specified output directory. The file type processing is currently limited to some common mobile device file types but it should be easy to modify for other types in the future (depending on their file signatures). All the processing really does is determine the output filename extension. All BLOBs are extracted - some just have a more user friendly file extension (vs. the default .blob extension).
And before you get on Monkey's back about the lack of service, have you seen how many different file formats there are these days?! ;)

Back to the script ... There's a bit of housekeeping stuff at the beginning of the script eg checking the database file exists, creating the output directory if required.
The real fun begins when we do the "pragma table_info" query. This returns a row entry for each column type in the specified table. Each row entry (ie column type) has the following fields:
ColumnID, Name, Type, NotNull, DefaultValue, PrimaryKey

The Type field should be set to "BLOB" for BLOBs. The Name field is set to the column name.
We can also figure out which column is being used for the Primary Key (ie unique key index for each record) by looking at the PrimaryKey value (which should be set to 1 for the Primary Key).

Now that we know which columns contain BLOBs, we can formulate a SELECT query which looks something like:
SELECT primarykeyname, columnname(s) FROM tablename ORDER BY primarykeyname;
Then we iterate through each of the returned table rows and for each BLOB column, we call the user defined "calculate_filename" function to construct an output filename using the table name, the primary key value (rowid), the BLOB column name and first several bytes of the BLOB.
From the first several bytes of each BLOB, we look for certain file signatures (.jpg, .png, .zip, .bplist, .3gp, .3g2, .amr) and name the output file's extension accordingly. If the BLOB was not one of the previously mentioned types, it is given the default file extension of ".blob".
The file naming convention is:
tablename_row_rowid_columnname.ext

Where .ext can be: .jpg, .png, .zip, .blob (default), .bplist (untested), .3gp (untested), .3g2 (untested) or .amr (untested).

For Base64 encoded BLOBs - unfortunately, there does not appear to be a reliable way of determining if a field is Base64 encoded unless you actually try to base64 decode it and the output "looks valid". Counting the Base64 encoded bytes and monitoring the characters used might find some Base64 encodings but it could also catch some strings which are not necessarily Base64 encoded. So, any Base64 BLOBs will end up with a .blob extension.

For further information on file signatures, check out Gary Kessler's huge compendium of file signatures.

Testing

On Ubuntu 14.04 LTS x64 (running Python 2.7.6), we used the Firefox SQLite Manager plugin to create 3 test databases. One had no BLOB data (testnoblob.sqlite), one had one BLOB column (testblob.sqlite) and the last had three BLOB columns (testblobs.sqlite). Due to lack of test data, we only tested the script with .jpg, .png and .zip BLOBs.
Fun fact: You can also use SQLite Manager to insert existing files as BLOBs into a table.

Here's what the database which does not contain any BLOBs (testnoblob.sqlite) looks like:

Now we run the script against the database:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testnoblob.sqlite main noblobop
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
No BLOB columns detected ... Exiting
cheeky@ubuntu:~$

Here's the contents of the "noblobop" directory ... NOTHING! Because there's no BLOBs in the database, silly!

Here's what the second database containing one BLOB column (testblob.sqlite) looks like:

Now we run the script against the database containing one BLOB column:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testblob.sqlite main blobop
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
Detecting BLOB columns = blobby
Extracting ... blobop/main_row_1_blobby.jpg
Extracting ... blobop/main_row_2_blobby.zip
Extracting ... blobop/main_row_3_blobby.blob

Exiting ...
cheeky@ubuntu:~$

Here's the contents of the "blobop" output directory:

We can see the the .jpg, .zip / office document and base64 text BLOBs have all been extracted successfully.
This was also confirmed by checking the file sizes of each output file against its BLOB size in the table.

Finally, here's the database containing three BLOB columns (testblobs.sqlite):

Now we run the script against the three BLOB database:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testblobs.sqlite main blobops
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
Detecting BLOB columns = blobby, blobby2, blobby3
Extracting ... blobops/main_row_1_blobby.jpg
Extracting ... blobops/main_row_1_blobby2.jpg
Extracting ... blobops/main_row_1_blobby3.png
Extracting ... blobops/main_row_2_blobby.zip
Extracting ... blobops/main_row_2_blobby2.zip
Extracting ... blobops/main_row_2_blobby3.zip
Extracting ... blobops/main_row_3_blobby.blob
Extracting ... blobops/main_row_3_blobby2.blob
Extracting ... blobops/main_row_3_blobby3.blob

Exiting ...
cheeky@ubuntu:~$

Here's the contents of the "blobops" output directory:

Note: The .png BLOB (for row id=1, column = blobby3) has also been successfully extracted along with .jpg, .zip and base64 text BLOBs.

We have also run the script on Windows 7 x64 with Python 2.7.6 with the same results.

So there you have it, repeat after me - "Everything seems to be in order ... I think I'll go across the street for some orange sherbet ..."

Final Thoughts

It is hoped that this script (sqlite-blob-dumper.py) can be used during an SQLite database / mobile forensics analysis to quickly retrieve embedded binary data such as pictures, voicemail recordings, video, binary .plist and .zip BLOBs.

As mentioned in the last post, to avoid changing data, analysts should only interface with copies of an SQLite database and not the original file.

This script does not handle deleted BLOB data.

And ... I'm spent! (For those too young, that's another classic Austin Powers movie reference)

Sunday, 28 June 2015

How u like Base(64)?

Monkey was having such a great time, no one had the heart to tell him he had the wrong type of base ...

A recent blog post by Heather Mahalik (@HeatherMahalik)
mentioned that a multiple Base64 decoding tool would be useful for mobile application analysis. What is Base64? Basically, it converts bytes into a printable 64 character set. This encoding is typically used when sending email and/or transferring/obfuscating data. Check out the Wikipedia page for more gory details.
There's already several existing tools we can use to perform Base64 decoding. For example, *nix systems have the "base64" command and recently Monkey found that Notepad++ (v6.7.9.2) will handle multiple Base64 encodes/decodes.
However, as most mobile apps use SQLite databases for storage, it would be pretty painful to first query the database and then manually perform each Base64 decode. And especially, if the field was Base64 encoded multiple times ... Unless of course, you had your own army of monkey interns!

Thankfully, we have previously used Python to interface with SQLite databases and after some quick Googling, we also found that Python has baked in Base64 encode/decode functionality.
So a scripted solution seems like the way to go (Sorry, intern monkey army!).

You can download the script (sqlite-base64-decode.py) from my GitHub page.

The Script

The user has to provide the script with the database filename, the table name, the Base64 encoded field's name and the number of iterations to run the Base64 decode.
The script will then query the database and then print each row's column values and the respective Base64 decode result in tab separated format.

Each app's database will have its own schema so we first need to run a "pragma table_info" query to find out how the database is laid out.
Specifically, we want to find out:
- the table's Primary Key name (for ordering the main query by),
- the table column names (for printing) and
- the index (column number) of the Base64 encoded column (the user provided the encoded field's name but we also need to know the index)

Once we have this info, we can then run our main query which will be the equivalent of:

SELECT * FROM tablename ORDER BY primarykeyname;

We then iterate through each returned row, run the base64.decodestring function the requested number of times and print both the returned row data and the decoded result.
On a decode error, the script prints "*** UNKNOWN ***" for the decoded value.

Here's the help text:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py -h
Running sqlite-base64-decode v2015-06-27
usage: sqlite-base64-decode.py [-h] db table b64field b64count

Extracts/decodes a base64 field from a SQLite DB

positional arguments:
db          Sqlite DB filename
table       Sqlite DB table name containing b64field
b64field    Suspected Sqlite Base64 encoded column name
b64count    Number of times to run base64 decoding on b64field

optional arguments:
-h, --help show this help message and exit
cheeky@ubuntu:~$

Future work might have the script sample each column's data to figure out which is Base64 encoded.
Base64 encoded data is typically limited to the following characters:
A-Z
a-z
0-9
+
/
=

Because the = sign is used for padding, it is usually a good indicator of Base64 encoding (especially at the end of the encoded string).
Base64 encoding usually takes 3 binary bytes (24 bits) and turns it into 4 printable bytes (32 bits). So the final encoding should be a multiple of 4 bytes.
Additionally, the more times you encode in Base64, the longer the resultant string.

Testing

For testing, we added the "base64enc" column to our previous post's testsms.sqlite database (specifically, the "sms" table). The test data looked like this:

Modified "sms" table with "base64enc" column added

The values for "base64enc" correspond to 2 x Base64 encoding the "message" value.
To obtain the 2 x Base64 encoded value, on Ubuntu we can do this:

cheeky@ubuntu:~$ echo -n 'Hey Monkey!' | base64
SGV5IE1vbmtleSE=
cheeky@ubuntu:~$

cheeky@ubuntu:~$ echo -n 'SGV5IE1vbmtleSE=' | base64
U0dWNUlFMXZibXRsZVNFPQ==
cheeky@ubuntu:~$

Note: The "-n" removes the newline character added by the "echo" command

So we can see our last encoding result corresponds to our "sms" table pic above.
ie 2 x Base64 encoding of 'Hey Monkey!' is U0dWNUlFMXZibXRsZVNFPQ==

Similarly, we can also use Notepad++ to do the encoding via "Plugins ... MIME Tools ... Base64 Encode".

As we see in the pic above, I used Notepad++ to 2 x Base64 encode the various "message" values and then inserted those values into the "sms" table's "base64enc" field using the SQLite Manager Firefox Plugin.

Now we run our script on our newly modified testsms.sqlite file ...
For shiggles, lets initially specify a 1 x Base64 decode:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 1
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    SGV5IE1vbmtleSE=
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    SGV5IFN0cmFuZ2VyIQ==
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    UCBpcyBmb3IgUEFHRURVTVAh
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    SSB3b25kZXIgd2hhdCBwZW9wbGUgd2l0aCBhIGxpZmUgYXJlIGRvaW5nIHJpZ2h0IG5vdyAuLi4=
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    VGhpcyBpcyBzbyBleGNpdGluZyEgSXQgcmVtaW5kcyBtZSBvZiBvbmUgdGltZSAuLi4gYXQgQmFuZCBDYW1wIC4uLg==

Exiting ...
cheeky@ubuntu:~$

No real surprises here. We can see the "B64Decoded" fields are still Base64 encoded. Also, apologies for the crappy layout ...
Now let's try a 2 x Base64 decode:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 2
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    Hey Monkey!
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    Hey Stranger!
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    P is for PAGEDUMP!
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    I wonder what people with a life are doing right now ...
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    This is so exciting! It reminds me of one time ... at Band Camp ...

Exiting ...
cheeky@ubuntu:~$

Note: The "message" and "B64Decoded" fields are the same - we have found our original message! :)
Finally, let's try a 3 x Base64 decode to see if the script falls into a screaming heap:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 3
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    *** UNKNOWN ***
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    *** UNKNOWN ***
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    *** UNKNOWN ***
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    *** UNKNOWN ***
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    *** UNKNOWN ***

Exiting ...
cheeky@ubuntu:~$

Note: The "*** UNKNOWN ***" values indicate that a decoding error has occurred (from testing this is usually due to a padding error).

We also ran these tests on a Windows 7x64 PC running Python 2.7.6 with the same results.

Final Thoughts

Special Thanks to Heather Mahalik for mentioning the need for the script. One of the great things about getting script ideas from Rockstar practioners in the field, means it's not going to be some banana-in-the-sky idea that no one uses. This script might actually be useful LOL.

The script ass-umes only one field is Base64 encoded and that the Primary Key only uses one field.
The script has only been tested with Monkey's own funky data - it will be interesting to see how it goes against some real life user data.

The "pragma table_info" query is something Monkey will probably re-use in the future because it allows us to discover a database table's schema rather than hard-coding a bunch of assumptions about the table.

Deleted table data is not addressed by this script.

Monkey's recent blue period of posts might be drawing to a close. Oh well, it was fun while it lasted. Maybe I can now get a life ... yeah, right ;)

Tuesday, 23 June 2015

Deleted SQLite Parser Script Update (Now With Added DFIR Rockstar!)

Monkey says: "Knowing DFIR Rockstars has its privileges!" (Mari's picture courteousy of her Google+ Profile)

This post aims to build upon Mari DeGrazia's sqlparse Python script which harvests data from unallocated and free blocks in SQLite databases. It is also available as a Windows command line exe and/or a Windows GUI exe here.
Further details regarding her initial script can be found here. Mari's script has proven so useful that its referred to in the SANS585 Advanced Smartphone Forensics course and by at least 2 books on mobile forensics (Practical Mobile Forensics by Bommisetty, Tamma and Mahalik (2014) and Learning iOS Forensics by Epifani and Stirparo (2015)).
Mari's impressive DFIR research in a variety of areas has also lead her to attain her well deserved DFIR Rockstar status as attested to by her fellow DFIR Rockstar, Heather Mahalik.
That's a pretty impressive introduction eh? Mari - my promotions check is in the mail right? Right? ;)

OK, so whats Monkey got to do with it?
I was spinning my paws looking at deleted SMS from an Android (circa 4.1.2) LG-E425 phone (aka LG Optimus L3 E430) when I remembered Mari's script and thought of some minor modifications that would allow analysts to recover additional string data from re-purposed SQLite pages.
A commercial mobile forensics tool was reporting a different number of deleted SMS on consecutive reads via flasher box. Specifically, the parsing of read 1 was producing X deleted SMS while the parsing of read 2 was producing X-1 deleted SMS.
Admittedly, the flasher box was initiating a reboot after each acquisition - so any unused pages in the phone's memory could have been freed/reused, thus affecting the number of recoverable deleted SMS.
However, as Monkey actually gets paid to do this for a living (pinch me, I must be dreaming!), a closer inspection was carried out.
While the total number of deleted SMS varied by one, there were two deleted SMS in report 1 that weren't in report 2. Additionally, there was one deleted SMS in report 2 that wasn't in report 1.
So while the net difference was one less SMS, there was a bit more going on behind the scenes.
Fortunately, the commercial forensic tool also showed the image offset where these "deleted" SMS entries were found so we had a good starting point ...

OK Monkey, put your floaties on. It's time for some SQLite diving!

An SQLite database is comprised of a number of fixed sized pages. The number of pages and page size are declared in the file header. According to the official documentation, there are 4 types of page. The first byte of each page (occurring after the file header) tells us what type of page it is. The actual row data from a database table lives in a "Leaf Table B-Tree" page type. This has the flag value of 0xD (13 decimal). In the interests of readability / reducing carpal tunnel syndrome, we shall now refer to these pages as LTBT pages.

A typical LTBT page looks like this:

An 0xD (LTBT) page with unallocated, allocated cells and free blocks

Back to Monkey's problem (well, one he actually has a chance of solving!) ... I observed that some of those "deleted" SMS were appearing in non-LTBT pages. The commercial mobile forensic tool was then finding/listing some of these entries but not all of them.
To accurately carve an entire SQLite record, you need to know the record's schema (order and type of column data) before reading the actual data. Any pages with overwritten cell headers (eg repurposed pages) may be difficult to accurately carve for all records. However, if we narrow our record recovery to any string content within a page, it becomes a lot easier. See this previous blog post for further details on carving SQLite records.

Within our LG phone data, it appeared that a page previously employed as a LTBT page was then re-purposed as another type (flag = 5, an "Interior Table B-tree" page). However, as this new page only used the first (and last) few bytes of the page, it still had previous record data leftover in the Unallocated region (see picture below).

The Unallocated region in non-LTBT pages can contain previous record data!

This previous data included some SMS records - some of which were being reported by the tool as deleted, while others were not.
This reporting discrepancy might have been because some of these SMS records also existed in allocated LTBT pages elsewhere or maybe it was due to the method the commercial tool was using to carve for SMS records. Due to the closed source nature of commercial tools, we can only speculate.
So rather than try to reverse engineer a proprietary tool, Monkey remembered Mari's sqlparse Python script and thought it might be easier/more beneficial to extend Mari's script to print the strings from all non-LTBT pages. By doing this, we can find which non-LTBT pages have previous row data in them (assuming the row data contained printable ASCII strings like SMS records). This will allow us to hunt for deleted records more efficiently (versus running strings over the whole file and having to figure out which strings are allocated / not allocated).
Bceause Mari had written her code in a logical / easy to understand manner (and commented it well!), it didn't take long to modify initially and only required about 10 extra lines of code.

You can download the updated software (command line Python, command line Windows exe, Windows GUI exe) from Mari's Github page. She is also writing an accompanying blog post which you can find at her blog here.

The Script

From my GitHub account, I "forked" (created my own copy of) Mari's SQLite-Deleted-Records-Parser project, made my changes and then committed it to my own branch. Then I submitted a "pull" request to Mari so she could then review and accept the changes. Mari then found a interoperability bug regarding the new code and the existing raw mode which she then also fixed. Thanks Mari!

At the start of the script, I added code to parse the optional -p flag (which is stored as the "options.printpages" boolean) so the script knows when to print the non-LTBT page printable characters to the user specified output file.
Next, I added an "elif" (else if) to handle non-LTBT pages (ie pages where the flag does not equal 13). This is where I stuffed up as I did not allow for the user specifying -r for raw mode (dumps deleted binary data) at the same time as the -p option. Mari fixed it so that in raw + printpages mode, the printable strings are now dumped from non-LTBT pages and deleted content is dumped from LTBT pages (as before).

Here's our cross-bred "elif" code (as of version 1.3):

    elif (options.printpages):
        # read block into one big string, filter unprintables, then print
        pagestring = f.read(pagesize-1) # we've already read the flag byte
        printable_pagestring = remove_ascii_non_printable(pagestring)

        if options.raw == True:
            output.write("Non-Leaf-Table-Btree-Type_"+ str(flag) + ", Offset " + str(offset) + ", Length " + str(pagesize) + "\n")
            output.write("Data: (ONLY PRINTABLE STRINGS ARE SHOWN HERE. FOR RAW DATA, CHECK FILE IN HEX VIEWER AT ABOVE LISTED OFFSET):\n\n")
            output.write(printable_pagestring)
            output.write( "\n\n")
        else:
            output.write("Non-Leaf-Table-Btree-Type_" + str(flag) + "\t" + str(offset) + "\t" + str(pagesize) + "\t" + printable_pagestring + "\n" )

The code above is called for each page.
Depending on if we are in raw mode, the output is written as binary (raw mode) or tab separated text (not raw mode) to the user specified output file.
Depending on the number of non-LTBT pages and their string content, the output file might be considerably larger if you run the script with the -p argument versus without the -p argument.

In both raw/not raw mode output files there are some common output field names.
The "Non-Leaf-Table-Btree-Type_Z" field shows what type of page is being output. Where Z is the flag type of the non-LTBT page (eg 2, 5, 10, 0 etc).
The offset field represents the file offset for that page (should be a multiple of the page size).
No prizes for guessing what the page size field represents (this should be constant).
The last field will be the actual printable text. Because its removing unprintable characters, the output string should not be too large, which should make it easier to spot any strings of interest.

Here's the help text:

cheeky@ubuntu:~$ python ./sqlparse.py -h
Usage: Parse deleted records from an SQLite file into a TSV File or text file
Examples:
-f /home/sanforensics/smsmms.db -o report.tsv
-f /home/sanforensics/smssms.db -r -o report.txt

Options:
-h, --help            show this help message and exit
-f smsmms.db, --file=smsmms.db
                        sqlite database file
-o output.tsv, --output=output.tsv
                        Output to a tsv file. Strips white space, tabs and
                        non-printable characters from data field
-r, --raw             Optional. Will out put data field in a raw format and
                        text file.
-p, --printpages      Optional. Will print any printable non-whitespace
                        chars from all non-leaf b-tree pages (in case page has
                        been re-purposed). WARNING: May output a lot of string
                        data.
cheeky@ubuntu:~$

Testing

I tested the new script with an existing test Android mmssms.db and it seemed to work OK as I was able to see non-LTBT string content for various pages.
To show you that new code doesn't fall in a screaming heap, on an Ubuntu 14.04 64-bit VM running Python 2.7, we're going to use the SQLite Manager Firefox plugin to create a test database (testsms.sqlite) with a test table (sms). Then we'll populate the table with some semi-amusing test data and then use a hex editor to manually add a test string ("OMG! Such WOW!!!") into a non-LTBT page (freelist page).

Here's the test "sms" table row data:

One time ... at Band Camp ...

Here's the relevant database header info screenshot:

Note: The page size is 1024 bytes and there are 3 total pages. The last page is on the freelist (unused).

To create a non-LTBT page (page with a non 0xD flag value), I added another test table, added some rows and then dropped (deleted) that table. The database's auto-vacuum was not set. This resulted in the third page being created and then having its type flag set to 0 (along with any row data it seems). This suggests that pages on the free list have their first byte set to zero and it also may not be possible to recover strings from zeroed freelist pages. At any rate, we now have a non-LTBT page we can manually modify and then parse with our new code.

Here's the gory page by page breakdown of our testsms.sqlite file ...

Page 1 starts with the "SQLite format 3" string and not a flag type.

Page 2 contains the test "sms" table data (ie LTBT page).

Page 3 contains the freelist page (non-LTBT) and our test string.

After using WinHex to add our test string to an arbitrary offset in the last page, we then ran our script (without the -p) and checked the contents of the outputfile.

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv
cheeky@ubuntu:~$

Here's what the testoutput.tsv file looked like:

As expected, our test string in a non-LTBT page was not extracted.
Then we re-ran the script with the -p argument ...

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv -p
cheeky@ubuntu:~$

Here's the output file:

The new version of the script has successfully extracted string content from both of the non-LTBT pages (ie page 1 and page 3).

OMG! Such WOW!!! Indeed ...

You might have noticed the first entry (at offset 0) being "Non-Leaf-Table-Btree-Type_83". This is because the very first page in SQLite database starts with the string "SQLite format 3". There is no flag as such. "S" in ASCII is decimal 83 so thats why its declaring the type as 83. You can also see the rest of the string ("QLite format 3") following on with the rest of the printable string data in the Data column.

OK now we try adding the -r (raw) mode argument:

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv -p -r
cheeky@ubuntu:~$

Because there's now binary content in the output file, Ubuntu's gedit spits the dummy when viewing it. So we use the Bless Hex editor to view the output file instead.

Raw Mode + PrintPages Mode Output

Notice how the first page's string content is shown (look for the "Qlite format 3" string towards to top of the pic). Remember, the first page is not considered an LTBT page, so its printable strings are retrieved.
There's also a bunch of Unallocated bytes retrieved (values set to zero) from offset 1042 which corresponds to Page 2's Unallocated area. Remember, Page 2 is an LTBT page - so the script only extracts the unallocated and free blocks (if present).
And finally, circled in red is our test string from Page 3 (a non-LTBT page type).
Cool! It looks like everything works!

Similarly, I re-ran the script on a Windows 7 Pro 64-bit PC running Python 2.7 with the same results.

Final Thoughts

Special Thanks again to Mari for releasing her sqlparse script and also for her prompt help and patience in rolling out the new updates.
I contacted her on Thursday about the idea and just a few short days later we were (well, she was) already releasing this solution ... Awesome!
A lesson (re)learned was that even if you're only adding a few lines of code, be aware of how it fits into the existing code structure. None of this, "I'm not touching that part of the code, so it should be OK *crosses fingers*". Especially because it was someone else's baby, Monkey should have re-tested all of the existing functionality before requesting a merge. Thankfully in this case, the original author was available to quickly fix the problem and the solution was relatively straight forward.
During the initial investigation of the LG mmssms.db database, I checked the file header and there were no freelist (unused) pages allocated. The official documentation says:

A database file might contain one or more pages that are not in active use. Unused pages can come about, for example, when information is deleted from the database. Unused pages are stored on the freelist and are reused when additional pages are required.

The lack of freelist pages might have been because of the vacuum settings.
Anyhoo, with the -p option enabled, this new version of the script will process freelist pages and print any strings from there too (just in case).

Also, don't forget to check for rollback journal files (have "-journal" appended to DB filename) and write-ahead logs (have "-wal" appended to DB filename) as other potential sources of row data. They should be small enough to quickly browse with a hex editor. When using an SQLite reader to read a database file, be careful not to open it with those journal/log files in the same directory as that could result in the addition/removal of table data.

Be wary of the difference between documentation and implementation (eg the official SQLite documentation didn't mention pages with flag values of zero). Reading the available documentation is just one facet of research. Simulating / seeing real-world data is another. Reading the available source code is yet another. But for practical analysis, nothing beats having real-world data to compare/analyze.
After all, it was real-world data (and some well-timed curiosity) that lead us to adding this new functionality.

In keeping with the Rockstar theme ... Monkey, OUT! (drops microphone)

Thursday, 31 July 2014

Squirrelling Away Plists

Just grabbin some acorns ...

Plists are Apple's way of retaining configuration information. They're scattered throughout OS X and iOS like acorns and come in 2 main types - XML and binary. Due to their scattered nature and the potential for containing juicy artefacts, monkey thought a script to read plists and extract their data into an SQLite database might prove useful. The idea being analysts run the script (plist2db.py) against a directory of plists and then browse the resultant table for any interesting squirrels. Analysts could also execute the same queries against different sets of data to find common items of interest (eg email addresses, filenames, usernames).
Similar in concept to SquirrelGripper which extracted exiftool data to a DB, the tool will only be as good as the data fields extracted and the analyst's queries. At the very least, it allows analysts to view the contents of multiple plists at the same time. Plus we get to try out Python 3.4's newly revised native "plistlib" which now parses BOTH binary and XML plists. Exciting times!

Not having easy access to an OS X or iOS system, monkey is going to have to improvise a bit for this post and also rely upon the kindness of plist donaters. Special Thanks to Sarah Edwards (@iamevltwin) and Mari DeGrazia (@maridegrazia) for sharing some sample plists used for testing.

XML based plists are text files which can be read using a text editor. Binary plists follow a different file format and typically require a dedicated reader (eg plist Editor Pro) or conversion to XML to make it human readable.
Both types of plist support the following data types:

CFString = Used to store text strings. In XML, these fields are denoted by the <string> tag.
CFNumber = Used to store numbers. In XML, the <real> tag is used for floats (eg 1.0) and the <integer> tag is used for whole numbers (eg 1).
CFDate = Used to store dates. In XML, the <date> tag is used to mark ISO formatted dates (eg 2013-11-17T20:10:06Z).
CFBoolean = Used to store true/false values. In XML, these correspond to <true/> or <false/> tags.
CFData = Used to store binary data. In XML, the <data> tag marks base64 encoded binary data.
CFArray = Used to group a list of values. In XML, the <array> tag is used to mark the grouping.
CFDictionary = Used to store sets of data values keyed by name. Typically data is grouped into dictionaries with <key> and <value> elements. The <key> fields use name strings. The <value> elements are typically one of the following - <string>, <real>, <float>, <date>, <true/>, <false/>, <data>. The order of key declaration is not significant. In XML, the <dict> tag is used to mark the dictionary boundaries.

To show how it all fits together, let's take a look an XML plist example featuring everyone's favourite TV squirrel ...

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Name</key>
    <string>Rocket J. Squirrel</string>
    <key>Aliases</key>
    <array>
        <string>Rocky the Flying Squirrel</string>
        <string>Rocky</string>
    </array>
    <key>City of Birth</key>
    <string>Frostbite Falls</string>
    <key>DNA</key>
    <data>
    cm9ja3ktZG5hCg==
    </data>
    <key>Year Of Birth</key>
    <integer>1959</integer>
    <key>Weight</key>
    <real>2.5</real>
    <key>Flight Capable</key>
    <true/>
</dict>
</plist>

Note: The DNA <data> field "cm9ja3ktZG5hCg==" is the base64 encoding of "rocky-dna".

We can cut and paste the above XML into plist Editor Pro and save it as a binary plist.
We can also open a new text file and paste the above XML into it to create an XML plist.

Further Resources

The Mac Developer Library documentation describes Plists here and the Apple manual page describes XML plists here.

Michael Harrington has a great working example / explanation of the binary plist file format here and here.

Setting Up

Using the binary capable "plistlib" requires Python v3.4+. So if you don't have it installed, you're gonna be disappointed. Note: Ubuntu 14.04 has Python 3.4 already installed so if you're already running that, you don't have to worry about all this setup stuff.

To install Python 3.4 on Ubuntu 12.04 LTS (eg like SANS SIFT 3), there's a couple of methods.
I used this guide from James Nicholson to install the 3.4.0beta source onto my development VM.
FYI 3.4.1 is currently the latest stable release and should be able to be installed in a similar manner.

There's also this method that uses an Ubuntu Personal Package Archive from Felix Krull.
But Felix makes no guarantees, so I thought it'd be better to install from source.

Alternatively, you can install Python 3.4.1 on Windows (or for OS X) from here.

Not having a Mac or iPhone, monkey created his own binary and XML plist files. First, we define/save the new binary plist file using plist Editor Pro (v2.1 for Windows), then we copy/paste the XML into new text file on our Ubuntu development VM and save it. This way we can have both binary and XML versions of our plist information. Note: Binary plists created by plist Editor Pro in Windows were read OK by the script in Ubuntu. However, Windows created XML plists proved troublesome (possibly due to Windows carriage returns/linefeeds?) - hence the cut and paste from the XML in plist Editor Pro to the Ubuntu text editor for saving.

For squirrels and giggles, we'll continue to base our test data on characters from the Rocky and Bullwinkle Show. For those that aren't familiar with the squirrel and moose, commiserations and see here.

The Script

For each file in the specified input directory (or just for an individual file), the script calls the "plistlib.load()" function.
This does the heavy lifting and returns the "unpacked root object" (usually a dictionary).
The script then calls a recursive "print_object" function (modified/re-used from here) to go into each/any sub-object of the root object and store the filename, plist path and plist value in the "rowdata" list variable.

Once all plist objects have been processed, the script creates a new database using the specified output filename and SQL "replaces" the extracted "rowdata" into a "plists" table. We use SQL "replace" instead of SQL "insert" so we don't get "insert" errors when running the script multiple times using the same source data and target database file. Although to be prudent, it's just as easy to define a different output database name each time ... meh.
The "plists" table schema looks like:

CREATE TABLE plists(filename TEXT NOT NULL, name TEXT NOT NULL, value TEXT NOT NULL, PRIMARY KEY (filename, name, value) )

Note: The "plists" table uses the combination of filename + name + value as a Primary Key. This should make it impossible to have duplicate entries.

See comments in code for further details.

Testing

To run the script we just point it at a directory or individual plist and give it a filename for the output SQLite database.
Here we are using the python3.4 beta exe from my Ubuntu development VM's locally installed directory ...

cheeky@ubuntu:~/python3.4b/bin$ ./python3.4 /home/cheeky/plist2db.py
Running plist2db.py v2014-07-24

Usage: plist2db.py -f plist -d database

Options:
-h, --help show this help message and exit
-f FILENAME XML/Binary Plist file or directory containing Plists
-d DBASE SQLite database to extract Plist data to
cheeky@ubuntu:~/python3.4b/bin$

Here's how the test data was stored ...

cheeky@ubuntu:~/python3.4b/bin$ tree /home/cheeky/test-plists/
/home/cheeky/test-plists/
+-- bin-plists
¦   +-- boris.plist
¦   +-- bullwinkle.plist
¦   +-- natasha.plist
¦   +-- rocky.plist
+-- Red-Herring.txt
+-- xml-plists
    +-- boris-xml.plist
    +-- bullwinkle-xml.plist
    +-- natasha-xml.plist
    +-- rocky-xml.plist

2 directories, 9 files
cheeky@ubuntu:~/python3.4b/bin$

Note: "Red-Herring.txt" is text file included to show how non-plist files are handled by the script.

Now we can try our extraction script with our test data ...

cheeky@ubuntu:~/python3.4b/bin$ ./python3.4 /home/cheeky/plist2db.py -f /home/cheeky/test-plists/ -d /home/cheeky/bullwinkles.sqlite
Running plist2db.py v2014-07-24

*** WARNING /home/cheeky/test-plists/Red-Herring.txt is not a valid Plist!

cheeky@ubuntu:~/python3.4b/bin$

Here is a screenshot of the resultant "bullwinkles.sqlite" database ...

Test Data Output Database

Note: The XML plist DNA <data> fields shown have been extracted and base64 *decoded* automatically by the "libplist" library. Our test data binary plists store the raw ASCII values we entered and the XML plists store the base64 encoded values. Being text based, I can understand why XML encodes binary data as base64 (so its printable). But binary plists don't have the printable requirement so there's no base64 encoding/decoding step and the raw binary values are written directly to the binary plist file.

By having the raw hexadecimal values from the <data> fields in the DB, we can cut and paste these <data> fields into a hex editor to see if there's any printable characters ...

Binary rocky.plist's DNA data value

From the previous 2 pictures, we can see that the "DNA" value from our binary "rocky.plist" is actually UTF-8/ASCII for "rocky-dna".

One nifty feature of plist Editor Pro is that from the "List view" tab, you can double click on a binary value represented by a "..." and it opens the data in a hex editor window. This binary inspection would be handy when looking at proprietary encoded data fields (eg MS Office FileAlias values). Or we could just run our script as above and cut and paste any <data> fields to a hex editor ...

From our results above, we can also see that the "Red-Herring.txt" file was correctly ignored by the script and that a total of 66 fields were extracted from our binary and XML plists (as expected).

Now we have a database, we can start SQL querying it for values ...
As the "name" and "value" columns are currently defined as text types, limited sorting functionality is available.

Here are a few simple queries for our test data scenario. Because our test plists are not actual OS X / iOS plists, you'll have to use your imagination/your own test data to come up with other queries that you might find useful/practical. More info on forming SQLite queries is available here.

Find distinct "Aliases"
SELECT DISTINCT value FROM plists
WHERE name LIKE '%Alias%';

Find all the values from the "rocky-xml.plist"
SELECT * FROM plists
WHERE filename LIKE '%rocky-xml.plist';

Find/sort records based on "Weight" value
SELECT * FROM plists
WHERE name LIKE '%Weight' ORDER BY value;
Note: Sort is performed textually as the value column is TEXT.
So the results will be ordered like 125.0, 125.0, 2.5, 2.5, 53.5, 53.5, 65.7, 65.7.

Find/sort records by "Info Expiry Date" value
SELECT * FROM plists
WHERE name LIKE '%Info Expiry Date' ORDER BY value;
Note: This works as expected as the date text is an ISO formatted text string.

The script has been developed/tested on Ubuntu 12.04 LTS (64bit) with Python 3.4.0beta.
It was also tested (not shown) with OS X MS Office binary plists, a Time Machine binary backup plist and a cups.printer XML plist.

Additionally, the script has been run with the bullwinkle test data on Win7 Pro (64 bit) with Python 3.4.1 and on a Win 8.1 Enterprise Evaluation (64 bit) VM with Python 3.4.1

Final Words

The idea was to write a script that grabs as much plist data as it can and leave it to the analyst to formulate their own queries for finding the data they consider important.
The script also allowed monkey to sharpen his knowledge on how plists are structured and granted some valuable Python wrestling time (no, not like that!).
By re-using a bunch of existing Python libraries/code, the script didn't take much time (or lines of code) to put together.
The native Python "plistlib" also allows us to execute on any system installed with Python 3.4 (OS X, Windows, Linux) without having to install any 3rd party libraries/packages.
I have not been able to run/test it on a complete OS X system (or on iOS plist files) but in theory it *should* work (wink, wink). I am kinda curious to see how many plists/directories it can process and how long it takes. The bullwinkle test data took less than a second to execute.

Depending on what artefacts you're looking for, you can use the script as an application artefact browsing tool or by using the same queries on data from different sources, you could use it to detect known keywords/values (eg IP theft email addresses, app configuration). Or perhaps you have a bunch of application directories from an iOS device that you're curious about. Rather than having to inspect each plist individually, you can run this script once and snoop away.

The sorting could be made more comprehensive if each data type was extracted to it's own table (ie floats in one table, ints in another). However, given that sorting by time currently works already, that additional functionality might not be much use?

If anyone uses this script, I'd appreciate hearing any feedback either via the comments section below or via email. It'd be nice to know if this script is actually of some use and not just another theoretical tool LOL.

Thursday, 23 January 2014

Android SMS script update and a bit of light housekeeping

Knock, Knock ...

During recent research into Android SQLite databases (eg sms), Mari DeGrazia discovered a bug in the sms-grep.pl script.
Mari's test data was from a Samsung Galaxy S II. It turns out the script wasn't handling Cell Header "Serial Type" values of 8 or 9.
These Cell Header values are respectively used to represent "0" and "1" integer constants and eliminate the need for a corresponding 0x0/0x1 byte value in the Cell Data field section.
So this meant that some fields were being interpreted as "0" when they were actually set to "1". DOH!

The previous Android test data I used did not utilize these particular cell header values which is why it escaped my monkey like attention to detail. Banana? Where?!

Anyway, there's an updated version of the sms-grep.pl script available from GitHub here.

Pictures speak louder than words so lets look at a simplified example of an SQLite cell record:

SQLite Cell Record Structure

From the diagram above, we can see the usual SQLite record format. A Cell Size, Rowid and Cell Header Size followed by the rest of the Cell Header and the Cell Data sections.
Notice how HeaderField-B = 0x8? This means there will be no corresponding value written in the Cell Data section (ie there is no DataField-B).
When read, the extracted value of DataField-B will be set (to 0) based on the HeaderField-B type (0x8).
Alternatively, if the HeaderField-B type value was 0x9, the extracted value of DataField-B would be set to 1.
Simples!

Additionally, since the previous sms-grep.pl post here - both Mari and I have used sms-grep.pl to carve sms messages from a cellphone's free space.
Here's how it played out:
- Cellebrite UFED was used to generate the .bin physical image file(s) from an Android phone.
- Then the .bin file(s) were added to a new X-Ways Forensics case.
- A keyword search for various phone numbers turned up multiple hits in the 1 GB+ "Free Space" file (ie unallocated space) which was then exported/copied to SIFT v2.14.
- The script's schema config file was adjusted to match the sms table schema.
- After trying the script with a 1GB+ file, we were consistently getting out of memory errors (even after increasing the SIFT VM RAM to 3 GB).
So the Linux "split" command was used to split the 1GB+ file into 3 smaller 500 MB files.
This ran error free although it meant running the script a few more times. Meh, still better than doing it by hand!
As mentioned in a previous post, this script can potentially be used with non-sms SQLite databases especially if the search term field appears near the start of the cell data section.

From now on, all of my scripts will be hosted at GitHub. I'm not sure how much longer GoogleCode will keep my existing scripts so I have also transferred most of those to GitHub.
Because I can no longer update sms-grep.pl on GoogleCode, I have removed the previous version to minimize further confusion.

Apologies for any inconvenience caused by this script oversight and Special Thanks to Mari for both spotting and letting me know about the error!

Friday, 14 December 2012

Cheeky Season's Greetings

Today I thought I would do a brain-dump of some things/tips I've done/encountered since starting my internship about 6 weeks ago.
Hopefully some of it will be useful to others but at the very least it will help ensure I don't forget stuff (an insidious affliction affecting 2 out of 3 older monkeys).
But before I get to that, I'd like to thank everyone who has helped me learn more about DFIR over the past year or so. This includes the generous DFIRers that have responded to my emails/Tweets or shared their knowledge (via blog, book, presentation or those unlucky enough to have a socially awkward meeting with me). Hopefully, I have also thanked you individually in an email/Tweet at some point. Mother monkey did not raise an ungrateful bastard - so if I have forgotten to thank you I apologise.
When I started this blog, I was hoping it might help me connect with some other newbies. But what I've since realised is that as DFIR constantly changes, everyone is a potential newbie and having the newbie curiosity is a pre-requisite. Perhaps that is why DFIRers seem so helpful/approachable?
Originally, it was also more of an excuse to draw crude (in every sense of the word) cartoons. But somehow it ended up becoming more technical - hopefully the picture above helps re-balance the universe somewhat.
Anyway, enough of the touchy-feely stuff - on to the forensics!

Bing Bar Parser

Based heavily on the work of others (*cough rip off *cough), I wrote a "bing-bar-parser.pl" Perl script to parse Bing Bar history files.
The Bing Bar is Microsoft's attempt at combining search, Hotmail and other stuff into an IE toolbar. I can't say how popular it is in the age of Google but just in case, here's a tool. This script is intended to provide a Perl based method to view a user's Bing Bar search terms as stored in "searchhs.dat". By writing it in Perl I hoped to both improve my Perl programming and provide a cross-platform tool for fellow SIFT/Linux users.

Specifically, the script was based on the following sources:
- Mari DeGrazia's comprehensive blog post on interpreting the Bing Bar's "searchhs.dat" file.
- The "processBingToolbar" function from the "sep-history-viewer" C# Google code project by "ForensicGeek InTheCorner"
- The SystemTime 128 bit data structure as defined by Microsoft.
- Mark Stosberg's blog on Percent-encoding URIs in Perl.

Note: According to the "sep-history-viewer" page, the Bing Bar's "searchhs.dat" file is typically located for XP in:
\Documents and Settings\<user>\Local Settings\Application Data\Microsoft\Search Enhancement Pack\Search Box Extension

For Vista / Win7 it is located in:
\Users\<user>\AppData\LocalLow\Microsoft\Search Enhancement Pack\Search Box Extension

Also note that for a Win7 system, an additional searchhs.dat file (with URI encoding) has been found in:
\Users\<user>\AppData\Local\Microsoft\BingBar\Apps\Search_6f21d9007fa34bc78d94309126de58f5\VersionIndependent

You can use the "bing-bar-parser.pl" script's -d option to decode the URI encoding to something more readable (eg convert %20 to a space character). For the help hints, type "bing-bar-parser.pl -h".

Here's a fictional example script output (fields are pipe separated):

.\bing-bar-parser.pl -f /cases/searchhs.dat -d

Running bing-bar-parser.pl v2012-11-10

File Header field check (should be "FACADE0"): FACADE0
Bing Bar File Version: 1
Record ID for next new record: 341
Number of Records in this file: 335

Record id: 1 | Search Count: 2 | Last Search Time (UTC) = 2012-12-14 Fri 16:06:36 | Term: monkeys
... [ bunch of records edited out ]
Record id: 340 | Search Count: 7 | Last Search Time (UTC) = 2012-12-15 Sat 01:26:39 | Term: monkeys on typewriters

bing-bar-parser.pl v2012-11-10 Finished!
Notice that the "Record ID for next new record" does not reflect the actual number of records in the file.

Limited testing has been done on SIFT v2.14 and on 64 bit Windows 7 with ActiveState Perl v5.16. The script seems to parse data in accordance with Mari's blog post.

For more information, please read the comments in the code (you can download it from my google code page here).
Special Thank-yous to Mari, "ForensicGeek InTheCorner" and Mark Stosberg - they saved me a bunch of time!

Viewing previous versions of Websites

Sometimes forensicators might need to look at previous versions of a website.
The Wayback Machine takes historical snapshots of selected websites and allows users to view them in their former glory.

Google also caches pages. You can view the latest cached page using the URL:
http://webcache.googleusercontent.com/search?q=cache:http://nfl.com

where http://nfl.com is the URL in question.
Images/animations may not load so you might have to click on the "Text-only version" link provided.
Incidentally, this cached page is what is loaded when you hover over a Google search result.
The Google cache might be handy if the page has been recently taken offline and has not made it into the Wayback Machine.

Steam Profiles

One of my practice cases involved investigating a PC for communications between a particular user who had "run away" and another party.
I noticed various references to a Steam profile in the Internet History.
Steam is used to download/play online games (eg Call of Duty, Left for Dead) and keeps various statistics on:
- if a user is currently online
- a user's playing time (eg 5 hours in the last 2 weeks)
- a user's game achievements
- a user's friends

So I thought it might be worth looking up this user's Steam profile. It's easy enough - go to: http://steamcommunity.com/apps
and type in the user's Steam profile name. If their profile is public, it should be returned by the search.

In my case, I was able to view the profile and saw that they had recently played and had also included their current location in their profile information. Of course, the user could have typed anything in for their current location but at least there was some indication that the user was still active.

Extracting from an SQLite Database

Say you want to extract the information from an SQLite database in CSV format for re-importing into another program. Here's one way of doing it courteousy of The Definitive Guide to SQLite (2 ed. 2010) by Allen and Owens ...

1. If required, install the latest "sqlite" command line shell from here.
Note: SIFT already has a version of "sqlite" installed.

2. Run the "sqlite" command with the input sqlite file as an argument. For example type:
"sqlite cookies.sqlite"

3. (Optional) Type ".schema" to list the tables, indexes etc. contained in the given sqlite file.

4. Type ".separator ," to change the default separator to a comma.

5. Type ".output ffcookies.csv" to specify an output file (eg "ffcookies.csv").

6. Run your query for the data that you're interested in. For example type:
"select * from moz_cookies;"

7. Type ".quit" to quit the "sqlite" command line shell.

The query results should now be CSV formatted in the nominated output file.
Alternatively, you can also use the SQLiteManager Firefox plugin to open an sqlite file and export it as CSV.

So thats all folks! Thanks for reading/commenting over the past year and here's wishing you a Happy Holiday Season (and Happy New Year too if I don't post anything in the near future).

Tuesday, 10 April 2012

I Thought I Saw A Twittering TweetDeck Monkey

Girl, Unallocated recently asked why I wasn't on Twitter. I'll admit that when Twitter first came out I was pretty skeptical - "Who would want to read about my over-ripe bananas?" and "140 characters isn't nearly enough to insult someone (creatively)". Not having a Twitter capable mobile device was just another nail in the coffin. However, after GU's recent questioning, I've since discovered that a lot of forensicky update notifications occur on Twitter and it's also a great way to ~~stalk~~ meet new people (I am obviously joking ... I don't really care about forensicky updates ;).
Anyhoo, I started looking into a couple of PC Twitter clients called TweetDeck and MetroTwit. A recent Youtube demo video showed MetroTwit using 600 Mb RAM just in normal ops. When you consider that the ancient WinXP PC I'm using has 2 GB RAM and I also like to run SIFT on VMWare, that made the decision pretty easy. Hello, TweetDeck! *I also noticed that Corey Harrell has used it - so I know at least one other person might find this post useful/semi-interesting ;)

I did a brief check after installation, and TweetDeck "only" uses ~100 Mb. A bit steep for 140 character messages, but methinks my trusty old PC can live with that. Also, as TweetDeck has been bought by Twitter, you kind of hope they have their functionality and security all sorted. Except ... recently there was an issue with TweetDeck granting a user access to other people's TweetDeck/Facebook profiles. This user wasn't actively trying to subvert the system, he just opened up TweetDeck and there it all was. So TweetDeck shut down for a while and they (presumably) fixed the issue(s). Not particularly inspiring stuff eh?

Which brings me to the topic of this post - I was initially interested to see what artifacts (if any) TweetDeck leaves behind. And it was while monkeying around with TweetDeck that I discovered that the Twitter web search functions also return some interesting information.

TweetDeck v1.3 (on WinXP)

The first thing TweetDeck wants you to do after installing, is register an email address and password for the TweetDeck interface. Chances are, most people will use the same login/password as their Twitter account. Tsk, tsk.

After launching TweetDeck and adding my Twitter profile, I ran SysInternals TCPView and found that TweetDeck connects via TCP over port 443 (HTTPS) to both "twitter.com" and "amazon.com" domains. Presumably, Twitter has outsourced their storage of tweets/images to Amazon's cloud storage. These connections seemed to timeout if nothing is browsed.
I wasn't going to go hardcore and network capture the traffic because a shortish time ago, in a University not so far away, Mike Wilkinson told my Network Forensics class that intercepting live network traffic is best left to the cats with warrants (in this part of The Empire anyway).

So then I ran SysInternals ProcMon (v2.96) and tried to see if TweetDeck was reading/writing any account info to the Windows Registry ... No such luck! Either I missed the entries (not uncommon) or they just weren't there. I did however, notice that the following file got accessed quite a lot:

C:\Documents and Settings\USER\Local Settings\Application Data\twitter\TweetDeck\localStorage\qrc__0.localstorage

I have obviously changed the profile name to protect the guilty ;)

So firing up our trusty WinHex hex editor revealed the following file header snippet in "qrc__0.localstorage":

SQLite3 DB

Look familiar? Those sneaky TweetDeck Developers!
So I copied the "qrc__0.localstorage" over to SANS SIFT to play around with it a little more (using the SQLite Manager Firefox Plugin) and here's what I found:
- The file contains one SQLite table called "ItemTable"
- "ItemTable" consists of a "rowid", "key" and "value" column fields
- "rowid" seems to increase with use - presumably as key/value fields are updated

Most of the "value" fields appear to be JSON encoded. More information about JSON is available here. But the 1 minute summary is:

- JSON is an acronym for "JavaScript Object Notation"
- JSON is a text based encoding system based on name/value pairs
- Possible value types are: strings (delimited by ""), numbers (base 10, can be negative/decimal fractions/base 10 exponentials), objects, arrays, true, false, null
- An object is defined using { } to enclose a name separated by a colon and then the value.
eg {object_name: object_value}
- An array is defined using [ ] to enclose a series of comma separated values.
eg [array_value1, array_value2]
- It is possible to nest values. So as we will see later, you can have an array of tweet objects each of which has an array of key/value objects. This can make it difficult to sort out all the different brackets and commas.

So what are the actual key/value fields in the "ItemTable" database?

Normally, there are 7 key/value pairs in the "ItemTable" database table:
"tweetdeck_account" = An "email" object representing the TweetDeck user's login (eg "user@email.com").
"columns" = Suspect it represents user column arrangement preferences for the GUI.
"hoard" = An object with it's name derived from the "email" object value (eg "user@email.com"). Used to store a 312 character alphanumeric string which I suspect is a hash/encrypted string (or maybe a group of such strings).
"clients" = Several objects representing account information plus the last 4 TweetDeck search terms.
"__version__" = 2.
"_session" = A 44 character alphanumeric string which probably represents a session key. It seems to change with each launch.
"feeds" = Suspect it has to do with the user defined columns each having its own data connection.

Additional testing revealed that if "Keep Me Signed In" is ticked on the TweetDeck login page and then TweetDeck is closed, these artifacts remain readable in "ItemTable" (using SQLite Manager).
If TweetDeck is closed and "Keep Me Signed In" is NOT ticked, then "ItemTable" appears empty (in SQLite Manager) . However, the data is still appears to be (partially) viewable with WinHex (looking at "qrc__0.localstorage").

By chance, I copied an active/open "qrc__0.localstorage" to SIFT and noticed that there were 2 extra key/value fields in "ItemTable" when TweetDeck is running. These field names are prepended with a ":" (indicating they're in shared memory) and are called:
":CACHED_HASHTAGS" = Had no values
":CACHED_TWITTER_USERS_V2" = Appears to be a cache of Twitter user profiles that the TweetDeck user is "following". Example format follows:
{"lowercase_handle1":["123456781", "CapitalizedHandle1", "Proper Name1", "http link to profile1 pic"], "lowercase_handle2":["123456782", "CapitalizedHandle2", "Proper Name2", "http link to profile2 pic"]}

Note: The numerical fields turn out to be Twitter user ID numbers.

I was only able to see these additional fields after copying an active "qrc__0.localstorage". The weird thing was, after closing/reopening/closing (staying logged in) TweetDeck, I was still able to see these fields in both SQLite Manager and WinHex. After I subsequently opened/logged out/closed TweetDeck, these fields were no longer visible in SQLite Manager but some of the data still appears viewable with WinHex.
Perhaps copying the active file meant the contents of shared memory were also auto-magically written to the new file copy. The subsequent opening(s) of this database file then showed the previously recorded ":" fields in the database.
And perhaps TweetDeck/SQLite doesn't delete/overwrite all the data in a table when "deleting" a table's contents (eg after logging out). Perhaps it just deletes the addressing mechanism? Hence, while you can't see the table in SQLite Manager, you can still see some data using WinHex? To quote Doris Day et al: "Perhaps, perhaps, perhaps".

OK, so we've taken a look at "ItemTable" - now what?

One of the more interesting things I noticed in "ItemTable" was that the "columns", "clients" and "feeds" fields mention a recurring multiple digit number sequence throughout each. There were also similar looking numbers present in ":CACHED_TWITTER_USERS_V2".
Suspecting this number to be a Twitter user ID number, I went to the Twitter Developer site and found this article regarding the Twitter REST API (API = Application Programming Interface). It lists lots of searches that you can perform from the comfort of your favourite browser. These search queries are run against the Twitter database via a Twitter web interface. The number of searches may be limited per hour and may not be all-inclusive.
So to confirm that the multi-digit number I was seeing in my "ItemTable" table was a valid Twitter user ID number, I typed the following into Firefox on SIFT.

https://api.twitter.com/1/users/lookup.json?user_id=XXXXXXXXX
where XXXXXXXXX = Twitter user ID number

SIFT then auto-magically launched a save dialog for the returned JSON file. I then used gedit to display it and saw a bunch of JSON text. The first key/value field was "id":XXXXXXXXX followed later by the corresponding "name" and "screen_name" values of the TweetDeck user. The JSON also showed various other profile info which is described later in the testing section below.

So, given a viable TweetDeck "qrc__0.localstorage" database file, we can derive a TweetDeck user's Twitter user ID number. We can then use the Twitter REST API with that user ID number to determine that user's handle name. Subsequently, I also ran the lookup query using the numbers found in the ":CACHED_TWITTER_USERS_V2" user profiles. These numbers were found to correspond to their surrounding handle name.

More Twitter REST API Searches

We can also use the Twitter user ID number/handle to perform other queries with the Twitter REST API:

For retrieving a Twitter user's Profile info we can use:
https://api.twitter.com/1/users/show.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/users/show.json?screen_name=TwitterHandle
Note: the "screen_name" argument is not case sensitive
For more info see here.

To list a Twitter user's "follower" ID numbers (eg who is following user_id=XXXXXXXXX?) we can use:
https://api.twitter.com/1/followers/ids.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/followers/ids.json?screen_name=TwitterHandle
For more info see here.

To list a Twitter user's friend ID numbers (eg who is user_id=XXXXXXXXX following?) we can use:
https://api.twitter.com/1/friends/ids.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/friends/ids.json?screen_name=TwitterHandle
For more info see here.

Given a list of Twitter user ID numbers (eg from a friend/follower lookup), you can use a comma separated list to retrieve each Twitter user's profile info via:
https://api.twitter.com/1/users/lookup.json?user_id=XXXXXXXXX, YYYYYYYYY, ZZZZZZZZZ
For more info see here.

You can also search Twitter by term:
http://search.twitter.com/search.json?q=TwitterHandle
This will return the last weeks worth of tweets containing the given TwitterHandle (eg tweets at/from/mentioning). It might not return 100% of all possible results. For more info see here.
Alternatively, you can perform a conventional web search at:
http://twitter.com/#!/search
This will return the results within the browser. If you need to record these results, having a JSON file (via the first search method) might make documentation easier rather than a screenshot/printout for the second method.

Additional Resources (Stuff I didn't know where else to put)

This article describes how each tweet has its own unique ID number.
This website allows you to retrieve a Twitter user's handle (eg Batman) from their Twitter user ID number or vice-versa. The website also states that the numeric user ID is like a primary key (unique number) that does not change. So if a user changes their Twitter handle, they can still be found via their unique ID number.
I also found out about these similar sites (here and here) which translate from a Twitter user's handle to a Twitter user's ID number (one way only).
From my limited testing, the results from these sites appear to be consistent with each other. Given a choice however, I would trust the Twitter REST API over these webpages. All that stuff about about straight from the horses mouth etc.
Whilst doing a search for TweetDeck artifacts, I came across this since superseded article. It describes the database schema for a previous version of TweetDeck. It's interesting that the developers have since changed to a more generic (and harder to understand) schema of key/JSON value pairs. An attempt at extra security through obscurity?

Enter the Twit-Monkey!

Manually parsing through all this JSON text was making me cross(eyed) and I vaguely remember Harlan Carvey mentioning he had written a Perl script that read JSONs. Just FYI, I'm going for a name drop record in this post :)
Anyhoo, I thought it would take me as long to read through some of these search results as it would to write a Perl script to print out JSONs in a more readable format (how's that for cocky?). So then I *strutted* to CPAN and found the aptly named JSON Perl package. This package has functions to both encode/decode JSONs and best of all, it has a function to "pretty print" JSONs in a more human readable format.

To install it on SIFT I typed:
"sudo cpan JSON"
and then said yes to any dependency installs.

Now we should be ready to write the script. As usual, my code is pretty hack-tacular but it seems to work OK for each of the search types listed above.

The Code

#CODE BEGINS ON LINE BELOW
#!/usr/bin/perl -w

# Perl script to pretty print JSON files

use strict;

use Getopt::Long;
use JSON;

my $version = "json-printer.pl v2012-04-08";
my $help = 0;
my $filename="";
my @jsonlist;

GetOptions('help|h' => \$help,
            'f=s' => \$filename);

if ($help || $filename eq "")
{
    print("\nHelp for $version\n\n");
    print("Perl script to pretty print JSON files.\n");
    print("Example: json-printer.pl -f search-5.json\n");
    exit;
}

open(my $tfile, "<", $filename) or die "Can't Open $filename JSON File!";
@jsonlist = <$tfile>; # extract each line to a list element
chomp(@jsonlist); # get rid of newlines
close($tfile);

my $json_string = join(' ', @jsonlist); # join list elements into one big happy scalar string

my $json = JSON->new->allow_nonref; # create a new JSON object which converts non-references into their values for encoding
my $perl_scalar = $json->decode($json_string); # converts JSON string into Perl hash(es)

# at this point (if you wanted to) you can add code to iterate thru the hash(es) to extract/use values.

my $pretty_printed = $json->pretty->encode( $perl_scalar ); # re-encode the hash just so we can then pretty print it (hack-tacular!)
print $pretty_printed;

#CODE ENDS ON LINE ABOVE

Code Summary

This short script ("json-printer.pl") lets the JSON Perl package do all the heavy lifting. First, it reads the user specified JSON file into a scalar string ("$json_string") and then it calls "$json->decode" to convert the string into a Perl hash data structure. The script then calls "$json->pretty->encode" to convert that hash data structure into a more human readable string format. It's a round-about way of doing things but it seems to work.

Note: "$perl_scalar" is a reference to a hash data structure. If you wanted to, you could use it to iterate through the hash structure and retrieve any of the individual values.

Code Testing

You'll just have to trust me on this and/or try out the code yourself. The script above is able to read the JSON data returned by the various Twitter REST API searches listed previously.
I don't feel comfortable displaying other people's tweeting information here (eg their user id, what method they use to tweet). Its probably no more detailed than you could get by searching www.twitter.com/search but I just don't think it's my place.

Anyhoo, as some of the REST API searches mentioned previously return information about individual tweets (eg in a user profile's "status" field), I thought it would be interesting to have a look at these returned tweet fields (without divulging real data). Each tweet in a JSON appears to have the following format:

"to_user_id" = Destination Twitter user ID number (in unquoted numeric format).
"source" = Source URL which can show how tweet was posted (eg via twitter.com, tweetdeck, tweethopper etc). Some URL arguments may also list the device (eg iPad, Android).
"profile_image_url" = HTTP URL showing the source's avatar filename.
"profile_image_url_https" = HTTPS URL showing the source's avatar filename.
"created_at" = Date and Time of tweet creation.
"text" = Tweet text including @ addressing (eg "@Superman How do you keep your tights so clean?").
"id" = Unique tweet ID number (in "" quoted string format).
"in_reply_to_status_id" = When available, shows the tweet ID number this tweet is replying to (in "" quoted string format).
"from_user_id_str" = Source's Twitter user ID number (in "" quoted string format).
"from_user" = Source's Twitter user handle (eg "Batman").
"geo" = Source's(?) location information.
"from_user_id" = Source's Twitter user ID number (in unquoted numeric format).
"id_str" = Unique tweet ID number (in "" quoted string format).
"iso_language_code" = Presumably the language used to tweet (eg "en" = English).
"to_user_id_str" = Destination's Twitter user ID number (in "" quoted string format). Can also be null.
"to_user" = Destination's Twitter user handle string (eg "Superman"). Can also be null.
"from_user_name" = Source's Twitter user personal name string (eg "Bruce Wayne")
"to_user_name" = Destination's Twitter user personal name string (eg "Clark Kent"). Can also be null.

Note: This list is not all inclusive (eg there's some optional reply fields I've left out).

Twitter help uses the term "Username" to denote the addressing mechanism name (eg @Batman). It then uses "Name" to represent your personal name ID (eg "Bruce Wayne"). Got it? Now here's where it might get a little confusing ...
In the above tweet message, the "to_user_name" and "from_user_name" contain the personal ID Names (eg "Bruce Wayne"). The "to_user" and "from_user" contain the addressing mechanism names (eg "Batman").

Thanks to Frank McClain for sending me a "Mice" tweet to analyse/help confirm my findings. Quack-tacular!

Some of the REST API searches mentioned previously may return a Twitter user's Profile information. Here are some selected Profile fields I found interesting:

"friends_count" = How many Twitter users this user is following (in unquoted numeric format).
"profile_image_url" = HTTP URL string showing user's avatar image filename.
"profile_background_image_url_https" = HTTPS URL string showing user's profile background image filename.
"url" = User's website URL string (eg "www.blog.com").
"id" = User's Twitter user ID number (in unquoted numeric format).
"screen_name" = User's handle/screen name (eg "Batman").
"location" = User's location string (eg "Gotham").
"lang" = Presumably the user's tweet language (eg "en" for English).
"followers_count" = How many Twitter users are following this user (in unquoted numeric format).
"name" = User's personal name ID (eg "Bruce Wayne").
"description" = User's witty description of themselves (eg "The Dark Knight").
"status" = Object containing the user's most recent (unprotected) sent tweet (see above tweet format for more details).
"profile_image_url_https" = HTTPS URL string showing user's avatar image filename.
"created_at" = Date and Time when the user's account was created.
"utc_offset" = Appears to be the numeric timezone offset in seconds from GMT (eg -21600).
"profile_background_image_url" = HTTP URL string showing user's profile background image filename.
"protected" = Boolean presumably representing whether the user's tweets are protected (eg true/false).
"id_str" = User's Twitter user ID number (in "" quoted string format).
"time_zone" = Appears to be string based representation of time zone (eg "Central Time (US & Canada)").
"geo_enabled" = Boolean presumably representing if the user's location is disclosed.
"profile_use_background_image" = Boolean representing if the user is using their background profile picture.

Adding to the name confusion mentioned earlier, we now see the "screen_name" and "name" keys. In this case, "screen_name" seems to refer to the addressing mechanism name (eg "Batman") and "name" refers to the personal ID name (eg "Bruce Wayne"). And while not explicitly present in these fields, there's also the commonly used "handle" term (eg "Batman"). Also note that "id" in a tweet message refers to the unique tweet ID number where as "id" in a profile represents a Twitter user's ID number ... Consistency? Pffft! We don't need no stinkin' consistency! Everyone's a critic huh?

Summary

We've installed TweetDeck and seen that it can store/leave behind the last 4 user search terms and multiple Twitter user ID number artifacts. These user ID numbers can then be used to find corresponding Twitter handles/addressing mechanisms via a Twitter REST API web search. The Twitter REST API can also be used to retrieve other information about the user's followers/leaders/profile.
We have also coded a small Perl script ("json-printer.pl") to make it easier to read the JSON formatted results returned from the Twitter REST API.
Any comments/suggestions/requests for further name droppage (the record now stands at 5 / 6 if you count Doris) are welcome.
FYI This will probably be my last post for a few weeks (month?) as I have to swap my PC for some renovation work boots. I will still be contactable via email / Twitter though. Hopefully, I'll be back with some new ideas and all of my digits in the near future ;)