Showing posts with label Perl. Show all posts
Showing posts with label Perl. Show all posts

Wednesday, 17 July 2013

G is 4 cookie! (nomnomnom)

What is it?

A Linux/Unix based Perl script for parsing cached Google Analytic requests. Coded/tested on SANS SIFT Virtual Machine v2.14 (Perl v5.10). The script (gis4cookie.pl) can be downloaded from:
http://code.google.com/p/cheeky4n6monkey/downloads/list

The script name is pronounced "G is for cookie". The name was inspired by this ...




Basically, Google Analytics (GA) tracks website statistics. When you browse a site that utilizes GA, your browser somewhat sneakily makes a request for a small invisible .GIF file. Also passed with that request is a bunch of arguments which tell the folks at Google various cookie type information such as the visiting times, page title, website hostname, referral page, any search terms used to find website, Flash version and whether Java is enabled. These requests are consequently stored in browser cache files. The neat part is that even if a user clears their browser cache or deletes their user profile, we may still be able to gauge browsing behaviour by looking for these GA requests in unallocated space.

Because there is potentially a LOT of data that can be stored, we felt that creating a script to extract this information would help us (and the forensics community!) save both time and our ageing eyeballs.

For more information (eg common browser cache locations) please refer to Mari Degrazia's blog post here.
Other references include Jon Nelson's excellent DFINews article on Google Analytic Cookies
and the Google Analytics documentation.

How It Works

1. Given a filename or a directory containing files, the script will search for the "google-analytics.com/__utm.gif?" string and store any hit file offsets.
2. For each hit file offset, the script will try to extract the URL string and store it for later parsing.
3. Each extracted URL hit string is then parsed for selected Google Analytic arguments which are printed either to the command line or to a user specified Tab Separated Variable file.

The following Google Analytic arguments are currently parsed/printed:
utma_first_time
utma_prev_time
utma_last_time
utmdt (page title)
utmhn (hostname)
utmp (page request)
utmr (referring URL)
utmz_last_time
utmz_sessions
utmz_sources (organic/referral/direct)
utmz_utmcsr (source site)
utmz_utmcmd (type of access)
utmz_utmctr (search keywords)
utmz_utmcct (path to website resource)
utmfl (Flash version)
utmje (Java enabled).
You probably won't see all of these parameters in a given GA URL. The script will print "NA" for any missing arguments. More information on each argument is available from the references listed previously.

To Use It

You can type something like:
./gis4cookie -f inputfile -o output.tsv -d

This will parse "inputfile" for GA requests and output to a tab separated file ("output.tsv"). You can then import the tsv file into your favourite spreadsheet application.
To improve readability, this example command also decodes URI encoded strings via the -d argument (eg convert %25 into a "%" character). For more info on URI/URL/percent encoding see here.

Note: The -f inputfile name cannot contain spaces.

Alternatively, you can point the script at a directory of files:
./gis4cookie -p /home/sansforensics/inputdir

In this example, the script prints its output to the command line (not recommended due to the number of parameters parsed). This example also does not decode URI/URL/percent encoding (no -d argument).

Note: The -p directory name MUST use an absolute path (eg "/home/sansforensics/inputdir" and not just "inputdir").

Other Items of Note

  • The script is Linux/Unix only (it relies on the Linux/Unix "grep" executable).
  • There is a 2000 character limit on the URL string extraction. This was put in so the URL string extraction didn't loop forever. So if you see the message "UH-OH! The URL string at offset 0x____ appears to be too large! (>2000 chars). Ignoring ..." you should be able to get rid of it by increasing the "$MAX_SZ_STRING" value. Our test data didn't have a problem with 2000 characters but your freaky data might. The 2000 character count starts at the "g" in "google-analytics.com/__utm.gif?".
  • Some URI encodings (eg %2520) will only have the first term translated (eg "%2520" converts to "%20"). This is apparently how GA encodes some URL information. So you will probably still see "%20"s in some fields (eg utmr_referral, utmz_utmctr). But at least it's a bit more readable.
  • The script does not find/parse UTF-16/Unicode GA URL strings. This is because grep doesn't handle Unicode. I also tried calling "strings" instead of "grep" but it had issues with the "--encoding={b,l}" argument not finding every hit.
  • The utmz's utmr variable may have issues extracting the whole referring URL. From the test data we had, sometimes there would be "utmr=0&" and other (rarer) times utmr would equal a URI encoded http address. I'm not 100% sure what marks the end of the URI encoded http address because there can also be embedded &'s and additional embedded URLs. Currently, the script is looking for either an "&" or a null char ("x00") as the utmr termination flag. I think this is correct but I can't say for sure ...
  • The displayed file offsets point to the beginning of the search string (ie the "g" in "google-analytics.com/__utm.gif?"). This is not really a limitation so don't freak out if you go to the source file and see other URL request characters (eg "http://www.") occurring before the listed file offset.
  • Output is sorted first by filename, then by file offset address. There are a bunch of different time fields so it was easier to sort by file offset rather than time.

Special Thanks

To Mari DeGrazia for both sharing her findings and helping test the script.
To Jon Nelson for writing the very helpful article on which this script is based.
To Cookie Monster for entertaining millions ... and for understanding that humour helps us learn.
"G is 4 Cookie and that's good enough for me!" (you probably need to watch the video to understand the attempted humour)

Thursday, 21 February 2013

Creating a Perl script to retrieve Android SMS


This script/post was inspired by Mari DeGrazia after she had to manually parse hundreds of Android SMS messages. Without her prior research and the principles she discusses in her post, there's little chance I would have attempted this script. Thanks for sharing Mari!
This post continues on from where Mari's post ended. We'll look further at an example Android SMS SQLite schema and then use it to explain how our SMS extraction script (sms-grep.pl) works. We will also walk you through how to use our script and what kind of output you can expect.

UPDATE 2014-01-23:
The code for sms-grep.pl has been revised/relocated to GitHub
It now handles Cell Header "Serial Type" values of 8 and 9 (ie Header values associated with 0 and 1).




UPDATE 2013-04-25:
Like a punch in the guts, some additional issues have arisen with the previous sms-grep.pl script (v2013-02-16).
Changes were made. Some bananas may have been thrown.


Issue #1
During further testing, we noticed that the initial version of sms-grep.pl was not reading some INTEGER fields correctly.
This has been corrected in the latest version 2013-04-14 with the addition of a new function "read_payload_int".
The previous script version tried to read payload INTEGER fields as VARINTs. This seems to work with positive integers less than 128 and so went unnoticed - as integer payload data in the "sms" table is typically limited to 1's or 0's. However, the "status" fields can read -1 and "thread_id" can be greater than 127 so the correction was made.

Issue #2
The script searches for a known phone number ("address" field) and then tries to go backwards a set number of fields until it hits the cell header size. Previously, it treated the payload fields prior to the "address" field (ie "thread_id") as VARINTs (like the cell header fields). As mentioned previously, this should not prove troublesome if the "thread_id" field is a single byte between 0 and 127. However, if the "thread_id" is greater than 127 or uses multiple bytes, there may be issues with ascertaining the cell header size and hence parsing the sms cell. See also the sms-cell-example-revised.png pic shown below in the original post.

The new version of the script requires the "-p" argument which represents the number of bytes between the last cell header field (VARINT) and the phone number "address" field. For our example schema, using "-p 2" means there's 2 bytes being used for the "thread_id" which sits in between the last cell header field and the "address" field.
This also means that to be thorough, the script will have to be run twice - once with "-p 1" and again with "-p 2" to cater for the possible 1 and 2 byte "thread_id" sizes. I decided to make it an argument rather than hard code it because the schema may change in the future. In practice, the "thread_id" will probably not exceed 2 bytes as the maximum "thread_id" value of 0xFFFF should be sufficiently large.

Changes to Configuration File FormatAs a result of the changes made for Issue #2, the configuration file no longer requires the PHONE type marker for the "address" field.
Instead, the "address" field can be declared as TEXT and the "-p" argument is used to define the relationship between the "address" field and the last cell header field. The example pic of the sample configuration file format has been edited accordingly.

To run it you now type something like:

sms-grep.pl -c config.txt -f mmssms.db -p 1 -s "5555551234" -o output.tsv
and 

sms-grep.pl -c config.txt -f mmssms.db -p 2 -s "5555551234" -o output.tsv

Testing
Limited testing with Android sms test data was again performed.
The script now seems to handle multiple byte payload integers correctly with the new configuration file format.
As always, users should validate this script for themselves before relying upon the returned results (this is my CYA = Cover Your Ass section). What worked for our test data may not work for your's ...

END UPDATE 
2013-04-25 (Original post below also edited/revised)

Introduction

Android stores SMS records in the "sms" table of /data/data/com.android.providers.telephony/databases/mmssms.db. SQLite can also store backups of "sms" table data in the /data/data/com.android.providers.telephony/databases/mmssms.db-journal file (in case it needs to undo a transaction). Journal files are a potential forensic gold mine because they may contain previously deleted data which is no longer visible in the current database.
As far as I'm aware, there is currently no freely available way to easily view/print the sms contents of mmssms.db-journal files.
And while you can query the mmssms.db database directly via SQLite, this will not return any older (deleted) sms entries from database pages which have been since been re-purposed.
Our sms-grep.pl script seems to work well with mmssms.db and mmssms.db-journal files and also with unallocated space (although file size limited/hardware dependent).
Additionally, our script will interpret date fields and print them in a human readable format so no more hours spent manually checking/converting timestamps!
Our script is also configurable - so you should be able to use it to look at multiple Android SMS SQLite schemas without having to modify the underlying code.

But before we dive into the script - it's probably a good idea to learn about how SQLite stores data ...

The SQLite Basics

The SQLite database file format is described in detail in Richard Drinkwater's blog posts here  and here.
There's also some extra information at the official SQLite webpage.

OK, now for the lazy monkeys who couldn't be bothered reading those links ...
The basic summary is that all SQLite databases have a main header section, followed by a bunch of fixed size storage pages.
There are several different types of page but each page is the same size as declared in the header.
One type of page is the "table B-Tree" type which has a 0xD byte flag marker. This type of page is used to store field data from the whole database (ie data from all of the tables) in units called "cells". Consequently, this page type will be used to store "sms" table data and because the page format is common to both mmssms.db and mmssms.db-journal files - our carving job is potentially much simpler.
Pages can also be deleted/re-allocated for another type of page so we must also be vigilant about non-"table B-tree" pages having free space which contains old "table B-tree" cell data. Think of it like file slack except for a database.

A 0xD type (ie "table B-tree") page will look like:





Generic Layout of a 0xD page

We can see the 0xD byte is followed by:
- 2 bytes containing the 1st free cell offset (0 if full)
- 2 bytes containing the number of used data cells in page
- 2 bytes containing the 1st used cell offset
- 1 byte fragmentation marker

Then depending on the number of used data cells, there will be a series of 2 byte offsets which point to each used data cell (see the green section in the pic). The cell pointer list starts with the closest data cell first and ends with the "1st used offset" data cell. Each used data cell should correspond to a row entry in a table of the database (eg an "sms" row entry).
Following those cell pointers (green), will be any free/unallocated space (blue) followed by the actual data cells (purple). The blue area is where we might see older previously deleted "table B-tree" data.

Breaking it down further, the general form of a data cell (from the purple section) looks like:

Generic Layout of a Cell

We can see there's a:
- Cell Size (which is the size of the cell header section + cell data section)
- Rowid (ie Primary Key of the row)
- Cell Header section (compromised of a "Cell Header Size" field + a bunch of fields used to describe each type/size of field data)
- Cell Data section (compromised of a bunch of fields containing the actual data)

You might have noticed an unfamiliar term called a "varint".
Varints are type of encoded data and are used to save space. They can be 1 to 9 bytes and require a bit of decoding.
Basically, you read the most significant byte (data is stored big endian) and if it's most significant bit is set to 1, it means there's another byte to follow/include. Then there's a bunch of droppping most significant bits and concatenating the leftovers into a single binary value.
Richard Drinkwater's got a better explanation (with example) here.
Later for our script, we will need to write a function to read these varints but for now, just know that a varint can store anywhere from 1-9 bytes (usually 1-2 bytes though) and it requires some decoding to arrive at the "original value".

So for our Android SMS scenario, a typical used "sms" data cell might look like:

Android SMS Cell example

You'll notice that there's a "Cell Header" section highlighted in purple and a "Cell Data" section highlighted in pink.
Think of the Cell Header section as a template that tells us how many bytes to expect for each field in the Cell Data section. The Cell Data section does not use varints to store data.
From the sms-cell-example-revised pic, we can see that most of the Cell Header field types are 0x01 - which means those fields use one byte of data in the subsequent cell data section (pink). Also please note the potential for multi-byte varints for the "thread_id" data field circled in red.
The official SQLite documentation refers to these cell header field type values as "Serial Type Codes" and there's a comprehensive definition table about halfway down the page here.

For our sms example, we can see from the purple section that the sms "Read" and "Type" fields will use 1 byte each to store their data in the Cell Data (pink) section. Looking at the pink section confirms this - the "Read" field value is 0 (0 for unread, 1 for read) and the "Type" field is 1 (1 for received, 2 for sent).
As a rule, if the value of the cell header field type (purple section) is between 0x0 and 0x4, the corresponding data field (pink) will use that many bytes (eg 0x1 means 1 byte data field, 0x4 means 4 bytes)
If the value of a cell header field (purple section) is 0x5 (eg "Date" & "Date_sent" fields), it will take 6 bytes in the cell data (pink) section. The "Date" and "Date_sent" data fields are 6 byte Big Endian values which (for Android) contain the number of milliseconds since the Unix epoch (1 Jan 1970).
There's a special case for handling strings. Firstly, the cell header field type value must be odd and greater than or equal to 13. Then to calculate the number of bytes required in the data section we use this formula:

Number of bytes in string = (cell header field type value - 13)/2.

So in our sms-cell-example-revised pic, the corresponding string size for the "Address" field is (0x21 - 0xD) / 0x2 = (33 - 13) / 2 = 10 bytes. I haven't actually shown a value for the "Address" in the pink section so just use your imagination!
Similarly, we can see that the "Body" field will take (0x23 - D) / 0x2 = (35 - 13) / 2 = 11 bytes.
Note: For long sms, the varint containing the "body" header field type has been observed to require 2 bytes.

You might also have noticed that not all of the cell header fields declared in the cell header section (purple) have a matching entry in the cell data section (pink). This is because if a cell header field is marked as NULL (ie 0x00), it does not get recorded in the cell data section (eg the purple "Rowid" header field's 0x00 value means there won't be a corresponding data field in the pink section).
So if we want to retrieve data, we can't go strictly off the schema - we have to pay attention to the cell header section so we can interpret the cell data fields correctly.

So how do/did we know what cell data field was what?
It turns out that SQLite ensures that the order of cell header fields in the cell header (purple section) is the same order as the database schema field order. Consequently, the cell data section (pink) will also appear in schema order (notwithstanding any missing null fields).
We can get the schema of a database file using the sqlite3 command line exe like this:

sansforensics@SIFT-Workstation:~$ sqlite3 mmssms.db
SQLite version 3.7.11 2012-03-20 11:35:50
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .headers on
sqlite> pragma table_info(sms);
cid|name|type|notnull|dflt_value|pk
0|_id|INTEGER|0||1
1|thread_id|INTEGER|0||0
2|address|TEXT|0||0
3|person|INTEGER|0||0
4|date|INTEGER|0||0
5|protocol|INTEGER|0||0
6|read|INTEGER|0|0|0
7|status|INTEGER|0|-1|0
8|type|INTEGER|0||0
9|reply_path_present|INTEGER|0||0
10|subject|TEXT|0||0
11|body|TEXT|0||0
12|service_center|TEXT|0||0
13|locked|INTEGER|0|0|0
14|error_code|INTEGER|0|0|0
15|seen|INTEGER|0|0|0
16|date_sent|INTEGER|0|0|0
sqlite>


So we can see that the "sms" table consists of 17 fields with the first being the "_id" (ie rowid) primary key and the last being the "date_sent" field. In practice, the "_id" is typically unused/set to NULL as it is just duplicating the Cell's Rowid varint (from the white section). Some fields are declared as INTEGERS and others TEXT. Notice how the "date" and "date_sent" are declared as INTEGERS? These represent ms since UTC.
At this stage, I'm not 100% certain on every field's meaning. We know the "address" field is used to store phone numbers and the "body" field stores the sms text string. From Mari's research we also know that "read" is 1 for a read sms, 0 otherwise and "type" indicates sent (2) or recieved sms (1). That should suffice for now.

So that's the basics of the data structures we will be looking at. In the next section, we'll share some thoughts on the script. Oh goody!

The Script

At first I thought we could find each 0xD page and iterate through the data cells that way but this would miss any old sms messages contained in pages which have since been re-purposed by SQLite. That method would also miss any corrupted/partial pages containing sms in unallocated space.
So to find the sms messages, we are going to have to come up with a way of detecting/printing individual sms data cells.

The strategy we ended up using was based on the "address" field (ie the phone number).

1. We read in our schema and print flags from a configuration file.

2. We create one big string from the nominated input file.
Perl has a handy function called "index" that lets you find out if a given string is contained in a larger string. We use this "index" function to find all phone number matches and their respective file offsets.

3. For each match's file offset, we then work backwards and try to find the cell header size field (ie the start of cell header).
Looking at the sms-cell-example-revised pic, we can see that there are 17 (purple) varint fields plus the "thread_id" (pink) data field between the "Address" cell value (in pink section) and the cell header length/size field (in purple section). The number of varint fields should be constant for a given schema but it is possible for the number of bytes required for each varint to change (eg the "thread_id" data field to is typically 1-2 bytes)..

4. Now that we've determined the cell header size file offset, we can read in the header field type varints (ie find out how many bytes each field requires/uses in the cell data section) and also read in/store the actual data.

5. We then repeat steps 3 and 4 until we have processed all our search hits.

6. We can then sort the data in chronological order before printing to screen/file.

The main sanity check of this process is checking the cell header size value range. Remember, the cell header size value should tell us the number of bytes required for the entire cell header (including itself). So for our example schema above, this value should be:
- above the 18 byte minimum (ie number of schema fields plus the size of the cell header length = 17 + 1) and
- below a certain threshold (18+5 at this time).
Most "sms" cell header sizes should be 18 bytes (most of the fields are one byte flags) but for longer "body" fields or large "thread_id" field values, multi-byte varints have been observed which would obviously increase number of bytes required for that cell header. Allowing for an extra 5 bytes seemed like a good start for now.

For more information on how the script works (eg see how painful it is to read a multi-byte varint!) you can read the comments in the code. I dare you ;)

Making it Schema Configurable
As Mari has noted, not every Android phone will have the same schema. So instead of having a different script for each schema, we'll be utilising a configuration text file. Think of the configuration file as a kind of plugin for the script. Each phone schema will have it's own configuration file. This file will tell the script:
- what the schema fields are and more importantly, their order,
- which fields are DATES or STRINGS or INTEGERS and
- whether we want to print this field's values


For possible future needs, we have also declared a "c4n6mtype=android" field. This is in case we need to read an iPhone schema sometime in the future (iPhones use seconds since UTC for their DATE fields).

Here's an example of a configuration file (also provided from my GoogleCode Download page as "sms-grep-sample-config.txt"):

Sample Android Configuration File

Notice that it's very similar to the schema we got earlier from sqlite3?
The most significant differences are:

- "date" field (which is now listed as a DATE type)
- "date_sent" field (which is now listed as a DATE type)
- the configuration file uses ":" as a field separator (sqlite3 uses "|")
- the print flags (1 prints the field value, 0 does not print)

The script will ignore any blank lines and lines starting with "#".

Update 2013-04-25: The "address" field is now declared as TEXT (previously declared as PHONE).

Running the Script
The first step would be to determine the schema. The easiest way to do this is to use the sqlite3 client with the mmssms.db as previously shown. Admittedly, this requires access to a database file so if your don't have a sample to work with, your out of luck.
Next it's time to create the configuration file - making sure to mark the PHONE field and any DATE fields. Also remember to specify which fields you wish to print.
Once that is done, we can run the script using something like:

sms-grep.pl -c config.txt -f mmssms.db -s "5555551234" -s "(555) 555-1234" -p 1  -o output.tsv

Note: Users can specify multiple phone numbers/formats to search for using -s arguments. At least one -s argument is required.
If no -o argument is specified, the results will be printed to the screen in Tab separated columns - which can get messy with lots of messages. Alternatively, an output Tab separated file (TSV) can be generated (eg using -o output.tsv).

Update 2013-04-25: To be thorough, the script will have to be run twice - once with "-p 1" and again with "-p 2" to cater for the possible 1 and 2 byte "thread_id" sizes.

Any extracted hits will be printed in chronological order based upon the first DATE type schema field declared in the configuration file (eg "date" field for our example configuration file). You will probably see multiple entries for the same SMS which was stored at different file offsets. The date sorting makes this situation easier to detect/filter.

Here's a fictional TSV output example based on the previously shown config file:

Fictional sample TSV output from sms-grep.pl

The arrows in the pic are used by Notepad++ to indicate TABs. We can see that only the print fields marked with a 1 in the configuration file (ie address, date, read, type, subject, body, seen, date_sent) are printed along with the file offset in hex.

Note: If you run the script on a partially complete cell (eg the cell header is truncated by the end of file so there's no corresponding cell data), the script will print out "TRUNCATED" for any strings and -999 for any integer fields. If you see these values, further manual parsing/inspection is recommended.

Testing
Limited testing of the script (initial version) has been performed with:
- 2 separate Android schemas
- Unallocated space (as retrieved by Cellebrite and exported into a new 600 MB file using X-Ways/WinHex)
- A Raw Cellebrite .bin file (1 GB size)

Note: The script failed to run with a 16 GB .bin file - we suspect this is due to a RAM deficiency in our test PC.

As I don't have an Android phone, I've relied pretty heavily on Mari for testing. We've tested it using ActiveState Perl v5.16 on 64 bit Windows 7 PCs. It should also work on *nix distributions with Perl. I have run it succesfully on SIFT v2.14.

Additionally, an outputted TSV file has also been successfully imported into MS Excel for subsequent analysis.

Update 2013-04-25: Subsequent re-testing of the new version (2013-04-14) was limited to 2 different schema mmssms.db files in allocated space only. I reckon if it worked for these, it should still work for the other previous test cases.

Validation tips:
grep -oba TERM FILE
Can be used in Linux/SIFT to print the list of search hit file offsets (in decimal).
For example: grep -oba "5555551234" mmssms.db
Additionally, WinHex can be used to search for phone number strings and will also list the location of any hits in a clickable index table.  The analyst can then easily compare/check the output of sms-grep.pl.

What it will do
Find sms records to/from given phone numbers with valid formatted cell headers. This includes from both the SQLite database files (ie mmssms.db) and backup journal files (ie mmssms.db-journal). It should also find any existing sms records (with valid headers) that appear in pages which have been since re-allocated by SQLite for new data. Finally, it should also be able to find SMS from unallocated space (assuming the size isn't larger than your hardware can handle).

What it doesn't do very well
If the cell header is corrupted/missing, the script will not detect sms data.

The script does some range checking on the cell header size and prints a warning message if required. However, it is possible to get a false positive (eg phone number string is found and theres a valid cell header size value at the expected cell header size field). This error should be obvious from the output (eg "body" field has nonsensical values). The analyst is encouraged to view any such errors in a Hex viewer to confirm the misinterpreted data.

Unallocated space has proved troublesome due to size limitations. The code reads the input file into one big string for searching purposes, so running out of memory is a possibility when running on large input data such as unallocated. However, if you break up the unallocated space into smaller chunks (eg 1 GB), the script should work OK. I have also used WinHex to copy SMS data out from unallocated and paste it into a separate smaller file. This smaller file was then parsed correctly by our script.
The big string approach seemed like the quickest/easiest way at the time. I was targeting the actual database files/database journals rather than unallocated. So consider unallocated a freebie ;)

We have seen some SQLite sms records from an iPhone 4S which does NOT include the phone numbers. There may be another field we can use instead of phone numbers (perhaps we can use a phone book id?). This requires further investigation/testing.

Final words

As always, you should validate any results you get from your tools (including this one!).

This script was originally created for the purposes of saving time (ie reduce the amount time spent manual parsing sms entries). However, along the way we also learnt more about how SQLite stores data and how we can actually retrieve data that even SQLite doesn't know it still has (eg re-purposed pages).

The past few weeks have flown by. I originally thought we were creating a one-off script for Android but due to the amount of different schemas available, we ended up with something more configurable. This flexibility should also make it easier to adapt this script for future SQLite carving use (provided we know the schema). It doesn't have to be limited to phones!

However, this scripting method relies on providing a search term and knowing what schema field that term will occur in. There's no "magic number" that marks each data cell, so if you don't know/cannot provide that keyword and schema, you are out of luck.

I would be interested to hear any comments/suggestions. It can probably be improved upon but at some point you have to stop tinkering so others can use it and hopefully suggest improvements. If you do have problems with it please let me know. Just a heads up - for serious issues, I probably won't be able to help you unless you supply some test data and the schema. 
To any SANS Instructors reading, any chance we can get a shoutout for this in SANS FOR 563? Monkey needs to build some street cred ... ;)

Wednesday, 23 May 2012

A Perl script plays Matchmaker with ExifTool and SQLite


Imagine their cute little offspring!
- Query-able metadata = The ability to sort/query by several metadata fields (not just time).
- Metadata from an well established/tested source library (ExifTool)
- SQLite cross platform compatibility
- And all available for the one low price on SIFT (free!)

Being Perl, it won't be as fast as a compiled language like C/C++.
Also, at this time only .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf and .jpg files are supported.
As a result, I'm pretty sure you wouldn't want to blindly point it at "C:\" as it could take a while/use a lot of memory.
Instead, say you found a directory of interest (eg a suspect's "naughty" directory). You could extract the metadata to the database, take a hash of the database and then send it on to another analyst for further analysis/reporting. Or you could analyse it yourself using standard SQL queries from either the "sqlite3" client or the Firefox "SQLite Manager" plugin.

As far as I know, this capability does not currently exist (at least for open source) so I think this script could prove handy for several different types of investigations. For example: fraud, e-discovery processing/culling, processing jpgs (eg exploitation).

Having a warped sense of humour, I'm calling this extraction tool "SquirrelGripper".
In Australia, to grab a bloke by his "nuts" (accidentally or on purpose or sometimes both) is also known as "the Squirrel Grip". This has been known to happen in the heat of battle/whilst tackling an opponent during a Rugby or Aussie Rules match. Thankfully, I have never had this done to me - I'd like to think I was just too quick but maybe they just couldn't find them LOL.
The idea behind the name is that "SquirrelGripper" (aka "SG") will lead analysts to the low hanging fruit right at the crux of the matter ;)
In polite company (eg client briefing), one could say that "SquirrelGripper" finds all the little nuts/nuggets of information.
By running this script an analyst will hopefully save time. Rather than having to launch the ExifTool exe multiple times for different files/metadata, they can launch this script once and then perform SQL queries for retrieving/presenting their metadata of interest.

Whilst developing this tool, I contacted some more experienced DFIR'ers for feedback. So I'd just like to say a big "ralphabetical" Thankyou to Brad Garnett, Cindy Murphy, Corey Harrell, Gerald Combs, "Girl, Unallocated" and Ken Pryor. Sorry for clogging your inboxes with my semi-organised/semi-chaotic thoughts. Having an audience for my email updates provided both structure and motivation for my thoughts.

An extra bunch of thanks to Corey Harrell (does this guy ever rest?) who was both my testing "guinea pig" and who also made some excellent suggestions. Most, if not all of his suggestions were incorporated into this latest version of SG. I can't wait to see/read how he will use it in his investigations.

Enough already! Let's get started ...

The Design

As SquirrelGripper has grown to over 1000 lines, I have decided to provide the code via my new Google Code project ("cheeky4n6monkey") and just give a general description in this blog post. Like all my scripts, it is provided "as is" and whilst I have done my best, there may still be bugs in the code. As always, perform your own validation testing before using it live.

We'll start by describing how SquirrelGripper will store metadata.
Currently, there are 9 tables in the created database - FileIndex, XLSXFiles, XLSFiles, DOCXFiles,DOCFiles, PPTXFiles, PPTFiles, PDFFiles, JPEGFiles.

FileIndex is the main index table and contains the following fields:
-"AbsFileName" (primary key) = Absolute path and filename. Used as a common key to FileIndex and other file specific tables.
-"DateTimeMetaAdded" = Date and time SquirrelGripper extracted the metadata for a file.
-"FileType" = ExifTool field indicating file type. We can use this to determine which table should contain the remaining file metadata.
-"CaseRef" = Mandatory User specified case tag string.
-"UserTag" = Optional User specified tag string. Can be used for labelling directories/files for a particular SG launch.

The 8 other tables will be used to record file type specific metadata (eg "Slides" in PPTXFiles = Number of Slides for a .pptx file). I decided to use separate tables for each file type because there are simply too many fields for a single table (over 100 in total).

Most of the metadata fields are stored as TEXT (strings) in the database. Some obvious integer fields are stored as INTs (eg Words, Slides, FileSize). GPS co-ordinates from tagged JPEGs are stored in both TEXT and REAL (decimal) format. If in doubt (eg ExifToolVersion), I made the script store the data as TEXT.
Dates are stored as TEXT in the database and I have tested that they can be processed by SQLite functions. For example, "julianday(CreateDate)" works OK. I have also been able to retrieve/sort records by date.

In the interests of everyone's sanity (including my own) I won't detail any more of the schema. You can see them yourself in the code by searching for the "CREATE TABLE" strings.

The code is structured quite similar to "exif2map.pl". We are using the File::Find package to recursively search for the handled filetypes and then we call Image::ExifTool's "GetFoundTags" and "GetInfo" functions to extract the metadata. Depending on the extracted "FileType" tag, we call the appropriate handling function (eg "ProcessPPT"). These handling functions will read selected fields and if the field is defined, the function will insert/replace the metadata into the user specified SQLite database. If the metadata does not exist, a "Not Present" string will be inserted into the table instead.

Selections of metadata fields were made by throwing darts by estimating which of the fields would be most likely to hold information of forensic interest.
For example, MS Office 2007+ files have a variable list of "Zip" fields which I thought was of limited value so the script doesn't store these. However, the main fields such as "FileName", "CreateDate", "FileModifyDate" etc. are all included.
Having said that, if you think SquirrelGripper should include a particular field that I've left out, please let me know.

Installing SquirrelGripper

The Image::ExifTool Perl package is already installed on SIFT. We have previously used it in our "exif2map.pl" script. However due to some internal label changes to ExifTool, we need to grab the latest version (v8.90) for our script. We can do this on SIFT by typing: "sudo cpan Image::ExifTool".

The last few posts have also detailed our travels with Perl and SQLite. We have been using the DBI Perl package to interface with SQLite databases. So if you haven't already done it, grab the latest DBI package from CPAN by typing: "sudo cpan DBI".

Next, you can download/unzip/copy "squirrelgripper.pl" (from here) to "/usr/local/bin" and make it executable (by typing "sudo chmod a+x /usr/local/bin/squirrelgripper.pl")

Now you should be ready to run SquirrelGripper on SIFT in all its glory.

To run on Windows, install ActiveState Perl and use the Perl Package Manager to download the ExifTool package (v8.90). DBI should already be installed. Next, copy the "squirrelgripper.pl" script to the directory of your choice.
You should now be able to run SG at the command prompt by typing something like:
"perl c:\squirrelgripper.pl -newdb -db nutz2u.sqlite -case caseABC -tag fraud-docs -dir c:\squirrel-testdata\subdir1"
See next section for what each of the arguments mean.


Running SquirrelGripper

For this test scenario, I have various .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf files in the "/home/sansforensics/squirrel-testdata/subdir1" directory.
I have also copied various .jpg files to the "/home/sansforensics/squirrel-testdata/subdir2" directory

It is assumed that a new database will be created for each case. However, the same database can be also used for multiple iterations of the script. Just FYI - you can get a help/usage message by typing "squirrelgripper.pl -h"

The script recursively searches thru sub-directories so please ensure you've pointed it at the right level before launching. It is also possible to mark different sub-directories with different case tags. eg Launch script with one directory using the case tag "2012-04-18-caseA-companyA". Then launch the script a second time pointing to another directory using the case tag "2012-04-18-caseA-companyB". SG can also handle multiple -dir arguments in case you need to extract data from more than one directory (eg "-dir naughty/pics -dir naughty/docs"). If a "-tag" argument is also specified, it will apply to files from both directories.

The first example uses "-newdb" to create the "nutz2u.sqlite" database in the current directory. It also tags all "subdir1" files with the "fraud-docs" user tag (you can see the "UserTag" value in the "FileIndex" table). Currently, the "-db", "-case" and "-dir" arguments are mandatory.
Note: the -dir directory can be an absolute path or a relative one.

squirrelgripper.pl -newdb -db nutz2u.sqlite -case caseABC -tag fraud-docs -dir /home/sansforensics/squirrel-testdata/subdir1/

The output looks like:


squirrelgripper.pl v2012.05.18
Assuming /home/sansforensics/squirrel-testdata/subdir1/ is an absolute path
Directory entry for processing = /home/sansforensics/squirrel-testdata/subdir1/

Now processing /home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx inserted into XLSXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf
/home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf inserted into PDFFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/word2k7.docx
/home/sansforensics/squirrel-testdata/subdir1/word2k7.docx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/word2k7.docx inserted into DOCXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc
/home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc inserted into DOCFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf
/home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf inserted into PDFFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt
/home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt inserted into PPTFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx inserted into XLSXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx
/home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx inserted into PPTXFiles table
sansforensics@SIFT-Workstation:~$



The second call assumes the "nutz2u.sqlite" database already exists and tags all "subdir2" files with the "fraud-pics" tag.

squirrelgripper.pl -db nutz2u.sqlite -case caseABC -tag fraud-pics -dir /home/sansforensics/squirrel-testdata/subdir2

The output looks like:


squirrelgripper.pl v2012.05.18
Assuming /home/sansforensics/squirrel-testdata/subdir2 is an absolute path
Directory entry for processing = /home/sansforensics/squirrel-testdata/subdir2

Now processing /home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg
lat = 41.888948, long = -87.624494
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg : No GPS Lat/Long data present
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg
lat = 41.888948, long = -87.624494
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg
lat = 36.1147630001389, long = -115.172811
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg inserted into JPEGFiles table
sansforensics@SIFT-Workstation:~$


SquirrelGripper has also been similarly tested using ActiveState Perl v5.14.2 on Win XP by myself and Corey Harrell has also tested it using ActiveState Perl v5.12 on Windows 7 (32/64). Please note however, that the majority of my testing was done on SIFT v2.12 and Perl 5.10.0.


Once the metadata has been extracted to the SQLite database we can use SQL queries to find specific files of interest. This can be done via the "sqlite3" client and/or more easily via the Firefox "SQLite Manager" plugin.


Extracting SquirrelGripper results

So now we have our SQLite database chock full of metadata - how can we search it?
We need some basic SQLite commands which we can learn here.
W3Schools and Wikipedia also have some useful general information on SQL queries.

For our first example query we will be finding pictures from a particular camera model and ordering the results by date/time of the original.
Building upon the previous section, we have opened the "nutz2u.sqlite" database by using the Firefox "SQLite Manager" plugin on SIFT.
To do this - open Firefox on SIFT, under the "Tools" menu, select "SQLite Manager" to launch.
Then under the "Database" menu, select "Connect Database", browse to the user specified database (eg "nutz2u.sqlite") and press the "Open" button. Click on the "FileIndex" table tree item on the left hand side and then look under the "Browse & Search" tab. You should now see something like:


nutz2u.sqlite's FileIndex Table

We can see that we've got 12 files of all the different supported file types (see "File Type" column in pic). We can also see the absolute path filename ("AbsFileName"), the Date/Time the data was extracted ("DateTimeMetaAdded"), the case tag field ("CaseRef") and the user tag field ("UserTag").
If we now click on the "JPEGFiles" table tree item and look under the "Browse & Search" tab, we can see which jpeg files had their metadata extracted:

nutz2u.sqlite's JPEGFiles Table

Note: For the file specific tables, you will probably have to use the scroll bar to see all of the fields. Not shown in the pic above are the "Model" and "DateTimeOriginal" fields. We will be using these fields in our SQL query.
To execute a query against the JPEGFiles table, we click on the "Execute SQL" tab and then enter in the following:

SELECT * FROM JPEGFiles WHERE Model='Canon PowerShot SX130 IS' ORDER BY DateTimeOriginal;


We then press the "Run SQL" button and we see that we have found two jpegs that meet our search criteria:

nutz2u.sqlite's SQL Search Results

Closer inspection of the "AbsFileName" fields shows the filenames are "/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg" and "/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg". So given a bunch of .jpgs, we have quickly found multiple pictures that were taken by the same camera model.I thought that was pretty cool eh?

Similarly, we can run other queries against the database such as:

Finding .docs by same author and sorting by page count:
SELECT * FROM DOCFiles WHERE Author='Joe Friday' ORDER BY PageCount;

Sorting Powerpoint files by Revision Number:
SELECT * FROM PPTFiles ORDER BY RevisionNumber;

Finding all "fraud-pics" user tagged .JPEGs:
SELECT * FROM JPEGFiles, FileIndex WHERE JPEGFiles.AbsFileName=FileIndex.AbsFileName AND FileIndex.UserTag='fraud-pics';

Finding the LastModified time and the user responsible (for .xlsx files) BEFORE a given date/time:
SELECT LastModifiedBy, ModifyDate from XLSXFiles where DATETIME(ModifyDate) < DATETIME('2012-05-01 06:36:54')
Note: We are using the SQLite DATETIME function to convert our string date into a format we can perform comparisons with.

Finding the Filename and Keywords (for .pptx files) where the Keywords contains "dea". This will find keywords such as "death" "deal" "idea". The % represents a wildcard:
SELECT FileName, Keywords from PPTXFiles where Keywords LIKE '%dea%'

Finding the Filename and Template names used (for .doc files) when the Template name starts with "Normal.dot":
SELECT FileName, Template FROM DOCFiles WHERE Template LIKE 'Normal.dot%'

Find Unique Company names by performing a UNION of each table's Company fields. For more information on the UNION keyword see here.
SELECT Company FROM DOCFiles UNION SELECT Company FROM DOCXFiles UNION SELECT Company FROM XLSXFiles UNION SELECT Company FROM XLSFiles UNION SELECT Company FROM PPTXFiles UNION SELECT Company FROM PPTFiles

Finding the Filename and GPS Longitude (for .jpg files) where the GPS Longitude is less than "-115.0":
SELECT FileName, GPSLongitude FROM JPEGFiles WHERE GPSLongitude < '-115.0'

There are LOTS more ways of querying the database because there's a shedload of other fields we can use. Check out the script code (search for "CREATE TABLE") or create your own example database (using SG) for more details on what metadata fields are being extracted. Where possible, I have used the same field names as the ExifTool exe prints to the command line.


Final Words

We have written a Perl script ("squirrelgripper.pl") to extract file metadata into an SQLite Database. We have also shown how to perform some basic SQL queries against that database.

In total, for the .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf, .jpeg file types there were over 100 fields to extract. Whilst I double-checked my script, it is possible I've made some errors/overlooked something. It is also possible to process other file types (eg EXE, DLL) but I thought I would wait for further feedback/specific requests before continuing.

UPDATE: Just had this thought - If/when I do add the EXE & DLL functionality, SquirrelGripper could also be used for locating known Malware/indicators of compromise as well. I've created a monster!

I suspect there will be some commonly used SQL queries across certain types of investigations. So analysts could build up a standard set of queries to run across multiple cases of the same type. That could save analysts time / help narrow down files of interest.

If you found this script useful and/or you have any questions, please drop me a Tweet / email / leave a comment. The more I know about if/how forensicators are using SquirrelGripper - the better I can make it. Now that the basic framework is in place, it should be pretty quick to add new fields/file types so long as they are supported by the Image::ExifTool package (v8.90).

Tuesday, 3 April 2012

Creating a RegRipper Plugins Maintenance Perl Script


I recently asked Corey Harrell about any unfulfilled programming ideas he might have and he told me about his idea for a RegRipper Plugins maintenance Perl script.

He was after a script that would go into the plugins directory and then tell the user what plugins were missing from the various "Plugins" text files. Note: I am using "Plugins" to refer to the "ntuser", "software", "sam", "system", "security" and "all" text files. These contain a list of Perl (*.pl) plugins that a user can use to batch run against a particular hive.
He also mentioned that it might be handy to be able to exclude certain files (eg malware plugins) from being considered so if a user has deleted certain entries, they aren't constantly being reminded "Hey! There's a difference between what's in the "Plugins" file and what's actually available (ie the *.pl plugins)".

UPDATE: This script was not intended to be used to blindly load every possible *.pl plugin into their relevant "Plugins" text file. It was intended for use as a diagnostic tool "What (*.pl) plugins are missing from my current "Plugins" file?" (eg "ntuser"). Users can then look at the missing plugins list and determine if they should/should not include them in their "Plugins" text file.

UPDATE: Script ("PrintMissingPlugins" function) has been updated so commented-out entries (eg #ccleaner) in "Plugins" text files are not used in comparisons for missing *.pl plugins. Thanks Francesco!

UPDATE: Script ("PrintMissingPlugins" function) has been updated so blank lines in "Plugins" text files are not used in comparisons for missing *.pl plugins Have also updated the Code and Testing sections.

So let's see how we can implement Corey's idea ...

The Code

# CODE STARTS ON LINE BELOW
#!/usr/bin/perl
# Note: I removed the "-w" above because the eval/require/getHive section was generating numerous variable masking warnings
# This shouldn't affect us as we only call the plugin's getHive method.

# Perl script to list updates for RegRipper plugin files (ntuser, software, sam, system, security, all)
# Original Idea by Corey Harrell
# Coded by cheeky4n6monkey@gmail.com
# Note: Probably best if you do NOT run this script from the RegRipper Plugins directory
# as it could affect the *.pl plugin processing.

# Created (2012-04-02)
# Modified (2012-04-03) for handling "#" commented out lines in "Plugins" text files (Thanks Francesco!)
# Modified (2012-04-04) for handling blank lines in "Plugins" text files (Thanks Francesco!)
# Modified (2012-04-04) re-instated "No missing plugins message"

use strict;

use Getopt::Long;
use File::Spec::Functions qw/catfile/;

my $version = "regripper-maint.pl v2012-04-04";
my $help = 0;
my $plugindir = "";
my $excludefile = "";
my @pluginslist;

GetOptions('help|h' => \$help,
    'dir=s' => \$plugindir,
    'x=s' => \$excludefile,
    'p=s@' => \@pluginslist);

if ($help || @pluginslist == 0  || $plugindir eq "")
{
    print("\nHelp for $version\n\n");
    print("Perl script to list discrepancies for RegRipper Plugin files (ntuser, software, sam, system, security, all)\n");
    print("\nUsage: regripper-maint.pl [-h|help] [-dir plugins_dir] [-x exclude_files] [-p plugins_list]\n");
    print("-h|help ............ Help (print this information). Does not run anything else.\n");
    print("-dir plugins_dir ... Plugins directory.\n");
    print("-x exclude_file .... Exclude text file which lists plugins to exclude from any checks.\n");
    print("-p plugins_list .... Plugins file(s) to be checked (eg all, ntuser).\n");
    print("\nExamples: \n");
    print("regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -p ntuser -p sam\n");
    print("regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -x regripper-maint-exclude -p all\n\n");
    exit;
}

print "\nRunning $version\n\n";

# List of plugins derived from the default (ntuser, software, sam, system, security, all) plugins text files
my @ntuser_plugins;
my @software_plugins;
my @sam_plugins;
my @system_plugins;
my @security_plugins;
my @all_plugins;

# Directory file listing of *.pl files in user's plugins directory
my @readinplugins;
# @readinplugins list broken up by hive
my @ntuser_actual;
my @software_actual;
my @sam_actual;
my @system_actual;
my @security_actual;
my @all_actual;

my @excludelist;

# Extract entries from user nominated exclude file
if (not $excludefile eq "")
{
    open(my $xfile, "<", $excludefile) or die "Can't Open $excludefile Exclude File!";
    @excludelist = <$xfile>; # extract each line to a list element
    chomp(@excludelist); # get rid of newlines
    close($xfile);
    foreach my $ig (@excludelist)
    {
        print "Ignoring the $ig plugin for any comparisons\n";
    }
    print "\n";
}

# Read in the entries in the default Plugins text file(s)
# Plugin files have lowercase names
foreach my $plugin (@pluginslist)
{
    open(my $pluginsfile, "<", catfile($plugindir,$plugin) ) or die "Can't Open $plugin Plugins File!";
    if ($plugin =~ /ntuser/)
    {
        print "Reading the ntuser Plugins File\n";
        @ntuser_plugins = <$pluginsfile>; # extract each line to a list element
        chomp(@ntuser_plugins); # get rid of newlines
    }
    if ($plugin =~ /software/)
    {
        print "Reading the software Plugins File\n";
        @software_plugins = <$pluginsfile>;
        chomp(@software_plugins);
    }
    if ($plugin =~ /sam/)
    {
        print "Reading the sam Plugins File\n";
        @sam_plugins = <$pluginsfile>;
        chomp(@sam_plugins);
    }
    if ($plugin =~ /system/)
    {
        print "Reading the system Plugins File\n";
        @system_plugins = <$pluginsfile>;
        chomp(@system_plugins);
    }
    if ($plugin =~ /security/)
    {
        print "Reading the security Plugins File\n";
        @security_plugins = <$pluginsfile>;
        chomp(@security_plugins);
    }
    if ($plugin =~ /all/)
    {
        print "Reading the all Plugins File\n";
        @all_plugins = <$pluginsfile>;
        chomp(@all_plugins);
    }
    close $pluginsfile;
}

# This code for determining a package's hive was cut / pasted / edited from "rip.pl" lines 42-67.
# Reads in the *.pl plugin files from the plugins directory and store plugin names in hive related "actual" lists
# Note: the "all_actual" list will later be a concatenation of all of the "actual" lists populated below
opendir(DIR, $plugindir) || die "Could Not Open $plugindir: $!\n";
@readinplugins = readdir(DIR);
closedir(DIR);

foreach my $p (@readinplugins)
{
    my $hive;
    next unless ($p =~ m/\.pl$/); # gonna skip any files which don't end in .pl
    my $pkg = (split(/\./,$p,2))[0]; # extract the package name (by removing the .pl)
    $p = catfile($plugindir, $p); # catfile is used to create absolute path filename (from File::Spec::Functions)
    eval
    {
        require $p; # "require" needs to be inside an eval in order to import package functions ?
        $hive = $pkg->getHive(); # hive name should be UPPERCASE but could be mixed *sigh*
    };
    print "Error: $@\n" if ($@);

    if ($hive =~ /NTUSER/i )
    {
        push(@ntuser_actual, $pkg);
    }
    elsif ($hive =~ /SOFTWARE/i )
    {
        push(@software_actual, $pkg);
    }
    elsif ($hive =~ /SAM/i )
    {
        push(@sam_actual, $pkg);
    }
    elsif ($hive =~ /SYSTEM/i )
    {
        push(@system_actual, $pkg);
    }
    elsif ($hive =~ /SECURITY/i )
    {
        push(@security_actual, $pkg);
    }
    elsif ($hive =~ /ALL/i )
    {
        push(@all_actual, $pkg); # some .pl plugins have "All" listed as their hive
    }
}

# Calls PrintMissingPlugins to compare a Plugins text file list with an "actual" *.pl plugins list
foreach my $plugin (@pluginslist)
{
    if ($plugin =~ /ntuser/)
    {
        PrintMissingPlugins("NTUSER", "ntuser", \@ntuser_actual, \@ntuser_plugins);
    }
    if ($plugin =~ /software/)
    {
        PrintMissingPlugins("SOFTWARE", "software", \@software_actual, \@software_plugins);
    }
    if ($plugin =~ /sam/)
    {
        PrintMissingPlugins("SAM", "sam", \@sam_actual, \@sam_plugins);
    }
    if ($plugin =~ /system/)
    {
        PrintMissingPlugins("SYSTEM", "system", \@system_actual, \@system_plugins);
    }
    if ($plugin =~ /security/)
    {
        PrintMissingPlugins("SECURITY", "security", \@security_actual, \@security_plugins);
    }
    if ($plugin =~ /all/)
    {
        PrintMissingPlugins("ALL", "all", \@all_actual, \@all_plugins);
    }
}

# End Main

sub PrintMissingPlugins
{
    my $hive = shift; # hive name
    my $name = shift; # Plugins text file name
    my $actual_plugins  = shift; # reference to list of plugins derived from *.pl files
    my $listed_plugins = shift; # reference to list of plugins derived from Plugins text file

    my @missing_plugins; # list stores *.pl files which are NOT declared in a given Plugins text file
    my @missing_pl; # list stores Plugin entries which do NOT have a corresponding .pl file

    print "\nThere are ".scalar(@$actual_plugins)." $hive plugins in $plugindir\n";
    # We say "lines" because there can accidental multiple declarations *sigh*
    print "There are ".scalar(@$listed_plugins)." plugin lines listed in the $name Plugins file\n";
    print scalar(@excludelist)." plugins are being ignored\n";

    # Handle the "all" Plugin case discrepancy
    # There's a large mixture of different hive plugins listed in the "all" Plugins text file
    # and only a handful of plugins who actually return "All" as their hive.
    # At this point, @all_actual should only contain plugins which list "All" as their hive
    # In a fit of hacktacular-ness, we'll now also add the contents from the other "actual" arrays to @all_actual.
    # Otherwise, when we compare the list of "All" hive plugins (@$actual_plugins) with
    # the lines in the "all" Plugin (@$listed_plugins), there will be a lot of differences reported.
    if ($hive eq "ALL")
    {
        push(@$actual_plugins, @ntuser_actual);
        push(@$actual_plugins, @software_actual);
        push(@$actual_plugins, @sam_actual);
        push(@$actual_plugins, @system_actual);
        push(@$actual_plugins, @security_actual);
    }
    # From here on, @all_actual / @$actual_plugins will contain a list of every processed type of .pl plugin file

    # For each *.pl plugin file, check that it has a corresponding entry in the given Plugins text file
    foreach my $pkg (@$actual_plugins)
    {
        my $res = scalar(grep(/^($pkg)$/, @$listed_plugins)); # is this .pl listed in Plugins file ?
        my $ignore = scalar(grep(/^($pkg)$/, @excludelist)); # is this .pl being ignored ?
        if ( ($res eq 0) and ($ignore eq 0) )
        {
            push(@missing_plugins, $pkg);
        }
    }
    if (@missing_plugins)
    {
        print "\nThere are ".scalar(@missing_plugins)." plugins missing from the $name Plugins file:\n";
        foreach my $miss (@missing_plugins)
        {
            print $miss."\n";
        }
    }

    # For each line in the Plugins text file, check that it has a corresponding *.pl file
    foreach my $plug (@$listed_plugins)
    {
        # If this Plugins entry has been commented out (by a preceding "#") OR if it starts with a whitespace (newline),
        # skip to next entry so we don't get told there's no corresponding .pl file
        if ( ($plug =~ /^#/) or ( not $plug =~ /^\S/ ) )
        {
            next;
        }
        my $res = scalar (grep(/^($plug)$/, @$actual_plugins)); # does this Plugin entry have a corresponding .pl file ?
        my $ignore = scalar(grep(/^($plug)$/, @excludelist)); # is this Plugin entry being ignored ?
        if ( ($res eq 0) and ($ignore eq 0) )
        {
            push(@missing_pl, $plug);
        }
    }
    if (@missing_pl)
    {
        print "\nThere are ".scalar(@missing_pl)." plugins declared in the $name Plugins file with no corresponding .pl file:\n";
        foreach my $miss (@missing_pl)
        {
            print $miss."\n";
        }
    }

    if ( (@missing_plugins eq 0) and (@missing_pl eq 0) )
    {
        print "No missing plugins detected\n";
    }

}

# CODE ENDS HERE


Code Summary

Here's a high level overview of the code:
- Parses command line arguments using "GetOptions"
- Prints out Help message (if required)
- Extract entries from user nominated exclude text file (if one is specified)
- Read in the entries in the specified "Plugins" text file(s)
- Reads in the *.pl plugin names from the plugins directory and store them in hive related "actual" lists
- Calls "PrintMissingPlugins" subroutine to compare the differences between a "Plugins" text file list with an "actual" *.pl plugins list. Function also allows for commented out lines (eg "# COMMENT") and blank lines in "Plugin" files.

If you have RegRipper installed, you should already have the necessary Perl packages installed (eg File::Spec::Functions).

It's not the most concise/efficient code. If in doubt, I tried to make it more readable (at least for me!). I also made more comments in the code so I wouldn't have to write a lot in this section. I think that might prove more convenient than scrolling up/down between the summary and the code.

Finally, I should mention that I kept my "regripper-maint.pl" script (and my exclusion text file "regripper-maint-exclude") in "/home/sansforensics/". I didn't want the script to parse itself when looking for .pl plugin files. I suspect I could have just as easily called "regripper-maint.pl" from "/usr/local/src/regripper/". Meh.

Testing

We'll start testing (on SIFT v2.12 using the default RegRipper install) with the 2 examples given in the script's Help message.


sansforensics@SIFT-Workstation:~$ ./regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -p ntuser -p sam

Running regripper-maint.pl v2012-04-04

Reading the ntuser Plugins File
Reading the sam Plugins File

There are 98 NTUSER plugins in /usr/local/src/regripper/plugins/
There are 97 plugin lines listed in the ntuser Plugins file
0 plugins are being ignored

There are 1 plugins missing from the ntuser Plugins file:
ccleaner

There are 1 SAM plugins in /usr/local/src/regripper/plugins/
There are 1 plugin lines listed in the sam Plugins file
0 plugins are being ignored
No missing plugins detected
sansforensics@SIFT-Workstation:~$


Looks OK! I have not added my "ccleaner.pl" script to the "ntuser" Plugins file so the results make sense. We can check the number of lines in the "ntuser" Plugins file using the following:

sansforensics@SIFT-Workstation:~$ wc -l /usr/local/src/regripper/plugins/ntuser
97 /usr/local/src/regripper/plugins/ntuser
sansforensics@SIFT-Workstation:~$


Just as reported, there's 97 lines in the "ntuser" Plugins file. As for the SAM results:

sansforensics@SIFT-Workstation:~$ wc -l /usr/local/src/regripper/plugins/sam
1 /usr/local/src/regripper/plugins/sam
sansforensics@SIFT-Workstation:~$


Which seems OK as there's only one "sam" Plugins entry ("samparse").
Now let's try the second help example ...

sansforensics@SIFT-Workstation:~$ ./regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -x regripper-maint-exclude -p all

Running regripper-maint.pl v2012-04-04

Ignoring the ccleaner plugin for any comparisons

Reading the all Plugins File

There are 3 ALL plugins in /usr/local/src/regripper/plugins/
There are 204 plugin lines listed in the all Plugins file
1 plugins are being ignored

There are 2 plugins missing from the all Plugins file:
winlivemail
winlivemsn

There are 2 plugins declared in the all Plugins file with no corresponding .pl file:
port_dev
wlm_cu
sansforensics@SIFT-Workstation:~$


So this example shows the script referring to our "regripper-maint-exclude" exclusion file (which has one line containing "ccleaner"). Hence, it ignores any "ccleaner" plugin comparisons.
We can also see that while only 3 "ALL" *.pl plugins were found, there are 204 lines declared in the "all" Plugins file. The "all" Plugins file is a special case in that it can contain plugins which refer to more than one type of hive. See the code comments for more info on how I dealt with this (it ain't particularly pretty).
Anyhoo, we can also see that there are 2 .pl plugins which are NOT declared in the "all" Plugins file. Open the "all" file and verify it for yourself - there is no "winlivemail" or "winlivemsn" entry.
There are also 2 entries in the "all" Plugins file which don't have corresponding .pl files (ie "port_dev.pl" and "wlm_cu.pl" do not exist). That's gonna make it a bit hard to call those plugins eh?

And here is the same example test except WITHOUT using the "regripper-maint-exclude" exception file:

sansforensics@SIFT-Workstation:~$ ./regripper-maint.pl -dir /usr/local/src/regripper/plugins/  -p all

Running regripper-maint.pl v2012-04-04

Reading the all Plugins File

There are 3 ALL plugins in /usr/local/src/regripper/plugins/
There are 204 plugin lines listed in the all Plugins file
0 plugins are being ignored

There are 3 plugins missing from the all Plugins file:
winlivemail
winlivemsn
ccleaner

There are 2 plugins declared in the all Plugins file with no corresponding .pl file:
port_dev
wlm_cu
sansforensics@SIFT-Workstation:~$


You can see that the "ccleaner" plugin is now included in the missing plugins.
Just to prove this isn't all smoke and mirrors, here are the results for all 6 hives (with no exclusions):

sansforensics@SIFT-Workstation:~$ ./regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -p ntuser -p software -p sam -p system -p security -p all

Running regripper-maint.pl v2012-04-04

Reading the ntuser Plugins File
Reading the software Plugins File
Reading the sam Plugins File
Reading the system Plugins File
Reading the security Plugins File
Reading the all Plugins File

There are 98 NTUSER plugins in /usr/local/src/regripper/plugins/
There are 97 plugin lines listed in the ntuser Plugins file
0 plugins are being ignored

There are 1 plugins missing from the ntuser Plugins file:
ccleaner

There are 54 SOFTWARE plugins in /usr/local/src/regripper/plugins/
There are 54 plugin lines listed in the software Plugins file
0 plugins are being ignored
No missing plugins detected

There are 1 SAM plugins in /usr/local/src/regripper/plugins/
There are 1 plugin lines listed in the sam Plugins file
0 plugins are being ignored
No missing plugins detected

There are 44 SYSTEM plugins in /usr/local/src/regripper/plugins/
There are 44 plugin lines listed in the system Plugins file
0 plugins are being ignored
No missing plugins detected

There are 3 SECURITY plugins in /usr/local/src/regripper/plugins/
There are 3 plugin lines listed in the security Plugins file
0 plugins are being ignored
No missing plugins detected

There are 3 ALL plugins in /usr/local/src/regripper/plugins/
There are 204 plugin lines listed in the all Plugins file
0 plugins are being ignored

There are 3 plugins missing from the all Plugins file:
winlivemail
winlivemsn
ccleaner

There are 2 plugins declared in the all Plugins file with no corresponding .pl file:
port_dev
wlm_cu
sansforensics@SIFT-Workstation:~$


I subsequently viewed the relevant Plugin files and confirmed the number of plugin lines declared matched what was printed above. I could also have used the "wc -l" trick mentioned previously as well. Meh.

Summary

Using Corey Harrell's idea, we've coded a RegRipper maintenance script which can detect discrepancies between whats declared in Plugin text files and what .pl plugins actually exist.
While this script doesn't help you if you don't use the Plugin text files, it has still been an interesting programming exercise. I can feel my Perl-fu growing stronger by the day ... muhahaha!

And for those that don't know, Cheeky4n6Monkey is now (intermittently) on Twitter (@Cheeky4n6Monkey) ... and co-incidentally(?) getting a lot less work done! Thankyou to all those who have already extended their welcome.

As always, please leave a comment if you found this post helpful / interesting / snake oil. PS I will send a copy of this script to the RegRipper folks just in case they find it useful (Thanks Brett and Francesco!).

Friday, 30 March 2012

Using Perl to Copy AppID Data from HTML to an SQLite Database


Someday You'll Find It ... The Jumplist Connection!

So all this talk of Windows 7 Jumplists (eg here and here) got me thinking - What if you needed to look up an AppID and didn't have access to the Internet? Also, Is there any way we can import a list of AppIDs so we can then access them from some future scripts (eg for processing Jumplists)?
I then had my "Aha!" moment (no, nothing to do with that 80s band), and thought "SQLite!"
SQLite also has the benefit of being cross-platform - so an AppID List generated on SIFT should work on a PC etc. By using a standard database, we can also ensure that our fearless forensicator lab uses a common set of (unchanging) source data.

So how exactly can we do this (using the SIFT VM)?
1. We create an empty SQLite Database.
2. We parse the ForensicsWiki AppID HTML page for data.
3. We populate the Database





Creating The Empty Database

Creating an (empty) SQLite AppID Database is pretty straight forward. I opted to do this part manually using SIFT's "sqlite3" command line tool.

1. At a terminal command line, I typed:
"sqlite3 WinAppIDs.sqlite"
2. At the resultant "sqlite3" prompt, I typed (without the quotes):
"CREATE TABLE IF NOT EXISTS WinAppIDs (AppID TEXT, AppDescrip TEXT, DateAdded NUMERIC, Source TEXT);" 
3. At the next "sqlite3" prompt, I typed (without the quotes):
"CREATE UNIQUE INDEX IF NOT EXISTS AppIdx ON WinAppIDs(AppID, AppDescrip);"
4. Finally, at the "sqlite3" prompt, I typed (without the quotes):
".quit"

There should now be a "WinAppIDs.sqlite" database file in the current directory (eg "/home/sansforensics"). This file contains a "WinAppIDs" table which has AppId, AppDescription, DateAdded and Source fields (just like the ForensicsWiki page tables).
Additionally, I have hacked/cheated a bit and created a Unique Index called "AppIdx". The purpose of the index is to ensure that we cannot insert duplicate entries. That is, we can't insert an entry with the same AppID AND the same AppDescription as an existing one. You wouldn't think this would be a problem with the AppId tables right? Wrong! In the tables there are entries such as:

Different AppIds, Same Descriptions
b8c29862d9f95832    Microsoft Office InfoPath 2010 x86
d64d36b238c843a3    Microsoft Office InfoPath 2010 x86

Same AppIDs, Different Descriptions
23646679aaccfae0    Adobe Reader 9.
23646679aaccfae0    Adobe Reader 9 x64

So rather than calling a SELECT query to check before every INSERT (there's 400+ AppIDs to insert), we'll just make the INSERT fail (via "AppIdx") when there's already an existing entry with the same AppID and AppDescription. For more on SQLite indexing read here. Specifically it says, "If the UNIQUE keyword appears between CREATE and INDEX then duplicate index entries are not allowed. Any attempt to insert a duplicate entry will result in an error."


Parsing the ForensicsWiki AppID HTML page for data and Populating the Database

Parsing/extracting the AppIDs from HTML is easy once you find out somebody's already written a CPAN package for that exact purpose! Its called "HTML::TableExtract" and you can read all about it here.
Additionally, the folks at ForensicsWiki have also included a "Source" link for each AppID. To extract the link we can use "HTML::LinkExtor". You can read about it here.
As we have seen in previous posts, we can use the Perl "DBI" package to interface with an SQLite Database. So knowing all this, we should now be able to insert our newly extracted data.
To simplify this exercise, we will extract the data and then enter it into the database from one script ("4n6wiki-parser.pl").
Incidentally, the StackOverflow website is a great searchable resource for Perl questions - this was where I not only heard about the previously mentioned packages but also saw example code of their use. Awesome!

Finally(!), we should write a short query script ("4n6wiki-query.pl") so that we can search our SQLite database by either AppID or AppDescrip.

Now let's take a look at what this monkey passes off for code ...

"4n6wiki-parser.pl" Code

#CODE BEGINS ON NEXT LINE
#!/usr/bin/perl -w

# Perl script to parse http://www.forensicswiki.org/wiki/List_of_Jump_List_IDs and import it into an existing SQLite Database

use strict;

use Getopt::Long;
use HTML::TableExtract;
use HTML::LinkExtor;
use DBI;

my $version = "4n6wiki-parser.pl v2012-03-27";
my $help = 0;
my $filename = "";
my $database = "";

GetOptions('help|h' => \$help,
    'file=s' => \$filename,
    'db=s' => \$database);

if ($help || $filename eq "" || $database eq "")
{
    print("\nHelp for $version\n\n");
    print("Perl script to parse http://www.forensicswiki.org/wiki/List_of_Jump_List_IDs and import it into an existing SQLite Database\n");
    print("\nUsage: 4n6wiki-parser.pl [-h|help] [-file filename] [-db database]\n");
    print("-h|help .......... Help (print this information). Does not run anything else.\n");
    print("-file filename ... Copy of HTML file to be parsed.\n");
    print("-db database ..... Target SQLite Database.\n");
    print("\nExample: 4n6wiki-parser.pl -file List_of_Jump_List_IDs.html -db WinAppIDs.sqlite\n");
    exit;
}

print "\nRunning $version\n\n";

open(my $htmlfile, "<".$filename) || die("Unable to open $filename for parsing\n");
my @lines = <$htmlfile>; # extract each line to a list element
my $html_string = join(' ', @lines); # join list elements into one big happy scalar string
close($htmlfile);

my $tblextract = HTML::TableExtract->new(keep_html => 1, headers => [qw(AppID Application Date Source)] );
$tblextract->parse($html_string);

my $lnkextr = HTML::LinkExtor->new();

my @bigarray; # will be a all table cells merged into one [row][cell] type array
my $rowcount = 0; # running count of rows in all tables

foreach my $tbl ($tblextract->tables)
{
    my @rows = $tbl->rows;
    my @cols = $tbl->columns;
    print "Extracted Table Size = ".scalar(@rows)." x ".scalar(@cols)."\n";

    for my $rownum (0 .. (@rows - 1))
    {
        # Ass-ume always 4 columns, last one (3) contains link info, col 2 contains date eg 8/22/2011
        for my $colnum (0 .. (@cols - 1))
        {
            if ($colnum == (@cols - 2) )
            {
                #reformat date into form yyyy-mm-dd
                my $date = $tbl->cell($rownum, $colnum);
                $date =~ /(\d+)\/(\d+)\/(\d+)/;
                my $year = $3;
                my $month = $1;
                my $day = $2;
                if ($day < 10)
                {
                    $day = "0".$day;
                }
                if ($month < 10)
                {
                    $month = "0".$month;
                }
                my $newdate = $year."-".$month."-".$day;
                $bigarray[$rowcount][$colnum] = $newdate;
            }
            elsif ($colnum == (@cols - 1))
            {       
                # Extract a link entry eg "http://social.technet.microsoft.com/Forums/" for col 3
                $lnkextr->parse($tbl->cell($rownum, $colnum));
                for my $link_tag ( $lnkextr->links )
                {
                    my ($tag, %links) = @$link_tag;
                    foreach my $key (keys %links)
                    {
                        $bigarray[$rowcount][$colnum] = $links{$key};
                        last; # only do it for one link then bail out
                    }
                }
            }
            else
            {
                #Record each of the other column fields for this row ie col 0, 1, 2
                $bigarray[$rowcount][$colnum] = $tbl->cell($rownum, $colnum);
            }
        }
        $rowcount = $rowcount + 1;       
    }
}

print "Number of Rows Extracted from HTML = $rowcount\n";

my $db = DBI->connect("dbi:SQLite:dbname=$database","","") || die( "Unable to connect to database\n" );
my $sth;

for my $currow (0 .. ($rowcount-1))
{
    my $id = $bigarray[$currow][0];
    my $descrip = $bigarray[$currow][1];
    my $date = $bigarray[$currow][2];
    my $source = $bigarray[$currow][3];
    $sth =  $db->prepare_cached("INSERT INTO WinAppIDs (AppID, AppDescrip, DateAdded, Source) VALUES (?, ?, ?, ?)"); #or die "Couldn't prepare statement: ".$db->errstr;
    $sth->execute($id, $descrip, $date, $source); #or die "Couldn't execute statement: ".$sth->errstr;
}
$sth->finish;
$db->disconnect;
print "$database Update complete ...\n"

# END CODE


"4n6wiki-parser.pl" Code Summary

Before you use this script, ensure that you have installed the following Perl packages: HTML::TableExtract, HTML::LinkExtor, DBI (eg type "sudo cpan HTML::TableExtract" etc.).

The first section calls "GetOptions" to parse the user's input arguments and then the following section is the Help printout.

We then read in the user specified HTML file into one long scalar variable ("$html_string").

Next we create a new HTML::TableExtract object to parse the table data. We use the argument "keep_html => 1" so we can later parse any web links in the table (using HTML::LinkExtor). The other argument, "headers => [qw(AppID Application Date Source)]" tells HTML::TableExtract the column headings so it can figure out how to extract data.
Note: I just used the first word of each column heading and it seemed to work OK. I don't think you can specify something like "Application Description" for a heading because the whitespace will confuse HTML::TableExtract. If they ever change the column headings to use the same first word (eg "AppID" and "AppID Description"), we're in trouble.
Anyhow, next we call the "parse" function of HTML::TableExtract to do all the heavy lifting and extract the data.

Now we create a new HTML::LinkExtor object to extract any HTML links in the table. We then proceed to loop through each row/column of each table that HTML::TableExtract has found (currently, there are 10 tables). We store the result in a huge array of arrays ("bigarray[rownum][colnum]").
For half of the columns (ie "AppId" and "Application Description"), we will just copy the data straight into "bigarray". The other 2 columns ("Date Added" and "Source") require a bit of massaging before entry (heheh).
ForensicsWiki lists the dates in the U.S. format of month/day/year (I DARE GU to say it's a better format! LOL) with no preceding zeros for single digits. SQLite uses the yyyy-mm-dd format in it's time calculations. So, just in case we want to sort later by date, our script has to re-arrange the date before storing it in "bigarray". We use a regular expression to extract the various parts of the date and then zero pad them if required before re-assembling them into the SQLite format. The regular expression code is:

"$date =~ /(\d+)\/(\d+)\/(\d+)/;"

 In the line before this, "$date" was assigned the value from the date cell (eg "8/22/2011"). The "\d+"s indicate a field of multiple digits. The surrounding "( )"s indicate we are extracting the various values (in order) to $1, $2, $3. We also need to "\" the date separators "/" so they don't confuse the regular expression delimiters. More information on Perl Regular expressions is available here.

I found the basic code for the Link Extraction here. However, we call the HTML::LinkExtor's "parse" function with the table cell content (not the whole row). I found it quite difficult to initially understand this snippet, so I will try to explain using our actual code below.

$lnkextr->parse($tbl->cell($rownum, $colnum));
for my $link_tag ( $lnkextr->links )
{
    my ($tag, %links) = @$link_tag;
    foreach my $key (keys %links)
    {
        $bigarray[$rowcount][$colnum] = $links{$key};
        last; # only do it for one link then bail out
    }
}


To get the results of the HTML::LinkExtor's "parse" call, we have to call the "$lnkextr->links" method. According to the CPAN documentation, this returns data in the form:
"[$tag, $attr => $url1, $attr2 => $url2,...]".
The enclosing "[ ]"s denote an array containing the "$tag" scalar variable and what looks like a hash list ("$attr => $url1, $attr =>$url2").
Theoretically, there should only be one "$link_tag" result for our case and therefore, only one for ("my $link_tag") loop iteration.

In the loop, we now see: "my ($tag, %links) = @$link_tag;"
The right hand hand side, "@$link_tag" is saying treat this "$link_tag" scalar as a reference to an array (hence the "@"). This array will be the one containing the "$tag" and the hash list mentioned previously.
The left hand side assigns values for the various parts of the array. So we end up with "$tag" and a hash list array (now called "%links").
The next step is to iterate through "%links" and for the first key/URL it finds, store that link in "bigarrray".
For more information on Perl References see here.

OK, now we should have all the data we need in "bigarray" and we also have an SQLite database file ("WinAppIDs.sqlite") containing an empty table ("WinAppIDs"). Time to feed the machine!
As we've seen in previous posts, we first call DBI's "connect" to interface with the Database. If it doesn't work we quit and nothing gets inserted.
Next we loop through each "row" of "bigarray" and store the column values in temporary scalars ($id, $descrip, $date, $source). We then prepare our SQLite INSERT statement:

$sth =  $db->prepare_cached("INSERT INTO WinAppIDs (AppID, AppDescrip, DateAdded, Source) VALUES (?, ?, ?, ?)"); #or die "Couldn't prepare statement: ".$db->errstr;

We're using "prepare_cached" because it apparently minimises resources for our repeated INSERTs (seriously?!) compared to a normal "prepare".
The "?" arguments are placeholders for our temporary values which get assigned in the next line.

$sth->execute($id, $descrip, $date, $source); #or die "Couldn't execute statement: ".$sth->errstr;

Note: I've commented out the ("die") error handling parts for "prepare_cached" and "execute" because if we fail to INSERT, it could be due to a duplicate. We should keep looping though "bigarray" because there might be new non-duplicate information later in "bigarray" (eg we might be updating an existing database).

After we've finished looping through "bigarray", we call "finish" and "close" and print out an "Update Complete" message.


"4n6wiki-query.pl" Code

# CODE BEGINS ON NEXT LINE
#!/usr/bin/perl -w

# Perl script to query an existing SQLite Database for Windows AppID info
# Intended to be used in conjunction with the 4n6wiki-parser.pl script.

use strict;

use Getopt::Long;
use DBI;

my $version = "4n6wiki-query.pl v2012-03-27";
my $help = 0;
my $appid = "";
my $descrip = "";
my $database = "";

GetOptions('help|h' => \$help,
    'appid=s' => \$appid,
    'descrip=s' => \$descrip,
    'db=s' => \$database);

if ($help || $database eq "" || ($descrip eq "" && $appid eq "") || ($descrip ne "" && $appid ne "") )
{
    print("\nHelp for $version\n\n");
    print("Perl script to query an existing SQLite Database for Windows AppID info\n");
    print("Intended to be used in conjunction with the 4n6wiki-parser.pl script.\n"); 
    print("\nUsage: 4n6wiki-query.pl [-h|help] [-appid APPID] [-descrip Description] [-db database]\n");
    print("-h|help .......... Help (print this information). Does not run anything else.\n");
    print("-appid APPID ..... Search Database for match(es) for this APPID.\n");
    print("-descrip Description ... Search Database for match(es) \"like\" this Description.\n");
    print("-db database ..... Target SQLite Database.\n");
    print("\nExamples: \n");
    print("4n6wiki-parser.pl -appid 3dc02b55e44d6697 -db WinAppIDs.sqlite\n");
    print("4n6wiki-parser.pl -descrip \"Adobe Flash\" -db WinAppIDs.sqlite\n");
    print("Note: Listing BOTH -appid and -descrip will not work / prints this message\n\n");
    exit;
}

print "\nRunning $version\n\n";

my $db = DBI->connect("dbi:SQLite:dbname=$database","","") || die( "Unable to connect to database\n" );
my $sth;
if ($appid ne "")
{
    $sth =  $db->prepare("SELECT AppID, AppDescrip FROM WinAppIDs WHERE AppID=?") or die "Couldn't prepare statement: ".$db->errstr;
    $sth->execute($appid) or die "Couldn't execute statement: ".$sth->errstr;
    PrintHeadings($sth);
    PrintResults($sth);
    if ($sth->rows == 0)
    {
        print "No Matching AppIDs found!\n";
    }
    else
    {   
        print $sth->rows." Matches returned\n";
    }
}
elsif ($descrip ne "")
{
    my $likestr = "%".$descrip."%";
    $sth =  $db->prepare("SELECT AppID, AppDescrip FROM WinAppIDs WHERE AppDescrip LIKE ?") or die "Couldn't prepare statement: ".$db->errstr;
    $sth->execute($likestr) or die "Couldn't execute statement: ".$sth->errstr;
    PrintHeadings($sth);
    PrintResults($sth);
    if ($sth->rows == 0)
    {
        print "No Matching Descriptions found!\n";
    }
    else
    {   
        print $sth->rows." Matches returned\n";
    }
}

$sth->finish;
$db->disconnect;

# End Main

sub PrintHeadings
{
    my $sth = shift;

    # Print field headings
    for (my $i = 0; $i <= $sth->{NUM_OF_FIELDS}-1; $i++)
    {
        if ($i == $sth->{NUM_OF_FIELDS} - 1)
        {
            print $sth->{NAME}->[$i]."\n"; #last item adds a newline char
        }
        else
        {   
            print $sth->{NAME}->[$i]." | ";
        }
    }
}

sub PrintResults
{
    my $sth = shift;
    my @rowarray;

    # Prints row by row / field by field
    while (@rowarray = $sth->fetchrow_array() )
    {
        for (my $i = 0; $i <= $sth->{NUM_OF_FIELDS}-1; $i++)
        {
            if ($i == $sth->{NUM_OF_FIELDS} - 1 )
            {
                print $rowarray[$i]."\n"; #last field in row adds newline
            }
            else
            {
                if ($rowarray[$i])
                {
                    print $rowarray[$i]." | ";
                }
                else
                {
                    print " | "; # field returned could be UNDEFINED, just print separator
                }
            }
        }
    }
}

# CODE ENDS


"4n6wiki-query.pl" Code Summary

This script is very similar to our previous "ffparser.pl" script. It will use DBI to run SELECT queries against a user nominated SQLite Database. There will be 2 types of queries - a search by AppID and a search for LIKE terms in AppDescrip. We won't use an exact match for AppDescrip because the analyst might not know the exact description (eg they just know it's for CCleaner).
As usual, it starts off with "GetOptions" and the Help printout sections.
We then call DBI's "connect" and one of the versions of "prepare" (depending on the user's input arguments).
Next we call "PrintHeadings" and "PrintResults" (we cut and pasted these functions from "ffparser.pl").
Finally, we call "finish" and "disconnect".

Testing

To test our scripts we will save a copy of the current ForensicsWiki AppID page  as "List_of_Jump_List_IDs.html". We will then edit/remove the entry for the "Win32 cmd.exe" in the last row of the last table and re-save it as "List_of_Jump_List_IDs-withoutCMD.html".
Now we can simulate an update by first running "4n6wiki-parser.pl" with "List_of_Jump_List_IDs-withoutCMD.html" and then again with "List_of_Jump_List_IDs.html". After our first run, our SQLite Database (created previously) should not contain our "Win32 cmd.exe" entry and then after the second run, it should contain our "Win32 cmd.exe" entry.

OK, here we run "4n6wiki-parser.pl" with "List_of_Jump_List_IDs-withoutCMD.html". (ie no "Win32 cmd.exe" entry)

sansforensics@SIFT-Workstation:~$ ./4n6wiki-parser.pl -file List_of_Jump_List_IDs-withoutCMD.html -db WinAppIDs.sqlite


Running 4n6wiki-parser.pl v2012-03-27


Extracted Table Size = 53 x 4
Extracted Table Size = 31 x 4
Extracted Table Size = 10 x 4
Extracted Table Size = 87 x 4
Extracted Table Size = 50 x 4
Extracted Table Size = 16 x 4
Extracted Table Size = 66 x 4
Extracted Table Size = 30 x 4
Extracted Table Size = 2 x 4
Extracted Table Size = 55 x 4
Extracted Table Size = 7 x 4
Number of Rows Extracted from HTML = 407
WinAppIDs.sqlite Update complete ...
sansforensics@SIFT-Workstation:~$


OK now we run a search for the "cmd" term in our "WinAppIDs.sqlite" database.

sansforensics@SIFT-Workstation:~$ ./4n6wiki-query.pl -descrip "cmd" -db WinAppIDs.sqlite


Running 4n6wiki-query.pl v2012-03-27


AppID | AppDescrip
6728dd69a3088f97 | Windows Command Processor - cmd.exe (64-bit)
1 Matches returned
sansforensics@SIFT-Workstation:~$

As expected, it only returns the value for the 64bit cmd.exe.
Now let's run "4n6wiki-parser.pl" with "List_of_Jump_List_IDs.html" (ie includes "Win32 cmd.exe" entry).

sansforensics@SIFT-Workstation:~$ ./4n6wiki-parser.pl -file List_of_Jump_List_IDs.html -db WinAppIDs.sqlite


Running 4n6wiki-parser.pl v2012-03-27


Extracted Table Size = 53 x 4
Extracted Table Size = 31 x 4
Extracted Table Size = 10 x 4
Extracted Table Size = 87 x 4
Extracted Table Size = 50 x 4
Extracted Table Size = 16 x 4
Extracted Table Size = 66 x 4
Extracted Table Size = 30 x 4
Extracted Table Size = 2 x 4
Extracted Table Size = 55 x 4
Extracted Table Size = 8 x 4
Number of Rows Extracted from HTML = 408
DBD::SQLite::st execute failed: columns AppID, AppDescrip are not unique at ./4n6wiki-parser.pl line 114.
DBD::SQLite::st execute failed: columns AppID, AppDescrip are not unique at ./4n6wiki-parser.pl line 114.
DBD::SQLite::st execute failed: columns AppID, AppDescrip are not unique at ./4n6wiki-parser.pl line 114.
[... Lines Edited Out]
DBD::SQLite::st execute failed: columns AppID, AppDescrip are not unique at ./4n6wiki-parser.pl line 114.
WinAppIDs.sqlite Update complete ...
sansforensics@SIFT-Workstation:~$

OK so we are getting some errors about duplicate entries but the last table seems to have an extra entry 8 x 4 (not 7 x 4). This looks promising!
We'll run a query and check ...

sansforensics@SIFT-Workstation:~$ ./4n6wiki-query.pl -descrip "cmd" -db WinAppIDs.sqlite


Running 4n6wiki-query.pl v2012-03-27


AppID | AppDescrip
6728dd69a3088f97 | Windows Command Processor - cmd.exe (64-bit)
bc0c37e84e063727 | Windows Command Processor - cmd.exe (32-bit)
2 Matches returned
sansforensics@SIFT-Workstation:~$


YAY! The update worked!

Let's run a quick query using the bc0c37e84e063727 AppId (for "Win32 cmd.exe") and see what results we get:

sansforensics@SIFT-Workstation:~$ ./4n6wiki-query.pl -appid bc0c37e84e063727 -db WinAppIDs.sqlite

Running 4n6wiki-query.pl v2012-03-27

AppID | AppDescrip
bc0c37e84e063727 | Windows Command Processor - cmd.exe (32-bit)
1 Matches returned
sansforensics@SIFT-Workstation:~$


OK we found our latest entry.

Just for fun, let's check how many entries we have via "sqlite3".

sansforensics@SIFT-Workstation:~$ sqlite3 WinAppIDs.sqlite
SQLite version 3.6.16
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>
sqlite> select count (AppId) from WinAppIDs;
408
sqlite>

Now let's see if that corresponds with the SQLite Manager Firefox plugin.

Opening our Database with the "SQLite Manager" Firefox Plugin

Note(1): The HTML links have been extracted into the table Source column.
Note(2): There are 408 records reported ie the same as what our previous "sqlite3" query reported. The latest record is our "Win32 cmd.exe" record.

Our last check (with "sqlite3") is to see if  our database's DateAdded field is usable by SQLite (we'll only grab the first 3 records). So here's those records with our inputted date format:

sqlite> select AppId, AppDescrip, DateAdded from WinAppIDs limit 3;
65009083bfa6a094|(app launched via XPMode)|2011-08-22
469e4a7982cea4d4|? (.job)|2011-08-22
b0459de4674aab56|(.vmcx)|2011-08-22
sqlite>

And here's the same results with the date field transformed into "number of days since now":

sqlite> select AppId, AppDescrip, julianday('now') - julianday(DateAdded) from WinAppIDs limit 3;
65009083bfa6a094|(app launched via XPMode)|220.419525995385
469e4a7982cea4d4|? (.job)|220.419525995385
b0459de4674aab56|(.vmcx)|220.419525995385
sqlite>

So we can see from 22 August 2011 until (now) 29 March 2012 has been calculated as ~220 days.
A manual check: 9 remaining days of August + 30 Sept days + 31 Oct days + 30 Nov days + 31 Dec days + 31 Jan days + 29 Feb days + 29 Mar days = 220
Coolio! Our date field is usable by SQLite.

Summary

OK so everything seems to work. We have extracted various fields from the ForensicsWiki AppId web page and then entered that data into an SQLite Database. We have then successfully queried that database for some information.
The code from "4n6wiki-query.pl" would be easy to cut and paste into existing scripts that currently only output the AppID thus possibly saving analysts some time.
I'm not sure if this will be of use to any forensicators but it was interesting from a Perl programming perspective. If you think you might find it useful, please leave a comment.