Thursday, 17 July 2014

Android Has Some Words With Monkey

Be advised ... Here thar be Squirrels!

The recent NIST Mobile Forensics Webcast and SANS FOR585 poster got monkey thinking about using the Android emulator for application artefact research. By using an emulator, we don't need to "root" an Android device in order to access artefacts from the protected data storage area (eg "/data/data/"). As an added bonus, the emulator comes as part of the FREE Android Software Development Kit (SDK). Hopefully, this post will help encourage further forensic research/scripts for Android based apps.
So now we just need a target app to investigate ... "Words With Friends" (WWF) is a popular scrabble type game with chat functionality. For this post, we'll be focusing on using an Android emulator to retrieve in-game chat artefacts and then create a script to parse them (""). It's a fairly long post so you might want to take that potty break now before we begin ;)

0. Installation / Setup

On an Ubuntu 12.04 LTS (32 bit) Virtual Machine (using VMware Player), I installed the following:
- Android SDK bundle including the "eclipse" IDE (from
- dex2jar tool to convert .dex byte code into the .jar Java archive format (from
- JD-GUI Java decompiler to display the source code from a .jar file (from

Android has an official SDK install guide here which also continues on here.
Installation was as simple as unzipping the downloaded archives and launching the relevant executable.
Lazy monkey just unzipped the archives to his home directory (ie "/home/cheeky/").
Here's a quick guide:
- Go to and download the 32 bit linux ADT bundle (includes both the eclipse IDE and Android SDK tools)
- Double click the zip file and use the Ubuntu Archive Manager to unzip the bundle to "/home/*username*" (eg unzips to "/home/cheeky/adt-bundle-linux-x86-20140702/")
- Use the Nautilus File Exporer GUI to navigate to the eclipse sub-directory (eg "/home/*username*/eclipse/")
- Double click on "eclipse" icon to launch it (or you could launch it from the command line via "/home/*username*/eclipse/eclipse")
- Go to the "Window" ... "Android SDK Manager" drop menu item and launch it. Some packages are installed by default but if you want to run an emulator with a specific/previous version of Android you need to download/install that specific SDK Platform (eg 4.2.2 SDK platform) and a corresponding Hardware System Image (eg ARM for a Nexus 7 tablet).
- Unzip the downloaded dex2jar zip file contents to "/home/*username*" (eg "/home/cheeky/dex2jar-")
- Unzip the downloaded jd-gui zip file to "/home/*username*" (eg "/home/cheeky/"). Note: we only need to extract the "jd-gui" exe.

Also installed was the Bless Hex Editor (via Ubuntu Software Center) and the Firefox SQLite Manager extension (via the Firefox Add-ons Manager).

To make things a bit easier, I also setup a soft link (ie alias) so we can just type "adb" without the preceding path info to launch the Android Debug Bridge.
cheeky@ubuntu:~$ sudo ln -s /home/cheeky/adt-bundle-linux-x86-20140702/sdk/platform-tools/adb /usr/bin/adb

1. Getting the .apk app install file

Android .apk install files are zip archives. You can download them from the GooglePlay store by using a Chrome plugin or via the apk-downloader website. For this experiment however, I wanted to test the specific version from my Nexus 7 tablet (WWF 7.1.4), so I decided to use the Android Debug Bridge (adb) method.
Excellent adb instructions are available from the official Android Dev site here.

To prepare my Nexus 7 (1st gen c.2012) for the .apk file transfer, I attached it to my PC via USB cable. I then enabled the tablet's "Developer options" from the "Settings" menu by tapping the "About tablet" ... "Build number" several times. Next, I went into "Developer options" and enabled the "USB Debugging" and "Stay awake" options.

From our Ubuntu VM we can now check for connected devices/emulators ...
cheeky@ubuntu:~$ adb devices
List of devices attached
*serialnumber_of_device*    device


Note: I have redacted the serial number of my Nexus 7. Just imagine a 16 digit hex value in place of *serialnumber_of_device" ...

So now we know that the adb has recognized our physical device, let's try connecting to it ...
cheeky@ubuntu:~$ adb -s *serialnumber_of_device* shell

For squirrels and giggles, lets try to list the files in the protected "/data/data/" directory ...
127|shell@grouper:/ $ ls /data/data
opendir failed, Permission denied
1|shell@grouper:/ $

We also can't do a directory listing of "/data/app/" (where the .apk install files are located) ...
shell@grouper:/ $ ls /data/app
opendir failed, Permission denied
1|shell@grouper:/ $

Fortunately, we CAN list the installed 3rd party packages and associated .apk file by typing "pm list packages -f -3"
1|shell@grouper:/ $ pm list packages -f -3
shell@grouper:/ $

At this point, we type "exit" to logout from the physical device.
From the output of the "pm list packages -f -3" command, we know that the WWF .apk file is "/data/app/com.zynga.words-1.apk".
So we can use the "adb pull" command to copy it to our local Ubuntu VM.
cheeky@ubuntu:~$ adb -s *serialnumber_of_device* pull /data/app/com.zynga.words-1.apk wwf.apk
1252 KB/s (23648401 bytes in 18.442s)

The above command copies "/data/app/com.zynga.words-1.apk" to the current directory and names it as "wwf.apk".
I didn't feel like typing the whole long .apk filename each time (lazy monkey!) so just called it "wwf.apk".
Anyhow, a copy of the WWF apk is now stored as "/home/cheeky/wwf.apk".

2. Create/Launch emulator

Now we create an Android emulator and fire it up ...
The Android website has some detailed instructions about creating/running the emulator here, here and here.
Here's the quick version ...
- Assuming you still have the eclipse IDE open, go to the "Window" ... "Android Virtual Device (AVD) Manager" menu item and create a new device similar to the following:

Test emulator device specs
Note: Be sure to tick the "Use Host GPU" checkbox to improve emulator speed. It also helps to ensure your VM has plenty of RAM.
- Start the device emulator by selecting the "testtab" device AVD and clicking "start". Alternatively, you can launch the AVD Manager GUI from the command line instead of via eclipse ...
cheeky@ubuntu:~$ /home/cheeky/adt-bundle-linux-x86-20140702/sdk/tools/android avd

The emulator can take a minute to boot but eventually you should see something like this:

Emulator at startup

Now we can see if our emulator is recognized by typing "adb devices". Note: Our physical Nexus 7 device has been disconnected from the PC so it doesn't appear now.
cheeky@ubuntu:~$ adb devices
List of devices attached
emulator-5554    device


Update 2014-07-26:
To research Google product artefacts such as GoogleMaps and Hangouts, you can use an emulator with a Google APIs target set (eg "Google APIs - API Level 19") instead of an Android target as shown previously.

According to the official documentation, Google Play services can only be installed on an emulator with an AVD that runs a Google APIs platform based on Android 4.2.2 or higher. To be able to use a Google API in the emulator, you must also first install the target Google API system image (eg "Google APIs - API Level 19") from the SDK Manager.

3. Installing an .apk on the emulator

To install the "wwf.apk" we previously pulled off our Nexus 7 device, we type the following:
cheeky@ubuntu:~$ adb -s emulator-5554 install wwf.apk
2929 KB/s (23648401 bytes in 7.883s)
    pkg: /data/local/tmp/wwf.apk

There should now be a WWF icon in the emulator's "App" screen which we can launch by clicking on it.

WWF is now installed!

We should now see a login screen for WWF where we can provide an email address and start playing games / chatting with others (ie create lots of juicy artefacts!).
By default, the emulator retains data between emulator launches so you shouldn't lose much/any app data if/when the emulator closes (eg after a crash).

4. Capture artefact data (via adb pull and DDMS)

For squirrels and giggles, let's connect to the emulator and see what our access privileges are (remembering that they were limited on the Nexus 7 device) ...
cheeky@ubuntu:~$ adb -s emulator-5554 shell
root@android:/ #

Now lets try viewing the "/data/data/" directory ...
root@android:/ # ls /data/data
root@android:/ #

Note: We are logged into our emulator as "root" so that's why we can now see the contents of "/data/data/" :)
Let's double-check that WWF was installed OK ...

root@android:/ # pm list packages -f -3
root@android:/ #

Now we can search for WWF chat artefacts in "/data/data/" ...
Note: Because the Nexus 7 doesn't have a removable SD card, we don't have to worry about checking "/mnt/sdcard/" for app artefacts. But just keep it in mind for other devices which may support app data storage on the SD card.

Let's do a file listing of the WWF directory ...
root@android:/ # ls /data/data/com.zynga.words/                               
root@android:/ #

Looking closer at the "databases" directory ...
root@android:/ # ls /data/data/com.zynga.words/databases/                     
root@android:/ #

Hmmm ... it looks like this could contain some interesting info.

Update 2014-07-26:
By using the "lsof" command on the emulator whilst our target app is running, we can see a list of currently open files .
This can then be used to locate application artefact files (eg databases). For example, we can type:
lsof | grep com.zynga.wordswhich should also lead us to various files open in "/data/data/com.zynga.words/databases/".

We type "exit" to logout for now.
Now we can "pull" these files of interest from the emulator for further analysis. For example, I'm now going to skip ahead and pull the file where I eventually found the WWF chat artefacts ...
cheeky@ubuntu:~$ adb -s emulator-5554 pull /data/data/com.zynga.words/databases/WordsFramework
1111 KB/s (159744 bytes in 0.140s)

Alternatively, we can use the eclipse IDE and the Dalvik Debug Monitor Server (DDMS) tool to "pull" files. If our emulator is running (first, you better go catch it!), DDMS should connect to it automagically.
To launch the DDMS tool, use the following eclipse drop down menu - "Window" ... "Open Perspective" ... "DDMS"
The DDMS tool allows us to a bunch of cool things such as:
- pull/push files to the emulator
- dump process heap memory
- spoof phone calls (logs connections only / not capable of voice transmission/reception)
- send SMS text messages to the emulator
- set the GPS location of the phone
More information on DDMS is available here.

OK you should see the emulator on the LHS under "Devices" and the "File Explorer" tab on the RHS.

Dalvik Debug Monitor Server running in eclipse

Under the "File Explorer" Tab, browse to "/data/data/com.zynga.words/databases"
Then select the "WordsFramework" file and click the floppy disk icon to "pull" the file onto your Ubuntu box.

For squirrels and giggles, we can also dump the WWF process heap memory and later search it for interesting strings.
To do this, on the LHS, select the "com.zynga.words" process, toggle the "Update Heap" button (making DDMS continuously monitor the heap) and then click on the "Dump HPROF file" button (looks like cylinder with red arrow pointing down).

Dumping the process heap memory via DDMS

Next select the "Leak Suspects Report" and wait ... FYI we're not interested in the report as much as the accompanying HPROF dump.
After a while, the title bar changes to reflect that the .hprof is stored with a ridiculously long numeric filename in "/tmp/".

Obtaining the HPROF file

We can now run "strings" against this file and review ...
For example, I checked the validity of the word "twerk" using the emulator's WWF "Word Check" and then dumped the process heap. I then ran the "strings" command and piped the output to a separate text file ("8058l.txt") for easier analysis/searching.
cheeky@ubuntu:~$ strings --encoding=l /tmp/android1924397721207258058.hprof > 8058l.txt
From the text file output, I was able to observe a bunch of little endian UTF-16 strings (possibly a dictionary update) contained in memory. Opening the .hprof file separately in a hex editor also confirms this.

Viewing the HPROF in a Hex Editor shows something squirrelly ...

Entering some selected words from this list into the WWF "Word Check" confirmed that they are acceptable words.
I think we just found us a squirrel! Who's a good monkey eh? Purr ...

DDMS also has a "Network Statistics" tab view available but this was not working with my emulator. It apparently can be hit and miss according to this Google Android Issue Tracker notice.
Anyway, there's more than one way to get network stats ...
First, we login to the emulator (without WWF running) and take a baseline list of network connections via "netstat"
cheeky@ubuntu:~$ adb -s emulator-5554 shell
root@android:/ #
root@android:/ # netstat
Proto Recv-Q Send-Q Local Address          Foreign Address        State
 tcp       0      0*              LISTEN
 tcp       0      0 *              LISTEN
 tcp       0      0         ESTABLISHED
tcp6       0      1 ::ffff: ::ffff: CLOSE_WAIT
root@android:/ # 

According to our emulator's "Settings" ... "About Phone" ... "Status" our emulator's IP address is which we can see under the "Local Address" columns.
Now we launch WWF and then take another "netstat" ...
root@android:/ # netstat                                                      
Proto Recv-Q Send-Q Local Address          Foreign Address        State
 tcp       0      0*              LISTEN
 tcp       0      0 *              LISTEN
 tcp       0      0      ESTABLISHED
 tcp       0      0     ESTABLISHED
 tcp       0      0     ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp     256      0         ESTABLISHED
 udp       0      0*              CLOSE
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      1 ::ffff: ::ffff: CLOSE_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
root@android:/ # 

Lastly, we start an in-game chat, send a message and take a "netstat" ...
root@android:/ # netstat                                                      
Proto Recv-Q Send-Q Local Address          Foreign Address        State
 tcp       0      0*              LISTEN
 tcp       0      0 *              LISTEN
 tcp       0      0      ESTABLISHED
 tcp       0      0     ESTABLISHED
 tcp       0      0     ESTABLISHED
 tcp       0      0       TIME_WAIT
 tcp       0      0      ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp       0      0      TIME_WAIT
 tcp       0      0      ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp       0      0         ESTABLISHED
 tcp       0      0        ESTABLISHED
 tcp       0      0      ESTABLISHED
 tcp     103      0         ESTABLISHED
 tcp       0      0      ESTABLISHED
 udp       0      0*              CLOSE
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      1 ::ffff: ::ffff: CLOSE_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
tcp6       0      0 ::ffff: ::ffff: TIME_WAIT
root@android:/ #

Obviously ESTABLISHED means theres a connection between our local host and a remote host.
TIME_WAIT means there WAS a connection but it is in the process of being closed. Thus you can also sometimes see previous TCP connections.
These netstat listings are snapshots at a particular time so you have to be quick and/or take several samples.

Now we can do a "whois" lookup on these IP addresses and find out a bit more info. Admittedly, these connections could be from ANY process on the emulator but as WWF is the only installed app, it's likely that it's the process responsible for these communications.
Without getting into too much detail, some of these remote IP addresses resolved back to companies such as Google, / Twitter, Yahoo, Amazon, Facebook, Softlayer Technologies and Akamai.

You could also fire up Wireshark or tcpdump and capture packets for analysis but depending on your jurisdiction this may be illegal (ie intercepting communications without consent). Anyhow, monkey won't be doing that - his delicate simian features wouldn't survive prison!

To use the DDMS Call / SMS / GPS functionality ...
Select "Window" ... "Show View" ... "Other" ... and then start typing "Emulator Control".
This should bring up a new tab where you can send SMS / simulate a voice call / set the emulator's GPS location.
Making voice call logs and sending SMS worked well with my emulator. However, I have not been able to verify that the GPS functionality works. Launching Googlemaps in the emulator's web browser and clicking on the crosshairs icon results in Googlemaps reporting "Your location could not be determined" ... *shrug*
FYI A couple of times, using the emulator control functionality also made the emulator laggy / caused a crash.

DDMS also allows users to grab screenshots from the emulator via the camera button.
However, we can also do screenshots from our emulator window via the Alt-PrintScreen button combo in a VMware environment.

5. Viewing the source code

- Open the .apk file using Ubuntu Archive Manager (or similar unzip app)
- Extract "classes.dex" (to minimize confusion I extracted it as "words-classes.dex" to "/home/cheeky/")
- Run "d2j-dex2jar.jar" from the install directory (eg "/home/cheeky/dex2jar-")
cheeky@ubuntu:~/dex2jar-$ ./ /home/cheeky/words-classes.dex
dex2jar /home/cheeky/words-classes.dex -> words-classes-dex2jar.jar

There will now be a "words-classes-dex2jar.jar" in the current direcotry (eg "/home/cheeky/dex2jar-")
- Run JD-GUI (either from command line or by double-clicking the extracted exe) and open the "words-classes-dex2jar.jar" file.
Be aware that not all of the code may be easily understandable due to obfuscation (eg class "a" has non-descriptive method/function names such as "a", "b" etc)
Anyhoo, now you can have fun basking in all things Java ... smells wonderful huh? ;)

6. Other .apk activities

We can use the "aapt" SDK tool to determine the Android permissions for the "wwf.apk" we pulled earlier ...
Note: the "aapt" tool is installed by default with the bundle and located in the latest android version sub-directory (android-4.4W in our case).
cheeky@ubuntu:~$ /home/cheeky/adt-bundle-linux-x86-20140702/sdk/build-tools/android-4.4W/aapt dump permissions wwf.apk
package: com.zynga.words
uses-permission: android.permission.INTERNET
uses-permission: android.permission.ACCESS_WIFI_STATE
uses-permission: android.permission.ACCESS_NETWORK_STATE
uses-permission: android.permission.READ_PHONE_STATE
uses-permission: android.permission.READ_CONTACTS
uses-permission: android.permission.SEND_SMS
uses-permission: android.permission.WRITE_EXTERNAL_STORAGE
uses-permission: android.permission.VIBRATE
uses-permission: android.permission.RECEIVE_BOOT_COMPLETED
uses-permission: android.permission.GET_ACCOUNTS
uses-permission: android.permission.AUTHENTICATE_ACCOUNTS
uses-permission: android.permission.USE_CREDENTIALS
uses-permission: android.permission.NFC
uses-permission: android.permission.ACCESS_FINE_LOCATION
permission: com.zynga.words.permission.C2D_MESSAGE
uses-permission: com.zynga.words.permission.C2D_MESSAGE

For a listing of possible Android permissions, see here.
For more details on the "aapt" tool, see here.

It can also be handy to explore what other files are included in an .apk.
For example, the "res" directory holds resources such as pics, sounds.
See here and here for further details on the apk archive structure and the apk building process.

Opening the "wwf.apk" file using Ubuntu Archive Manager, we also note that under the "res/raw/" directory there exists the "dict" file - that sounds pretty squirrelly eh?
Unfortunately, opening it in a Hex editor shows that it's encoded somehow :'(
So no free WWF word list for you! Hey, it was worth a shot ...

7. Creating a Chat Extraction script

OK we're both losing the will to go on, so I'll start finishing up by mentioning the WWF chat artefacts and describing the accompanying extraction script.

Where are the chat artefacts stored?
Under "/data/data/com.zynga.words/databases/" there is a "WordsFramework" file.

I discovered this by "pulling" all the files from the "databases" directory and looking at them using the Firefox SQLite Manager or the Bless hex editor.
We can also use the Linux "file" command to figure out what type of file it is.
cheeky@ubuntu:~$ file WordsFramework
wwf/WordsFramework: SQLite 3.x database, user version 220

Opening it up in Firefox SQLite Manager, we can see that theres 2 tables of interest - "users" and "chat_messages"
Here's a diagram showing how the 2 tables go together ... well, it's been abbreviated down to what monkey considers the important fields anyway ...

WWF chat schema

From our previous adventures in Python SQLite (eg Facebook Messenger post), we know how to query/extract this kind of stuff. Here's the query that gets us the chat artefacts ...
SELECT chat.chat_message_id, chat.game_id, chat.created_at,, chat.message, chat.user_id, users.email_address, users.phone_number, users.facebook_id, users.facebook_name, users.zynga_account_id
FROM chat_messages as chat, users
WHERE users.user_id = chat.user_id ORDER BY chat.created_at;

The script ("") opens the specified "WordsFramework" file, runs the above query and prints out the chat messages in chronological order.
If there's multiple game conversations going on, the analyst can (manually) use the "game_id" to filter out conversations from the TSV output.

It has been developed/tested for Python 2.7.3 on a 32-bit Ubuntu 12.04 LTS VM.
It can be downloaded from Github here.

Making the script executable (via "sudo chmod a+x") and running it with no arguments shows the help text.
cheeky@ubuntu:~/wwf$ ./
Running wwf-chat-parser v2014-07-11
Usage: -d wordsframework_db -o chat_output.tsv

  -h, --help    show this help message and exit
  -d FRAMEWKDB  WordsFramework database input file
  -o OUTPUTTSV  Chat output in Tab Separated format

Here's what the command line output looks like when using the "WordsFramework" file from our emulated chat ...
cheeky@ubuntu:~/wwf$ ./ -d /home/cheeky/WordsFramework -o wwf-output.tsv
Running wwf-chat-parser v2014-07-11

Extracted 9 chat records


With that many columns returned by the query, outputting to the command line just looked too crappy/confusing.
So instead, the script creates a Tab Separated (TSV) file for the output instead.
Here's what the TSV output file would look like if imported into an LibreOffice Calc spreadsheet ...

TSV output of ""

Some comments/observations:
- Data is fictional and is included just to illustrate which fields are extracted. Any id numbers and names have been changed to protect the Simians.
- The "created_at" times appear to be referenced to GMT (Don't trust me ... verify it for yourself).
- The "zynga_account" and "email_address" fields are only populated for the device owner (ie "emulator-monkey"). The opponent's corresponding details aren't populated.
- The "phone_number" field does not appear to be populated at all (FYI the emulator's "Settings" ... "About phone" ... "My phone number" is 1-555-521-5554).
- The highlighted rows 5-6 are from a chat between "emulator-monkey" and "3rd-party-monkey" and show how it is possible to use the "game_id" value to determine conversation threads.
- Monkey does not use Facebook so the Facebook ID and name are included for completeness but not tested. It is apparently possible to use your Facebook login to login in to WWF.
- This script relies on allocated chats (ie chats from active games). There might be chat strings still present in the "WordsFramework" database from previous completed/expired games but I haven't had time to research this area. Running "strings" on the WordsFramework file and/or opening it in a Hex editor might help in that regard.

8. Resources

Here are some resources that I found useful while researching for this post ...
The Official Android SDK documentation (eg how to install the SDK, run the emulator, install the app etc.)

Cindy Murphy's (@CindyMurph) webcast/slides on reversing Android malware

Pau Oliva Fora's (@pof) RSA presentation on reverse engineering Android Apps

Thomas Cannon's (@thomas_cannon) blog post on reverse engineering Android
and his "Gaining access to Android" presentation from DEFCON 20

Lee Reiber's (@Celldet) Forensic Focus webcast on malware detection

9. Final Words

We have been able to use the increased privileges of the Android emulator to uncover and harvest Android application artefacts.
This can be used by researchers to develop forensic extraction scripts without requiring actual rooted physical devices (or expensive commercial forensic tools).

Hopefully, this post will help to address the mobile device "app gap" that currently exists between commercial forensic tools and the sheer number of apps available on the GooglePlay market.
But if not, at least we got to chase some squirrels and learn some new things about WWF chats.

Some initial research shows that Microsoft's Windows Phone emulator cannot currently be used in a similar manner because the MS emulator does not allow you to load Windows marketplace apps onto it. It seems like it's mainly for testing apps that you write yourself. It has trust issues apparently.
As for iOS, monkey doesn't have access to OS X or any Apple devices (shocking!) so can't say whether the iOS emulator supports loading apps from the store and/or is "jailbroken" by default.

OK due to time/space constraints and a tired monkey ("What do you mean Red Bull doesn't come in Banana flavour?!"), we're gonna stop here ... There are probably a few squirrels that escaped but at this point, something is better than nothing eh?

Friday, 13 June 2014

Monkeying around with Windows Phone 8.0

Ah, the wonders of Windows Phone 8.0 ... Failing eyesight, Frustration and Squirrel chasing

Currently, there is not much freely available documentation on how Windows Phone 8.0 stores data so it is hoped that the information provided in this post can be used as a stepping stone for further research / possible scripting. Hopefully, analysts will also be able to use this post to help validate any future tool results.

Special Thanks to Detective Cindy Murphy (@CindyMurph), Lieutenant Jennifer Krueger Favour (@rednogn) and the Madison Police Department ("Forensicate Like A Champion!") for providing the opportunity and encouragement for this research.
Unfortunately, due to time contraints and a limited test data set, I wasn't able to write an all-singing/all-dancing script. Instead, some one-off scripts were created to extract/sort the relevant data a lot quicker than it would have taken to do manually. Rather than releasing scripts that are customized for a limited set of test data (which I don't have easy access to any more) - this post will be limited to documenting the data sources/structures.

OK, so no free tool and you're still here reading huh? In Yoda voice: "The nerd runs strong in this one" ;)

Thanks to Maggie Gaffney from Massachusetts State Patrol / Teel Technologies, the initial test data (.bin file) was sourced via JTAG from a Nokia 520 Windows 8.0 phone - a "cheap" smart phone common to prepaid plans. The .bin file was then opened in X-Ways Forensics to parse the 28(!) file system partitions and to export out files of interest. The exported files were then viewed in hex view using Cellebrite Physical Analyzer (love the data interpretation and colour coded bookmarking!). Later, we were also able to get our paws on some test data from a HTC PM23300 Windows Phone 8.0 phone courtesy of JoAnn Gibb from the Ohio Attorney Generals Office. It's awesome knowing people that know people!

Note: The Nokia 520 does not display the full SMS timestamp info (threaded messages display date only).
So while we can potentially re-create the order of threaded messages as per the test phone, we can't easily validate the exact time an SMS message was sent/received. There's a good chance that other Windows Phone 8.0 phones will use the same timestamp mechanism and hopefully they will display the full timestamp.

So where's the data?!

The SMS content, MMS file attachment info and Contacts information are stored (via the 28th Partition) in:


Various .dat files containing MMS content are also stored in sub-directories of:


The Call log is stored in:


The "store.vol" and "Phone" files seem to be ESE Databases (see explanantions here and here) with the magic number of "xEF xCD xAB x89" present at bytes 4-8. Consequently, we tried opening "store.vol" using Nirsoft's ESE Database viewer but had limited success - the SMS message texts were not viewable however other data was. This suggests that maybe the "store.vol" file differs in some way from the ESE specification and/or the tool had issues reading the file.
Joachim Metz has also both documented (here  and here) and written a C library "libesedb" to extract ESE databases. Unfortunately, I didn't discover Joachim's library until after we started poking around .. Anyway, it was a pretty masochistic interesting exercise trying to reverse engineer the "store.vol" file. One possible benefit of this data diving is that it *might* also reveal unallocated/partially overwritten data records that might be ignored by libraries which read the amount of data declared (vs reading all the data present). This is pure speculation though as I don't know if old records are overwritten or just marked as invalid.

Viewing "store.vol" using Cellebrite Physical Analyzer, relevant data was observed for text strings (eg phone numbers, SMS text strings) encoded in UTF-16 LE throughout the file.
As a database file there will be tables. Each table will have columns of values (eg time, text content, flags). A single (table row) record will thus have data stored for each column.
Table data will be organized within the file somehow (eg multiple SMS records organized into page blocks). So it is likely that finding a hit for a specific SMS will lead you to the contents of other SMS messages (potentially around the same timeframe).

The Nokia 520 was actually locked with a 4 digit PIN when we started investigating. Without access to the phone, any manual inspection/validation would have been impossible. It was unknown if the phone would have been wiped if too many incorrect PINs were entered. So any guesses would have to be documented and carefully chosen. It wasn't looking good ... until a combination of thinking outside the box and a touch of luck lead us to an SMS text message (in "store.vol") with the required 4 digit code. Open sesame!

Some things we tried with the data ...

To find specific SMS records we searched for unique/known strings from the SMS text (eg "Look! A Squirrel!"). A single record was found per SMS in "store.vol" and each record also contained a UTF-16-LE string set to "SMStext".

To find contact information, we searched for known phone number strings (eg +16085551234, 123456, 1234). Some numbers were observed in "store.vol" in close proximity to "SMStext" strings while other instances were located close to what appeared to be contact information (eg contact names).

To search for field markers and flags, we compared separate SMS text records and looked for patterns/commonalities in the hex. Sometimes the pattern was obvious (eg "SMStext" occurs in each SMS message) and sometimes it wasn't so obvious (sometimes there is no discernible pattern!).

Figuring out the timestamp format being used was HUGE. Without it, we could not have figured out the order messages were sent/received. Using Cellebrite Physical Analyzer to view the "store.vol" hex, Eagle-eyed Cindy noticed that there were 8 byte groupings occurring before/after the SMS text content. These 8 bytes were usually around the same value range (eg in LE xFF03D2315FE1C701). Which is what you'd expect within a single message. Subsequent messages usually had larger values - which corresponds to messages sent/received at a later time.
Like most hex viewers, Cellebrite Physical Analyzer can interpret a predefined number of bytes from the current cursor position and print a human friendly version. Using this, Calculon Cindy showed an otherwise oblivious monkey that these 8 byte groupings could be interpreted as MS FILETIME timestamps! To be honest, I was expecting smaller 4 byte timestamps - Silly monkey!
By comparing the 8 byte values surrounding a specific SMS text message (eg "Look! A Squirrel!") with the date displayed on the phone for that message, we theorized that our mysterious timestamps were *probably* MS FILETIME timestamps (No. of 100 ns increments since 1 January 1601 in UTC). For example, xFF03D2315FE1C701 = Sat, 18 August 2007 06:15:37 UTC. As the phone did not display the exact time for each SMS, we could only use the order of threaded messages and the date displayed to somewhat confirm our theory. Various SMS sent/received dates on the phone were spot checked against a corresponding "store.vol" entry timestamp date and the date values consistently matched.

What the data looks like

After some hex ray vision induced cross-eyedness (who knew that looking at hex is almost like a curse!), we think we've figured out some general data structures for SMS, MMS, Contacts and Call log records. There's still some unknowns/grey areas but it's a start.

- On the data structure diagrams below, "?" is used to denote varying/unknown number of bytes.
- FILETIMEs are LE 8 byte integers representing the number of 100 ns intervals since 1 JAN 1601.
- In general, strings are null terminated and UTF-16-LE encoded (ie 2 bytes per character).

Sent / Received SMS records

There are two types of SMS data structures which are mixed together. Each type of SMS structure contains a UTF-16-LE encoded string for "SMStext". However, one type contains phone number strings and the other does not.
For later ease of understanding, we'll say these "SMStext" records occur in "Area 1".
Initially, monkey was confused about why some SMS records had phone numbers and some didn't. However, by inspecting the unlocked phone, we were able to confirm that the SMS message records with no number corresponded to sent SMS.

Sent "SMStext" record (from Area 1 in "store.vol")

Note 1: Note the lack of Phone number information. From test data, FILETIME values (in red and pink) seemed a little inconsistent. Sometimes FILETIMEs within the same record matched each other and other times they varied by seconds/minutes.
Note 2: The Sent Text string (in yellow) is null terminated and encoded in UTF-16-LE.

Received "SMStext" record (from Area 1 in "store.vol")

Note 1: Received SMS have multiple source phone number strings listed (in orange). These seem to remain constant within a given record (eg PHONE1 = PHONE2 = PHONE3)
Note 2: Similar to Sent "SMStext" records, the FILETIMEs (in red and pink) within a record might/might not vary.
Note 3: The Received Text string (in yellow) is null terminated and encoded in UTF-16-LE.

To find out the destination phone number for a sent SMS we can make use of the factoid observed by searching "store.vol" for the FILETIMEs from a specific Sent "SMStext" record.
It appears that FILETIMEs 1, 3 & 4 (in pink) from a given Sent "SMStext" record usually occur once in the entire "store.vol". The FILETIME2 value (in red) however, also appears in a second area ("Area 2"). This area has a bunch of different looking data records each containing the null terminated UTF-16-LE encoded string for "SMS". Also contained in each data record is a phone number string. The "Area 2" SMS records look like:

"SMS" record (from Area 2 in "store.vol")

Note 1: Each "SMS" record contains a UTF-16-LE encoded string for "SMS".
Note 2: From both sets of test data, there seems to be a consistent number of bytes between:
- The FILETIMEX (in red) and "SMS" string (in kermit green) and
- The "SMS" string (in kermit green) and the Phone number string (in orange).

So, each sent "SMStext" FILETIME2 value (from Area 1) should have a corresponding match with an "SMS" record's FILETIMEX value (in Area 2). In this way, we can match a sent "SMStext" message with the destination phone number via the FILETIME2 value. Sounds a little crazy right? But the test data seems to confirm this. Purrr!


Contact information is also located in "store.vol". There were 2 observed data structure types - both contained phone number and name information however, one data type had an extra 10 digit number string. It was later discovered via phone inspection that the records with the extra 10 digit strings corresponded with "Hotmail" address book entries. It would be interesting to see if the 10 digit number corresponded to a unique hotmail user ID of some kind.
The second type of contacts structure was a "Phonebook" entry - presumably these contact types were entered into the phone by the user rather than slurped up from a Hotmail account.
Common to both contact records were multiple occurrences of the same contact name and phone number. OCD phonebook, OCD phonebook, OCD phone book ... ;)

"Hotmail" Contacts record (from "store.vol")

"Phonebook" Contacts record (from "store.vol")

Note 1: The flag value (in red) which can be used to determine if the contact record is a "Hotmail" or "Phonebook" entry.
Note 2: The potential 6 byte magic number (0xFFFFFF2A2A00) for Contact records should make it easier to find each entry. This was discovered by Sharp-eyed Cindy on the last day (by which time monkey had lost the will to live).
Note 3: The 10 digit string (in pink) could be a potential Hotmail ID.

MMS data

Further research is required for MMS records (eg linking timestamps and phone numbers to sent files). But here's what we've learned so far ...
Various .dat files containing MMS content (eg there was a .dat file containing a sent JPEG and another .dat file containing the accompanying text) are stored in:


under 3 sub-directories: "0", "2" and "7". These folders might correspond to Sent, Received and Draft???
There were multiple .dat files with similar names each seemingly containing info for different parts of the same MMS.

In "store.vol", there are records containing the UTF-16-LE encoded string for "MMS". These records also contain 3 filename strings and a filetype string (possibly the MIME type eg "image/jpeg"). From my jet-lagged memory, I want to say that the filename strings were pointing to the same filename and there were multiple "MMS" entries for a single MMS message (ie each MMS message has three separate files associated with it). But you should probably should check it out for yourself ...

MMS record (from "store.vol")

Call log

The Call log information is located in the "Phone" file. Each Call log record contains a flag (in blue) to mark whether a call record is Missed / Incoming / Outgoing. The flag values were confirmed via inspection of the phone and corresponding Call log record. There's also Start and Stop FILETIMEs, repeated contact names and repeated phone numbers.
Of potential interest is a 10 digit ASCII encoded string (in grey) and what looks to be a GUID (in light purple). Each call record had the same GUID string value enclosed by "{}". Perhaps this GUID represents the phone device or the calling application??? I wonder if it would be consistent between different model phones...

Call log record (from "Phone")


So there you have it - we started off knowing very little about Windows Phone 8.0 data storage and now we know a considerable amount more especially regarding SMS records.
Due to time constraints, it was not possible to investigate the non-SMS related data areas (ie MMS, Call log, Contacts) with the same level of detail. However, it's probably better to share what we've discovered now as I don't know when I'll be able to perform further research.
The observations in this post may not be consistent for Windows 8.1 and/or on other models of Windows phones but hopefully this post can still serve as a starting point. As always, check that the underlying data matches your expectations!

It was really awesome having someone else to bounce ideas off when hex-diving. I'm pretty sure I would have missed some important details (eg the FILETIME timestamp) had it not been for another set of eyes. Of course, that's not always going to be possible so I also appreciated the other opportunities to work automonously / with minimal supervision. Someday monkey might have to do this on his lonesome! :o
Initially, it was easy to tie my idea of success with the "I have to code a solution for every scenario/data set". It would have been awesome if I could have done that but the fact was - we didn't have any SMS messages from "store.vol" at the start and after running the one-off SMS script, we had 5000+ messages sorted in chronological order with their associated phone numbers. Success doesn't have to be black and white. It sounds cliche but focusing on little wins each day made it easier to start eating the metaphorical elephant. Now please excuse me, while I adjust my pants ...

Thursday, 23 January 2014

Android SMS script update and a bit of light housekeeping

Knock, Knock ...

During recent research into Android SQLite databases (eg sms), Mari DeGrazia discovered a bug in the script.
Mari's test data was from a Samsung Galaxy S II. It turns out the script wasn't handling Cell Header "Serial Type" values of 8 or 9.
These Cell Header values are respectively used to represent "0" and "1" integer constants and eliminate the need for a corresponding 0x0/0x1 byte value in the Cell Data field section.
So this meant that some fields were being interpreted as "0" when they were actually set to "1". DOH!

The previous Android test data I used did not utilize these particular cell header values which is why it escaped my monkey like attention to detail. Banana? Where?!

Anyway, there's an updated version of the script available from GitHub here.

Pictures speak louder than words so lets look at a simplified example of an SQLite cell record:

SQLite Cell Record Structure

From the diagram above, we can see the usual SQLite record format. A Cell Size, Rowid and Cell Header Size followed by the rest of the Cell Header and the Cell Data sections.
Notice how HeaderField-B = 0x8? This means there will be no corresponding value written in the Cell Data section (ie there is no DataField-B).
When read, the extracted value of DataField-B will be set (to 0) based on the HeaderField-B type (0x8).
Alternatively, if the HeaderField-B type value was 0x9, the extracted value of DataField-B would be set to 1.

Additionally, since the previous post here - both Mari and I have used to carve sms messages from a cellphone's free space.
Here's how it played out:
- Cellebrite UFED was used to generate the .bin physical image file(s) from an Android phone.
- Then the .bin file(s) were added to a new X-Ways Forensics case.
- A keyword search for various phone numbers turned up multiple hits in the 1 GB+ "Free Space" file (ie unallocated space) which was then exported/copied to SIFT v2.14.
- The script's schema config file was adjusted to match the sms table schema.
- After trying the script with a 1GB+ file, we were consistently getting out of memory errors (even after increasing the SIFT VM RAM to 3 GB).
So the Linux "split" command was used to split the 1GB+ file into 3 smaller 500 MB files.
This ran error free although it meant running the script a few more times. Meh, still better than doing it by hand!
As mentioned in a previous post, this script can potentially be used with non-sms SQLite databases especially if the search term field appears near the start of the cell data section.

From now on, all of my scripts will be hosted at GitHub. I'm not sure how much longer GoogleCode will keep my existing scripts so I have also transferred most of those to GitHub.
Because I can no longer update on GoogleCode, I have removed the previous version to minimize further confusion.

Apologies for any inconvenience caused by this script oversight and Special Thanks to Mari for both spotting and letting me know about the error!

Monday, 13 January 2014

Facebook / Facebook Messenger Android App Parser Script

Poorly drawn parody of the Faceoff movie poster

Not satisfied with how your forensic tools are currently presenting Facebook (v3.3 for Android) / Facebook Messenger (v2.5.3 for Android) messages and contacts?
Would you also like a GoogleMaps URL that plots each message using available geographic metadata?
Or maybe you're just curious about how Facebook / Facebook Messenger stores contacts/messages on Android devices?
If so, read on! If not, then there's nothing to see here ... move along.

This Python script is the brainchild of Shafik Punja (@qubytelogic).  I recently contacted him regarding mobile device script ideas and he quickly got back to me with sample Android test databases and cell phone screenshots (for validation). Without his assistance and feedback, this script would not be here today. So, Thankyou Shafik! The Commonwealth of Forensic Monkeys salutes you!

It's also pretty fortunate timing because Heather Mahalik (@heathermahalik) recently gave an awesomely detailed SANS webcast about Data Retention on Android/iOS devices. In her talk she covered where to look for various application data artefacts and also mentioned a few fun Facebook facts.

BTW I don't use Facebook / Facebook Messenger (monkey has no social life) and no one in their right mind would volunteer their personal data for this blog. So just for you ingrates, I made up a test scenario involving 3 muppets and several Facebook messages.
Due to time contraints, I have only fabricated the script relevant data fields just so we had something to print out.
Any id's I use (hopefully) do not correspond to valid Facebook accounts.
Your own data will probably have more populated fields/longer field lengths. Meh.

The script has been developed/tested on SANS SIFT v2.14 running Python 2.6.4. It has also been run successfully on Win7x64 running Python 2.7.6. You can download it from my Google Code page here.

UPDATE 2014-02-05 Two potential issues have arisen with this code.
1. emoji/non-printable characters in the message text may cause the script to crash. Currently, I do not have the Android test data to verify this.
2. If the keywords "to" or "from" are in the message text and it is subsequently used in a GoogleMaps URL, the URL will not plot properly. This is because GoogleMaps interprets these keywords as routing instructions. I can't think of a way around this without changing the text for the GoogleMaps plot.

Data, Data ... where's the data?

For sanity's sake, I am limiting the scope of this post to actual message content and contacts information. There's a crapload of databases/tables that Facebook uses so I had to draw the line somewhere (even if it's in crayon!). From Shafik's test data, there are 3 tables ("contacts", "threads" and "messages") that we are going to extract data from. These tables are stored in 2 separate SQLite database files ("contacts_db2" and "threads_db2").

The "contacts" table
The Facebook app for Android (katana) stores it's "contacts" table data in:

Notice there's no file extension but it's actually SQLite.

Similarly, the Facebook Messenger app for Android (orca) stores it's "contacts" table data in:

Notice how the filenames are the same?
If you compare their table schemas, you will find that they are identical.
Using the SQLite Manager plugin for Firefox on SIFT v2.14, I opened both "contacts_db2" files and checked the "contacts" table schema (found under the Structure tab).

Facebook App (katana) / Facebook Messenger App (orca) "contacts" table schema:

The "data" column is actually a JSON encoded series of key/value pairs. JSON (JavaScript Object Notation) is just another way of exchanging information. See here for further details.

Using the SQLite Manager Firefox plugin, our fictional muppet test scenario "contacts" table looks like:

Muppet test data "contacts" table

Note: If you hover your cursor over the data cell you're interested in, it brings up the neat yellow box displaying the whole string. So you don't have to waste time re-sizing/scrolling.
If it makes it easier, you can also copy the data cell string and use the JSON validator here to pretty print/validate the string.

In addition to the "contact_id" column, the script extracts/outputs the following JSON fields from the "data" column:

The "PictureUrl" values were usually observed to be based at the "" domain.
The "timelineCoverPhoto" values were usually observed to be based at the "" domain.
For the muppet test scenario data, I've just used picture URLs from wikipedia.

The "threads" table

The Facebook app for Android (katana) stores it's "messages" and "threads" table data in:

Similarly, the Facebook Messenger app for Android (orca) stores it's "messages" and "threads" table data in:

For the "threads" table, the Facebook / Facebook Messenger schemas are identical.

Facebook App (katana) / Facebook Messenger App (orca) "threads" table schema :
CREATE TABLE threads (thread_id TEXT PRIMARY KEY, thread_fbid TEXT, action_id INTEGER, refetch_action_id INTEGER, last_visible_action_id INTEGER, name TEXT, participants TEXT, former_participants TEXT, object_participants TEXT, senders TEXT, single_recipient_thread INTEGER, single_recipient_user_key TEXT, snippet TEXT, snippet_sender TEXT, admin_snippet TEXT, timestamp_ms INTEGER, last_fetch_time_ms INTEGER, unread INTEGER, pic_hash TEXT, pic TEXT, can_reply_to INTEGER, mute_until INTEGER, is_subscribed INTEGER, folder TEXT, draft TEXT )

For the "threads" table, we are only interested in the "thread_id" and "participants" columns.
The "thread_id" can be used to group all the messages from a particular conversation.
Later, we will use the "thread_id" to link "messages" table entries with the "participants" of that thread.
The "participants" column is formatted in JSON and looks something like:
[{"email":"","user_key":"FACEBOOK:100000987654321","name":"Kermit The Frog","mute":0,"lastReadReceiptTimestampMs":0},{"email":"","user_key":"FACEBOOK:1087654322","name":"Rowlf","mute":0,"lastReadReceiptTimestampMs":0}]

The script currently only extracts/prints the "name" data field. It is currently left to the analyst to match these "name" values with the "displayName" fields extracted from the "contacts" table mentioned previously.

Here's a screenshot of our fictional muppet "threads" table:

Muppet test data"threads" table

The "messages" table
The Facebook / Facebook Messenger apps' "messages" table schemas differ by one column - Facebook Messenger's "messages" table has an extra column called "auto_retry_count". We're not going to extract this field anyway so our idea of using one extraction script for both apps is still viable. Phew!

Facebook App (katana) "messages" table schema:
CREATE TABLE messages (msg_id TEXT PRIMARY KEY, thread_id TEXT, action_id INTEGER, subject TEXT, text TEXT, sender TEXT, timestamp_ms INTEGER, timestamp_sent_ms INTEGER, mms_attachments TEXT, attachments TEXT, shares TEXT, msg_type INTEGER, affected_users TEXT, coordinates TEXT, offline_threading_id TEXT, source TEXT, is_non_authoritative INTEGER, pending_send_media_attachment STRING, handled_internally_time INTEGER, pending_shares STRING, pending_attachment_fbid STRING, client_tags TEXT, send_error STRING )

Facebook Messenger App (orca) "messages" table schema:
CREATE TABLE messages (msg_id TEXT PRIMARY KEY, thread_id TEXT, action_id INTEGER, subject TEXT, text TEXT, sender TEXT, timestamp_ms INTEGER, timestamp_sent_ms INTEGER, mms_attachments TEXT, attachments TEXT, shares TEXT, msg_type INTEGER, affected_users TEXT, coordinates TEXT, offline_threading_id TEXT, source TEXT, is_non_authoritative INTEGER, pending_send_media_attachment STRING, handled_internally_time INTEGER, pending_shares STRING, pending_attachment_fbid STRING, client_tags TEXT, send_error STRING, auto_retry_count INTEGER )

For our test scenario, we will be using the Facebook Messenger App schema (orca) for the "messages" table. It should not matter either way.
Our fictional muppet test scenario "messages" table looks like this:

Muppet test data "messages" table

Note: This screenshot does not show all of the column values, just the script relevant ones (ie "msg_id", "thread_id", "text", "sender" (JSON formatted), "timestamp_ms", "coordinates" (JSON formatted), "source").

The "msg_id" is a unique identifier for each message stored in the table.
The "thread_id" is used to group messages from the same conversation thread.
The "text" string stores the message's text. Note: For formatting reasons, the script converts any "/r/n" and "/n" to spaces.
The "sender" column JSON looks like:
{"email":"","user_key":"FACEBOOK:100000987654321","name":"Kermit The Frog"}
From testing observations, this "name" field should correspond to a "displayName" JSON field from the "contacts" table.

The "timestamp_ms" column seems to be the ms since 1JAN1970 in UTC/GMT. It was verified by comparing the message timestamps with screenshots taken from the test Android phone. The test phone displayed the local time of this timestamp.

The "coordinates" column JSON looks like:
Sometimes this column was blank, while other times there were only values defined for latitude/longitude/accuracy.

The "source" column values have been observed to be "messenger", "chat", "web", "mobile".
At this time, I don't know what all of the values indicate. Further testing is required as the Messaging Help from Facebook does not mention this "source" field. Boo!
It's probably safe to say that "messenger" indicates the Facebook Messenger app (not sure if this includes the mobile versions).
The "chat" probably indicates the source being the chat sidebar from within a browser.
The "mobile" possibly indicates the source is a Facebook app running on mobile device (eg Android/iPhone).
The "web" could mean a "Facebook message" was sent from a browser logged into Facebook?
There is also a Firefox add-on for sending Facebook messages but it's unknown which category this would fall under.
BTW if you know what all these values stand for please let us know via the comments section!

So putting it all together ... here's our script relevant data all nicely mapped out for you :)

Facebook messaging schema

Note: The JSON encoded data fields are highlighted in blue ("Columbia Blue" according to Wikipedia for all you interior decorator types :). The remaining uncoloured fields are either text or numeric in nature.

From the diagram above, we can use the "thread_id" to match "participants" (senders/receivers) to a particular thread (collection of messages). See the red link in the diagram.
As mentioned earlier, we can also link the "messages" table's "sender" column ("name") back to an entry in the "contacts" table's "data" column ("displayName"). See the yellowy-orange link in the diagram above.

Due to "sender" columns sometimes being blank, the script does not currently do this automagically (Drats!). Instead, it is suggested that the analyst manually matches each participant "name" from the extracted contacts output using the contacts "displayName" field.
From the test data supplied, these two fields seem to correspond. Future versions of the script could also print the "user_key" field in case there are multiple contacts with the same "displayName"s.

How the script works ...

OK, enough about the data. Let's see what the script does eh?

The script connects to the given "threads_db2" and "contacts_db2" SQLite files and runs queries to extract the stored contacts and messages.
It then sorts/outputs these values to the command line and optionally to the nominated Tab Separated Variable files.
The script converts the "timestamp_ms" column values into the form YYYY-MM-DDThh:mm:ss.
If a message has latitude/longitude data, it will also provide a plot of the position via a GoogleMaps URL. The message text and timestamp are also included on the plot.

In case you were wondering about the SQLite query the script uses to extract the messages ...
select messages.msg_id, messages.thread_id, messages.text, messages.sender, threads.participants, messages.timestamp_ms, messages.source, messages.coordinates from messages, threads where messages.thread_id=threads.thread_id order by messages.thread_id, messages.timestamp_ms;

And for the contacts ...
select contact_id, data from contacts;

To make things easier, Python has some existing libraries we can use:
sqlite3 (for querying the SQLite files)
json (for converting the JSON strings to a Python object we can parse)
datetime (for converting "the timestamp_ms" field into a readable date/time string)
urllib (used to ensure our GoogleMaps URL doesn't contain any illegal characters ... it makes 'em an offer they can't refuse!)

On SIFT v2.14, I've used the following command to make the script directly executable (ie no need to type "python" before the script name):
sudo chmod a+x

Here's the help text ...

sansforensics@SIFT-Workstation:~$ ./
Running fbmsg-extractor v2014-01-08 Initial Version
Usage: -t threads_db -c contacts_db -x contacts.tsv -z messages.tsv

  -h, --help      show this help message and exit
  -t THREADSDB    threads_db2 input file
  -c CONTACTSDB   contacts_db2 input file
  -x CONTACTSTSV  (Optional) Contacts Tab Separated Output Filename
  -z MESSAGESTSV  (Optional) Messages Tab Separated Output Filename

And here's what happens when we run it with our muppet test data ...

sansforensics@SIFT-Workstation:~$ ./ -t facebook/test/threads_db2 -c facebook/test/contacts_db2 -x muppet-contacts.txt -z muppet-messages.txt
Running fbmsg-extractor v2014-01-08 Initial Version

Extracted CONTACTS Data

contact_id    profileFbid    displayName    displayNumber    universalNumber    smallPictureUrl    bigPictureUrl    hugePictureUrl    timelineCoverPhoto

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMwo=    1087654323    Fozzie Bear    (555) 555-0003    +15555550003

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTAwMDAwOTg3NjU0MzIxCg==    100000987654321    Kermit The Frog    NA    NA

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=    1087654321    Miss Piggy    (555) 555-0001    +15555550001

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMgo=    1087654322    Rowlf    (555) 555-0002    +15555550002

Extracted MESSAGES Data

msg_id    thread_id    text    sender    participants    timestamp_ms    source    latitude    longitude    accuracy    heading    speed    altitude    googlemaps
m_id.123456789012345678    t_1234567890abcdefghijk1    Hi-ho! You coming to the show?    Kermit The Frog    Kermit The Frog, Miss Piggy    2014-01-03T23:45:03    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0,+-117.918157+%28Hi-ho%21+You+coming+to+the+show%3F+%402014-01-03T23%3A45%3A03%29&iwloc=A&hl=en

m_id.123456789012345679    t_1234567890abcdefghijk1    Yes Kermie! Just powdering my nose ...    Miss Piggy    Kermit The Frog, Miss Piggy    2014-01-03T23:49:05    mobile    33.802399    -117.914954    1500.0    NA    NA    NA,+-117.914954+%28Yes+Kermie%21+Just+powdering+my+nose+...+%402014-01-03T23%3A49%3A05%29&iwloc=A&hl=en

m_id.123456789012345680    t_1234567890abcdefghijk1    So ... At IHOP again huh?    Kermit The Frog    Kermit The Frog, Miss Piggy    2014-01-03T23:50:05    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0,+-117.918157+%28So+...+At+IHOP+again+huh%3F+%402014-01-03T23%3A50%3A05%29&iwloc=A&hl=en

m_id.123456789012345683    t_1234567890abcdefghijk1    More Pork Rolls for you to love!    Miss Piggy    Kermit The Frog, Miss Piggy    2014-01-03T23:50:45    mobile    33.802399    -117.914954    1500.0    NA    NA    NA,+-117.914954+%28More+Pork+Rolls+for+you+to+love%21+%402014-01-03T23%3A50%3A45%29&iwloc=A&hl=en

m_id.123456789012345689    t_1234567890abcdefghijk2    Yo Fozzie! Where u at?    Kermit The Frog    Kermit The Frog, Fozzie Bear    2014-01-03T23:47:13    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0,+-117.918157+%28Yo+Fozzie%21+Where+u+at%3F+%402014-01-03T23%3A47%3A13%29&iwloc=A&hl=en

m_id.123456789012345690    t_1234567890abcdefghijk2    Hey Kermie! I'm almost BEAR ! Wokka!Wokka!Wokka!    Fozzie Bear    Kermit The Frog, Fozzie Bear    2014-01-03T23:47:43    mobile    33.808227    -117.918948    12.0    90.0    1.0    0.0,+-117.918948+%28Hey+Kermie%21+I%27m+almost+BEAR+%21+Wokka%21Wokka%21Wokka%21+%402014-01-03T23%3A47%3A43%29&iwloc=A&hl=en

4 contacts were processed

6 messages were processed


As you can see, the script prints out the contacts information first followed by the message data. Columns are tab separated but when dealing with large numbers of contacts/messages, the command line quickly becomes unreadable. It is HIGHLY recommended that analysts utilize the output to TSV functionality.

Here's what the outputted TSV data looks like after being imported into a spreadsheet program:

Script's Output TSV for Muppet test data contacts

Script's Output TSV for Muppet test data messages

Contacts are sorted alphabetically by "displayName".
Messages are sorted first by thread, then in chronological order (using the "timestamp_ms" value).

Not all messages will have defined geodata. Some may be blank or only have lat/long/accuracy with no speed/heading/altitude.
CAUTION: Not sure what the units are for the accuracy/speed/heading/altitude
In general, the script outputs the string "NA" if there is no defined value.

Just for shiggles, lets plot Miss Piggy's position for her first reply back to Kermit (ie "Yes Kermie! Just powdering my nose ...") using the GoogleMaps URL from the messages TSV.

Where's Piggy?

From the example screenshot, we can see the message text and timestamp plotted along with her position in GoogleMaps. Somebody's telling porkies eh?

Some Other Trivia ...

Format of "contact_id"
The funky looking "contact_id" (eg "Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=") from the "contacts" table is actually base64 encoded. Looking closer at the letters and numbers comprising the "contact_id", we can see an "=" character.
I remembered seeing similar strings in base64 encoded emails ... so just for shiggles, I tried decoding it via the "base64" command.
Here's a fictional demo example:

sansforensics@SIFT-Workstation:~$ echo 'Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=' | base64 --decode

The decoded format appears to be "contact:XXX:YYY"
XXX = remains constant for each "contact_id" and corresponds to /data/data/com.facebook.orca/shared_prefs/com.facebook.orca_preferences.xml's "/auth/user_data/fb_uid" value. It's believed to be used as a unique user id for the Facebook account user.
YYY = Seems to be a user id field for the stored contact (ie equals their "profileFbid" value).

Don't believe everything in the "contacts" database however. Heather Mahalik mentioned in her SANS webcast that Facebook can add contact entries when the app suggests potential friends. Consequently, stored messages should be used to indicate whether a contact entry is someone the account holder has communicated with.

XML files
Be sure to also check out the various XML files containing potentially relevant info (such as username, times, account info). You can find these under:

Even though they share the same filename, /data/data/com.facebook.orca/shared_prefs/com.facebook.orca_preferences.xml differs from the /data/data/com.facebook.katana/shared_prefs/com.facebook.orca_preferences.xml.
In addition to having a different order of declarations, the katana version mentions what appears to be the user's email address.

Other databases
Also check out the "prefs" database tables for username, times, account info. This can be found under /data/data/com.facebook.katana/databases/prefs_db and /data/data/com.facebook.orca/databases/prefs_db.

Facebook Messaging Documentation
Just in case you're as semi-oblivious to Facebook messaging as I am, here's some messaging help I found from the Facebook website.

When you send someone a message, it gets delivered to the person’s Facebook Messages.
If the person you messaged has turned chat on, your message will appear as a chat. If they have chat off, the message will appear in their message inbox and they will receive a notification

Chat and message histories are threaded together — you can think of them as one and the same. When you open a conversation, you’ll see a conversation that includes all your messages along with your entire chat history. If you send a chat message to a friend who has turned chat off, the chat message will be sent directly to their message inbox.

Can I message my mobile contacts if we’re not Facebook friends?
Yes. Confirming your phone number when you first sign in helps ensure that your contacts will be able to find you. Messenger works similar to texting or other mobile messaging apps, and you can add people to your Messenger contacts by entering their phone number.
To allow people who have your phone number to reach you in Messenger, the app will ask you to set the control called "Who can look you up by the phone number you provided?" to Public.

Can I message friends who aren’t using the Facebook Messenger mobile app?
Yes. People who don't have the Facebook Messenger app on their phone will receive chats and messages you send whenever they log into Facebook.

How does location work with the Messenger mobile app?
When you send a message from the Messenger app, your location is included by default. You can turn this feature off by tapping before you send a message, which turns the arrow from blue (on) to gray (off). Location remains off for that conversation until you tap the arrow again.
In order to share your location in messages, you'll need to turn on location services in the main settings of your smartphone.

Who can see my location when I share it in a conversation in Messenger?
Your location is only visible to the people in that conversation.

Does Facebook store my location when I include it in a message?
When you add your location to a message, the location becomes a permanent part of the message history.
When you send a message to a friend with your location, that friend can see it as a pin on a map when they tap on your message. Your location won't appear anywhere outside of the message.

Final Thoughts

This script has been tested with a limited amount of test data. It is possible that some non-defined/blank fields might cause the script to fall over in a screaming heap. If so, let me know and I will try to fix it although some test data may be required to locate the bug(s) quicker. Knowing this monkey's luck, Facebook will probably change their database schema next week anyway LOL.

In the end, it took just as long to write this blog article as it did to write the script. Once you familarize yourself with the relevant libraries / Google for what you want the code to do (Thankyou Stack Overflow !) - it's pretty straight forward. Even a monkey can do it! And while this functionality is possibly already implemented in a number of forensic tools, writing this script provided me with a deeper understanding of the data involved whilst also allowing me to improve my Python programming skills. So I'd say it was well worth the effort.

It seems GoogleCode is stopping uploads from mid-January so this will probably be the last script I put on there. I'll have to find a new (free) home for future scripts. Anyone have suggestions? I am thinking of sharing a GoogleDrive / DropBox folder but that might not allow for easily viewed release notes. Not that anyone reads the release notes anyway LOL.

As usual, please feel free to leave comments/suggestions in the comments section below.