Thursday, 21 February 2013

Creating a Perl script to retrieve Android SMS

This script/post was inspired by Mari DeGrazia after she had to manually parse hundreds of Android SMS messages. Without her prior research and the principles she discusses in her post, there's little chance I would have attempted this script. Thanks for sharing Mari!
This post continues on from where Mari's post ended. We'll look further at an example Android SMS SQLite schema and then use it to explain how our SMS extraction script ( works. We will also walk you through how to use our script and what kind of output you can expect.

UPDATE 2014-01-23:
The code for has been revised/relocated to GitHub
It now handles Cell Header "Serial Type" values of 8 and 9 (ie Header values associated with 0 and 1).

UPDATE 2013-04-25:
Like a punch in the guts, some additional issues have arisen with the previous script (v2013-02-16).
Changes were made. Some bananas may have been thrown.

Issue #1
During further testing, we noticed that the initial version of was not reading some INTEGER fields correctly.
This has been corrected in the latest version 2013-04-14 with the addition of a new function "read_payload_int".
The previous script version tried to read payload INTEGER fields as VARINTs. This seems to work with positive integers less than 128 and so went unnoticed - as integer payload data in the "sms" table is typically limited to 1's or 0's. However, the "status" fields can read -1 and "thread_id" can be greater than 127 so the correction was made.

Issue #2
The script searches for a known phone number ("address" field) and then tries to go backwards a set number of fields until it hits the cell header size. Previously, it treated the payload fields prior to the "address" field (ie "thread_id") as VARINTs (like the cell header fields). As mentioned previously, this should not prove troublesome if the "thread_id" field is a single byte between 0 and 127. However, if the "thread_id" is greater than 127 or uses multiple bytes, there may be issues with ascertaining the cell header size and hence parsing the sms cell. See also the sms-cell-example-revised.png pic shown below in the original post.

The new version of the script requires the "-p" argument which represents the number of bytes between the last cell header field (VARINT) and the phone number "address" field. For our example schema, using "-p 2" means there's 2 bytes being used for the "thread_id" which sits in between the last cell header field and the "address" field.
This also means that to be thorough, the script will have to be run twice - once with "-p 1" and again with "-p 2" to cater for the possible 1 and 2 byte "thread_id" sizes. I decided to make it an argument rather than hard code it because the schema may change in the future. In practice, the "thread_id" will probably not exceed 2 bytes as the maximum "thread_id" value of 0xFFFF should be sufficiently large.

Changes to Configuration File FormatAs a result of the changes made for Issue #2, the configuration file no longer requires the PHONE type marker for the "address" field.
Instead, the "address" field can be declared as TEXT and the "-p" argument is used to define the relationship between the "address" field and the last cell header field. The example pic of the sample configuration file format has been edited accordingly.

To run it you now type something like: -c config.txt -f mmssms.db -p 1 -s "5555551234" -o output.tsv
and -c config.txt -f mmssms.db -p 2 -s "5555551234" -o output.tsv

Limited testing with Android sms test data was again performed.
The script now seems to handle multiple byte payload integers correctly with the new configuration file format.
As always, users should validate this script for themselves before relying upon the returned results (this is my CYA = Cover Your Ass section). What worked for our test data may not work for your's ...

2013-04-25 (Original post below also edited/revised)


Android stores SMS records in the "sms" table of /data/data/ SQLite can also store backups of "sms" table data in the /data/data/ file (in case it needs to undo a transaction). Journal files are a potential forensic gold mine because they may contain previously deleted data which is no longer visible in the current database.
As far as I'm aware, there is currently no freely available way to easily view/print the sms contents of mmssms.db-journal files.
And while you can query the mmssms.db database directly via SQLite, this will not return any older (deleted) sms entries from database pages which have been since been re-purposed.
Our script seems to work well with mmssms.db and mmssms.db-journal files and also with unallocated space (although file size limited/hardware dependent).
Additionally, our script will interpret date fields and print them in a human readable format so no more hours spent manually checking/converting timestamps!
Our script is also configurable - so you should be able to use it to look at multiple Android SMS SQLite schemas without having to modify the underlying code.

But before we dive into the script - it's probably a good idea to learn about how SQLite stores data ...

The SQLite Basics

The SQLite database file format is described in detail in Richard Drinkwater's blog posts here  and here.
There's also some extra information at the official SQLite webpage.

OK, now for the lazy monkeys who couldn't be bothered reading those links ...
The basic summary is that all SQLite databases have a main header section, followed by a bunch of fixed size storage pages.
There are several different types of page but each page is the same size as declared in the header.
One type of page is the "table B-Tree" type which has a 0xD byte flag marker. This type of page is used to store field data from the whole database (ie data from all of the tables) in units called "cells". Consequently, this page type will be used to store "sms" table data and because the page format is common to both mmssms.db and mmssms.db-journal files - our carving job is potentially much simpler.
Pages can also be deleted/re-allocated for another type of page so we must also be vigilant about non-"table B-tree" pages having free space which contains old "table B-tree" cell data. Think of it like file slack except for a database.

A 0xD type (ie "table B-tree") page will look like:

Generic Layout of a 0xD page

We can see the 0xD byte is followed by:
- 2 bytes containing the 1st free cell offset (0 if full)
- 2 bytes containing the number of used data cells in page
- 2 bytes containing the 1st used cell offset
- 1 byte fragmentation marker

Then depending on the number of used data cells, there will be a series of 2 byte offsets which point to each used data cell (see the green section in the pic). The cell pointer list starts with the closest data cell first and ends with the "1st used offset" data cell. Each used data cell should correspond to a row entry in a table of the database (eg an "sms" row entry).
Following those cell pointers (green), will be any free/unallocated space (blue) followed by the actual data cells (purple). The blue area is where we might see older previously deleted "table B-tree" data.

Breaking it down further, the general form of a data cell (from the purple section) looks like:

Generic Layout of a Cell

We can see there's a:
- Cell Size (which is the size of the cell header section + cell data section)
- Rowid (ie Primary Key of the row)
- Cell Header section (compromised of a "Cell Header Size" field + a bunch of fields used to describe each type/size of field data)
- Cell Data section (compromised of a bunch of fields containing the actual data)

You might have noticed an unfamiliar term called a "varint".
Varints are type of encoded data and are used to save space. They can be 1 to 9 bytes and require a bit of decoding.
Basically, you read the most significant byte (data is stored big endian) and if it's most significant bit is set to 1, it means there's another byte to follow/include. Then there's a bunch of droppping most significant bits and concatenating the leftovers into a single binary value.
Richard Drinkwater's got a better explanation (with example) here.
Later for our script, we will need to write a function to read these varints but for now, just know that a varint can store anywhere from 1-9 bytes (usually 1-2 bytes though) and it requires some decoding to arrive at the "original value".

So for our Android SMS scenario, a typical used "sms" data cell might look like:

Android SMS Cell example

You'll notice that there's a "Cell Header" section highlighted in purple and a "Cell Data" section highlighted in pink.
Think of the Cell Header section as a template that tells us how many bytes to expect for each field in the Cell Data section. The Cell Data section does not use varints to store data.
From the sms-cell-example-revised pic, we can see that most of the Cell Header field types are 0x01 - which means those fields use one byte of data in the subsequent cell data section (pink). Also please note the potential for multi-byte varints for the "thread_id" data field circled in red.
The official SQLite documentation refers to these cell header field type values as "Serial Type Codes" and there's a comprehensive definition table about halfway down the page here.

For our sms example, we can see from the purple section that the sms "Read" and "Type" fields will use 1 byte each to store their data in the Cell Data (pink) section. Looking at the pink section confirms this - the "Read" field value is 0 (0 for unread, 1 for read) and the "Type" field is 1 (1 for received, 2 for sent).
As a rule, if the value of the cell header field type (purple section) is between 0x0 and 0x4, the corresponding data field (pink) will use that many bytes (eg 0x1 means 1 byte data field, 0x4 means 4 bytes)
If the value of a cell header field (purple section) is 0x5 (eg "Date" & "Date_sent" fields), it will take 6 bytes in the cell data (pink) section. The "Date" and "Date_sent" data fields are 6 byte Big Endian values which (for Android) contain the number of milliseconds since the Unix epoch (1 Jan 1970).
There's a special case for handling strings. Firstly, the cell header field type value must be odd and greater than or equal to 13. Then to calculate the number of bytes required in the data section we use this formula:

Number of bytes in string = (cell header field type value - 13)/2.

So in our sms-cell-example-revised pic, the corresponding string size for the "Address" field is (0x21 - 0xD) / 0x2 = (33 - 13) / 2 = 10 bytes. I haven't actually shown a value for the "Address" in the pink section so just use your imagination!
Similarly, we can see that the "Body" field will take (0x23 - D) / 0x2 = (35 - 13) / 2 = 11 bytes.
Note: For long sms, the varint containing the "body" header field type has been observed to require 2 bytes.

You might also have noticed that not all of the cell header fields declared in the cell header section (purple) have a matching entry in the cell data section (pink). This is because if a cell header field is marked as NULL (ie 0x00), it does not get recorded in the cell data section (eg the purple "Rowid" header field's 0x00 value means there won't be a corresponding data field in the pink section).
So if we want to retrieve data, we can't go strictly off the schema - we have to pay attention to the cell header section so we can interpret the cell data fields correctly.

So how do/did we know what cell data field was what?
It turns out that SQLite ensures that the order of cell header fields in the cell header (purple section) is the same order as the database schema field order. Consequently, the cell data section (pink) will also appear in schema order (notwithstanding any missing null fields).
We can get the schema of a database file using the sqlite3 command line exe like this:

sansforensics@SIFT-Workstation:~$ sqlite3 mmssms.db
SQLite version 3.7.11 2012-03-20 11:35:50
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .headers on
sqlite> pragma table_info(sms);

So we can see that the "sms" table consists of 17 fields with the first being the "_id" (ie rowid) primary key and the last being the "date_sent" field. In practice, the "_id" is typically unused/set to NULL as it is just duplicating the Cell's Rowid varint (from the white section). Some fields are declared as INTEGERS and others TEXT. Notice how the "date" and "date_sent" are declared as INTEGERS? These represent ms since UTC.
At this stage, I'm not 100% certain on every field's meaning. We know the "address" field is used to store phone numbers and the "body" field stores the sms text string. From Mari's research we also know that "read" is 1 for a read sms, 0 otherwise and "type" indicates sent (2) or recieved sms (1). That should suffice for now.

So that's the basics of the data structures we will be looking at. In the next section, we'll share some thoughts on the script. Oh goody!

The Script

At first I thought we could find each 0xD page and iterate through the data cells that way but this would miss any old sms messages contained in pages which have since been re-purposed by SQLite. That method would also miss any corrupted/partial pages containing sms in unallocated space.
So to find the sms messages, we are going to have to come up with a way of detecting/printing individual sms data cells.

The strategy we ended up using was based on the "address" field (ie the phone number).

1. We read in our schema and print flags from a configuration file.

2. We create one big string from the nominated input file.
Perl has a handy function called "index" that lets you find out if a given string is contained in a larger string. We use this "index" function to find all phone number matches and their respective file offsets.

3. For each match's file offset, we then work backwards and try to find the cell header size field (ie the start of cell header).
Looking at the sms-cell-example-revised pic, we can see that there are 17 (purple) varint fields plus the "thread_id" (pink) data field between the "Address" cell value (in pink section) and the cell header length/size field (in purple section). The number of varint fields should be constant for a given schema but it is possible for the number of bytes required for each varint to change (eg the "thread_id" data field to is typically 1-2 bytes)..

4. Now that we've determined the cell header size file offset, we can read in the header field type varints (ie find out how many bytes each field requires/uses in the cell data section) and also read in/store the actual data.

5. We then repeat steps 3 and 4 until we have processed all our search hits.

6. We can then sort the data in chronological order before printing to screen/file.

The main sanity check of this process is checking the cell header size value range. Remember, the cell header size value should tell us the number of bytes required for the entire cell header (including itself). So for our example schema above, this value should be:
- above the 18 byte minimum (ie number of schema fields plus the size of the cell header length = 17 + 1) and
- below a certain threshold (18+5 at this time).
Most "sms" cell header sizes should be 18 bytes (most of the fields are one byte flags) but for longer "body" fields or large "thread_id" field values, multi-byte varints have been observed which would obviously increase number of bytes required for that cell header. Allowing for an extra 5 bytes seemed like a good start for now.

For more information on how the script works (eg see how painful it is to read a multi-byte varint!) you can read the comments in the code. I dare you ;)

Making it Schema Configurable
As Mari has noted, not every Android phone will have the same schema. So instead of having a different script for each schema, we'll be utilising a configuration text file. Think of the configuration file as a kind of plugin for the script. Each phone schema will have it's own configuration file. This file will tell the script:
- what the schema fields are and more importantly, their order,
- which fields are DATES or STRINGS or INTEGERS and
- whether we want to print this field's values

For possible future needs, we have also declared a "c4n6mtype=android" field. This is in case we need to read an iPhone schema sometime in the future (iPhones use seconds since UTC for their DATE fields).

Here's an example of a configuration file (also provided from my GoogleCode Download page as "sms-grep-sample-config.txt"):

Sample Android Configuration File

Notice that it's very similar to the schema we got earlier from sqlite3?
The most significant differences are:

- "date" field (which is now listed as a DATE type)
- "date_sent" field (which is now listed as a DATE type)
- the configuration file uses ":" as a field separator (sqlite3 uses "|")
- the print flags (1 prints the field value, 0 does not print)

The script will ignore any blank lines and lines starting with "#".

Update 2013-04-25: The "address" field is now declared as TEXT (previously declared as PHONE).

Running the Script
The first step would be to determine the schema. The easiest way to do this is to use the sqlite3 client with the mmssms.db as previously shown. Admittedly, this requires access to a database file so if your don't have a sample to work with, your out of luck.
Next it's time to create the configuration file - making sure to mark the PHONE field and any DATE fields. Also remember to specify which fields you wish to print.
Once that is done, we can run the script using something like: -c config.txt -f mmssms.db -s "5555551234" -s "(555) 555-1234" -p 1  -o output.tsv

Note: Users can specify multiple phone numbers/formats to search for using -s arguments. At least one -s argument is required.
If no -o argument is specified, the results will be printed to the screen in Tab separated columns - which can get messy with lots of messages. Alternatively, an output Tab separated file (TSV) can be generated (eg using -o output.tsv).

Update 2013-04-25: To be thorough, the script will have to be run twice - once with "-p 1" and again with "-p 2" to cater for the possible 1 and 2 byte "thread_id" sizes.

Any extracted hits will be printed in chronological order based upon the first DATE type schema field declared in the configuration file (eg "date" field for our example configuration file). You will probably see multiple entries for the same SMS which was stored at different file offsets. The date sorting makes this situation easier to detect/filter.

Here's a fictional TSV output example based on the previously shown config file:

Fictional sample TSV output from

The arrows in the pic are used by Notepad++ to indicate TABs. We can see that only the print fields marked with a 1 in the configuration file (ie address, date, read, type, subject, body, seen, date_sent) are printed along with the file offset in hex.

Note: If you run the script on a partially complete cell (eg the cell header is truncated by the end of file so there's no corresponding cell data), the script will print out "TRUNCATED" for any strings and -999 for any integer fields. If you see these values, further manual parsing/inspection is recommended.

Limited testing of the script (initial version) has been performed with:
- 2 separate Android schemas
- Unallocated space (as retrieved by Cellebrite and exported into a new 600 MB file using X-Ways/WinHex)
- A Raw Cellebrite .bin file (1 GB size)

Note: The script failed to run with a 16 GB .bin file - we suspect this is due to a RAM deficiency in our test PC.

As I don't have an Android phone, I've relied pretty heavily on Mari for testing. We've tested it using ActiveState Perl v5.16 on 64 bit Windows 7 PCs. It should also work on *nix distributions with Perl. I have run it succesfully on SIFT v2.14.

Additionally, an outputted TSV file has also been successfully imported into MS Excel for subsequent analysis.

Update 2013-04-25: Subsequent re-testing of the new version (2013-04-14) was limited to 2 different schema mmssms.db files in allocated space only. I reckon if it worked for these, it should still work for the other previous test cases.

Validation tips:
grep -oba TERM FILE
Can be used in Linux/SIFT to print the list of search hit file offsets (in decimal).
For example: grep -oba "5555551234" mmssms.db
Additionally, WinHex can be used to search for phone number strings and will also list the location of any hits in a clickable index table.  The analyst can then easily compare/check the output of

What it will do
Find sms records to/from given phone numbers with valid formatted cell headers. This includes from both the SQLite database files (ie mmssms.db) and backup journal files (ie mmssms.db-journal). It should also find any existing sms records (with valid headers) that appear in pages which have been since re-allocated by SQLite for new data. Finally, it should also be able to find SMS from unallocated space (assuming the size isn't larger than your hardware can handle).

What it doesn't do very well
If the cell header is corrupted/missing, the script will not detect sms data.

The script does some range checking on the cell header size and prints a warning message if required. However, it is possible to get a false positive (eg phone number string is found and theres a valid cell header size value at the expected cell header size field). This error should be obvious from the output (eg "body" field has nonsensical values). The analyst is encouraged to view any such errors in a Hex viewer to confirm the misinterpreted data.

Unallocated space has proved troublesome due to size limitations. The code reads the input file into one big string for searching purposes, so running out of memory is a possibility when running on large input data such as unallocated. However, if you break up the unallocated space into smaller chunks (eg 1 GB), the script should work OK. I have also used WinHex to copy SMS data out from unallocated and paste it into a separate smaller file. This smaller file was then parsed correctly by our script.
The big string approach seemed like the quickest/easiest way at the time. I was targeting the actual database files/database journals rather than unallocated. So consider unallocated a freebie ;)

We have seen some SQLite sms records from an iPhone 4S which does NOT include the phone numbers. There may be another field we can use instead of phone numbers (perhaps we can use a phone book id?). This requires further investigation/testing.

Final words

As always, you should validate any results you get from your tools (including this one!).

This script was originally created for the purposes of saving time (ie reduce the amount time spent manual parsing sms entries). However, along the way we also learnt more about how SQLite stores data and how we can actually retrieve data that even SQLite doesn't know it still has (eg re-purposed pages).

The past few weeks have flown by. I originally thought we were creating a one-off script for Android but due to the amount of different schemas available, we ended up with something more configurable. This flexibility should also make it easier to adapt this script for future SQLite carving use (provided we know the schema). It doesn't have to be limited to phones!

However, this scripting method relies on providing a search term and knowing what schema field that term will occur in. There's no "magic number" that marks each data cell, so if you don't know/cannot provide that keyword and schema, you are out of luck.

I would be interested to hear any comments/suggestions. It can probably be improved upon but at some point you have to stop tinkering so others can use it and hopefully suggest improvements. If you do have problems with it please let me know. Just a heads up - for serious issues, I probably won't be able to help you unless you supply some test data and the schema. 
To any SANS Instructors reading, any chance we can get a shoutout for this in SANS FOR 563? Monkey needs to build some street cred ... ;)


  1. Could you please clarify one point about "Unallocated Space"
    Do you mean within the .db file or from the file system itself.
    If the latter, then what would be your recommendation to actually 'get' that ?

    thanks a lot, great post

    1. Hi there!

      The script should work on both unallocated sms entries from the file system and also on any sms entries from within the .db file.

      For our testing, we/Mari first used Cellebrite to extract an image of (file system) unallocated. Then she used X-Ways to export "Free Space" into another file which she then ran the script on.

      I'll see if I can get her to elaborate further in another comment below ...

      Thanks for reading

  2. Hello, to clarify I used Cellebrite to create a physical image of the phone. There are then two ways you can use the script to get messages from "unallocated" space. You can run the script across the .bin file created by Cellebrite. This method will make no distinction between unallocated, the SMS or the journal file. This will also include the deleted entries in the sms.db file.

    The other method focuses on just unallocated space. Using X-Ways, I loaded the .bin fin. I then exported out unallocated space and ran the script across this file.

    If you run the script across just the sms.db file it will pull all the messages and make no distinction between existing and deleted messages. I have not tested the script on just the sms.db file.

    I would be interested to hear how this script works with other tools used to acquire cell phones.....


    1. thanks a lot.
      I was thinking of using "dd if= of=" commands to achieve that, but at this points not quite sure what the 'if' should be.

    2. replying to myself:
      so what do you think about this

      1. adb shell "mount" or "cat /proc/partitions" to figure out what /data is mounted to.

      2. adb shell dd if=/dev/block/XXXXXXXXX of=/sdcard/dataimage.img

      3. adb pull /sdcard/dataimage.img

  3. Hello,
    I've run the script but keep getting errors even though hits are found.

    The schema looks like below and I have changed config.txt accordingly.

    But whatever number I put I get this error message even though I can clearly see the associated SMS's in Oxygen sqlite viewer. For some numbers I don't see anything at all in the viewer but obviously the script does but won't process them either.

    40 total hits found
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x423A)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x404FD)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x4065D)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x83780)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x83DC4)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x83F61)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x840CC)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x8418D)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x841E0)
    ### Cell header size is out of range - Not a valid sms! Skipping hit (0x8B31D)


    >sqlite3.exe mmssms.db
    sqlite> .headers on
    sqlite> pragma table_info(sms);





    thanks in advance for the advice


    1. Hmmm ... Two immediate possibilities come to mind:
      1. The data is being read/extracted wrong by the script (eg incorrect schema or script bug)
      2. The script needs it's FUDGE variable increased.

      During testing we were seeing minimum values of 0x12 for the cell header sizes with our example schema.
      With the schema you listed above, there should be a minimum value of 0x13.
      However, the maximum size can be adjusted by changing the FUDGE variable. It is currently set to 5 on/around line 330. Try changing it to 10 and see if it makes a difference.

      If this doesn't work ...
      On/around line 353 of the script can you remove the "#" sign so it prints out the hdrsizeval it's retrieving?
      ie Change
      #printf ("hdrsizeval = $hdrsizeval at 0x%x\n", $hdrsizeoffset);
      printf ("hdrsizeval = $hdrsizeval at 0x%x\n", $hdrsizeoffset);

      That will show us if the header size value is sensible value.


    2. Mari has informed me that the script can act a bit funky if the phone number search terms are not exact matches with whats stored in the db. This might be what you are seeing as well.
      For example, say the number stored in the database is "(777) 555-1234".
      If you use a partial search term such as "555-1234", the script will find the hit but it's offset calculations will be off because it will count backwards from the first "5" and arrive at the wrong header offset.

      So if you haven't done this already, I'd recommend you open up the SQLite db and use the exact phone numbers (including any "(", ")" or space characters).

      In our tests, we noticed the area code set on sent messages but not on received messages. This might be model specific - we had limited test data. Anyhow, the script picked out both area code and non-area code versions in our testing provided the search terms were exactly as per the database.

    3. Just wondering what the resolution on this was? I am having the same issue.

    4. I have not heard anything back from the previous poster so I assume they either gave up or it worked?

      As mentioned above, one potential culprit is a search term mismatch. But if you've confirmed that the search term is exactly the same as viewing through a SQLite viewer than you could try playing around with the -p offset argument (new to the latest version).

      But without additional test data/output, I can't really help much more.

  4. Hi Adam,

    Both yours and Maria's postings on this subject are very interesting.

    Quick question - how difficult would it be to write a python script to do the same job running within the Cellebrite PA application?


    Richard Drinkwater

    1. Hi Richard,

      Thanks for reading/commenting.
      I have found your blog posts very helpful as well.

      I have limited Python experience (all those tab indents give the irrits) but I haven't relied on any Perl specific libraries - just used standard Perl functions.
      So I think it should be fairly straight forward to write something similar.
      Perhaps Cellebrite is already working on it?
      Their CellebriteUSA Twitter account has started following me since this post :)

  5. So does this require search terms to search for text messages? I guess it does since it is using the found phone numbers themselves to define the text message records.

    I've been extracting text messages out of a bunch of different phones' raw data and every darn one is different. I have my code compartmentalized enough where I can just add in code for identifying a text message record and then extracting all fields of the record each time I come across a new format (SQLite schema), but that is still a fair amount of work. Eventually it would be great to have a template system so just a few parameters and a template would be required instead of new code each time, but that could prove difficult from what I've seen especially with all the variable length fields. I don't think it is as easy as what you have here.

    1. Hi HikingMike,

      Yeah this script was developed with the assumption that you know the phone number beforehand. If you have an allocated SQLite db you can query it and then use any/all of the outputted numbers with this script.

      It was because of all the different schemas out there that I added in the schema configuration file rather than have a different script version for each schema.

      Regarding your last sentence ...
      Are you saying that I am over-simplifying the extraction/missing something? Or am I just mis-interpreting your comment?
      If you could provide further details, perhaps I can improve the script.

    2. Hi, thanks for the reply!

      Well, let me first say that I don't know Perl so I'm not totally sure what's going on here... Some further description from you might help me. (Maybe I should read the SQLite documentation you linked also :)
      But I was not seeing how it handles the parsing of the message records to identify what is each item.

      I'm taking another look at it now with fresher eyes. Ok, it finds the matches on the search terms and saves those offsets. Then it calls process_sms on those. In the process_sms, it calls calculate_varint_bytes_from_lsb which now is more involved from what I was seeing before. I didn't dig in enough. Good code comments there. I'll have to look that over more to figure out what's going on but it seems that somehow you are figuring out how long a varint field is by the value of the byte. I definitely didn't expect that to be possible and I would love to learn how that works. Then you can work backwards if it's possible to figure out how long each field is.

      Now I have seen some text message records that have data between two known items, like between a phone number and the timestamp, and sometimes that data is 2 bytes, 1 byte, or sometimes 0 bytes - the phone number is directly adjacent to the timestamp. This means one or two fields can be zero bytes or more. I don't know how you can work backwards with that situation.

      Also, what if the phone number were to come after the body? You wouldn't know how long the body was and wouldn't be able to work backwards to the header (unless you tried something rougher like checking until the plain text bytes ended).

      This is a tricky problem and it looks like you've made some great work for solving it. I should have taken a few more looks before I said the last sentence before, but I do still want to learn more.


    3. To use this script properly you must know the schema and declare it in the configuration file before running the search. That way the script knows how to interpret the header/raw data sections.

      If you have an existing SQLite database - this can be done as I have shown in the blog post.

      The schema is also required because SQLite does not write anything for NULL raw data values. However, SQLite does write a header section before the raw data. The header section is like a template telling you what is coming up in the following data section.
      So when the script sees a "00" in the header field type section it knows that field will not have any data written in the data section and can retrieve the raw data accordingly.

  6. Ok that makes sense with the NULL and no bytes being used for those values. I thought you didn't have the location of the header at first though since it searched on the given phone numbers... then it worked backwards somehow to the header (step 3 above). That's the part I'm not understanding. I know you have the schema but some fields might not have any "visible" data in a record so you can't just count fields/bytes backwards.

    I'll have to look it over more.


    1. This is where the updated -p argument comes in.
      The "-p" argument which represents the number of bytes between the last cell header field (VARINT) and the phone number "address" field. For our example schema, using "-p 2" means there's 2 bytes being used for the "thread_id" which sits in between the last cell header field and the "address" field.

      So given an offset to a phone number within an sms record, the script will skip back "-p" bytes first. Then it skips the total number of schema header fields (ie VARINTs) until it gets to the cell header size. Once it knows the header size, the script can start reading forwards and recovering data.

    2. Oooooh ok, I missed that argument. So that hopefully is a constant number of bytes, or else you'll need to run it additional times for different "-p" values. Again, thanks for your replies and help understanding your script.