Sunday, 28 June 2015

How u like Base(64)?




Monkey was having such a great time, no one had the heart to tell him he had the wrong type of base ...


A recent blog post by Heather Mahalik (@HeatherMahalik)
mentioned that a multiple Base64 decoding tool would be useful for mobile application analysis. What is Base64? Basically, it converts bytes into a printable 64 character set. This encoding is typically used when sending email and/or transferring/obfuscating data. Check out the Wikipedia page for more gory details.
There's already several existing tools we can use to perform Base64 decoding. For example, *nix systems have the "base64" command and recently Monkey found that Notepad++ (v6.7.9.2) will handle multiple Base64 encodes/decodes.
However, as most mobile apps use SQLite databases for storage, it would be pretty painful to first query the database and then manually perform each Base64 decode. And especially, if the field was Base64 encoded multiple times ... Unless of course, you had your own army of monkey interns!

Thankfully, we have previously used Python to interface with SQLite databases and after some quick Googling, we also found that Python has baked in Base64 encode/decode functionality.
So a scripted solution seems like the way to go (Sorry, intern monkey army!).

You can download the script (sqlite-base64-decode.py) from my GitHub page.

The Script

The user has to provide the script with the database filename, the table name, the Base64 encoded field's name and the number of iterations to run the Base64 decode.
The script will then query the database and then print each row's column values and the respective Base64 decode result in tab separated format.

Each app's database will have its own schema so we first need to run a "pragma table_info" query to find out how the database is laid out. 
Specifically, we want to find out:
- the table's Primary Key name (for ordering the main query by),
- the table column names (for printing) and
- the index (column number) of the Base64 encoded column (the user provided the encoded field's name but we also need to know the index)

Once we have this info, we can then run our main query which will be the equivalent of:
SELECT * FROM tablename ORDER BY primarykeyname;
We then iterate through each returned row, run the base64.decodestring function the requested number of times and print both the returned row data and the decoded result.
On a decode error, the script prints "*** UNKNOWN ***" for the decoded value.

Here's the help text:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py -h
Running sqlite-base64-decode v2015-06-27
usage: sqlite-base64-decode.py [-h] db table b64field b64count

Extracts/decodes a base64 field from a SQLite DB

positional arguments:
  db          Sqlite DB filename
  table       Sqlite DB table name containing b64field
  b64field    Suspected Sqlite Base64 encoded column name
  b64count    Number of times to run base64 decoding on b64field

optional arguments:
  -h, --help  show this help message and exit
cheeky@ubuntu:~$


Future work might have the script sample each column's data to figure out which is Base64 encoded.
Base64 encoded data is typically limited to the following characters:
A-Z
a-z
0-9
+
/
=


Because the = sign is used for padding, it is usually a good indicator of Base64 encoding (especially at the end of the encoded string).
Base64 encoding usually takes 3 binary bytes (24 bits) and turns it into 4 printable bytes (32 bits). So the final encoding should be a multiple of 4 bytes.
Additionally, the more times you encode in Base64, the longer the resultant string.

Testing

For testing, we added the "base64enc" column to our previous post's testsms.sqlite database (specifically, the "sms" table). The test data looked like this:

Modified "sms" table with "base64enc" column added

The values for "base64enc" correspond to 2 x Base64 encoding the "message" value.
To obtain the 2 x Base64 encoded value, on Ubuntu we can do this:

cheeky@ubuntu:~$ echo -n 'Hey Monkey!' | base64
SGV5IE1vbmtleSE=
cheeky@ubuntu:~$

cheeky@ubuntu:~$ echo -n 'SGV5IE1vbmtleSE=' | base64
U0dWNUlFMXZibXRsZVNFPQ==
cheeky@ubuntu:~$ 


Note: The "-n" removes the newline character added by the "echo" command

So we can see our last encoding result corresponds to our "sms" table pic above.
ie 2 x Base64 encoding of 'Hey Monkey!' is U0dWNUlFMXZibXRsZVNFPQ==

Similarly, we can also use Notepad++ to do the encoding via "Plugins ... MIME Tools ... Base64 Encode".



As we see in the pic above, I used Notepad++ to 2 x Base64 encode the various "message" values and then inserted those values into the "sms" table's "base64enc" field using the SQLite Manager Firefox Plugin.

Now we run our script on our newly modified testsms.sqlite file ...
For shiggles, lets initially specify a 1 x Base64 decode:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 1
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    SGV5IE1vbmtleSE=
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    SGV5IFN0cmFuZ2VyIQ==
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    UCBpcyBmb3IgUEFHRURVTVAh
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    SSB3b25kZXIgd2hhdCBwZW9wbGUgd2l0aCBhIGxpZmUgYXJlIGRvaW5nIHJpZ2h0IG5vdyAuLi4=
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    VGhpcyBpcyBzbyBleGNpdGluZyEgSXQgcmVtaW5kcyBtZSBvZiBvbmUgdGltZSAuLi4gYXQgQmFuZCBDYW1wIC4uLg==

Exiting ...
cheeky@ubuntu:~$


No real surprises here. We can see the "B64Decoded" fields are still Base64 encoded. Also, apologies for the crappy layout ...
Now let's try a 2 x Base64 decode:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 2
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    Hey Monkey!
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    Hey Stranger!
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    P is for PAGEDUMP!
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    I wonder what people with a life are doing right now ...
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    This is so exciting! It reminds me of one time ... at Band Camp ...

Exiting ...
cheeky@ubuntu:~$


Note: The "message" and "B64Decoded" fields are the same - we have found our original message! :)
Finally, let's try a 3 x Base64 decode to see if the script falls into a screaming heap:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 3
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    *** UNKNOWN ***
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    *** UNKNOWN ***
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    *** UNKNOWN ***
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    *** UNKNOWN ***
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    *** UNKNOWN ***

Exiting ...
cheeky@ubuntu:~$ 


Note: The "*** UNKNOWN ***" values indicate that a decoding error has occurred (from testing this is usually due to a padding error).

We also ran these tests on a Windows 7x64 PC running Python 2.7.6 with the same results.

Final Thoughts

Special Thanks to Heather Mahalik for mentioning the need for the script. One of the great things about getting script ideas from Rockstar practioners in the field, means it's not going to be some banana-in-the-sky idea that no one uses. This script might actually be useful LOL.

The script ass-umes only one field is Base64 encoded and that the Primary Key only uses one field.
The script has only been tested with Monkey's own funky data - it will be interesting to see how it goes against some real life user data.

The "pragma table_info" query is something Monkey will probably re-use in the future because it allows us to discover a database table's schema rather than hard-coding a bunch of assumptions about the table.

Deleted table data is not addressed by this script.

Monkey's recent blue period of posts might be drawing to a close. Oh well, it was fun while it lasted. Maybe I can now get a life ... yeah, right ;)

Tuesday, 23 June 2015

Deleted SQLite Parser Script Update (Now With Added DFIR Rockstar!)


Monkey says: "Knowing DFIR Rockstars has its privileges!" (Mari's picture courteousy of her Google+ Profile)


This post aims to build upon Mari DeGrazia's sqlparse Python script which harvests data from unallocated and free blocks in SQLite databases. It is also available as a Windows command line exe and/or a Windows GUI exe here.
Further details regarding her initial script can be found here. Mari's script has proven so useful that its referred to in the SANS585 Advanced Smartphone Forensics course and by at least 2 books on mobile forensics (Practical Mobile Forensics  by Bommisetty, Tamma and Mahalik (2014) and Learning iOS Forensics by Epifani and Stirparo (2015)).
Mari's impressive DFIR research in a variety of areas has also lead her to attain her well deserved DFIR Rockstar status as attested to by her fellow DFIR Rockstar, Heather Mahalik.
That's a pretty impressive introduction eh? Mari - my promotions check is in the mail right? Right? ;)

OK, so whats Monkey got to do with it?
I was spinning my paws looking at deleted SMS from an Android (circa 4.1.2) LG-E425 phone (aka LG Optimus L3 E430) when I remembered Mari's script and thought of some minor modifications that would allow analysts to recover additional string data from re-purposed SQLite pages.
A commercial mobile forensics tool was reporting a different number of deleted SMS on consecutive reads via flasher box. Specifically, the parsing of read 1 was producing X deleted SMS while the parsing of read 2 was producing X-1 deleted SMS.
Admittedly, the flasher box was initiating a reboot after each acquisition - so any unused pages in the phone's memory could have been freed/reused, thus affecting the number of recoverable deleted SMS.
However, as Monkey actually gets paid to do this for a living (pinch me, I must be dreaming!), a closer inspection was carried out.
While the total number of deleted SMS varied by one, there were two deleted SMS in report 1 that weren't in report 2. Additionally, there was one deleted SMS in report 2 that wasn't in report 1.
So while the net difference was one less SMS, there was a bit more going on behind the scenes.
Fortunately, the commercial forensic tool also showed the image offset where these "deleted" SMS entries were found so we had a good starting point ...

OK Monkey, put your floaties on. It's time for some SQLite diving!


An SQLite database is comprised of a number of fixed sized pages. The number of pages and page size are declared in the file header. According to the official documentation, there are 4 types of page. The first byte of each page (occurring after the file header) tells us what type of page it is. The actual row data from a database table lives in a "Leaf Table B-Tree" page type. This has the flag value of 0xD (13 decimal). In the interests of readability / reducing carpal tunnel syndrome, we shall now refer to these pages as LTBT pages.

A typical LTBT page looks like this:

An 0xD (LTBT) page with unallocated, allocated cells and free blocks

Back to Monkey's problem (well, one he actually has a chance of solving!) ... I observed that some of those "deleted" SMS were appearing in non-LTBT pages. The commercial mobile forensic tool was then finding/listing some of these entries but not all of them.
To accurately carve an entire SQLite record, you need to know the record's schema (order and type of column data) before reading the actual data. Any pages with overwritten cell headers (eg repurposed pages) may be difficult to accurately carve for all records. However, if we narrow our record recovery to any string content within a page, it becomes a lot easier. See this previous blog post for further details on carving SQLite records.

Within our LG phone data, it appeared that a page previously employed as a LTBT page was then re-purposed as another type (flag = 5, an "Interior Table B-tree" page). However, as this new page only used the first (and last) few bytes of the page, it still had previous record data leftover in the Unallocated region (see picture below).


The Unallocated region in non-LTBT pages can contain previous record data!


This previous data included some SMS records - some of which were being reported by the tool as deleted, while others were not.
This reporting discrepancy might have been because some of these SMS records also existed in allocated LTBT pages elsewhere or maybe it was due to the method the commercial tool was using to carve for SMS records. Due to the closed source nature of commercial tools, we can only speculate.
So rather than try to reverse engineer a proprietary tool, Monkey remembered Mari's sqlparse Python script and thought it might be easier/more beneficial to extend Mari's script to print the strings from all non-LTBT pages. By doing this, we can find which non-LTBT pages have previous row data in them (assuming the row data contained printable ASCII strings like SMS records). This will allow us to hunt for deleted records more efficiently (versus running strings over the whole file and having to figure out which strings are allocated / not allocated).
Bceause Mari had written her code in a logical / easy to understand manner (and commented it well!), it didn't take long to modify initially and only required about 10 extra lines of code.

You can download the updated software (command line Python, command line Windows exe, Windows GUI exe) from Mari's Github page. She is also writing an accompanying blog post which you can find at her blog here.

The Script

From my GitHub account, I "forked" (created my own copy of) Mari's SQLite-Deleted-Records-Parser project, made my changes and then committed it to my own branch. Then I submitted a "pull" request to Mari so she could then review and accept the changes. Mari then found a interoperability bug regarding the new code and the existing raw mode which she then also fixed. Thanks Mari!

At the start of the script, I added code to parse the optional -p flag (which is stored as the "options.printpages" boolean) so the script knows when to print the non-LTBT page printable characters to the user specified output file.
Next, I added an "elif" (else if) to handle non-LTBT pages (ie pages where the flag does not equal 13). This is where I stuffed up as I did not allow for the user specifying -r for raw mode (dumps deleted binary data) at the same time as the -p option. Mari fixed it so that in raw + printpages mode, the printable strings are now dumped from non-LTBT pages and deleted content is dumped from LTBT pages (as before).

Here's our cross-bred "elif" code (as of version 1.3):

    elif (options.printpages):
        # read block into one big string, filter unprintables, then print
        pagestring = f.read(pagesize-1) # we've already read the flag byte
        printable_pagestring = remove_ascii_non_printable(pagestring)
      
        if options.raw == True:
            output.write("Non-Leaf-Table-Btree-Type_"+ str(flag) + ", Offset " + str(offset) + ", Length " + str(pagesize) + "\n")
            output.write("Data: (ONLY PRINTABLE STRINGS ARE SHOWN HERE. FOR RAW DATA, CHECK FILE IN HEX VIEWER AT ABOVE LISTED OFFSET):\n\n")
            output.write(printable_pagestring)
            output.write( "\n\n")
        else:
            output.write("Non-Leaf-Table-Btree-Type_" + str(flag) + "\t" +  str(offset) + "\t" + str(pagesize) + "\t" + printable_pagestring + "\n" )


The code above is called for each page.
Depending on if we are in raw mode, the output is written as binary (raw mode) or tab separated text (not raw mode) to the user specified output file.
Depending on the number of non-LTBT pages and their string content, the output file might be considerably larger if you run the script with the -p argument versus without the -p argument.

 In both raw/not raw mode output files there are some common output field names.
The "Non-Leaf-Table-Btree-Type_Z" field shows what type of page is being output. Where Z is the flag type of the non-LTBT page (eg 2, 5, 10, 0 etc).
The offset field represents the file offset for that page (should be a multiple of the page size).
No prizes for guessing what the page size field represents (this should be constant).
The last field will be the actual printable text. Because its removing unprintable characters, the output string should not be too large, which should make it easier to spot any strings of interest.

Here's the help text:

cheeky@ubuntu:~$ python ./sqlparse.py -h
Usage: Parse deleted records from an SQLite file into a TSV File or text file
Examples:
-f /home/sanforensics/smsmms.db -o report.tsv
-f /home/sanforensics/smssms.db -r -o report.txt


Options:
  -h, --help            show this help message and exit
  -f smsmms.db, --file=smsmms.db
                        sqlite database file
  -o output.tsv, --output=output.tsv
                        Output to a tsv file. Strips white space, tabs and
                        non-printable characters from data field
  -r, --raw             Optional. Will out put data field in a raw format and
                        text file.
  -p, --printpages      Optional. Will print any printable non-whitespace
                        chars from all non-leaf b-tree pages (in case page has
                        been re-purposed). WARNING: May output a lot of string
                        data.
cheeky@ubuntu:~$


Testing

I tested the new script with an existing test Android mmssms.db and it seemed to work OK as I was able to see non-LTBT string content for various pages.
To show you that new code doesn't fall in a screaming heap, on an Ubuntu 14.04 64-bit VM running Python 2.7, we're going to use the SQLite Manager Firefox plugin to create a test database (testsms.sqlite) with a test table (sms). Then we'll populate the table with some semi-amusing test data and then use a hex editor to manually add a test string ("OMG! Such WOW!!!") into a non-LTBT page (freelist page).

Here's the test "sms" table row data:

One time ... at Band Camp ...

Here's the relevant database header info screenshot:

Note: The page size is 1024 bytes and there are 3 total pages. The last page is on the freelist (unused).

To create a non-LTBT page (page with a non 0xD flag value), I added another test table, added some rows and then dropped (deleted) that table. The database's auto-vacuum was not set. This resulted in the third page being created and then having its type flag set to 0 (along with any row data it seems). This suggests that pages on the free list have their first byte set to zero and it also may not be possible to recover strings from zeroed freelist pages. At any rate, we now have a non-LTBT page we can manually modify and then parse with our new code.

Here's the gory page by page breakdown of our testsms.sqlite file ...

Page 1 starts with the "SQLite format 3" string and not a flag type.

Page 2 contains the test "sms" table data (ie LTBT page).

Page 3 contains the freelist page (non-LTBT) and our test string.

After using WinHex to add our test string to an arbitrary offset in the last page, we then ran our script (without the -p) and checked the contents of the outputfile.

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv
cheeky@ubuntu:~$


Here's what the testoutput.tsv file looked like:


As expected, our test string in a non-LTBT page was not extracted.
Then we re-ran the script with the -p argument ...

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv -p
cheeky@ubuntu:~$

Here's the output file:


The new version of the script has successfully extracted string content from both of the non-LTBT pages (ie page 1 and page 3).

OMG! Such WOW!!! Indeed ...

You might have noticed the first entry (at offset 0) being "Non-Leaf-Table-Btree-Type_83". This is because the very first page in SQLite database starts with the string "SQLite format 3". There is no flag as such. "S" in ASCII is decimal 83 so thats why its declaring the type as 83. You can also see the rest of the string ("QLite format 3") following on with the rest of the printable string data in the Data column.

OK now we try adding the -r (raw) mode argument:

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv -p -r
cheeky@ubuntu:~$


Because there's now binary content in the output file, Ubuntu's gedit spits the dummy when viewing it. So we use the Bless Hex editor to view the output file instead.

Raw Mode + PrintPages Mode Output

Notice how the first page's string content is shown (look for the "Qlite format 3" string towards to top of the pic). Remember, the first page is not considered an LTBT page, so its printable strings are retrieved.
There's also a bunch of Unallocated bytes retrieved (values set to zero) from offset 1042 which corresponds to Page 2's Unallocated area. Remember, Page 2 is an LTBT page - so the script only extracts the unallocated and free blocks (if present).
And finally, circled in red is our test string from Page 3 (a non-LTBT page type).
Cool! It looks like everything works!

Similarly, I re-ran the script on a Windows 7 Pro 64-bit PC running Python 2.7 with the same results.

Final Thoughts

Special Thanks again to Mari for releasing her sqlparse script and also for her prompt help and patience in rolling out the new updates.
I contacted her on Thursday about the idea and just a few short days later we were (well, she was) already releasing this solution ... Awesome!
A lesson (re)learned was that even if you're only adding a few lines of code, be aware of how it fits into the existing code structure. None of this, "I'm not touching that part of the code, so it should be OK *crosses fingers*". Especially because it was someone else's baby, Monkey should have re-tested all of the existing functionality before requesting a merge. Thankfully in this case, the original author was available to quickly fix the problem and the solution was relatively straight forward.
During the initial investigation of the LG mmssms.db database, I checked the file header and there were no freelist (unused) pages allocated. The official documentation says:

A database file might contain one or more pages that are not in active use. Unused pages can come about, for example, when information is deleted from the database. Unused pages are stored on the freelist and are reused when additional pages are required.

The lack of freelist pages might have been because of the vacuum settings.
Anyhoo, with the -p option enabled, this new version of the script will process freelist pages and print any strings from there too (just in case).

Also, don't forget to check for rollback journal files (have "-journal" appended to DB filename) and write-ahead logs (have "-wal" appended to DB filename) as other potential sources of row data. They should be small enough to quickly browse with a hex editor. When using an SQLite reader to read a database file, be careful not to open it with those journal/log files in the same directory as that could result in the addition/removal of table data.

Be wary of the difference between documentation and implementation (eg the official SQLite documentation didn't mention pages with flag values of zero). Reading the available documentation is just one facet of research. Simulating / seeing real-world data is another. Reading the available source code is yet another. But for practical analysis, nothing beats having real-world data to compare/analyze.
After all, it was real-world data (and some well-timed curiosity) that lead us to adding this new functionality.

In keeping with the Rockstar theme ... Monkey, OUT! (drops microphone)

Saturday, 13 June 2015

Android APK Permissions Script

In this issue ... We take a look at Android Perms... So hawt!

An Android app install file (.apk) declares its required permissions in its AndroidManifest.xml binary file.
While there is limited official documentation about this file format, we can use tools such as the aapt Android developer tool and/or the AndroGuard Python tool to interrogate .apks directly. As these tools require a bit of effort to download/install (eg dependencies), lazy monkey here thought that a Python script (print_apk_perms.py) to read/print Android permissions from multiple .apks might be useful. It is hoped this script can be used to quickly determine which apps have permission XYZ. eg For those cases where the suspect/victim claims "It wasn't me! It was the app that did it!"
You can download it from my GitHub page.

The official Android developer documentation describes Android Permissions as:

A permission is a restriction limiting access to a part of the code or to data on the device. The limitation is imposed to protect critical data and code that could be misused to distort or damage the user experience.

Each permission is identified by a unique label.

If an application needs access to a feature protected by a permission, it must declare that it requires that permission with a <uses-permission> element in the manifest. Then, when the application is installed on the device, the installer determines whether or not to grant the requested permission by checking the authorities that signed the application's certificates and, in some cases, asking the user. If the permission is granted, the application is able to use the protected features. If not, its attempts to access those features will simply fail without any notification to the user.
There are numerous manifest permissions which are listed here.
Basically, each permission has a corresponding string which is *usually* prefixed by "android.permission."
eg "android.permission.CAMERA"

Notice how we said "usually"? Monkey had just completed an initial version that searched for "android.permission." prefixed strings when he noticed the following permission names in the previous link:
com.android.voicemail.permission.ADD_VOICEMAIL
com.android.launcher.permission.INSTALL_SHORTCUT
com.android.browser.permission.READ_HISTORY_BOOKMARKS
com.android.voicemail.permission.READ_VOICEMAIL
com.android.alarm.permission.SET_ALARM
com.android.browser.permission.WRITE_HISTORY_BOOKMARKS
com.android.voicemail.permission.WRITE_VOICEMAIL


Additionally, an Adobe Reader .apk had a permission string like:
com.android.vending.BILLING
(ie "permission" is not even in the permission string!)

Argh! There may have been some subsequent poo flinging on our way back to the drawing board ...

Thankfully, during the research phase, Monkey and Dr Google found these very helpful links ...
Olaf Dietsche's Blog on Exploring Android's binary XML format
and
AndroidSec's 2 blog posts on the binary Android Manifest XML file.
Part 1
Part 2

Basically, despite the .xml extension, the AndroidManifest.xml file is not human readable and relies on declaring XML fields via binary "chunk" types. Strings are stored in a common pool area and are stored only once so as to minimize file size. Our permission strings should be stored in this common pool area.

To get to the AndroidManifest.xml file, we have to unzip the .apk and then use a hex editor to open the AndroidManifest.xml from the archive's root directory.
The AndroidManifest.xml file starts with a 64 bit ResXMLTree_header. This is an alias for a ResChunk_header data structure consisting of:
- an unsigned LE 16 bit "type",
- an unsigned LE 16 bit "headerSize",
- an unsigned LE 32 bit "size"

From our observations, there are two "type" values that are relevant for our script:
- the RES_XML_TYPE (0x0003) and
- the RES_STRING_POOL_TYPE (0x0001)

The first ResXMLTree_header / ResChunk_header in the file should have a "type" equal to RES_XML_TYPE (0x0003).
After the first ResXMLTree_header / ResChunk_header, there's another section containing the common string pool. This section consists of a ResStringPool_header and a bunch of string offsets.
The ResStringPool_header consists of:
- a 64 bit ResChunk_header with "type" equal to RES_STRING_POOL_TYPE (0x0001).
- an unsigned LE 32 bit "stringCount" (number of strings declared in pool)
- an unsigned LE 32 bit "styleCount"
- an unsigned LE 32 bit "flags"
- an unsigned LE 32 bit "stringsStart" (offset from the start of the "ResStringPool_header" to the first string size)
- an unsigned LE 32 bit "stylesStart"

Next, there are "stringCount" instances of unsigned LE 32 bit offsets. Each offset leads us to a string size, followed by the actual UTF16 LE encoded string.

OK, to summarize all that crap above, the beginning of an AndroidManifest.xml file should look like:

AndroidManifest.xml File Layout

Putting it all together ... our script is going to read the common string pool, extract any strings containing ".permission" or "com.android.", then print them out. Additionally, let's give our script the ability to recursively process directories of .apks so we don't have to call it separately for each .apk. And to make .apk comparisons easier, we'll allow for printing permission strings in ralph-abetical order. Doesn't sound so hard right? :)

Script

Similar to our last post, we will use the Python zipfile library to unzip and peek into .apk files (an .apk is a zipped archive). Once we find the AndroidManifest.xml, we search for any string containing ".permission" or "com.android." and then print the .apk name, the file offsets and then the permission strings. If the sort argument (-s) is specified, it prints the permission strings in alphabetical order otherwise it prints the permission strings ordered by file offset. There is also a debug (-d) argument to print all strings from the string pool so the user can see if a permission string has been missed.

Also like our last post, the script tests if the input argument is a directory and if it isn't, it ass-umes the argument to be a single file. If it is a directory, the script walks thru each file and sub-directory and calls the "parse_apk_perms" function for each file. This is the function that searches for/prints the permission strings.

The AndroidManifest.xml file relies on the concept of a string pool and declaring XML relationships by referring back to other binary data "chunks". The benefit of just searching the string pool for permission strings is that the script only prints each permission once (regardless of how many times that permission string is used/declared). See the testing section later for an example of how much easier it is to determine permissions when there are no duplicate strings.

To find the permission strings in the "parse_apk_perms" function, we first use zipfile.open to open the manifest file and then we call the file "read" function to get the contents into one large string object.
After sanity checks of the ResXMLTree_header and ResStringPool_header "type" fields, the script extracts the "stringCount" and "stringsStart" fields.
It then extracts "stringCount" x string offsets into a list. These offsets are relative to the "stringsStart" offset (which itself is relative to the start of the ResStringPool_header).

For example, the file offset address for String 0 = starting address of "ResStringPool_header" + "stringStart" offset + "String 0 offset"
This actually points to the unsigned LE 16 bit integer containing String 0's number of UTF16 LE encoded characters (not including the NULL terminator).
After the string size integer comes the actual UTF16LE encoded string.

Once we have our string value, we can check it for ".permission" or "com.android." (which indicates its a permission string).
If it contains either, we use the file offset as the key to store that permission string in a Python dictionary (called "permsdict").
Then depending on the sorting order required, we sort a list of dictionary keys based on file offset (default) or by permission name.

In order to perform the sorting, we use the Python "sorted" function and combine it with a "lambda" inline function.
There's a helpful explanation of lambda functions here.

Just FYI, here's the "parse_apk_perms" sorting code for sorting by permission name:
    sorted_by_perm_keys = sorted(permsdict, key = lambda x : permsdict[x])

The "sorted" function returns a sorted list of dictionary keys using the "key" argument to specify that we want to sort the output list by the "permsdict" dictionary value. ie x is the file offset key, permsdict[x] is the corresponding permission string.
Once we have the sorted list (now called "sorted_by_perm_keys"), we can iterate thru it and print the filename, file offset and permission string.

Here's the script's help output:
cheeky-android@cheekydroid:~$ python ./print_apk_perms.py -h
usage: print_apk_perms.py [-h] [-s] [-d] target

Print Android Manifest permission strings from an .apk file/directory
containing .apk files

positional arguments:
  target      Target .apk / directory containing .apks

optional arguments:
  -h, --help  show this help message and exit
  -s          Print permissions sorted by name (default is sorted by offset)
  -d          Prints ALL strings for debugging (default is OFF)
cheeky-android@cheekydroid:~$

Testing

The script was tested on Ubuntu x64 with Python 2.7 and .apks from Android 4.4.2 and 5.1.1 devices.

A previous post showed how we can download the Android SDK and use dev tools like the Android emulator. It also showed how to use the "aapt" and "adb" tools to investigate .apks. ie Monkey isn't going to repeat himself (this time!) so go read the post if any of the following sounds like a barrel of monkeys ...

For this post, we will only need the aapt and adb tools. We will "adb pull" .apks from an Android 4.4 device and a 5.1 device. This will require first enabling USB debugging and trusting the connected PC from the Android devices.
In normal forensic practice, we would usually acquire the .apks via a commercial mobile forensic imaging tool and/or via JTAG/Flasher box download (no, the Flasher box is NOT what you're thinking ... pervert!).
Anyhoo, as long as you are able to copy over an .apk or a directory of .apks (eg from /data/app or /system/app or /mnt/asec), you can then run this script. See here for more details on possible .apk install locations.

OK, returning back to our scheduled programming ... we copy our test .apks into a test directory structure like this:


Test Directory Structure

Note: The "testsubdir4" sub-directory containing the firefox4.apk and "testsubdir5" sub-directory containing the firefox.apk
This will demonstrate the script's sub-directory traversing functionality.

Now we run the script on the "4.4.2-apks" directory using the default sort order (ie sorted by file offset):

cheeky-android@cheekydroid:~$ python ./print_apk_perms.py test-apks/4.4.2-apks

Running print_apk_perms.py v2015-06-13
Source file = test-apks/4.4.2-apks
Output will be ordered by AndroidManifest.xml file offset

Attempting to parse test-apks/4.4.2-apks/adobe-reader4.apk
Input apk file test-apks/4.4.2-apks/adobe-reader4.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by offset ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/4.4.2-apks/adobe-reader4.apk:AndroidManifest.xml    0x696    android.permission.INTERNET
test-apks/4.4.2-apks/adobe-reader4.apk:AndroidManifest.xml    0x6d0    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/4.4.2-apks/adobe-reader4.apk:AndroidManifest.xml    0x726    android.permission.ACCESS_NETWORK_STATE
test-apks/4.4.2-apks/adobe-reader4.apk:AndroidManifest.xml    0x778    com.android.vending.BILLING

Attempting to parse test-apks/4.4.2-apks/twitter4.apk
Input apk file test-apks/4.4.2-apks/twitter4.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by offset ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xb18    com.twitter.android.permission.READ_DATA
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xb6c    android.permission-group.PERSONAL_INFO
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xbbc    com.twitter.android.permission.MAPS_RECEIVE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xc16    com.twitter.android.permission.C2D_MESSAGE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xc6e    com.twitter.android.permission.RESTRICTED
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xcc4    com.twitter.android.permission.AUTH_APP
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xd38    android.permission.INTERNET
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xd72    android.permission.ACCESS_NETWORK_STATE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xdc4    android.permission.VIBRATE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xdfc    android.permission.READ_PROFILE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xe3e    android.permission.READ_CONTACTS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xe82    android.permission.RECEIVE_SMS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xec2    android.permission.GET_ACCOUNTS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xf04    android.permission.MANAGE_ACCOUNTS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xf4c    android.permission.AUTHENTICATE_ACCOUNTS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xfa0    android.permission.READ_SYNC_SETTINGS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xfee    android.permission.WRITE_SYNC_SETTINGS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x103e    android.permission.ACCESS_FINE_LOCATION
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1090    android.permission.USE_CREDENTIALS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x10d8    android.permission.SYSTEM_ALERT_WINDOW
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1128    android.permission.WAKE_LOCK
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1164    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x11ba    com.google.android.c2dm.permission.RECEIVE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1212    com.google.android.providers.gsf.permission.READ_GSERVICES
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x128a    com.android.launcher.permission.INSTALL_SHORTCUT
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x12ee    android.permission.READ_PHONE_STATE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1338    com.sonyericsson.home.permission.BROADCAST_BADGE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x139c    com.sec.android.provider.badge.permission.READ
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x13fc    com.sec.android.provider.badge.permission.WRITE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x145e    android.permission.CAMERA
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1494    android.permission.ACCESS_WIFI_STATE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x2dfc    com.android.vending.INSTALL_REFERRER
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x32f0    com.android.contacts
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x569a    android.permission.GLOBAL_SEARCH
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x5ab4    com.google.android.c2dm.permission.SEND
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x6166    android.permission.BIND_REMOTEVIEWS

Attempting to parse test-apks/4.4.2-apks/testsubdir4/firefox4.apk
Input apk file test-apks/4.4.2-apks/testsubdir4/firefox4.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by offset ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xbba    android.permission.GET_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xbfc    android.permission.ACCESS_NETWORK_STATE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xc4e    android.permission.MANAGE_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xc96    android.permission.USE_CREDENTIALS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xcde    android.permission.AUTHENTICATE_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xd32    android.permission.WRITE_SYNC_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xd82    android.permission.WRITE_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xdc8    android.permission.READ_SYNC_STATS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xe10    android.permission.READ_SYNC_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xe5e    org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xed4    android.permission.RECEIVE_BOOT_COMPLETED
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xf2a    org.mozilla.firefox.permission.PER_ANDROID_PACKAGE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xf92    org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xffe    android.permission.ACCESS_FINE_LOCATION
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1050    android.permission.INTERNET
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x108a    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x10e0    com.android.launcher.permission.INSTALL_SHORTCUT
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1144    com.android.launcher.permission.UNINSTALL_SHORTCUT
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x11ac    com.android.browser.permission.READ_HISTORY_BOOKMARKS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x121a    android.permission.WAKE_LOCK
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1256    android.permission.VIBRATE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x128e    org.mozilla.firefox.permissions.PASSWORD_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x12f4    org.mozilla.firefox.permissions.BROWSER_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1358    org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1490    android.permission.NFC
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x14ec    android.permission.RECORD_AUDIO
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x15ea    android.permission.CAMERA
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x5a4a    com.android.internal.app.ResolverActivity
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x5c28    com.android.vending.INSTALL_REFERRER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x6428    org.mozilla.firefox.permissions.HEALTH_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x6e1a    android.permission.GLOBAL_SEARCH

Parsed 3 .apk files
cheeky-android@cheekydroid:~$



Note: Sorry about the funky formatting, Blogger is having line wrap issues with the long strings :(. Each field should be TAB separated.
Next, we try it on the "5.1.1-apks" directory using the -s argument (sorting by file permission string name):

cheeky-android@cheekydroid:~$ python ./print_apk_perms.py test-apks/5.1.1-apks/ -s

Running print_apk_perms.py v2015-06-13
Source file = test-apks/5.1.1-apks/
Output will be ordered by Permission string

Attempting to parse test-apks/5.1.1-apks/adobe-reader.apk
Input apk file test-apks/5.1.1-apks/adobe-reader.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by permname ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/5.1.1-apks/adobe-reader.apk:AndroidManifest.xml    0x726    android.permission.ACCESS_NETWORK_STATE
test-apks/5.1.1-apks/adobe-reader.apk:AndroidManifest.xml    0x696    android.permission.INTERNET
test-apks/5.1.1-apks/adobe-reader.apk:AndroidManifest.xml    0x6d0    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/5.1.1-apks/adobe-reader.apk:AndroidManifest.xml    0x778    com.android.vending.BILLING

Attempting to parse test-apks/5.1.1-apks/camera.apk
Input apk file test-apks/5.1.1-apks/camera.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by permname ...
Filename    Permission_Offset    Permission_String
==============================================================

Attempting to parse test-apks/5.1.1-apks/malwarebytes.apk
Input apk file test-apks/5.1.1-apks/malwarebytes.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by permname ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x68a    android.permission.ACCESS_NETWORK_STATE
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x5a0    android.permission.GET_TASKS
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x618    android.permission.INTERNET
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x77c    android.permission.KILL_BACKGROUND_PROCESSES
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x7d8    android.permission.NFC
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x6dc    android.permission.READ_PHONE_STATE
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x84e    android.permission.RECEIVE_BOOT_COMPLETED
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x912    android.permission.RECEIVE_SMS
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x652    android.permission.VIBRATE
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x5dc    android.permission.WAKE_LOCK
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x726    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x808    android.permission.WRITE_SETTINGS
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x8a4    com.android.browser.permission.READ_HISTORY_BOOKMARKS

Attempting to parse test-apks/5.1.1-apks/testsubdir5/firefox.apk
Input apk file test-apks/5.1.1-apks/testsubdir5/firefox.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by permname ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1134    android.permission.ACCESS_FINE_LOCATION
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xd32    android.permission.ACCESS_NETWORK_STATE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x120c    android.permission.ACCESS_WIFI_STATE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xe14    android.permission.AUTHENTICATE_ACCOUNTS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x181c    android.permission.CAMERA
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x11c0    android.permission.CHANGE_WIFI_STATE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1592    android.permission.DOWNLOAD_WITHOUT_NOTIFICATION
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xcf0    android.permission.GET_ACCOUNTS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x73c8    android.permission.GLOBAL_SEARCH
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1186    android.permission.INTERNET
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xd84    android.permission.MANAGE_ACCOUNTS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x16c2    android.permission.NFC
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xf46    android.permission.READ_SYNC_SETTINGS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xefe    android.permission.READ_SYNC_STATS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x100a    android.permission.RECEIVE_BOOT_COMPLETED
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x171e    android.permission.RECORD_AUDIO
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xdcc    android.permission.USE_CREDENTIALS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1424    android.permission.VIBRATE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x13e8    android.permission.WAKE_LOCK
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1258    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xeb8    android.permission.WRITE_SETTINGS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xe68    android.permission.WRITE_SYNC_SETTINGS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x137a    com.android.browser.permission.READ_HISTORY_BOOKMARKS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x5cd4    com.android.internal.app.ResolverActivity
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x12ae    com.android.launcher.permission.INSTALL_SHORTCUT
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1312    com.android.launcher.permission.UNINSTALL_SHORTCUT
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x5eb2    com.android.vending.INSTALL_REFERRER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1060    org.mozilla.firefox.permission.PER_ANDROID_PACKAGE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x14c2    org.mozilla.firefox.permissions.BROWSER_PROVIDER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1526    org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x67c2    org.mozilla.firefox.permissions.HEALTH_PROVIDER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x145c    org.mozilla.firefox.permissions.PASSWORD_PROVIDER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xf94    org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x10c8    org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE

Parsed 4 .apk files
cheeky-android@cheekydroid:~$


Note: Permissions are now printed in alphabetical order.
Also note, the camera.apk did not have any android.permission strings declared.
This result can be confirmed by running "aapt" against the camera.apk:

cheeky-android@cheekydroid:~$ /home/cheeky-android/Android/Sdk/build-tools/22.0.1/aapt dump permissions /home/cheeky-android/test-apks/5.1.1-apks/camera.apk
package: com.modaco.cameralauncher
cheeky-android@cheekydroid:~$


For a more typical comparison, here's the output of the "aapt" dev tool against the "4.4.2-apks/firefox4.apk":

cheeky-android@cheekydroid:~$ /home/cheeky-android/Android/Sdk/build-tools/22.0.1/aapt dump permissions /home/cheeky-android/test-apks/4.4.2-apks/testsubdir4/firefox4.apk
package: org.mozilla.firefox
uses-permission: name='android.permission.GET_ACCOUNTS'
uses-permission: name='android.permission.ACCESS_NETWORK_STATE'
uses-permission: name='android.permission.MANAGE_ACCOUNTS'
uses-permission: name='android.permission.USE_CREDENTIALS'
uses-permission: name='android.permission.AUTHENTICATE_ACCOUNTS'
uses-permission: name='android.permission.WRITE_SYNC_SETTINGS'
uses-permission: name='android.permission.WRITE_SETTINGS'
uses-permission: name='android.permission.READ_SYNC_STATS'
uses-permission: name='android.permission.READ_SYNC_SETTINGS'
permission: org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE
uses-permission: name='org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE'
uses-permission: name='android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission: name='org.mozilla.firefox.permission.PER_ANDROID_PACKAGE'
uses-permission: name='android.permission.GET_ACCOUNTS'
uses-permission: name='android.permission.ACCESS_NETWORK_STATE'
uses-permission: name='android.permission.MANAGE_ACCOUNTS'
uses-permission: name='android.permission.USE_CREDENTIALS'
uses-permission: name='android.permission.AUTHENTICATE_ACCOUNTS'
uses-permission: name='android.permission.WRITE_SYNC_SETTINGS'
uses-permission: name='android.permission.WRITE_SETTINGS'
uses-permission: name='android.permission.READ_SYNC_STATS'
uses-permission: name='android.permission.READ_SYNC_SETTINGS'
permission: org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE
uses-permission: name='org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE'
permission: org.mozilla.firefox.permission.PER_ANDROID_PACKAGE
uses-permission: name='android.permission.ACCESS_FINE_LOCATION'
uses-permission: name='android.permission.ACCESS_NETWORK_STATE'
uses-permission: name='android.permission.INTERNET'
uses-permission: name='android.permission.WRITE_EXTERNAL_STORAGE'
uses-permission: name='com.android.launcher.permission.INSTALL_SHORTCUT'
uses-permission: name='com.android.launcher.permission.UNINSTALL_SHORTCUT'
uses-permission: name='com.android.browser.permission.READ_HISTORY_BOOKMARKS'
uses-permission: name='android.permission.WAKE_LOCK'
uses-permission: name='android.permission.VIBRATE'
uses-permission: name='org.mozilla.firefox.permissions.PASSWORD_PROVIDER'
uses-permission: name='org.mozilla.firefox.permissions.BROWSER_PROVIDER'
uses-permission: name='org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER'
uses-permission: name='android.permission.NFC'
uses-permission: name='android.permission.RECORD_AUDIO'
uses-permission: name='android.permission.CAMERA'
permission: org.mozilla.firefox.permissions.BROWSER_PROVIDER
permission: org.mozilla.firefox.permissions.PASSWORD_PROVIDER
permission: org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER
cheeky-android@cheekydroid:~$


Note: Repeated permission strings (eg android.permission.ACCESS_NETWORK_STATE).

And here is our script's output for the same firefox4.apk:

cheeky-android@cheekydroid:~$ python ./print_apk_perms.py test-apks/4.4.2-apks/testsubdir4/firefox4.apk

Running print_apk_perms.py v2015-06-13
Source file = test-apks/4.4.2-apks/testsubdir4/firefox4.apk
Output will be ordered by AndroidManifest.xml file offset

Attempting to open single file test-apks/4.4.2-apks/testsubdir4/firefox4.apk
Input apk file test-apks/4.4.2-apks/testsubdir4/firefox4.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by offset ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xbba    android.permission.GET_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xbfc    android.permission.ACCESS_NETWORK_STATE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xc4e    android.permission.MANAGE_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xc96    android.permission.USE_CREDENTIALS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xcde    android.permission.AUTHENTICATE_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xd32    android.permission.WRITE_SYNC_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xd82    android.permission.WRITE_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xdc8    android.permission.READ_SYNC_STATS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xe10    android.permission.READ_SYNC_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xe5e    org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xed4    android.permission.RECEIVE_BOOT_COMPLETED
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xf2a    org.mozilla.firefox.permission.PER_ANDROID_PACKAGE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xf92    org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xffe    android.permission.ACCESS_FINE_LOCATION
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1050    android.permission.INTERNET
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x108a    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x10e0    com.android.launcher.permission.INSTALL_SHORTCUT
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1144    com.android.launcher.permission.UNINSTALL_SHORTCUT
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x11ac    com.android.browser.permission.READ_HISTORY_BOOKMARKS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x121a    android.permission.WAKE_LOCK
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1256    android.permission.VIBRATE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x128e    org.mozilla.firefox.permissions.PASSWORD_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x12f4    org.mozilla.firefox.permissions.BROWSER_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1358    org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1490    android.permission.NFC
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x14ec    android.permission.RECORD_AUDIO
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x15ea    android.permission.CAMERA
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x5a4a    com.android.internal.app.ResolverActivity
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x5c28    com.android.vending.INSTALL_REFERRER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x6428    org.mozilla.firefox.permissions.HEALTH_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x6e1a    android.permission.GLOBAL_SEARCH
cheeky-android@cheekydroid:~$


Note: Our script only prints each permission string once compared to "aapt" printing some permission strings multiple times.

Not shown (because we've lost the will to go on): We validated the script's output for each .apk against the "aapt" tool.
The 3 Android 4.4.2 .apks tested were for Twitter, Firefox and Adobe Reader.
The 4 Android 5.1.1 .apks tested were for Firefox, Adobe Reader, Camera, MalwareBytes.
All permissions found by the "aapt" tool were found by our script. As expected, our script only listed each permission once.

Final Thoughts

Our print_apk_perms.py script only prints strings containing ".permission" or "com.android.". If there's a permission string that does not contain either of those strings, the script will not print it. If you experience this, you can run the same command with a "-d" to print all .apk pool strings to double check. You can also use the "aapt" dev tool to manually interrogate the .apk of interest as we did previously in the testing section.

Testing was done using English language based Android devices, it is unknown if/how the script will work with non-English device .apks.
It was also tested on a limited number of .apks but as long as Android App Developers create consistent AndroidManifest.xml files, the script *should* work OK. *nervous giggle*

Monkey was thinking of a similar app permission script for iOS but he doesn't have any test devices/data. Also, I suspect copying app files directly from a unrooted iOS device is not getting any easier these days (besides performing an iOS backup).

PS I may be showing my age with the "perms" reference ...


Friday, 5 June 2015

Extracting Pictures from MS Office (2007)

It extracts the pictures or it gets the hose! Er, Sorry about that ... Python can be a little unco-operative at times ;)


A MS Office (2007) document is comprised of a group of files zipped together into one archive file. Pictures are stored in a "media" subfolder and are linked to the document via relationships declared in various XML files. A quick Google did not find an existing Python script to extract MS Office (2007) pictures, so this post intends to show how we can create a basic image extraction Python script (msoffice-pic-extractor.py). You can download it from my GitHub page.

This post was inspired after Jared Greenhill (@jared703) retweeted a David Koepi (@davidkoepi) tweet containing this link .

So thanks to them, monkey had a reason to get off the couch ... and sit in front of a PC instead :)

We begin by unzipping the content of the various MS Office files (.docx, .xlsx, .pptx) and noting how they are arranged. You can use 7-zip (in Windows) or the Archive Manager (in Ubuntu) to view an MS Office document's component files/sub-directories.

 MS Word 2007

Word images are stored under the zip archive's word/media directory and are named generically. eg image1.jpeg

Word images are stored under the word/media directory


Image metadata is stored in word/document.xml using the <wp:docPr> XML element tag.

Word image metadata is stored in word/document.xml

This metadata includes the source picture's filename under the "descr" attribute. For example:
<wp:docPr id="1" name="Picture 0" descr="Hex-and-BADCOFFEE.png"/>

MS Powerpoint 2007

Powerpoint images are stored under the ppt/media directory and are named generically. eg image1.jpeg

Powerpoint images are stored under the ppt/media directory



Image metadata is stored (per slide) under the ppt/slides/ directory. Each slide's XML file is named generically. eg slide1.xml, slide2.xml

Powerpoint image metadata is stored per slide in ppt/slides/


Image metadata for slides are stored using the <p:cNvPr> XML element tag. This metadata includes the source picture's filename under the "descr" attribute. For example:
<p:cNvPr id="4" name="Picture 3" descr="Hex-and-BADCOFFEE.png"/>

Note: Both "name" and "descr" were set to string values for pictures. Other (non-picture) instances of the <p:cNvPr> element may also exist but they will not typically set both the "name" and "descr" attributes. So this gives us a tentative way of identifying picture metadata.

MS Excel 2007

Excel images are stored under the xl/media directory and are named generically. eg image1.jpeg

Excel images are stored under the xl/media directory


Image metadata is stored (per worksheet) under the xl/drawings/ directory. Each worksheet's XML file is named generically. eg drawing1.xml, drawing2.xml

Excel image metadata is stored per slide in xl/drawings/

Image metadata for worksheets are stored using the <xdr:cNvPr> XML element tag. This metadata includes the source picture's filename under the "descr" attribute. For example:
<xdr:cNvPr id="2" name="Picture 1" descr="Hex-and-BADCOFFEE.png"/>

Other Observations

It was observed that pictures inserted from source .jpg's were then stored in the zip file's media directory as .jpeg.
Pictures inserted from source .bmp and/or .png were stored as .png.
Pictures inserted from clipart .wmf were stored as .wmf. Clipart also had the path to the Clipart source file written to the "descr" attribute. eg descr = "C:\Program Files (x86)\Microsoft Office\MEDIA\CAGCAT10\j0216724.wmf"


The Script

When first researching/writing any extraction script, Google is your friend :)
Some helpful Python tips were found at StackOverflow by searching for "Python", "zip" and "namespace XML".
This post showed how we can read the files from a zipfile and extract/output selected files.
This post showed how to handle XML namespaces in an XML file. This is relevant because the element tags containing the source picture's filename are declared using XML namespaces.
So for .docx files, the <wp:docPr> tag is used for picture metadata. The "wp" represents the namespace and the "docPr" is the element name. Namespaces are used so that you can have multiple elements with the same name so long as they are in different namespaces. eg domain1:petmonkey_name, domain2:petmonkey_name.

The msoffice-pic-extractor.py script takes two arguments:
- the target filename of the MS Office 2007 file (or it can be the name of a single level directory containing multiple MS Office 2007 files)
- the destination directory for extracting the pictures to. The pictures are extracted to a sub-directory with the same name as the source MS Office file. The extracted files will be labelled like image1.jpeg etc.

Here's the script's help text:
cheeky@ubuntu:~$ python msoffice-pic-extractor.py -h
usage: msoffice-pic-extractor.py [-h] target destdir

Extracts pics from given MS Office document

positional arguments:
  target      MS Office document / directory of Office documents to be searched
  destdir     output dir

optional arguments:
  -h, --help  show this help message and exit
cheeky@ubuntu:~$

The script tries to detect whether the "target" argument is a directory. If its not detected as a directory, it is assumed to be a single file. The file extension is then checked and the parse_docx / parse_xlsx / parse_pptx functions are called as required.
If the "target" is a directory, then the script walks through the files in the directory and calls the appropriate parse functions based on the file extension.

Note: The script does not currently handle nested subdirectories - it ass-umes all files are contained in the root of the directory specified.

The parse functions are all very similar - we probably could have had one function and passed it different arguments to indicate the filetype but for initial testing/debugging, it was quicker/simpler to have separate parse functions.
Anyhoo, each parse function follows this basic pattern:
- Checks that the file is a valid zip file using the zipfile.is_zipfile() function
- Creates a zipfile object via the zipfile.ZipFile() function
- Uses zipfile.infolist() to list the file contents of the zip file. It then checks for the picture metadata XML file (eg word/document.xml) and prints out the relevant metadata. For any pictures stored in the media directory, it also calls zipfile.read() to retrieve the contents and then writes the contents to a new file in the "destdir" directory.

Checking for the picture metadata involves calling ElementTree.parse to parse the appropriate XML file and then extracting/printing out any picture elements. Looking at the .docx parsing code, we need to extract the "name" and "descr" attributes from any "wp:docPr" elements.

So the relevant code looks like this:
docdata = z.open(j.filename) # opens the picture metadata xml file using the zipfile library's open function
tree = ET.parse(docdata) # parses the XML file to get to the root/top node
root = tree.getroot()
We then specify that "wp" represents an XML namespace via the following line:
namespace = {"wp" : "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}
 and now we call the findall function to return a list (called picdatas) of all "<wp:docPr>" elements:
picdatas = root.findall(".//wp:docPr", namespace)
Then we can iterate through each item in the list and print the "name" and "descr" attributes (if they are both set):

for picdata in picdatas: #id="4" name="Picture 3" descr="Penguins.jpg"
    name = picdata.get("name")
    descr = picdata.get("descr")
    if (name is not None) and (descr is not None):
        print(filename + " : " + j.filename + ", name = " + name + ", descr = " + descr)
 For more information on parsing XML trees, see my previous post.

Testing

The script has been tested on Win7 x64 & Ubuntu 14.04 x64 with Python 2.7 and MS Office 2007 .docx, .xlsx, .pptx files.

For testing, we created a .docx with pictures inserted in the following order:
Hex-and-BADCOFFEE.png
squirrel-moving-acorn.bmp
squirrel-moving-acorn.png
wp-app-trawling-blk.jpg


Script use example for single .docx (testdoc.docx):
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testdoc.docx testdocop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testdoc.docx
Output dir = testdocop

Attempting to open single file testdoc.docx

Attempting to parse docx = testdoc.docx
Input MS Office file testdoc.docx checked OK!
Processing word/document.xml for picture metadata
testdoc.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testdocop/testdoc.docx
Extracting picture image4.jpeg to testdocop/testdoc.docx
Extracting picture image2.png to testdocop/testdoc.docx
Extracting picture image3.png to testdocop/testdoc.docx
cheeky@ubuntu:~$

Note: You can see the "name" attribute gives an general indication of the order in which the pictures were inserted into a .docx file. Also note how the "descr" values show the source image's filename.

Here's the script's output directory contents:

Extracted pictures for the first .docx version


Note: Extracted picture file types may differ from the original source file types

Later, we inserted "winphone-washer.png" after the first picture, so the order became:
Hex-and-BADCOFFEE.png
winphone-washer.png
squirrel-moving-acorn.bmp
squirrel-moving-acorn.png
wp-app-trawling-blk.jpg


We then ran the script on the new file (testdoc2.docx) ...
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testdoc2.docx testdocop2

Running msoffice-pic-extractor.py v2015-05-23
Source file = testdoc2.docx
Output dir = testdocop2
Creating destination directory ...

Attempting to open single file testdoc2.docx

Attempting to parse docx = testdoc2.docx
Input MS Office file testdoc2.docx checked OK!
Processing word/document.xml for picture metadata
testdoc2.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc2.docx : word/document.xml, name = Picture 4, descr = winphone-washer.png
testdoc2.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc2.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc2.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testdocop2/testdoc2.docx
Extracting picture image5.jpeg to testdocop2/testdoc2.docx
Extracting picture image4.png to testdocop2/testdoc2.docx
Extracting picture image3.png to testdocop2/testdoc2.docx
Extracting picture image2.png to testdocop2/testdoc2.docx
cheeky@ubuntu:~$

The output directory looked like:

Extracted pictures for the second .docx version (added winphone-washer.png)

We can see from the "name" values that Picture 4 (winphone-washer.png) was added after Pictures 0 to 3.

Script use example for single .pptx (testppt.pptx):
For testing, we created a .pptx with the pictures in the following order -
Hex-and-BADCOFFEE.png (slide1)
squirrel-moving-acorn.bmp and squirrel-moving-acorn.png (both on slide2)
wp-app-trawling-blk.jpg (slide3)


Running the script:
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testppt.pptx testppop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testppt.pptx
Output dir = testppop
Creating destination directory ...

Attempting to open single file testppt.pptx

Attempting to parse pptx = testppt.pptx
Input MS Office file testppt.pptx checked OK!
Processing ppt/slides/slide2.xml for picture metadata
testppt.pptx : ppt/slides/slide2.xml, name = Content Placeholder 3, descr = squirrel-moving-acorn.bmp
testppt.pptx : ppt/slides/slide2.xml, name = Picture 4, descr = squirrel-moving-acorn.png
Processing ppt/slides/slide3.xml for picture metadata
testppt.pptx : ppt/slides/slide3.xml, name = Content Placeholder 3, descr = wp-app-trawling-blk.jpg
Processing ppt/slides/slide1.xml for picture metadata
testppt.pptx : ppt/slides/slide1.xml, name = Picture 3, descr = Hex-and-BADCOFFEE.png
Extracting picture image3.png to testppop/testppt.pptx
Extracting picture image2.png to testppop/testppt.pptx
Extracting picture image1.png to testppop/testppt.pptx
Extracting picture image4.jpeg to testppop/testppt.pptx
cheeky@ubuntu:~$

In contrast to the .docx file, the "name" values seem to vary depending on the source file type (or perhaps the position on the slide? eg title vs body) so we can't ascertain the order in which they were added. The output file names seem to confirm the order of appearance however. Also note how the "descr" values show the source image's filename.

The output directory looked like:

Extracted pictures for the test .pptx file



Script use example for single .xlsx (testxl.xlsx):
For testing, we created a .xlsx with the pictures in the following order:
Hex-and-BADCOFFEE.png (sheet1)
squirrel-moving-acorn.bmp and squirrel-moving-acorn.png (both on sheet2)
wp-app-trawling-blk.jpg (sheet3)


Running the script:
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testxl.xlsx testxlop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testxl.xlsx
Output dir = testxlop
Creating destination directory ...

Attempting to open single file testxl.xlsx

Attempting to parse xlsx = testxl.xlsx
Input MS Office file testxl.xlsx checked OK!
Extracting picture image4.jpeg to testxlop/testxl.xlsx
Processing xl/drawings/drawing3.xml for picture metadata
testxl.xlsx : xl/drawings/drawing3.xml, name = Picture 1, descr = wp-app-trawling-blk.jpg
Processing xl/drawings/drawing1.xml for picture metadata
testxl.xlsx : xl/drawings/drawing1.xml, name = Picture 1, descr = Hex-and-BADCOFFEE.png
Extracting picture image1.png to testxlop/testxl.xlsx
Processing xl/drawings/drawing2.xml for picture metadata
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 2, descr = squirrel-moving-acorn.png
Extracting picture image2.png to testxlop/testxl.xlsx
Extracting picture image3.png to testxlop/testxl.xlsx
cheeky@ubuntu:~$

The output directory looked like:

Extracted pictures for the test .xlsx file


The "name" values appear to be reset per Excel worksheet/ XML drawing file but the numbering seems consistent with the order in which they appear. eg "Picture 1" appears before "Picture 2" on worksheet / XML drawing 2. Also note how the "descr" values show the source image's filename.

And now for the bonus party trick - processing all three file types from the same source directory with one command:

Here's what the source directory looked like:


All 3 MS Office file types in the same source directory


Running the script looks like:
cheeky@ubuntu:~$ python msoffice-pic-extractor.py testgroup testgroupop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testgroup
Output dir = testgroupop
Creating destination directory ...

Attempting to parse xlsx = testxl.xlsx
Input MS Office file testxl.xlsx checked OK!
Extracting picture image4.jpeg to testgroupop/testxl.xlsx
Processing xl/drawings/drawing3.xml for picture metadata
testxl.xlsx : xl/drawings/drawing3.xml, name = Picture 1, descr = wp-app-trawling-blk.jpg
Processing xl/drawings/drawing1.xml for picture metadata
testxl.xlsx : xl/drawings/drawing1.xml, name = Picture 1, descr = Hex-and-BADCOFFEE.png
Extracting picture image1.png to testgroupop/testxl.xlsx
Processing xl/drawings/drawing2.xml for picture metadata
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 2, descr = squirrel-moving-acorn.png
Extracting picture image2.png to testgroupop/testxl.xlsx
Extracting picture image3.png to testgroupop/testxl.xlsx

Attempting to parse docx = testdoc.docx
Input MS Office file testdoc.docx checked OK!
Processing word/document.xml for picture metadata
testdoc.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testgroupop/testdoc.docx
Extracting picture image4.jpeg to testgroupop/testdoc.docx
Extracting picture image2.png to testgroupop/testdoc.docx
Extracting picture image3.png to testgroupop/testdoc.docx

Attempting to parse pptx = testppt.pptx
Input MS Office file testppt.pptx checked OK!
Processing ppt/slides/slide2.xml for picture metadata
testppt.pptx : ppt/slides/slide2.xml, name = Content Placeholder 3, descr = squirrel-moving-acorn.bmp
testppt.pptx : ppt/slides/slide2.xml, name = Picture 4, descr = squirrel-moving-acorn.png
Processing ppt/slides/slide3.xml for picture metadata
testppt.pptx : ppt/slides/slide3.xml, name = Content Placeholder 3, descr = wp-app-trawling-blk.jpg
Processing ppt/slides/slide1.xml for picture metadata
testppt.pptx : ppt/slides/slide1.xml, name = Picture 3, descr = Hex-and-BADCOFFEE.png
Extracting picture image3.png to testgroupop/testppt.pptx
Extracting picture image2.png to testgroupop/testppt.pptx
Extracting picture image1.png to testgroupop/testppt.pptx
Extracting picture image4.jpeg to testgroupop/testppt.pptx

Parsed 3 MS Office files
cheeky@ubuntu:~$

Here's the output files:

Output files after group processing

For giggles, we created a Libre Office Writer document in Ubuntu, saved it as a Word 2007/2010/2013 .docx and then ran the script. The script extracted the pictures OK but the "descr" and "name" fields did not contain the same level of detail as observed for an official MS Office 2007 .docx. The "name" attribute was consistently set to "Picture" and the "descr" attribute was blank/empty. So while we may not be able to retrieve the source picture's filename, we can still extract the images.
This may also indicate that Word 2010/2013 uses the same file structure as Word 2007. So our script might be able to extract pictures from MS Office 2010/2013 documents. Meh.

Final Thoughts

Currently the msoffice-pic-extractor.py script either individual files or multiple MS Office files located under a single level "target" directory.
Resolving this issue would probably require incorporating the path into the output filename so that 2 MS Office files with the same filename but under different directories could be processed OK. Seemed a little over-complicated for such a quick script. Or maybe I was just feeling like a lazy monkey (again!).

Because MS Office can convert the inserted source pictures into a different file type for storage, any original EXIF data (eg GPS co-ordinates, camera model) will not be retained (apart from the source filename).

While the forensic uses for this project are somewhat limited (eg possible IP theft / illicit image storage), the project still provided a good learning exercise showing how we can use Python to read zip files and parse XML.
It wasn't overly-complicated (he says thanking StackOverflow profusely) but as with learning any language, practice makes perfect. The alternative view is - throw enough crap on the wall, and some of it is bound to stick to you :)