Saturday, 14 October 2017

Monkey takes a .heic

The hills are alive ... with the compression of H.265!

With iOS 11 and macOS High Sierra (10.13), Apple has introduced a file container format called High Efficiency Image File Format (aka HEIF - apparently its pronounced "heef"). Apple is using HEIF to store camera/video/Apple "Live Photos". HEIF is based on multiple standards such as:
- ISO Base Media File Format ISO (14496-12) for structuring data sections within the file container
- ISO/IEC 23008-12 MPEG-H Part 2 / ITU H.265 for compressing the actual still picture and video data. Also referred to as High Efficiency Video Coding (HEVC). Theoretically, HEIF could use other compression algorithms but Apple is using it exclusively with HEVC / H.265.

Some benefits of HEIF are:
- It approximately halves the file size for a given image/video quality.
- It allows for a single file to contain multiple media (eg multiple animated still pictures AND sound e.g. an Apple "Live Photo").

Apple HEIF images will have a .heic file extension. Apple HEVC encoded movies will have the familar Quicktime .mov extension but internally they will use HEVC / H.265 compression. The ISO Base Media File Format ISO (14496-12) is based upon the Quicktime file structure and so it will apply to both .heic images and HEVC .mov files. 

Because it uses a more complex compression algorithm than previous standards (eg H.264 and JPEG), only recent model Apple devices have the required hardware to create HEVC content.
According to Apple's 2017 WWDC presentation 503 "Introducing HEIF and HEVC", to create HEVC pictures/video you need (at least) an iPhone 7 / iPad Pro (A10 Fusion chip) running iOS 11 or a 6th generation Intel Core processor running macOS 10.13 High Sierra.
Software decoding support is apparently available for all Apple devices (presumably running iOS 11 / High Sierra) but playback performance will probably suffer on older hardware.

For the rest of this post we will discuss:
- how to view .heic and HEVC .mov files
- the file format for .heic files
- the file format for HEVC .mov files

We won't be discussing how HEVC / H.265 compression works. For a quick overview on some basic concepts and the difference between H.264 and H.265, please watch this video.

And before we dive any deeper ...
Special Thanks to Maggie Gaffney from Teeltech USA for providing us with iPhone 8 Plus test media files.
We also used sample .heic files from an Ars Technica review (iPad Pro) and sample files provided to the FFmpeg forum (iPhone 7 Plus).

Viewing & Compatibility Issues

Here is an article showing how to set up an iOS 11 device to save/transfer .heic files in their original format (Camera set to "High Efficiency" and "Photos - Transfer to Mac or PC" set to "Keep Originals"). Apple can auto-magically convert .heic files to .jpg files (and h.265 .mov to h.264 .mov) when transferring to non-compatible devices/destinations (eg PC or emails). So if you're not receiving .heic files, check those iOS settings.

Apart from viewing them natively on iOS or High Sierra (eg using Apple Photos or Preview), we found the easiest way to view .heic files was using this free Windows HEIF utility by @liuziangexit.
Note: there are two versions - Chinese and English. Being the uncultured lapdog monkey that we are, we downloaded the English version. Be sure to read the readme file included. Running it on Windows 10 (VM) also required installing the signed Microsoft C++ Redistributable package which was conveniently included in the download zip file.



There is also a website that converts .heic to .jpeg but this may not be appropriate for sensitive photos.

For playing HEVC/H.265 encoded .mov files, we found that IrfanView and VLC player worked OK (IrfanView seemed to have better performance than VLC when viewing high resolution videos).

FFmpeg (v3.3.3) can also be used to screen grab frames  (1 per sec) from a H.265 .mov. The command is:
ffmpeg.exe -i sourcemovie.mov -vf fps=1 outputframe_%d.png
This will result in "outputframe_1.png", "outputframe_2.png" etc. being generated to the current directory.

For more compatible playback, we can convert an H.265 .mov into an H.264 .mov. The command is:
ffmpeg.exe -i source265movie.mov -map 0 -c copy -c:v libx264 outputmovie264.mov
This copies all other streams (eg audio, subtitles) to the new output file and re-encodes/outputs the source video stream to H.264. See here for details on using the FFmpeg map argument.

We found the easiest way to send a test .heic from an iPhone to a PC was to upload it to Dropbox which has been updated to support .heic and H.265 encoded .mov files. You can view both .heic and .mov files from the Dropbox.com website. Unfortunately, it appears that Dropbox might rename the files upon upload. We were expecting to see something like "IMG_4479.heic" but the filename on Dropbox was something like "Photo Oct 08, 10 20 05.heic". Consequently, a hash compare of the source/destination files may be required to verify exact copies.

Exiftool (v10.63) added improved support for HEIF and it will display the EXIF data from an Apple generated .heic or H.265 .mov file. It has not been confirmed if iOS created .heic / HEVC .mov files will retain ALL of their original EXIF metadata after being auto-magically converted to .jpg / H.264 files.

We have not been able to find a non-Apple viewer for HEVC encoded "Live Photos". Trying to transfer them via Dropbox resulted in a "Live Photo" .heic file containing a single image (no sound or other animation). Sorry, no "Live Photos" for you!

File Structures

Now that we know how we can view iOS created HEIF images and videos, lets take a closer look at the actual file formats.
This will be a (reasonably?) short overview - we aren't going to become "data masochists" and delve into every field or the compression side of things. Maybe in a future post (especially if you've been a bad, bad, dirty, dirty monkey and feel the need to be punished LOL) ...

Apple created .heic and .mov files are BIG Endian.

Both .heic and .mov files are based on the ISO Base Media File Format. This means a .heic or .mov file container is divided into dozens of functional "boxes" of data. The start of each box will be marked with a 4 byte box size (typically) and a four byte box type string (eg. 'ftyp', 'mdat', 'meta'). Within the box, there will be other data fields which may consist of other boxes and/or a structured pattern of bytes. So there is a complicated hierarchy of boxes within boxes thing happening which makes it difficult to quickly understand every detail. The majority of the bytes (ie compressed data) will be stored in an "mdat" box. Other boxes will be used to store meta data about how to access/treat the data in those "mdat" boxes.
For further details on how these boxes are structured, please refer to the ISO Base Media File Format standard. Both it and the Quicktime movie format document will be your best friends for this section. FYI the ISO Base Media File Format is also used for .mp4 and .3gp files - so learning about this format will aid in understanding multiple types of media files.

Other handy references include:
- Chapter 3 of Lasse Heikkila's HEIF implementation thesis
- the Nokiatech HEIF Github site
- the 2017 Apple WWDC HEIF presentations (follow the transcript and slide PDF links) for the HEIC File Format and the Intro to HEIF amd HEVC.

For an Apple iPhone 8 Plus .heic file (containing a single 4032 x 3024 image) the file structure can look like this:

ftyp (size=0x18, majorbrand = 'heic', minorversion = 0, compatiblebrands = mif1, heic)
meta (size = 0xF74)
    hdlr (size = 0x22, handler_type is "pict" i.e. file is an image)
    dinf (size = 0x24)
    pitm (size = 0xE, item_ID = 0x31 = primary item)
    iinf (size 0x43D, entry_count = 0x33 = number of items stored)
        infe = ItemInfoEntry, size = 0x15, version = 2, item_ID = 0x1, item_type = hvc1, item_name = ""  [Tile 1]
        infe = ItemInfoEntry, size = 0x15, version = 2, item_ID = 0x2, item_type = hvc1, item_name = ""  [Tile 2]
        ...
        infe = ItemInfoEntry, size = 0x15, version = 2, item_ID = 0x30, item_type = hvc1, item_name = "" [Tile 30]
        infe = ItemInfoEntry, size = 0x15, version = 2, item_ID = 0x31, item_type = grid, item_name = ""  [derived image from all tiles]
        infe = ItemInfoEntry, size = 0x15, version = 2, item_ID = 0x32, item_type = hvc1, item_name = ""  [thumbnail]
        infe = ItemInfoEntry, size = 0x15, version = 2, item_ID = 0x33, item_type = Exif, item_name = ""  [EXIF]

    iref (size = 0x94, version = 0, contains array of SingleItemTypeReferenceBox)
        dimg (size = 0x6C, from_item_ID = 0x31, reference_count = 0x30, to_item_ID = 0x1, 0x2 ... 0x30) [derived image]
        thmb (size = 0xE, from_item_ID = 0x32, reference_count = 0x1, to_item_ID = 0x31) [thumbnail]
        cdsc (size = 0xE, from_item_ID = 0x33, reference_count = 0x1, to_item_ID = 0x31) [content description ref / exif]

    iprp (size = 0x6F3)
        ipco (size = 0x5AD) = ItemPropertyContainerBox = property data*
            colr (size = 0x230) = Colour Information 1
            hvcC (size = 0x70) = decoder configuration 1
            ispe (size = 0x14) = spatial extent 1-1
            ispe (size = 0x14) = spatial extent 1-2
            irot (size = 0x9) = Image rotation 1
            pixi (size = 0x10) = Pixel information 1
            colr (size = 0x230) = Colour Information 2
            hvcC (size = 0x70) = decoder configuration 2
            ispe (size = 0x14) = spatial extent 2-1
            pixi (size = 0x10) = Pixel information 2
        ipma (size = 0x13E) = Item Property Association = connects property data in ipco to item numbers*

            List of 0x32 items. Each item has the structure [item number(2 bytes), size (1 byte), data (size bytes)]
    idat (size = 0x10)
    iloc (size = 0x340, version = 1, offset_size = 4, length_size = 4, base_offset_size = 0, index_size = 0, item_count = 0x33 )
        [item_id = 0x1, file offsets used, base_offset = 0, extent_count = 0x1, extent_offset = X1, extent_length = Y1]
        [item_id = 0x2, file offsets used, base_offset = 0, extent_count = 0x1, extent_offset = X2, extent_length = Y2]
        ...
        [item_id = 0x33, file offsets used, base_offset = 0, extent_count = 0x1, extent_offset = X33, extent_length = Y33]

mdat (size = variable, contains data on EXIF / thumbnail / image data)

It looks a little daunting (and this doesn't even show all of the boxes/fields!) but once you figure out which fields are relevant, its not too bad.  We've color coded some sections to make it more followable/wake up those weary eyes.

The ftyp section declares the 'majorbrand' (i.e. file type) as "heic".
The meta section declares how to interpret the raw data stored in the mdat section. Notable sub-boxes include:
    hdlr = The 'handler_type' is set to "pict" which means this is an image (as opposed to a video).
    pitm = Specifies the Primary Item number (eg item_ID 0x31)
    iinf = Contains a list of ItemInfoEntrys. The number of items and sizes will change with resolution/shape (eg camera specs, square photo).  From the 2017 WWDC 513 presentation and actual iOS samples we've observed, images are divided/stored as tiles.
            For a 4032 x 3024 resolution image, there were 0x33 items declared in each .heic file. These consisted of:
            0x30 items with each item_type = 'hvc1'. Each item corresponds to a 512x512 tile.
            1 'grid' item represents the full derived image
            1 'hvc1' item is used for the 320x240 thumbnail
            1 'Exif' item is used for storing EXIF data
    iref = contains array of SingleItemTypeReferenceBox items. From this section we can see that item_ID = 31 is a derived image ('dimg') that refers to item_IDs 0x1 to 0x30 (tiles). There are also references to the thumbnail and exif items.
    iprp = connects item_IDs in the 'ipma' sub-section to properties in the 'ipco' sub-section. *We were unable to find much public documentation on how this is implemented (apart from the Nokia HEIF Github source code).
    iloc = contains file offsets for each item_ID section. e.g. For EXIF (item_ID = 0x33), the extent_offset = 0x000043DB,  extent_length = 0x000007F8. So if we go to the file offset at 0x000043DB, we will see the EXIF item data.
The mdat section contains the raw image data, thumbnail and Exif information.

Due to the tiling, full file recovery could be a bastard a lot more complicated compared to recovering a jpeg (where you can carve everything between the 0xFFD8 and 0xFFD9 markers).
As iOS 11 also uses file based encryption, it *should* be impossible to carve & recover .heic files anyway.
However, if those .heic files were also copied to a separate non-encrypted device (eg PC) and then corrupted/deleted, it *may* be possible to repair or recover some/all of the tiles (theoretically!).

OK, suck it up buttercups ...because there's more!

Here's the file structure for a 6.73 second 7.3 MB Apple HEVC / H.265 encoded .mov taken with an iPhone 8 Plus:

ftyp (size = 0x14, majorbrand = 'qt  ')

wide (size = 0x8)
mdat (size = 0x00746120, contains HEVC / H.265 video data)
moov (size = 0x0028FA)
    mvhd (size = 0x6C, version = 0, creation_time = 0xD5FFE81E (secs since 1JAN1904), modification_time = 0xD5FFE825, ­

                timescale = 0x0000258 = 600 dec. units per sec, duration = 0x00000FC7 = 4039 dec units => 4039/600 = 6.73 secs, 
                next_track_ID = 0x5)
    trak (size = 0x0FE6)
        tkhd (size = 0x5C, version = 0, creation_time =
0xD5FFE81E, modification_time = 0xD5FFE825, track_ID = 0x1, 
                  duration = 0xFC7, width = 0x07800000 => 0x780 = 1920 decimal, height = 0x04380000 => 0x438 = 1080 decimal)
        tapt (size = 0x44)
        edts (size = 0x24)
        mdia (size = 0xF1A) = media box
            mdhd (size = 0x20, version = 0, creation_time =
0xD5FFE81E, modification_time = 0xD5FFE825, timescale = 0x258, 
                        duration = 0xFC7)
            hdlr (size = 0x31, component type = mhlr = media handler, component subtype = vide, component manufacturer = appl,

                     component name = "Core Media Video"
            minf (size = 0xEC1) = contains file offsets to samples/chunks of samples
    trak (size = 0x07B4)
        tkhd (size = 0x5C, version = 0, creation_time =
0xD5FFE81E, modification_time = 0xD5FFE825, track_ID = 0x2, 
                  duration = 0xFC7, width = 0, height = 0)
        edts (size = 0x24)
        mdia (size = 0x72C)
            mdhd (size = 0x20, version = 0, creation_time =
0xD5FFE81E, modification_time = 0xD5FFE825, timescale = 0x000AC44 =
                        44100 samples/sec, duration = 0x00049000 = 299008 samples = 6.78 sec)
            hdlr (size = 0x31, component type = mhlr = media handler, component subtype = soun, component manufacturer = appl,

                     component name = "Core Media Audio"
            minf (size = 0x6D3) = contains file offsets to samples/chunks of samples
    trak (size = 0x042E)
        tkhd (size = 0x5C, version = 0, creation_time =
0xD5FFE81E, modification_time = 0xD5FFE825, track_ID = 0x3, 
                  duration = 0xFC7, width = 0, height = 0)
        edts (size = 0x24)
        tref (size = 0x20)
        mdia (size = 0x386)
            mdhd (size = 0x20, version = 0, creation_time =
0xD5FFE81E, modification_time = D5FFE825, timescale = 0x258, 
                        duration = 0xFC7)
            hdlr (size = 0x34, component type = mhlr = media handler, component subtype = meta, component manufacturer = appl,

                     component name = "Core Media Metadata"
            minf (size = 0x32A) = contains file offsets to samples/chunks of samples
    trak (size = 0x0271)
        tkhd (size = 0x5C, version = 0, creation_time = 0xD5FFE81E, modification_time = 0xD5FFE825, track_ID = 0x4, 

                  duration = 0xFC7, width = 0, height = 0)
        edts (size = 0x24)
        tref (size = 0x20)
        mdia (size = 0x1C9)
            mdhd (size = 0x20, version = 0, creation_time =
0xD5FFE81E, modification_time = 0xD5FFE825, timescale = 0x258, 
                        duration = 0xFC7)
            hdlr (size = 0x34, component type = mhlr = media handler, component subtype = meta, component manufacturer = appl,

                     component name = "Core Media Metadata"
            minf (size = 0x16D) => contains file offsets to samples/chunks of samples

udta (size= 0x08)
free (size = 0x400)
meta (size = 0x5BD)
    hdlr (size = 0x22, component subtype = mdta)
    keys (size = 0xC9) => contains various metadata field names
    ilst (size = 0xCA) => contains various metadata field values
    free (size = 0x400)

free (size = 0x88)

We can see some familiar 4 letter strings (reckon you might be spouting some others of your own by now ...) and the offset information is now contained in the 'moov' section (recall that offset info is stored in the 'meta' / 'iloc' section for a .heic).
Also, instead of utilising "items" like .heic, the movie is organised into traks (eg video trak, sound trak). These 'trak' boxes include file offsets to the 'mdat' section (via 'trak' / 'mdia' / 'minf').
The 'moov' box has a movie header atom labelled 'mvhd'. This shows the length of the movie and creation/modified dates (amongst other things).
There were 4 traks recorded - one for video (track_ID=1), one for sound (track_ID=2) and two for meta data (track_ID=3 and 4). The second (smaller) metadata trak (track_ID=4) may be extending the first (track_ID=3) metadata trak (due to space limitations?) as the metadata strings are different but seem related.
'free' marks boxes that can be ignored/skipped
'meta' marks a box containing metadata however, the 'meta' structure from a .mov *will not* match the 'meta' data structure from a .heic image. Presumably Exiftool will grab metadata from both 'trak' and 'meta' boxes.

In other observed H.265 .mov files (both smaller and larger), multiple 'mdat' sections were observed. This may be related to the existence of a 'hoov' box which we couldn't find any documentation on. The 'hoov' box appeared at an lower (earlier) file offset than the 'moov' and also contained 'mvhd' and 'trak' boxes etc.
Perhaps the 'hoov' box was a previous 'moov' box that had its name modified so its data can be overwritten? eg as file grows in size, data gets re-written? #SpeculatorMonkey


Final Thoughts

Oh, my aching lederhosen!
And this post has only just scratched the arse surface of the HEIF-y beast.
There are a lot more possibilities with HEIF than what Apple has currently implemented. The Nokiatech Github site demonstrates a bunch of different image file possibilities (eg single images, sequences of images, HD movies, combined images/video).

Although we weren't able to capture a native Apple "Live Photo" for examination, we *suspect* it will use a sequences .heics file and have a 'moov' box in addition to 'ftyp', 'mdat' and 'meta' boxes. This was kinda shown on slide 60 of the 2017 Apple WWDC slides for High Efficiency Image File Format.

This post by macrumors.com states that Apple "Live Photos" initially consisted of a 12 MP jpeg with 45 frames of H.264 .mov at 15 fps (i.e. 3 secs video = 1.5 secs before/after button press).
This anandtech forum article states that "Live Photos" are:
"1440x1080 HEVC on certain devices, albeit paired with HEIC images now instead of JPEG. There is also the option of leaving it has 1440x1080 h.264 with JPEG though."

Anyhoo, if you are able to catch a "Live Photos" unicorn file, we'd be very interested to hear about its file structure (leave a comment?).

UPDATE 15OCT2017:
For "Live Photos" we tried directly connecting an iPhone 8 running iOS 11.0.3 to a Windows 7 PC and was able to see the DCIM folders. The iPhone 8 was set to the default "High Efficiency" Photos with Mac/PC transfer set to "Keep Originals". However, after copying the files over, when we looked at the transferred .heic file structures on the PC they were single images.
There were no 'moov' or 'trak' items. So it looks like iOS is not openly exposing their "Live Photo" file structure to non-Apple devices. Boo! :'(

Finally, if you know of any other easy to install/use .heic viewers or have any thoughts/suggestions, please leave a note in the comments.

Monday, 21 August 2017

Monkey Unpacks Some Python

UNPACK-ing Python .. Now with added monkey!
Some forensic folks have suggested that a Python tutorial on how to read/print binary data types might be helpful to budding Python programmers in the community.
So in this post, we will simulate reverse engineering a fictional contact file format and then write a Python script to extract/print out the values.
For brevity, this post ass-umes the reader has a basic knowledge of Python (i.e. they can launch a script and know about functions/assigning variables etc.). There are plenty of introductory tutorials online - if you are a beginner, you might want to check out the Google Developer Python course before proceeding.

The script (unpack-tute.py) has been tested with both Python v2.7.12 and Python 3.4.1 on a Win7x64 PC.
Historically, Python 2 had more supported 3rd party libraries. Consequently, it was the first version of Python that this monkey learned and we are actually more familiar with Python 2. Python 2's End Of Life is currently scheduled for April 2020 so there's a few years left. However, as this script does not rely on 3rd party libraries, we have adapted it to run on both Python 2 and 3.
The main difference affecting this script was that Python 3 treats strings as Unicode by default so we had to add an encode('utf-8') call when searching through our data file.

There is more than one way to code a solution. We have tried to make this code easy to follow instead of making it "Pythonic" (whatever that even means) or by adding lots of error checking code (if you write a script, you should know how it to use it!).
The Python script (unpack-tute.py) and sample binary file (testctx.bin) will be posted to my brand-monkey-spanking-new GitHub Python Tutorials folder.

So, here's a screenshot of the "testctx.bin" file we want to read:
Screenshot of "testctx.bin" (brought to us courtesy of WinHex!)

Note: The first contact record is highlighted and offsets are listed in decimal. Curious George is ... curious?

Using some reverse engineering strategies that we previously wrote about here we can make a few observations regarding the structure of each Contact record ...

  • We can see there's a repeated "ctx!" string before each Contact record.
  • After each record's "ctx!" field, there is a Little Endian 2 byte field that seems to increase with each subsequent record (eg 0x0100 at decimal offset 68, 0x02FF at decimal offset 516, 0xFFFF at decimal offset 804). For initial classification, we will say its an index record number.
  • Each record has a UTF16LE (ie 2 bytes per character) string that contains a name (eg George).
  • Each record has a UTF8/ASCII (ie 1 byte per character) string that contains a phone number (eg 5551234).
  • Before each of the strings, there is a one byte integer corresponding to the string size in bytes.
  • The last field seems to be a 4 byte field. By observing which bytes vary and which bytes remain constant(ie the left most bytes change more rapidly than the rightmost bytes), we suspect the last field is a Little Endian timestamp field. Feeding in the first record's last 4 bytes (ie 0x26CDDB56) into DCode results in a valid date/time for a Unix 32 bit Little Endian timestamp
DCoding the Contact timestamp


So here's our contact record format:
Contact record data structure


And here's a summary of what we want the script to do:

1. Open "testctx.bin" file (read only)
2. Store file contents
3. Search file contents for ctx! markers
4. For each hit:
    4a. Print hit offset
    4b. Extract Index Number field and print
    4c. Extract Name Length field and print
    4d. Extract Name String (UTF16LE) field and print
    4e. Extract Phone Length field and print
    4f. Extract Phone String (UTF8) field and print
    4g. Extract Unix Timestamp field and print (in ISO format)

5. Close file

Simples!

The Script

OK now that we know what we want to do, here's how we implement each step in code ...

Steps 1 & 2 Open file and store file contents (See "unpack-tute.py" lines 25-33):
1. Open "testctx.bin" file (read only)
2. Store file contents


For step 1, we open the "textctx.bin" file in read-only binary mode (what the "rb" stands for):
    fb = open(filename, "rb")

We chose read-only mode because we don't want to change the file contents and we chose binary mode because we are interpreting the file as raw bytes (not text).
Then to read/store the file contents, we call:
    filecontent = fb.read()

So the "filecontent" variable will now contain every byte from the "testctx.bin" file and individual bytes can be accessed directly using the "slice" notation.
For example, filecontent[0:3] is 3 bytes long and includes the bytes at offsets 0, 1 and 2. It does NOT include the byte at offset 3.
If we replace the start/end locations of our slice example with a variable called startoffset, we get:
    filecontent[startoffset:(startoffset+3)]
This will include the 3 bytes at start, start+1, start+2 only.
The reader might want to remember that little notation nugget as monkey has the feeling it will be popping up again later ... (Hehe, Poo jokes are still floating around in 2017!)

Step 3: Search file contents for "ctx!" markers (See "unpack-tute.py" lines 35-49):
Knowing that "ctx!" encoded in ASCII/UTF8 is x63 x74 x78 x21, we can use a variable "searchstring" to represent our search term in hex:
    searchstring = "\x63\x74\x78\x21"

We now consider the "filecontent" variable as one big string of bytes ...
Python string types have a find() method which searches the parent string for a substring. The find() method returns -1 if the substring is not found otherwise, it returns the first offset where it found the substring. The find() method can also take an starting offset argument so we can use a while loop to repeatedly call find() with an incrementing starting offset until we get no more hits. Thus we can find an offset for each substring hit in the parent string, which we can then store in a Python list called "hitlist".
Here's the code:
    nexthit = filecontent.find(searchstring.encode('utf-8'), 0)
    hitlist = []
    while nexthit >= 0:
        hitlist.append(nexthit)
        nexthit = filecontent.find(searchstring.encode(), nexthit + 1)

We use searchstring.encode('utf-8') because of Python 3 compatibility issues. Python 3 treats all strings as Unicode by default, where as we need to search in UTF8 (ie byte by byte). So we have to encode the searchstring as UTF8 before running the search.
Default Python 2 strings are represented as sequences of raw bytes so calling searchstring.encode('utf-8') in Python 2 has no real effect - we could have used Python 2 lines such as:
    nexthit = filecontent.find(searchstring, 0)
and
    nexthit = filecontent.find(searchstring, nexthit + 1)
This was the only major script change required for Python2 and Python 3 compatibility.

Step 4: Looping through each hit (See "unpack-tute.py" lines 50-88):
Now we have our hitlist of offsets to "ctx!" markers and we know how each contact record is structured, so we can iterate through the filecontent variable using a for loop and extract/print the data we need using the slice notation we discussed previously.

4a. We print out each hit offset in both decimal and hexadecimal.
    print("\nHit found at offset: " + str(hit) + " decimal = " + hex(hit) + " hex")

We use the str() function to convert the "hit" offset variable into a decimal string for printing and the hex() function to convert the hit offset variable into a hexadecimal string.

4b. The first field ("Index Number") after the "ctx!" marker will start 4 bytes after the hit offset. To calculate the offset, we can use code like:
    indexnum_offset = hit + 4 
As we have already read the entire file into filecontent, we can access the 2 byte "Index Number" field and interpret it as a Little Endian 2 byte integer as follows:
    indexnum = struct.unpack("<H", filecontent[indexnum_offset:(indexnum_offset+2)])[0]

We are using the struct module's "unpack" function on the given filecontent slice to interpret the slice as a LE 2 byte integer and store it in the "indexnum" variable.
The "<H" argument tells unpack how to interpret the raw bytes i.e. "<" for Little Endian, "H" for unsigned 2 byte integer.
The unpack function returns a tuple (kinda like a sequence of variables) so we specify the "[0]" at the end to retrieve the first converted value. It seems a bit weird until you find out that you can chain types together in the same unpack call. For example, "<HH" specifies 2 consecutive LE unsigned 2 byte integers. Unfortunately, we cannot use chaining here due to the variable length of Name/Phone strings in the contact record.
There's a bunch of other unpack types defined in the Python help documentation (search for "pack unpack").

We can now print out our interpreted "indexnum" value but we need to use the str() function to convert our Index Number integer into a printable string. We can use code such as:
    print("indexnum = " + str(indexnum))


We can re-use a similar pattern of code for the remaining fields in the record.
That is, we calculate the offset of field X, interpret those slice bytes and then print.
Because we know the record field sizes (or can read them e.g. via "Name Length" size byte), calculating the offsets becomes an exercise in adding field sizes to previous field offsets to get to the next offset address.

4c. So for the second field ("Name Length") we can use:
    namelength_offset = indexnum_offset + 2
    print("namelength_offset = " + str(namelength_offset))
    namelength = struct.unpack("B", filecontent[namelength_offset:(namelength_offset+1)])[0]
    print("namelength = " + str(namelength))

For the "Name Length" field (one byte long), we use a starting offset ("namelength_offset") which is 2 bytes past the "Index Number" offset ("Index Number" field is 2 bytes long).
We use unpack with the "B" argument as we are interpreting the 1 byte at filecontent[name_length_offset:(namelength_offset+1)] as an unsigned 1 byte integer and storing it in the "namelength" variable.

4d. For the third field ("Name String") we can use:
    namestring_offset = namelength_offset + 1
    namestring = filecontent[namestring_offset:(namestring_offset+namelength)].decode('utf-16-le')
    print("namestring = " + namestring)

After calculating the "Name String" field offset (should be one byte past the "Name Length" field), we can use the string.decode('utf-16-le') method to interpret the filecontent[namestring_offset:(namestring_offset+namelength)] slice as a UTF16LE string and store it in the "namestring" variable.

4e. For the fourth field ("Phone Length") we can use:
    phonelength_offset = namestring_offset + namelength
    phonelength = struct.unpack("B", filecontent[phonelength_offset:(phonelength_offset+1)])[0]
    print("phonelength = " + str(phonelength))

After calculating the "Phone Length" field offset (should be "Name Length" bytes past the "Name String" offset), we use unpack with the "B" argument as we are interpreting the 1 byte at filecontent[phonelength_offset:(phonelength_offset+1)] as an unsigned 1 byte integer and storing it in the "phonelength" variable.

4f. For the fifth field ("Phone String") we can use:
    phonestring_offset = phonelength_offset + 1
    print("phonestring_offset = " + str(phonestring_offset))
    phonestring = filecontent[phonestring_offset:(phonestring_offset+phonelength)].decode('utf-8')
    print("phonestring = " + phonestring)

After calculating the "Phone String" field offset (should be one byte past the "Phone Length" field), we can use the string.decode('utf-8') method to interpret the filecontent[phonestring_offset:(phonestring_offset+phonelength)] slice as a UTF8 string and store it in the "phonestring" variable.

4g. For the sixth and last field ("Unix Timestamp") we can use:
    timestamp_offset = phonestring_offset + phonelength
    print("timestamp_offset = " + str(timestamp_offset))
    timestamp = struct.unpack("<I", filecontent[timestamp_offset:(timestamp_offset+4)])[0]
    print("raw timestamp decimal value = " + str(timestamp))
    timestring = datetime.datetime.utcfromtimestamp(timestamp).strftime("%Y-%m-%dT%H:%M:%S")
    print("timestring = " + timestring)

We calculate the timestamp offset as being "Phone Length" bytes past the "Phone String" field and print the timestamp offset to help with debugging.
We use unpack with the "<I" argument to interpret the 4 byte filecontent[timestamp_offset:(timestamp_offset+4)] slice as a LE unsigned 4 byte integer and then store the integer value in the "timestamp" variable.
eg interprets 0x26CDDB56 LE as 0x56DBCD26 BE = 1457245478 decimal = number of seconds since 1JAN1970.
We then call the datetime.datetime.utcfromtimestamp() method to create a Python "datetime" object using the number of seconds since 1JAN1970. The returned datetime object has a "strftime" method we can call to obtain a human readable ISO format string. The "%Y-%m-%dT%H:%M:%S" argument to strftime() specifies that we want a datetime string formatted as Year-Month-DayTHour:Minute:Second.

Step 5: After we process all of the "ctx!" hits, we close the file (See "unpack-tute.py" line 89):
    fb.close()

For shiggles, we also print out the number of hits in the hitlist on line 91 before the script finishes.
    print("\nProcessed " + str(len(hitlist)) + " ctx! hits. Exiting ...\n")

Running the script

For Python v2.7.12:
In a Win7x64 command terminal window with "unpack-tute.py" and "testctx.bin" copied to "c:\":

C:\>c:\Python27\python.exe unpack-tute.py
Running unpack-tute.py v2017-08-19


Hit found at offset: 64 decimal = 0x40 hex
indexnum = 1
namelength_offset = 70
namelength = 12
namestring = George
phonelength = 7
phonestring_offset = 84
phonestring = 5551234
timestamp_offset = 91
raw timestamp decimal value = 1457245478
timestring = 2016-03-06T06:24:38

Hit found at offset: 512 decimal = 0x200 hex
indexnum = 65282
namelength_offset = 518
namelength = 18
namestring = King Kong
phonelength = 9
phonestring_offset = 538
phonestring = +15554321
timestamp_offset = 547
raw timestamp decimal value = 1457245695
timestring = 2016-03-06T06:28:15

Hit found at offset: 800 decimal = 0x320 hex
indexnum = 65535
namelength_offset = 806
namelength = 30
namestring = Magilla Gorilla
phonelength = 10
phonestring_offset = 838
phonestring = +445552468
timestamp_offset = 848
raw timestamp decimal value = 1457258495
timestring = 2016-03-06T10:01:35

Processed 3 ctx! hits. Exiting ...


C:\>

For Python 3.4.1:
 In a Win7x64 command terminal window with "unpack-tute.py" and "testctx.bin" copied to "c:\":

C:\>c:\Python34\python.exe unpack-tute.py
Running unpack-tute.py v2017-08-19


Hit found at offset: 64 decimal = 0x40 hex
indexnum = 1
namelength_offset = 70
namelength = 12
namestring = George
phonelength = 7
phonestring_offset = 84
phonestring = 5551234
timestamp_offset = 91
raw timestamp decimal value = 1457245478
timestring = 2016-03-06T06:24:38

Hit found at offset: 512 decimal = 0x200 hex
indexnum = 65282
namelength_offset = 518
namelength = 18
namestring = King Kong
phonelength = 9
phonestring_offset = 538
phonestring = +15554321
timestamp_offset = 547
raw timestamp decimal value = 1457245695
timestring = 2016-03-06T06:28:15

Hit found at offset: 800 decimal = 0x320 hex
indexnum = 65535
namelength_offset = 806
namelength = 30
namestring = Magilla Gorilla
phonelength = 10
phonestring_offset = 838
phonestring = +445552468
timestamp_offset = 848
raw timestamp decimal value = 1457258495
timestring = 2016-03-06T10:01:35

Processed 3 ctx! hits. Exiting ...


C:\>

We can see that all of the name and phone strings are complete / as shown in the Hex view picture.
We also verified that each "timestring" value corresponded to it's raw LE hex value using Dcode.

Final Thoughts

After you know the basics of a language, programming is a skill best sharpened by working on actual projects (not reading books or blog posts).
Google and StackOverflow are your friends when researching how to code common tasks in Python.
Which makes print statements your No-BS-tell-it-like-it-is best friend when debugging (e.g. print offset addresses and/or values to debug). A well placed print statement can be the easiest way of finding out that your fifth cola/coffee didn't do you any favours.

The code in this script is intended for use with files that can fit into memory (ie 0 MB to *maybe* hundreds of MB).
Larger files may require breaking up the file into chunks before reading/processing.

In writing this script, we used Notepad++ (v6.7.9.2) with the Language set to Python to get the funky syntax highlighting (eg comments in green, auto-indenting). The TAB size was set to 4 spaces via the Settings, Preferences, Tab Settings menu. We disabled "Word Wrap" (under View menu) and enabled line numbers (under Settings, Preferences, Editing menu) so if/when you get a runtime error, you can find the relevant line more readily.

If you are in the forensic community and found this post helpful or you're in the forensic community and had some questions/thoughts about the code, please leave a comment or send me an email (No, I will not do your homework/assignment! But if its for a new artifact for a case, monkey might be convinced ;).


Wednesday, 4 January 2017

Monkey Plays (LAN) Turtle

OMG! Sooo Turtle-y!

The Hak5 LAN Turtle recently plodded across our desk so we decided to poke it with a stick and see how effective it is in capturing Windows (7) credentials.
From the LAN Turtle wiki:
The LAN Turtle is a covert Systems Administration and Penetration Testing tool providing stealth remote access, network intelligence gathering, and man-in-the-middle monitoring capabilities.
Housed within a generic "USB Ethernet Adapter" case, the LAN Turtle’s covert appearance allows it to blend into many IT environments.
It costs about U$50 and looks like this:




It consists of a System-On-Chip running an openwrt (Linux) based OS. Amongst other things, it can act like a network bridge/router between:
- a USB Ethernet interface which you plug into your target PC. This interface can also be ssh'd into via its static IP address 172.16.84.1 (for initial configuration and copying off creds).
- a 10/100 Mbps Ethernet port which you can use to connect the Turtle to the Internet (providing remote shell access and allowing the install of modules/updates from LANTurtle.com). It is not required to capture creds during normal operation.

It also has 16 MB on board Flash memory and can be configured to run a bunch of different modules via a Module Manager.

By using the Turtle's USB Ethernet interface to create a new network connection and then sending the appropriate responses, the Turtle is able to capture a logged in user's Windows credentials. Apparently Windows will send credentials over a network whether the screen is locked or not (a user must be logged in).

We will be using the QuickCreds module written by Darren Kitchen which was based on the research of Rob "Mubix" Fuller.
To send the appropriate network responses, QuickCreds calls Laurent Gaffie's Responder Python script and saves credentials (eg NTLMv2 for our Win 7 test case) to numbered directories in /root/loot. The amber Ethernet LED will blink rapidly while QuickCreds is running. When finished capturing (~30 secs to a few minutes), the amber LED is supposed to remain lit.

But wait - there's more! The turtle can also offer remote shell/netcat/meterpreter access, DNS spoofing, man-in-the-middle browser attacks, nmap scans and so much more via various downloadable modules. Alas, we only have enough time/sanity/Turtle food to look at the QuickCreds module.

Setup

We will be both configuring and testing the Turtle on a single laptop running Windows 7 Pro x64 with SP1. Realistically, you would configure it on one PC and then plug it into a separate target PC.
 
We begin setup by plugging the Turtle into the configuration PC and using PuTTY to ssh as root to 172.16.84.1. For proper menu display, be sure to adjust the PuTTY Configuration's Windows, Translation, Remote character set to "Latin-1, Western Eur".

The default root Turtle password is sh3llz. Upon first login, the user is then prompted to change the root password.
Ensure an Internet providing Ethernet cable is plugged in to the Turtle's Ethernet port to provide access to LANTurtle.com updates.

Note: The Turtle may also require Windows to install the "Realtek USB FE Family Controller" Network Adapter driver before you can communicate with it.

Upon entering/confirming the new root password, you should see something like:

LAN Turtle Main Menu via PuTTY session


Under Modules, Module Manager, go to Configure, then Directory to select the QuickCreds module for download. You can select/check a module for download via the arrow/spacebar keys.

Return back to Modules, select the QuickCreds module, then Configure (this will take a few minutes to download/install/configure the dependencies from the Internet). Remember to have an Internet providing Ethernet cable plugged into the Turtle.

Select the QuickCreds Enable option so QuickCreds is launched whenever the Turtle is plugged into a USB port.
(Optional) You can also select the Start option to start the QuickCreds module now and it should collect your current Windows login creds.

We are now ready to remove the Turtle from our config PC and place it into a target PC's USB port.

If you're having issues getting the Turtle working, try to manually reset the Turtle following the "Manually Upgrading" wiki procedure at the bottom of this page.

There's also a Hak5 Turtle/QuickCreds demo and explanation video by Darren Kitchen and Shannon Morse thats well worth a view.

Capturing Creds

Insert the Turtle into the (locked) target PC and wait for the creds to be captured. Our Turtle's amber Ethernet light followed this pattern on insertion:
- ON/OFF
- OFF (10 secs)
- Blinking at 1 Hz (15 secs)
- OFF (1-2 secs)
- Rapid Blinking > 1 Hz (indefinitely or until we launch PuTTY when it remains ON)

From testing, once we see the rapid blinking, the creds have been captured.

If you have an Internet cable plugged in to the Turtle when capturing creds, you can also remote SSH into the Turtle to retrieve the captured creds.This is not in the scope of this post however.

For our testing, we will keep it simple and use PuTTY's scp to retrieve the stored creds (eg capture creds, retrieve Turtle, take Turtle back to base for creds retrieval):
We remove the Turtle from the target PC and re-insert it into our config PC. For our testing on a single laptop this meant - we removed the Turtle, unlocked the laptop and then re-inserted the Turtle.
Note: Due to the auto enable, the Turtle will also capture the config PC's creds upon insertion.

Now PuTTY in to the Turtle, then choose Exit to get to the Turtle command prompt/shell (shell ... Get it? hyuk, hyuk).

To find the latest saved creds we can type something like:

ls -alt /root/loot

which shows us the latest creds (corresponding to our current config PC) is stored under /root/loot/12/

root@turtle:~# ls -alt /root/loot/
-rw-r--r--    1 root     root           319 Jan  2 11:14 responder.log
drwxr-xr-x    2 root     root             0 Jan  2 11:13 12
drwxr-xr-x   14 root     root             0 Jan  2 11:11 .
drwxr-xr-x    2 root     root             0 Jan  2 11:01 11
drwxr-xr-x    2 root     root             0 Jan  2 11:00 10
drwxr-xr-x    2 root     root             0 Jan  2 10:46 9
drwxr-xr-x    2 root     root             0 Jan  2 08:58 8
drwxr-xr-x    2 root     root             0 Jan  2 08:49 7
drwxr-xr-x    2 root     root             0 Jan  2 08:46 6
drwxr-xr-x    2 root     root             0 Jan  2 08:35 5
drwxr-xr-x    2 root     root             0 Jan  2 08:34 4
drwxr-xr-x    2 root     root             0 Jan  2 08:26 3
drwxr-xr-x    2 root     root             0 Jan  2 08:21 2
drwxr-xr-x    2 root     root             0 Jan  2 08:20 1
drwxr-xr-x    1 root     root             0 Jan  2 08:20 ..
root@turtle:~#

So looking further at /root/loot/11/ (ie the creds from when we plugged the Turtle into the locked laptop) shows us a few log files and a text file containing our captured creds (ie HTTP-NTLMv2-172.16.84.182.txt).

root@turtle:~# ls /root/loot/11/
Analyzer-Session.log           Poisoners-Session.log
Config-Responder.log           Responder-Session.log
HTTP-NTLMv2-172.16.84.182.txt
root@turtle:~#


Our creds should be stored in HTTP-NTLMv2-172.16.84.182.txt and we can use the following command to check that the file contents look OK:

more /root/loot/HTTP-NTLMv2-172.16.84.182.txt

which should return something like:

admin::N46iSNekpT:08ca45b7d7ea58ee:88dcbe4446168966a153a0064958dac6:5c7830315c7830310000000000000b45c67103d07d7b95acd12ffa11230e0000000052920b85f78d013c31cdb3b92f5d765c783030

Where admin is the login name and the second field (eg N46iSNekpT) corresponds to the domain.
Note: This is an NTLMv2 example sourced from hashcat.

Once we have found the appropriate file containing the creds we want, we can use PuTTY pscp.exe to copy the files from the Turtle to our config PC.
From our Windows config PC we can use something like:
pscp root@172.16.84.1:/root/loot/11/HTTP-NTLMv2-172.16.84.182.txt .

to copy out the creds file. Note the final . to copy the creds file into the current directory on the config PC.

We can then feed this (file or individual entries) into hashcat to crack the user password. This is an exercise left for the reader.

Turtle Artifacts?

Now that we have our creds, lets see if we can find any fresh Turtle scat er, artifacts.

Starting with the Turtle plugged in to an unlocked PC, we look under the Windows Device Manager and find the Network adapter driver for the Turtle - ie the "Realtek USB FE Family Controller"

Turtle Network Adapter Driver Properties


The Details Tab from the Properties screen yields a "Device Instance Path" of:
USB\VID_0BDA&PID_8152\00E04C36150A

Similarly, the "Hardware Ids" listed were "USB\VID_0BDA&PID_8152" and "USB\VID_0BDA&PID_8152&REV_2000".

The HardwareId string ("VID_0BDA&PID_8152") implies that the driver was communicating with a Realtek 8152 USB Ethernet controller. Note: 0BDA is the vendor id for Realtek Semiconductor (see https://usb-ids.gowdy.us/read/UD/0bda) and the Turtle Wiki specs confirm the Turtle uses a "USB Ethernet Port - Realtek RTL8152".

We then used FTK Imager (v3.4.2.2) to grab the Registry hives so we can check them for artifacts.

Searching the SYSTEM hive for part of the "Device Instance Path" string (ie "VID_0BDA&PID_8152") yields an entry in SYSTEM\ControlSet001\Enum\USB\VID_0BDA&PID_8152

Potential First Turtle Insertion Time

The Last Written Time appears to match the first time the Turtle was inserted into the PC (21DEC2016 @ 21:15:54 UTC).

Another hit occurs in SYSTEM\ControlSet001\Enum\USB\VID_0BDA&PID_8152\00E04C36150A

Potential Most Recent Turtle Insertion Time


The Last Written Time appears to match the most recent time the Turtle was inserted (2JAN2017 @ 11:45:01 UTC).


The Turtle's 172.16.84.1 address appears in the Windows SYSTEM Registry hive as a "DhcpServer" value under SYSTEM\ControlSet001\services\Tcpip\Parameters\Interfaces\{59C1F0C4-66A7-42C8-B25E-6007F3C40925}.

Turtle's DHCP Address and Timestamp

Additionally under that same key, we can see a "LeaseObtainedTime" value which appears to be in seconds since Unix epoch (1JAN1970).
Using DCode to translate gives us:

Turtle DHCP LeaseObtainedTime Conversion


ie 2 JAN 2017 @ 11:24:37
This time occurs between the first time the Turtle was inserted (21DEC2016) and the most recent time the Turtle was inserted (2JAN2017 @ 11:45:01). This is plausible as the Turtle was plugged in multiple times during testing on the 2 JAN 2017. It is estimated that the Turtle was first plugged in on 2 JAN 2017 around the same time as the "LeaseObtainedTime". 

These timestamps potentially enable us to give a timeframe for Turtle use. We say potentially because it is possible that another device using the "Realtek USB FE Family Controller" driver may have also been used. However, the specific IP address (172.16.84.1) can help us point the flipper at a rogue Turtle.

The "Realtek USB FE Family Controller" string also appears in the "Description" value under the SOFTWARE hive:
SOFTWARE\Microsoft\Windows NT\CurrentVersion\NetworkCards\17

NetworkCards entry potentially pointing to Turtle

Note: The NetworkCards number entry will vary (probably will not be 17 in all cases)

There are probably more artifacts to be found but these Registry entries were the ones that were the most obvious to find. The Windows Event logs did not seem to log anything Turtle-y definitive.

However, based on the artifacts above, we can only say that a Turtle was probably plugged in. We don't have enough (yet?) to state which modules (if any) were run.

Final Thoughts

Anecdotally from the Hak5 Turtle Forums, capturing Windows credentials with the LAN Turtle seems to be hit and miss.
From our testing, the Turtle QuickCreds module worked for a Win7 laptop but failed to capture creds for a Win10 VM running on the same laptop. Once the Turtle was plugged in to the laptop, it captured the creds for the host Win7 OS but upon connecting the Turtle to the Win10 VM via the "Removable Devices" VMware 12 Player menu, the amber LED remains solidly lit and the Win10 creds were not captured.
Interestingly, not all of the Win7 Registry artifacts listed previously were observed in the Win10 VM's Registry:
Both SYSTEM\ControlSet001\Enum\USB\VID_0BDA&PID_8152 and SYSTEM\ControlSet001\Enum\USB\VID_0BDA&PID_8152\00E04C36150A were present in the Win10 SYSTEM registry.
However, no hits were observed for "172.16.84.1" in SYSTEM.
There were various hits for "Realtek USB FE Family Controller" in SYSTEM.
The "Realtek USB FE Family Controller" string also appears in the "Description" value under the Win10 SOFTWARE hive:
SOFTWARE\Microsoft\Windows NT\CurrentVersion\NetworkCards\5
The lack of Win10 Registry DHCP artifacts probably indicates that while the Realtek USB Ethernet driver was installed, the Turtle was unable to assign the 172.16.84.1 IP address within the WIN10 VM (possibly because the Win7 still has it reserved?).

Fortunately, Jackk has recorded a helpful YouTube video demonstrating the LAN Turtle running QuickCreds successfully against a Win10 laptop (not VM). So it is possible on Win10 ... Jackk also shows how to use the Turtle's sshfs module to copy off the cred files via a FileZilla client (instead of using pscp).

Any comments/suggestions are turtle-y welcome in the comments section below.

Saturday, 13 August 2016

Google S2 Mapping Scripts

Sorry Monkey - there is just no point to mapping jokes ...

Cindy Murphy's recent forensic forays into Pokemon Go (here and here)
 have inspired further monkey research into the Google S2 Mapping library. The S2 library is also used by Uber, Foursquare and Google (presumably) for mapping location data. So it would probably be useful to recognise and/or translate any S2 encoded location artifacts we might come across in our forensic travels eh? *repeats in whispered voice* travels ...
After a brief introduction to how S2 represents lat/long data, we will demonstrate a couple of multi-platform (Windows/Linux) Python conversion scripts for decoding/encoding S2 locations (using sidewalklabs' s2sphere library).

S2 Mapping (In Theory)

The main resources for this section were:
The TLDR - its possible to convert a latitude/longitude from a sphere into a 64 bit integer. The resultant 64 bit integer is known as a cellid.
But how do we calculate this 64 bit cellid? This is achieved by projecting our spherical point (lat, long) onto one of 6 faces of an enclosing cube and then using a Hilbert curve function to specify a grid location for a specified cell size. Points along the Hilbert curve that are close to each other in value, are also spatially close to one another. This diagram from Christian's blog post better illustrates the point:

Points close to each other on the Hilbert curve have similar values. Source: Christian Perone's Blog

In the above diagram, the Hilbert curve (darker gray line) goes from the bottom LHS to the bottom RHS (or vice versa). Each grid box contains one of those Y-shaped patterns. The scale at the bottom represents the curve if it was straightened out like a piece of string.
If you now imagine the grid becoming finer/smaller but still requiring one Y-shape per box, you can see how a smaller cell grid size requires a finer resolution for points on the curve. So the smaller the cell/grid size, the higher the number of bits required to store the position. For example, a level 2 cell size only needs 4 bits where as the maximum level 30 requires 60 bits. Level 30 cell sizes are approximately 0.48-0.93 square centimetres in size depending on the lat/long.

Fun fact: Uber apparently uses a level 12 cell size (approx. 3.3 to 6.4 square kilometres per cell). 
Second fun fact: The Metric system has been around for over 100 years so stop whining about all the metric measurements already *looks over sternly at the last of the Imperial Guards in the United States, Myanmar and Liberia*.

Ahem ... so here's what a level 30 cellid looks like:

Level 30 cellid structure. Source: Octavian Procopiuc's GoogleDoc

The first 3 bits represent which face of the enclosing cube to use and the remaining 60 bits are used to store the position on the Hilbert curve. Note: The last bit is set to 1 to mark the end of the Hilbert positioning bits.

When cellids are converted into hexadecimal and have their least significant zeroes removed (if present), they are in their shortened "token" form.
eg1 cellid = 10743750136202470315 (decimal) has a token id = 0x951977D377E723AB
eg2 cellid = 9801614157982728192 (decimal) = 0x8806542540000000. The 16 hex digits can then be shortened to a token value of "880654254". To convert back to the original hex number, we keep adding least significant zeroes to "880654254" until its 16 digits long (ie 64 bits).
Analysts should anticipate seeing either cellids or token ids. These might be in plaintext (eg JSON) or may be in an SQLite database.

Note: Windows Calculator sucks at handling large unsigned 64 bit numbers. According to this, its limited between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807. So a number like 10,743,750,136,202,470,315 made it return an incorrect hex representation after conversion.
This monkey spun his paws for a while trying to figure out why the token conversions didn't seem to make sense. The FFs Solution - use the Ubuntu Calculator for hex conversions of 64 bit integers instead.

The Scripts

Two Python 2.7+ scripts were written to handle S2 conversions and are available from my GitHub here. They have been tested on both Windows 7 running Python 2.7.12 and Ubuntu x64 14.04 running Python 2.7.6.

s2-latlong2cellid.py converts lat, long and cellid level to a 64 bit Google S2 cellid.

s2-cellid2latlong.py converts a 64 bit Google S2 cellid to a lat, long and S2 cellid level.

IMPORTANT: These scripts rely on the third-party s2sphere Python library. Users can install it via:
pip install s2sphere
(on Windows) and:
sudo pip install s2sphere
(on Ubuntu)

Here's the help text for s2-latlong2cellid.py:
python s2-latlong2cellid.py -h
Running s2-latlong2cellid.py v2016-08-12

usage: s2-latlong2cellid.py [-h] llat llong level

Converts lat, long and cellid level to a 64 bit Google S2 cellid

positional arguments:
  llat        Latitude in decimal degrees
  llong       Latitude in decimal degrees
  level       S2 cell level

optional arguments:
  -h, --help  show this help message and exit

Here's the help text for s2-cellid2latlong.py:
python s2-cellid2latlong.py -h
Running s2-cellid2latlong.py v2016-08-12

usage: s2-cellid2latlong.py [-h] cellid

Convert a 64 bit Google S2 cellid to a lat, long and S2 cellid level

positional arguments:
  cellid      Google S2 cellid

optional arguments:
  -h, --help  show this help message and exit


Testing

A handy online S2map visualizer was written by David Blackman - the Lead Geo Engineer at Foursquare (and formerly of Google ie he knows his maps). See also the Readme here. S2map also has several other mapping dropbox options besides the default "OSM Light". These include "Mapbox Satellite" which projects the cellid onto an aerial view.

We start our tests by going to GoogleMaps and noting the lat, long of an intersection in Las Vegas.

Pick a spot! Any spot!

We then specify that lat, long as input into the s2-latlong2cellid.py script (with level set to 24):
python s2-latlong2cellid.py 36.114574 -115.180628 24
Running s2-latlong2cellid.py v2016-08-12

S2 cellid = 9279882692622716928

We then put that cellid into s2map.com:

Level 24 test cellid plotted on s2map.com.


Note: The red arrow was added by monkey to better show the plotted cellid (its tiny).
So we can see that our s2-latlong2cellid.py gets us pretty close to where we originally specified on GoogleMaps.

What happens if we keep the same lat, long coordinates but decrease the level of the cellid from 24 to 12?
python s2-latlong2cellid.py 36.114574 -115.180628 12
Running s2-latlong2cellid.py v2016-08-12

S2 cellid = 9279882742634381312

Obviously this is a different cellid because its set at a different level, but just how far away is the plotted level 12 cellid now?

Level 12 test cellid plotted on s2map.com.

Whoa! The cell accuracy has just decreased a bunch. It appears the center of this cellid is completely different to the position we originally set in GoogleMaps. It is now centred on the Bellagio instead of the intersection. This is presumably because the cell size is now larger and the center point of the cell has moved accordingly.

To confirm these findings, we take our level 24 cellid 9279882692622716928 and use it with s2-cellid2latlong.py.

python s2-cellid2latlong.py 9279882692622716928
Running s2-cellid2latlong.py v2016-08-12

S2 Level = 24
36.114574473973924 , -115.18062802526205

We then plot those coordinates on GoogleMaps ...

Level 24 test cellid 9279882692622716928 plotted on GoogleMaps via s2-cellid2latlong.py


ie Our s2-cellid2latlong.py script seems to work OK for level 24.

Here's what it looks like when we use the level 12 cellid 9279882742634381312:
python s2-cellid2latlong.py 9279882742634381312
Running s2-cellid2latlong.py v2016-08-12

S2 Level = 12
36.11195989469266 , -115.17705862118852

Level 12 test cellid 9279882742634381312 plotted on GoogleMaps via s2-cellid2latlong.py


This seems to confirm the results from s2map.com. For the same lat, long, changing the cellid level can significantly affect the returned (centre) lat, long.

We also tested our scripts against s2map.com with a handful of other cellids and lat/long/levels and they seemed consistent. Obviously time contraints will not let us test every possible point.

Final Thoughts

Using the s2sphere library, we were able to create a Python script to convert a lat, long and level to an S2 cellid (s2-latlong2cellid.py). We also created another script to convert a S2 cellid to a lat, long and level (s2-cellid2latlong.py).
The higher the cellid level, the more accurate the location. You can find the cellid level by using the s2-cellid2latlong.py script.
Plotting a cellid with s2map.com is the easiest way of visualizing the cellid boundary on a map. Higher levels (>24) become effectively invisible however.

To locate potential S2 cellids we can use search terms like "cellid" or variations such as "cellid=". If its stored in plaintext (eg JSON), those search terms should find it. No such luck if its encrypted or stored as a binary integer though.

While there are other S2 Python libraries, this Monkey decided to use sidewalklabs s2sphere library based on its available documentation and pain-free cross platform support (pip install is supported).

Other Google S2 Python libraries include:
https://github.com/micolous/s2-geometry-library
(As demonstrated in Christian's blog and also used in the Gillware Pokemon script. This seems to be Linux only)
and
https://github.com/qedus/sphere
(Has the comment: "Needs to be packaged properly for use with PIP")

Some other interesting background stuff ...
Interview article with David Blackman (Foursquare)

Matt Ranney's (Chief Systems Architect at Uber) video presentation on "Scaling Uber's Real-time Market Platform" (see 18:15 mark for S2 content)

Uber uses drivers phone as a backup data source (claims its encrypted)


In the end, creating the Python conversion scripts was surprisingly straight forward and only required a few lines of code.
It will be interesting to see how many apps leave Google S2 cellid artifacts behind (hopefully ones with a high cellid level). Hopefully these scripts will prove useful when looking for location artifacts. eg when an analyst finds a cellid/token and wants to map it to a lat/long or when an analyst wants to calculate a cellid/token for a specific lat, long, level.