Sunday 15 February 2015

Reversing Monkey

Reversing may also drive you bananas ...

When trying to recover/carve deleted data, some reverse engineering of the file format may be required. Without knowing how the data is stored, we cannot recover the data of interest - be it timestamps, messages, images, video or another type of data. This quick blog post is intended to give some basic tips that have been observed during monkey's latest travels into reverse engineering of file formats. It was done partly as a memory aid/thinking exercise but hopefully other monkeys will find it useful. This post assumes there's no obfuscation/encryption applied to the file and it does not cover reverse engineering malware exes (which is another kettle of bananas).  

Collect as much Background Information as possible

If you find yourself trying to reverse a file format, you probably have an idea of what type of target data it contains (eg text, image, picture). Familarising yourself with how your target file is organised at a conceptual level will help connect what your seeing at the hex level. The more you can find out about the file (eg encoding methods, typical file sizes, indexing arangements), the more "pointers" you will have.
Here's a handy reference for high level overviews of various common file formats:
http://www.digitalpreservation.gov/formats/fdd/descriptions.shtml

How much do you need / Scope

You may not have to reverse the whole file if you are only after a subset of information (eg just the message field). Knowing what type of encoding was used beforehand will help. For example, for a message field, you can perform an ASCII string search for your target string.

Hex Editor

Reversing a file will require wading into the hex so make sure you're comfortable with your chosen Hex editor. Something that shows offsets in both hex/decimal (BE/LE) and can also interpret byte/word/multi-word values will help when the file has embedded offset addresses. Some people can think in hex exclusively, I am not one of them (yet?) - so why not let the computer do the conversions?
WinHex and HexWorkshop are a couple of editors that I have used in the past. Other people have mentioned using the 010 binary editor. Forensic guru and Faux-Photoshopper extroadinaire Brian Moran swears by HexWorkshop (he may also swear about other things but that's for another conversation). Some nice features about HexWorkshop are that it can detect binary differences between files as well as allow you to define your own templates so different fields can be color coded. It also allows for some statistical analysis - it will show you how many times a given byte value occurs in your sample set which is great for finding those repeated pesky 1 byte field markers (or suspected TrueCrypt containers).

Patience / experimentation

Be prepared to spend lots of time on it. Reversing doesn't appear to be a "one process fits all, I'll have it done in X hours" kinda thing (especially when you're starting out). Always work from a sample copy of the data so (if you have to) you can modify your working copy to confirm/deny your crazy theories (eg I think this is a timestamp field ... lets change it and see what is read/displayed back). Just don't forget that you've modified the file!

Sample data

The more varied, the merrier. Being able to compare multiple sets of data can help you confirm your suspicions about a particular set of bytes. For example, is it really a static field or is it a timestamp?
Alternatively, which fields stay the same and which fields change between sets of data? This is where having a hex editor capable of showing the differences between files can help.

Endian-ness

Is the file written/used on a Big Endian (BE eg 0x12 0x34) or Little Endian (LE eg 0x34 0x12) system. If its running on Intel hardware, then it's probably Little Endian.

Signatures

The file signature is the "magic number"/series of hex values which lets the reading software know it's "their kind of file". Gary Kessler keeps a handy index of file signatures here. Chances are, if it's a known file container format it will be in that listing.
Notice how I said container? With video files especially, there are various container formats (eg AVI, MP4) but these can contain encoded data (eg MJPG, H.264) which have their own rules/format.
Most files will have multiple bytes dedicated for the file signature. However, internal field markers may only use one or two bytes which will result in a lot of false hits when searching for those field markers amongst random looking data.

Byte boundaries

Are the files grouping data at the bit, byte, word etc level?
Knowing if your fields are grouped along particular size boundaries means that you can minimize wild goose/geese chases. For example, once you know that integers are written as 4 byte LE, it can make it easier to keep track of what is padding and what is data.

Padding / file slack

A bunch of zeros (or xFFs) can be a potential indicator that some padding has taken place so the data can fit into a certain (even/odd) number of bytes. If the file was written on the fly, it probably reserved more space than it needed for future use. If the file was not "closed" properly, you might then see these reserved/pad bytes with no easily discernible end of file marker.

Regular sized blocks of data or variable?

Detecting fixed sized blocks of data will be aided by comparing multiple data sets.
For variable sized data blocks, the length will probably be declared *somewhere* before the data so the reading software knows how much to read.
Alternatively, there may some sort of begin/end of data marker. For example, "0xFF 0xD8" marks the beginning of JPEG data and "0xFF xD9" marks the end. You are more likely to get trailers when the data size is not declared/known beforehand.

Strings

Are they Unicode (eg 2 bytes per character like UTF-16BE/LE) or ASCII (1 byte per character) encoded? Are they null terminated? If they are not null terminated, expect to see a size of string type field either directly before or *somewhere* before the actual string - again, the reading program needs to know how much to read before it calls for the read.

Timestamps

These are likely to exist in most file formats. Note: We're talking about internal timestamps here not filesystem ones. For carving, being able to ascertain a file's time period will help narrow down the search (assuming you know the relevant time period).
Becoming familar with the multitude of timestamp formats will help - Paul Sanderson's blog post on timestamps is a great starting point . From my travels, 4 byte integers listing the number of seconds since a given point (eg since 1JAN1970) are pretty common for anything non-Windows based (eg Android, iPhone devices). So searching your file for the 3 most significant bytes of a desired date range might lead you to some timestamps within your file. Digital Detective's DCode is a great free tool for calculating selected potential (Windows, *nix, Mac) timestamp values.

Offsets

I am using this term to refer to the internal addressing mechanisms used to point the reading software to a certain point/byte in the file.
These can be:
- Relative to a certain point (eg go forward 100 bytes from this byte) or
- Absolute (eg 215 bytes from the start of the file).

Complicating matters are nested collections of offsets - so you might have a table of offsets referrring to more tables of offsets etc. Eventually, you should be able to follow the trail to find the relevant/target data. Hopefully, your eyesight and sanity are both still intact ;)
Matthew Ekenstedt has offered some great tips regarding offsets on his website. To paraphrase, he reckons the larger hex values you see are potential byte offsets relative to the beginning of the file. Smaller hex values could be relative offsets from a particular point (eg field headers). The smallest hex values (eg 1-2 bytes) will probably correspond to lengths of data fields.
So how big is too big for an offset? Knowing your file size will help you decide if a potential offset is realistic or not. For example, you're not likely to find a greater than 4 byte offset for a file (4 bytes = xFFFFFFFF = 4 GB).

Indexes (for want of a better description)

Some files (eg video) may not append a table of offsets until it is actually exported (eg user explicitly saves video). So when carving for un-exported (ie user has not chosen to save but the file was still written), this may result in finding files which do not have their final indexes recorded. Boo!
Don't let this deter you from trying to play the unexported file back though - if the recording software can read it, there must be sufficient indexing available to retrieve data. Which leads us to our last point ...

Windows file formats

In some cases, the file might come bundled in a Windows exe for playback (eg exported video) or it might use a Windows exe to read it. Because of this, we can use Sysinternals Process Monitor to show us how the file is being read (eg the order of file offsets as the file is read and the associated length of the reads). Note: Process Monitor outputs the file offsets in decimal so you'll have to convert it into hex before searching your file for those offsets/read lengths. Knowing how a file is being read can lead us to how the data is indexed/stored (eg an offset table refers to another offset table which contains the actual start offsets for certain data runs).

Final Words

Hopefully these tips were helpful. If you have any other tips that you'd like to share, please leave a comment below :)
And now that you know what I know about reverse engineering file formats, there really isn't anything else that I can suggest - so please don't ask me to reverse your funky file format :).
And to finish things off, here's an interesting paper which shows the value of all this hex diving - "Forensic analysis of video file formats"  by Thomas Gloe et al. 2014. Specifically, it shows how looking at the arrangement of image/video data fields can show if an image/video has been edited by software.

Good Luck and Happy Reversing!