Wednesday 17 July 2013

G is 4 cookie! (nomnomnom)

What is it?

A Linux/Unix based Perl script for parsing cached Google Analytic requests. Coded/tested on SANS SIFT Virtual Machine v2.14 (Perl v5.10). The script (gis4cookie.pl) can be downloaded from:
http://code.google.com/p/cheeky4n6monkey/downloads/list

The script name is pronounced "G is for cookie". The name was inspired by this ...




Basically, Google Analytics (GA) tracks website statistics. When you browse a site that utilizes GA, your browser somewhat sneakily makes a request for a small invisible .GIF file. Also passed with that request is a bunch of arguments which tell the folks at Google various cookie type information such as the visiting times, page title, website hostname, referral page, any search terms used to find website, Flash version and whether Java is enabled. These requests are consequently stored in browser cache files. The neat part is that even if a user clears their browser cache or deletes their user profile, we may still be able to gauge browsing behaviour by looking for these GA requests in unallocated space.

Because there is potentially a LOT of data that can be stored, we felt that creating a script to extract this information would help us (and the forensics community!) save both time and our ageing eyeballs.

For more information (eg common browser cache locations) please refer to Mari Degrazia's blog post here.
Other references include Jon Nelson's excellent DFINews article on Google Analytic Cookies
and the Google Analytics documentation.

How It Works

1. Given a filename or a directory containing files, the script will search for the "google-analytics.com/__utm.gif?" string and store any hit file offsets.
2. For each hit file offset, the script will try to extract the URL string and store it for later parsing.
3. Each extracted URL hit string is then parsed for selected Google Analytic arguments which are printed either to the command line or to a user specified Tab Separated Variable file.

The following Google Analytic arguments are currently parsed/printed:
utma_first_time
utma_prev_time
utma_last_time
utmdt (page title)
utmhn (hostname)
utmp (page request)
utmr (referring URL)
utmz_last_time
utmz_sessions
utmz_sources (organic/referral/direct)
utmz_utmcsr (source site)
utmz_utmcmd (type of access)
utmz_utmctr (search keywords)
utmz_utmcct (path to website resource)
utmfl (Flash version)
utmje (Java enabled).
You probably won't see all of these parameters in a given GA URL. The script will print "NA" for any missing arguments. More information on each argument is available from the references listed previously.

To Use It

You can type something like:
./gis4cookie -f inputfile -o output.tsv -d

This will parse "inputfile" for GA requests and output to a tab separated file ("output.tsv"). You can then import the tsv file into your favourite spreadsheet application.
To improve readability, this example command also decodes URI encoded strings via the -d argument (eg convert %25 into a "%" character). For more info on URI/URL/percent encoding see here.

Note: The -f inputfile name cannot contain spaces.

Alternatively, you can point the script at a directory of files:
./gis4cookie -p /home/sansforensics/inputdir

In this example, the script prints its output to the command line (not recommended due to the number of parameters parsed). This example also does not decode URI/URL/percent encoding (no -d argument).

Note: The -p directory name MUST use an absolute path (eg "/home/sansforensics/inputdir" and not just "inputdir").

Other Items of Note

  • The script is Linux/Unix only (it relies on the Linux/Unix "grep" executable).
  • There is a 2000 character limit on the URL string extraction. This was put in so the URL string extraction didn't loop forever. So if you see the message "UH-OH! The URL string at offset 0x____ appears to be too large! (>2000 chars). Ignoring ..." you should be able to get rid of it by increasing the "$MAX_SZ_STRING" value. Our test data didn't have a problem with 2000 characters but your freaky data might. The 2000 character count starts at the "g" in "google-analytics.com/__utm.gif?".
  • Some URI encodings (eg %2520) will only have the first term translated (eg "%2520" converts to "%20"). This is apparently how GA encodes some URL information. So you will probably still see "%20"s in some fields (eg utmr_referral, utmz_utmctr). But at least it's a bit more readable.
  • The script does not find/parse UTF-16/Unicode GA URL strings. This is because grep doesn't handle Unicode. I also tried calling "strings" instead of "grep" but it had issues with the "--encoding={b,l}" argument not finding every hit.
  • The utmz's utmr variable may have issues extracting the whole referring URL. From the test data we had, sometimes there would be "utmr=0&" and other (rarer) times utmr would equal a URI encoded http address. I'm not 100% sure what marks the end of the URI encoded http address because there can also be embedded &'s and additional embedded URLs. Currently, the script is looking for either an "&" or a null char ("x00") as the utmr termination flag. I think this is correct but I can't say for sure ...
  • The displayed file offsets point to the beginning of the search string (ie the "g" in "google-analytics.com/__utm.gif?"). This is not really a limitation so don't freak out if you go to the source file and see other URL request characters (eg "http://www.") occurring before the listed file offset.
  • Output is sorted first by filename, then by file offset address. There are a bunch of different time fields so it was easier to sort by file offset rather than time.

Special Thanks

To Mari DeGrazia for both sharing her findings and helping test the script.
To Jon Nelson for writing the very helpful article on which this script is based.
To Cookie Monster for entertaining millions ... and for understanding that humour helps us learn.
"G is 4 Cookie and that's good enough for me!" (you probably need to watch the video to understand the attempted humour)