Sunday, 27 February 2022

Monkey Attempts To Digest Some Google Takeout (DetectedActivitys)

 

Careful What You Eat, Monkey!

One of Monkey's co-workers (Troy) was able to provide investigators with a location of interest by looking at the device owner's Google Takeout "Location History.json".

Specifically, Troy looked at the DetectedActivity classification strings which Google (Play Services) uses to estimate whether the device was still, in a vehicle, on foot etc. There was a transition from IN_VEHICLE to STILL which indicated the vehicle had stopped at a location (and not just passed through).

Being part of Google Play services means this DetectedActivity data is likely only available for Google Accounts involving Android devices. We have yet to see any DetectedActivity in iOS device related Google Takeouts.


Special Thanks to: 

- Troy for sharing his findings 

- Rasmus Riis Kristensen, Mike Lacombe, Heather Mahalik and Lee Crognale for checking their Google Takeout data and script testing.

- Josh Hickman for providing more test Android Takeout data and additional feedback regarding the new Takeout format & velocity / heading / altitude fields.


The main DetectedActivity categories are defined as:

IN_VEHICLE The device is in a vehicle, such as a car.

ON_BICYCLE The device is on a bicycle.

ON_FOOT The device is on a user who is walking or running.

RUNNING The device is on a user who is running.

STILL The device is still (not moving).

TILTING The device angle relative to gravity changed significantly.

UNKNOWN Unable to detect the current activity.

WALKING The device is on a user who is walking.

For further details see here.

Note: We have also observed undocumented activity types such as IN_ROAD_VEHICLE, IN_FOUR_WHEELER_VEHICLE, IN_CAR, IN_RAIL_VEHICLE

While some forensic tools already display Google Takeout information, highlighting/filtering by DetectedActivity was not easily done/not possible with those tools.

For Troy's Takeout (and in Josh Hickman's Android 12 Google Takeout), there's a "Location History" folder and in the root of that folder is the "Location History.json" file.
There's also some sub folders but they are subject to account owner editing. Apparently the "Location History.json" is raw and unaffected by user deleted locations/trips.
This blog post by Ross Donnelly has some more details.

Initially, Monkey wrote a protoype Python3 script for the "Location History.json" but then noticed that recent test Takeouts (done in Jan-Feb 2022) no longer had this JSON file.

UPDATE 28FEB2022: The prototype "gLocationHistoryActivity.py" script for "Location History.json" is now available from GitHub here. Note: This has been tested with small 15-30 MB files only and does not use the "ijson" library.

Instead, Google seems to have replaced it with a file called "Records.json". The newer file is also JSON formatted but has some slight differences eg timestamp format, extra fields.
Monkey could not find any documentation on this file format.
Checking in with other DF folks confirmed their Takeouts also had "Records.json" instead of "Location History.json".

It is currently unknown if this change was implemented at the Google server end or whether the Android/Google Play version on the device affects which .json file gets created.

UPDATE 28FEB2022: Josh Hickman performed another Takeout on his test Android 12 data and got a "Records.json" instead of a "Location History.json".
So it appears the change was implemented at the Google server end.

So ass-uming any Takeouts will now export to "Records.json", Monkey wrote a script to process the "Records.json" file which can be described as follows: 

  • 1 locations list which can store many location element records.
  • Each location element has the following fields:

    •     "source" (usually "UNKNOWN" but have also seen "CELL" here)
    •     "deviceTag" (device identifier)
    •     "platformType" (usually "ANDROID")
    •     "formFactor" (eg "PHONE")
    •     "serverTimestamp" (eg "2022-02-04T04:40:17.685Z", not always present, can be same for multiple location elements)
    •     "deviceTimestamp" (eg "2022-02-04T04:40:16.214Z", not always present, can be same for multiple location elements)
    •     "timestamp" (eg "2022-02-02T00:55:06.311Z", changes with each element record, to avoid confusion Monkey is calling this the "element timestamp")
    •     "latitudeE7" (in degrees scaled by 10 000 000)
    •     "longitudeE7" (in degrees scaled by 10 000 000)
    •     "altitude" (not always present, units are in metres per Josh Hickman)
    •     "heading" (not always present, units are degrees clockwise from True North per Josh Hickman eg 90 = East, 180 = South)
    •     "velocity" (not always present, units are metres per second per Josh Hickman)
    •     "accuracy" (units unknown, suspected to be in m)
    •     "verticalAccuracy" (not always present, units unknown, suspected to be in m)
  • Each location element may/may not have an "activity" list.
  • Each activity list has an activity "timestamp" and can store 1 or more "subactivitys". 
    • Note: "subactivity" is a monkey term used to label any child activitys which are listed in the activity list (see example below).
  • Each "subactivity" can have multiple type and confidence (percentage) pairs (eg type = ON_FOOT, confidence = 3 and type = STILL, confidence = 89).


Here's an example of a parent activity declaration in a "Records.json":

"activity": [{
      "activity": [{
        "type": "STILL",
        "confidence": 89
      }, {
        "type": "ON_FOOT",
        "confidence": 3
      }, {
        "type": "WALKING",
        "confidence": 3
      }, {
        "type": "UNKNOWN",
        "confidence": 3
      }, {
        "type": "IN_VEHICLE",
        "confidence": 2
      }, {
        "type": "ON_BICYCLE",
        "confidence": 2
      }, {
        "type": "IN_RAIL_VEHICLE",
        "confidence": 2
      }, {
        "type": "IN_ROAD_VEHICLE",
        "confidence": 1
      }, {
        "type": "IN_FOUR_WHEELER_VEHICLE",
        "confidence": 1
      }, {
        "type": "IN_CAR",
        "confidence": 1
      }],
      "timestamp": "2022-02-04T05:52:43.071Z"
    }]

In the above example, we can see 1 parent activity list [highlighted in red] which has an activity timestamp ("2022-02-04T05:52:43.071Z" in yellow) and 1 subactivity child [in blue]. 
The subactivity has multiple type/confidence fields [in green] which classify what activity Google thinks the device is doing (eg 89% STILL).
We have observed that more than 1 subactivity can be listed per parent activity.

Scripting

Initially, Monkey used the standard Python json library to load the whole "Records.json" into memory. While this worked for .json files in the MB size range, this caused a memory error when a large 1.3GB dataset (~ 10 years of data) was processed. 
After Rasmus helpfully pointed Monkey to the following articles:

Monkey used the third party "ijson" library  to successfully iteratively load the 1.3 GB  "Records.json" file.

The revised script (gRecordsActivity_ijson_date.py) searches through all location elements for elements with an "activity". It then processes/stores each DetectedActivity per element timestamp into a Python dictionary. There can be more than one activity per element timestamp.
The DetectedActivity data for each element timestamp date is then output to a Tab Separated Variable (TSV) file and a Keyhole Markup Language (KML) file for analysis.

Here is how to use the script (gRecordsActivity_ijson_date.py) which is available from GitHub here.

First step - install the ijson library via pip. On Ubuntu 20.04 LTS, you can use this command:
pip3 install ijson

You can now run the script by pointing it to the Takeout Records.json and an output directory.

Here is the usage help:
python3 gRecordsActivity_ijson_date.py -h
usage:  gRecordsActivity_ijson_date.py [-i input_file -o output_dir -a start_isodate -b end_isodate]

Extracts/parses "Detected Activity" data from Google Takeout "Records.json" (large files) and outputs TSV and KML files to given output dir

optional arguments:
  -h, --help  show this help message and exit
  -i INPUT    Input Records filename
  -o OUTPUT   Output KML/TSV directory
  -a START    Filter FROM (inclusive) Start ISO date (YYYY-MM-DD)
  -b END      Filter BEFORE (inclusive) End ISO date (YYYY-MM-DD)


Here is an abbreviated example command output for some test data:

python3 gRecordsActivity_ijson_date.py -i Records-4.json -o records4_output
Running gRecordsActivity_ijson_date.py v2022-02-26


ACTIVITY => 
[{'activity': [{'type': 'STILL', 'confidence': 99}, {'type': 'UNKNOWN', 'confidence': 1}], 'timestamp': '2022-02-04T03:49:57.246Z'}]
Element timestamp: 2022-02-04T03:49:55.263Z
Element serverTimestamp: 2022-02-04T04:22:27.214Z
Element deviceTimestamp: 2022-02-04T04:22:25.715Z
No. (sub)Activitys => 1
(sub)Activity #1 timestamp = 2022-02-04T03:49:57.246Z
No. (sub)Activity types: 2

ACTIVITY => 
[{'activity': [{'type': 'STILL', 'confidence': 99}, {'type': 'UNKNOWN', 'confidence': 1}], 'timestamp': '2022-02-04T03:51:01.973Z'}]
Element timestamp: 2022-02-04T03:50:48.614Z
Element serverTimestamp: 2022-02-04T04:22:27.214Z
Element deviceTimestamp: 2022-02-04T04:22:25.715Z
No. (sub)Activitys => 1
(sub)Activity #1 timestamp = 2022-02-04T03:51:01.973Z
No. (sub)Activity types: 2

[Output Redacted for brevity ...]

ACTIVITY => 
[{'activity': [{'type': 'TILTING', 'confidence': 100}], 'timestamp': '2022-02-04T04:24:24.209Z'}, {'activity': [{'type': 'STILL', 'confidence': 99}, {'type': 'UNKNOWN', 'confidence': 1}], 'timestamp': '2022-02-04T04:26:35.525Z'}, {'activity': [{'type': 'UNKNOWN', 'confidence': 40}, {'type': 'IN_VEHICLE', 'confidence': 10}, {'type': 'ON_BICYCLE', 'confidence': 10}, {'type': 'ON_FOOT', 'confidence': 10}, {'type': 'WALKING', 'confidence': 10}, {'type': 'RUNNING', 'confidence': 10}, {'type': 'STILL', 'confidence': 10}, {'type': 'IN_ROAD_VEHICLE', 'confidence': 10}, {'type': 'IN_RAIL_VEHICLE', 'confidence': 10}, {'type': 'IN_FOUR_WHEELER_VEHICLE', 'confidence': 10}, {'type': 'IN_CAR', 'confidence': 10}], 'timestamp': '2022-02-04T04:30:11.757Z'}]
Element timestamp: 2022-02-04T04:30:36.434Z
Element serverTimestamp: 2022-02-04T04:40:17.685Z
Element deviceTimestamp: 2022-02-04T04:40:16.214Z
No. (sub)Activitys => 3
(sub)Activity #1 timestamp = 2022-02-04T04:24:24.209Z
No. (sub)Activity types: 1
(sub)Activity #2 timestamp = 2022-02-04T04:26:35.525Z
No. (sub)Activity types: 2
(sub)Activity #3 timestamp = 2022-02-04T04:30:11.757Z
No. (sub)Activity types: 11

[Output Redacted for brevity ...]

ACTIVITY => 
[{'activity': [{'type': 'STILL', 'confidence': 99}, {'type': 'UNKNOWN', 'confidence': 1}], 'timestamp': '2022-02-04T10:00:45.110Z'}]
Element timestamp: 2022-02-04T10:00:32.756Z
Element serverTimestamp: 2022-02-04T10:45:20.367Z
Element deviceTimestamp: 2022-02-04T10:45:18.854Z
No. (sub)Activitys => 1
(sub)Activity #1 timestamp = 2022-02-04T10:00:45.110Z
No. (sub)Activity types: 2


Total no. of elements with at least one Activity = 36
No. of elements with multiple Activitys = 2

Processing Activitys ... Number of days = 1
Processing 2022-02-04 = 39 entries

Processed/Wrote 39 Total Activity entries to: records4_output

Exiting ...

This resulted in the following files being created in the records4_output directory:
2022-02-04.kml
2022-02-04.tsv

While the previous command will extract all dates, the script can also accept date ranges in ISO format.
The date range filter arguments are:
-a start isodate_string (eg 2021-01-31)
-b end isodate_string

The date args can be used singly or together. 
eg everything before xxx would use "-b xxx", everything after zzz would use "-a zzz"
To specify a date range, use both args ie. "-a zzz -b xxx"

For example:
python3 gRecordsActivity_ijson_date.py -i Records.json -o outputdir -a 2022-01-01 -b 2022-02-03

will extract dates between 2022-01-01 and 2022-02-03 (inclusive) into the outputdir given.
This can be helpful if there are years of data in the Takeout but only certain days are of interest.

Once you have the output directory containing the per day .KML and .TSVs, you can load the TSV for the day of interest into a spreadsheet app (eg Excel, OpenOffice Calc) and search for DetectedActivity transitions etc.

Example TSV Output

Note1: Rows are sorted by "element_timestamp". In the picture, there are 3 rows shown selected with the same "element_timestamp" value but they have 3 distinct "activity_timestamp"s. This is an example of an element with multiple (sub)activitys. Each subactivity has its own timestamp.
Note2: The "detected_activity" column can be used to more readily detect transitions between DetectedActivitys.
Note3: The "num_subactivity_types" column shows the number of DetectedActivitys listed for each (sub)activity.

Once you have found a DetectedActivity entry of interest, you can open the corresponding .KML file in Google Earth Desktop to plot the location and view selected attributes.

Redacted Example KML output viewed in Google Earth Desktop

Note1: The map has been redacted to protect Monkey's jungle gym. 
Note2: The label of each point is comprised of the "element timestamp" and the number of (sub)activity types. eg "2022-02-04T04:30:36.434Z, num_subactivity_types = 11"
Note3: The "accuracy" field can indicate how much a location plot should be trusted.
Note4: Selecting the checkbox next to the parent ISO date folder name (eg "2022-02-04") will plot all of the points in the folder. All points are not plotted on screen by default to improve Google Earth Desktop performance.

Final Thoughts

A script has been written to parse the Google Takeout "Records.json" for DetectedActivitys.
By using this script, an analyst can minimize the clutter / maximize Google Earth performance and still visualize DetectedActivity locations and transitions (eg IN_VEHICLE to STILL).
Hopefully, this script will prove useful when analyzing a user's pattern of life / looking for anomalies.