How to anonymise a gps file?
Imagine you have a GPS trace that you want to make anonymous, in the legal sense. How would you do this? Is snapping to the nearest x distance and stripping out the time enough? Are there internationally agreed standards on this? Has anyone already written an algorithm to do this? If not I plan to make a function in my evolving stplanr package to do this.
Reproducible example (using awesome rotation function from @geospacedman) from my own 'Identifiable' data:
library(rgdal) library(tmap) downloader::download("https://www.openstreetmap.org/trace/1619756/data", "test.gpx") r <-readOGR(dsn = "test.gpx", layer = "tracks") r <- spTransform(r, CRS("+init=epsg:27700")) rproj <- rotateProj(rs, 90) # rotate projection for plotting r <- spTransform(r, rproj) rs <- rgeos::gSimplify(r, 1000) # snap to nearest km qtm(r) + qtm(rs, line.col = "red") + tm_layout(draw.frame = F) + tm_scale_bar()
The result is shown above. In summary: is the red route 'identifiable' and is there a better way?
I'm working with our local cycling group to anonymise GPX files on two criteria (primarily for security). I've never come across a standard way of anonymising data but this satisfies two concerns of our members, while preserving accuracy along roads and speed information:
- Personal locations, removing 'private' areas for individuals;
- Obscuring timestamps so that travel data could not be used to identify individual movements.
GPSBabel can do both of these from the command line - for example, to shift the times in a GPX file by +123450 seconds, and remove all trackpoints 0.5 km away from a landmark in northern Tanzania:
gpsbabel -t -i gpx -f infile.gpx -x transform,wpt=trk,del -x track,move=123450s -x radius,distance=0.5K,lat=-3.368,lon=36.624,nosort,exclude -x transform,trk=wpt,del -o gpx -F infile_rand.gpx
-t: process tracks only;
-f: input file type (gpx) and filename;
-x: two sequential (-x) filter arguments for timeshift (move) and removal (radius,exclude) around a point;
-F: output file type and filename.
This command chains together several filters - first transforming the trackpoints into waypoints, then filtering, then transforming back to trackpoints.
Note that reducing the decimal places around the landmark / privacy area is VERY important as it obscures the exact centre of the privacy area. 3 decimal places = ~ 110m accuracy in this case.
I usually call GPSBabel from R, writing a new GPX file with filters applied, including a random timeshift +/- 2 weeks. This would be better as a bash or python script but a lot of the other work I do is in R and I'm lazy…
# Get the correct location for GPSBabel: GB <- Sys.which("gpsbabel") # Set up the filters shift <- round((runif(1, 0, 2600000) - 1300000), 0) # +/- 2 weeks in secs filter <- " -x transform,wpt=trk,del" filter <- paste(" -x track,move=", shift, "s", sep = "") filter <- paste(filter, " -x radius,distance=", dist, "K,", "lat=", lat, ",long=", lon, sep = "") filter <- paste(filter, " -x transform,wpt=trk,del",) # Pass the complete command to the system system(paste(GB, " -t -i gpx -f ", gpx_file, filter, " -o gpx -F ", gsub(".gpx", replacement = "_rand.gpx", x = gpx_file, fixed = T), sep = ""), intern = TRUE)
You are out of luck, this is tremendously hard to do! If you are serious about it you should read about differential privacy because this is probably what you are after.
When you think of this problem, you should consider the case of a recluse person living at the end of long isolated road. Do you really think you can do something about their GPS coordinate and not reveal anything about that particular person. The side information here is that it can be easily discovered that only one person lives there.
Stripping the user Id, the time and adding noise to the data points is a good place to start. But the problem is that all the datapoints are heavily correlated so if you add random noise to each point the noise will cancel out and someone will be able to derive the likely trajectories. So the noise would have to be resistant to this attack, for instance by making it constant over a trajectory. But then, trajectories can probably be easily matched with likely routes based on roads, etc.
I am not sure if the data you will end up with will still be workable for whatever you want to do with it but at least it is a passionating field.
PS: I don't know about legally acceptable, I would expect it to be a moving target and country specific while the mathematical definition of differential privacy is the most robust you can get.
make an adjustment to the X and Y coordinate of each point by a random distance between a certain minimum and maximum offset. also make the direction of the offset (plus or minus) a random selection. Include in the randomisation that some points may have no adjustment to one or both parts of a coordinate pair.