My Problem
I have a data stream coming from a program that connects to a GPS device and an inclinometer (they are actually both stand alone devices, not a cellphone) and logs the data while the user drives around in a car. The essential data that I receive are:
- Latitude/Longitude - from GPS, with a resolution of about +-5 feet,
- Vehicle land-speed - from GPS, in knots, which I convert to MPH
- Sequential record index - from the database, it's an auto-incrementing integer and nothing ever gets deleted,
- some other stuff that isn't pertinent to my current problem.
This data gets stored in a database and read back from the database into an array. From start to finish, the recording order is properly maintained, so even though the timestamp that is recorded from the GPS device is only to 1 second precision and we sample at 5hz, the absolute value of the time is of no interest and the insertion order suffices.
In order to aid in analyzing the data, a user performs a very basic data input task of selecting the "start" and "end" of curves on the road from the collected path data. I get a map image from Google and I draw the curve data on top of it. The user zooms into a curve of interest, based on their own knowledge of the area, and clicks two points on the map. Google is actually very nice and reports where the user clicked in Latitude/Longitude rather than me having to try to backtrack it from pixel values, so the issue of where the user clicked in relation to the data is covered.
The zooming in on the curve clips the data: I only retrieve data that falls in the Lat/Lng window defined by the zoom level. Most of the time, I'm dealing with fewer than 300 data points, when a single driving session could result in over 100k data points.
I need to find the subsegment of the curve data that falls between those to click points.
What I've Tried
Originally, I took the two points that are closest to each click point and the curve was anything that fell between them. That worked until we started letting the drivers make multiple passes over the road. Typically, a driver will make 2 back-and-forth runs over an interesting piece of road, giving us 4 total passes. If you take the two closest points to the two click points, then you might end up with the first point corresponding to a datum on one pass, and the second point corresponding to a datum on a completely different pass. The points in the sequence between these two points would then extend far beyond the curve. And, even if you got lucky and all the data points found were both on the same pass, that would only give you one of the passes, and we need to collect all passes.
For a while, I had a solution that worked much better. I calculated two new sequences representing the distance from each data point to each of the click points, then the approximate second derivative of that distance, looking for the inflection points of the distance from the click point over the data points. I reasoned that the inflection point meant that the points previous to the inflection were getting closer to the click point and the points after the inflection were getting further away from the click point. Doing this iteratively over the data points, I could group the curves as I came to them.
Perhaps some code is in order (this is C#, but don't worry about replying in kind, I'm capable of reading most languages):
static List<List<LatLngPoint>> GroupCurveSegments(List<LatLngPoint> dataPoints, LatLngPoint start, LatLngPoint end)
{
var withDistances = dataPoints.Select(p => new
{
ToStart = p.Distance(start),
ToEnd = p.Distance(end),
DataPoint = p
}).ToArray();
var set = new List<List<LatLngPoint>>();
var currentSegment = new List<LatLngPoint>();
for (int i = 0; i < withDistances.Length - 2; ++i)
{
var a = withDistances[i];
var b = withDistances[i + 1];
var c = withDistances[i + 2];
// the edge of the map can clip the data, so the continuity of
// the data is not exactly mapped to the continuity of the array.
var ab = b.DataPoint.RecordID - a.DataPoint.RecordID;
var bc = c.DataPoint.RecordID - b.DataPoint.RecordID;
var inflectStart = Math.Sign(a.ToStart - b.ToStart) * Math.Sign(b.ToStart - c.ToStart);
var inflectEnd = Math.Sign(a.ToEnd - b.ToEnd) * Math.Sign(b.ToEnd - c.ToEnd);
// if we haven't started a segment yet and we aren't obviously between segments
if ((currentSegment.Count == 0 && (inflectStart == -1 || inflectEnd == -1)
// if we have started a segment but we haven't changed directions away from it
|| currentSegment.Count > 0 && (inflectStart == 1 && inflectEnd == 1))
// and we're continuous on the data collection path
&& ab == 1
&& bc == 1)
{
// extend the segment
currentSegment.Add(b.DataPoint);
}
else if (
// if we have a segment collected
currentSegment.Count > 0
// and we changed directions away from one of the points
&& (inflectStart == -1
|| inflectEnd == -1
// or we lost data continuity
|| ab > 1
|| bc > 1))
{
// clip the segment and start a new one
set.Add(currentSegment);
currentSegment = new List<LatLngPoint>();
}
}
return set;
}
This worked great until we started advising the drivers to drive around 15MPH through turns (supposedly, it helps reduce sensor error. I'm personally not entirely convinced what we're seeing at higher speed is error, but I'm probably not going to win that argument). A car traveling at 15MPH is traveling at 22fps. Sampling this data at 5hz means that each data point is about four and a half feet apart. However, our GPS unit's precision is only about 5 feet. So, just the jitter of the GPS data itself could cause an inflection point in the data at such low speeds and high sample rates (technically, at this sample rate, you'd have to go at least 35MPH to avoid this problem, but it seems to work okay at 25MPH in practice).
Also, we're probably bumping up sampling rate to 10 - 15 Hz pretty soon. You'd need to drive at about 45MPH to avoid my inflection problem, which isn't safe on most of the curves of interest. My current procedure ends up splitting the data into dozens of subsegments, over road sections that I know had only 4 passes. One section that only had 300 data points came out to 35 subsegments. The rendering of the indication of the start and end of each pass (a small icon) indicated quite clearly that each real pass was getting chopped up into several pieces.
Where I'm Thinking of Going
- Find the minimum distance of all points to both the start and end click points
- Find all points that are within +10 feet of that distance.
- Group each set of points by data continuity, i.e. each group should be continuous in the database, because more than one point on a particular pass could fall within the distance radius.
- Take the data mid-point of each of those groups for each click point as the representative start and end for each pass.
- Pair up points in the two sets per click point by those that would minimize the record index distance between each "start" and "end".
Halp?!
But I had tried this once before and it didn't work very well. Step #2 can return an unreasonably large number of points if the user doesn't click particularly close to where they intend. It can return too few points if the user clicks very, particularly close to where they intend. I'm not sure just how computationally intensive step #3 will be. And step #5 will fail if the driver were to drive over a particularly long curve and immediately turn around just after the start and end to perform the subsequent passes. We might be able to train the drivers to not do this, but I don't like taking chances on such things. So I could use some help figuring out how to clip and group this path that doubles back over itself into subsegments for passes over the curve.