OpenStreetMap (OSM) is a global project formed around a geographic information database which is being filled by all comers — both enthusiasts and interested companies. Anybody can contribute, but the openness has its downside: incorrect edits often get into the database. Hence plenty of validators of OSM data have been written which allow to maintain the data quality at an acceptable level.
Since 2016 there exists an open source subway preprocessor that validates (generates error reports) rapid transit routes in OSM for completeness and logical/topological errors, and converts them into formats that are suitable for routing and rendering, e.g. GTFS. Besides OSM data it takes a list of public transport (PT) networks which contains the checking information about the number of lines, stations etc. per a PT network. The preprocessor has successfully proven itself in the preparation of PT data for applications such as Maps.me and Organic Maps.
In this article, I would like to share an approach to detecting one of the types of errors that occur quite often in OSM data and automatic detection of which is somewhat challenging. It's an accidental loss of a station from a route. The source code of the validator and the described algorithm are open source. But first, let's define the concepts used to represent PT data in OpenStreetMap.
Basic concepts
At the first reading, you can do with the layman's idea of a station (stop), although in OSM at least four different types of objects can correspond to it:
railway=halt (e.g. Elsholz halt),
railway=station (e.g. Green Park tube station),
public_transport=stop_position (e.g. stop position at Green Park tube station),
public_transport=stop_area (e.g. stop area at Green Park tube station).
Note. The links to OSM objects may become obsolete. One can obtain actual ones by running the following query in Overpass Turbo while positioned near some station:(
nwr[railway=station]({{bbox}});
rel[public_transport=stop_area]({{bbox}});
);
(._;>>;);
out;
Schematically, these concepts are illustrated by the image (created for OSM Wiki under CC SA-4.0 license):
In case you want to delve into the program code of the algorithm on GitHub, I note that the role of the station in it is performed by the StopArea class based on the OSM public_transport=stop_area relation which combines the railway station, platforms and entrances to the station at a certain PT stop location.
Note. We will talk about high-speed rail transport, although algorithms and conclusions can be extended to overground transport: buses, trams, trolleybuses.
A transfer / interchange is two or more stations that can be crossed on foot in a relatively short time. In OSM, they are designated by a special object — relation with public_transport=stop_area_group tag. An example is the transfer at Green Park station between Piccadilly Line, Jubilee Line and Victoria Line. Sometimes what looks like a single station to a passenger can be represented in OSM as several stations connected by a transfer.
A route is a sequence of stops, usually with its own identifier, which is periodically followed by PT vehicles. An example of a real route in OSM: Piccadilly line: Uxbridge → Cockfosters.
Consider a hypothetical route "R1: Reading → Maryland". It follows from the definition that "R2: Maryland → Reading" is a different route, as well as the shortened variant "R4: Reading → Slough" or the express route "R3: Maryland - Slough - Reading" making only one intermediate stop.
A route master is a combination of all route variants within the general direction, including forward and backward routes, shortened routes, express routes accelerated by skipping some stations, as well as variants with other differences. In our example, all the listed routes which we designate as R1-R5, can be combined into one route master. An example of a real route master is Piccadilly Line.
Note. Here and further, the stations and routes in the pictures are for illustrative purposes only and may not correspond to reality.
More about the problem being solved
When the validator processes OSM data, it compares the number of stations found on a route with the number stored in the control values file. While storing and maintaining the number of stations on all route masters in all rapid transit networks in the world seems possible, then for all route variants it is not an option. Therefore, if a station in one of the directions is accidentally deleted from the OSM data, the number of stations in the route master will not change and the validator will not notice, for example, an error where Langley station is missing from the backward direction:
The algorithm
Search for twin routes
Let's turn to the set of routes R1-R5 one again:
Let's call two routes in a route master potential twins if the beginning of the first route is the end of the other, and vice versa. For the route R1 potential twins are R2 and R3. Among the potential twins, we will choose the closest one — in terms of the number of common stations. Thus, in our case, we will find two pairs of twin routes: (R1, R2) and (R4, R5). For each pair, we will run the algorithm to find differences, because it is very likely that the difference in the twins is the result of errors in the data.
Finding differences in routes
Determining whether one route contains all stations from another route is not a difficult task. However, there are many nuances here: a station can be missing from both routes, or several stations are missing at once, going successively or not. Furthermore, a station in one route may not correspond to the same station in another, that is, it may be mistakenly replaced by a duplicate or a completely different station. All this is very reminiscent of... Levenshtein edit distance!
Let's look at routes T1 and T2 again:
Let's imagine a route as a word, where the letters are stations (in general, transfer hubs). According to the first letters of the stations, the word RTBSLHM corresponds to the route T1, and the word MHSBTR corresponds to the backward route T2. Let's compare the word for T1 with the inverted word for T2:
RTBSLHM
RTBSHM
For ideal twins, the words should match, but we have a "typo" in T2 — the letter L is missing. Thus, the search for station inconsistencies on a pair of routes is reduced to searching for the edit distance between a pair of words.
Note. The first letters are used here for illustration purposes only. The real "letters" are the objects of stations/transfers!
The Wagner-Fischer algorithm allows, using dynamic programming, to find the shortest sequence of steps to convert one word into another by adding, deleting or replacing one letter in one step. The result will be an editorial prescription — the places in which additions/deletions/replacements are needed.
Thus, the Wagner-Fischer algorithm adapted to routes will give out all the differences in stations for a pair of twin routes. Let's illustrate this with another example with more "typos" on routes, where, in addition to the loss of stations in the source data, there are two different stations where there should be one common station.
The adapted Wagner-Fischer algorithm, when running on a pair of routes (S1, S2), would provide the following warnings:
Taplow station is missing from route S1;
Langley station is missing from route S2;
Different Slough stations on routes S1 and S2.
False positives
In reality, every detected absence/mismatch of a station may be a false positive and requires a separate investigation, however, in the described version, the algorithm will give too many false positives in cases where the forward and backward routes really diverge at some point along the way, for example:
The algorithm will say that the stations Woodstock and Esplanade, Salt River and Mutual do not correspond to each other, and Ndabeni is missing from the U1 route. To cope with this, it is worth noting that in twin routes the rails usually run quite close, and within the station the landing points are not more than a couple of dozen meters apart. But even taking into account large railway stations with many tracks and platforms, this figure should not exceed 100 m. We can safely overestimate this figure, because it is better to get a false positive than to miss an error. Moreover, if the paths of the railway routes really diverge, then usually much more than by 100 m.
Thus, to detect divergence of routes, we formulate the following criterion: if the distance from the station, defined by the algorithm as missed or inappropriate, to the rails of the backward route exceeds a certain threshold (~ 100 m), then we believe that the rails diverge, and we do not generate error messages.
Practical application
The mentioned metro validator has been working for many years in OSM, it contains versatile checks, and the errors it issues are regularly corrected by several people. At the same time, this is not the only PT validator. Nevertheless, when the above algorithm was run on 300+ metro and other high-speed railway transport networks, it produced 144 missed/mismatched stations, merely 17 of which turned out to be false positives.
That's how this beautiful example of the pervasive essence of the theory of algorithms worked, where text processing methods have found application in the field of public transport — the realm of graph theory.
The algorithm is recommended for implementation in such widely used validators as: