igor_suhorukov Feb 16 2023 at 15:28

How to put the whole world into a regular laptop: PostgreSQL and OpenStreetMap

Easy

12 min

3.3K

Open source*PostgreSQL*Java*OpenStreetMap*

Translation

When a person used to say that he controls the whole world, he was usually placed in the next room with Bonaparte Napoleon. I hope that these times are in the past and everyone can analyze the geodata of the entire Earth and get answers to their global questions in minutes and seconds. I published Openstreetmap_h3 - my project, which allows you to perform geoanalytics on data from OpenStreetMap in PostGIS or in any query engine that can work with Apache Arrow / Parquet.

First of all, I say hello to the haters and skeptics. What I developed is really unique and solves the problem of transforming and analyzing geodata using the usual and familiar tools available to every analyst and data science specialist without bigdata, GPGPU, FPGA. What looks easy to use and code now is my personal project where I have invested my vacations, weekends, sleepless nights and a lot of personal time over the past 3 years. Maybe I will share the background of the project and the rake that I went through, but first I will still describe the end result.

The first post does not pretend to be a monograph, I will start with a brief overview. I plan to share my experience at conferences and in publications, and although the topic is niche, it will be useful for many projects working with geodata from OSM. For programmers and users who are not familiar with the OSM toolkit and data model, many things may seem a little strange. Until I delve into these basics. I will only note that despite his criticism, OpenStreetMap itself and services are developing and the amount of data is increasing year by year. In some regions of the world, the completeness and quality of OSM data is better than that of other projects. Therefore, we simply accept the data model as it is - flexible and not restricting the user in markup, complicating solutions for analytics and data visualization due to the design features of data structures.

There are specialized columnar, distributed opensource databases with which I also experimented, but still PostgreSQL + PostGIS turned out to be the most reliable and tested solutions available at the moment. Well, the icing on the cake: the same data schema in Postgresql can be loaded into the CitusDB columnar store inside the same PostgreSQL with just one more command line parameter.

You faced with geodata analysis every day

When you use a taxi in small towns, calling a car by phone, then with a high probability your trip is charged by the program based on OSM data. For billing, one of the routing packages is used. Through this use case, the taxi company employees put the house number and street on the buildings and contribute not only to their business, but also to OpenStreetMap.

The data analytics scenario also includes the tasks of where it is better to place an outlet so that buyers come to it. Again, data on walking distance and population of the surroundings can be extracted from geodata. You can calculate the value of real estate based on many factors related to the location of the object and its surroundings.

Scientists can build predictive models to predict epidemics, urban evolution, plan recreational areas and development of existing territories based on open geodata.

Well, you can answer any geography question that comes to your mind: calculate the area of cities and buildings, the length of roads and extract the names of cities, regions and islands. You can, for example, become a champion in the game of "Cities" or establish a new service for renting electric scooters. Everything is limited only by your imagination.

Why this project needed and what did not suit the existing solutions

Attempts to load the planet dump into the PostgreSQL database were long and painful... The Osmosis utility is written in Java, has rich functionality for working with OSM data, and also allows you to load this data into a PostgreSQL database with the pgSnapshot structure for analytics. This project is no longer developed in terms of functionality, only on support. It is possible to convert data from PBF to tsv files for use in a database and there are scripts to create database schemas. The pgSnapshot lossless scheme, which allows you to analyze all available information without loss, build routes, etc. The main disadvantage is the increased table of nodes, which can reach hundreds of gigabytes, which does not allow for a high performant data load in the table and preserving the physical order of records, the long construction of huge indexes, the high price of full scan over a giant table. Osm2pgsql, Imposm allows you to load data into PostGIS, data import occurs by mapping with the loss of part of the data, is not suitable for routing and is more focused on checking tiles or working with a subset of transformed data in the user requirements table. Osmium export makes scripts faster than osmosis, but PostGIS export functionality is limited and the data model is more primary than in osmosis.

When the workhorse fails, it's time to harness the elephant (PostgreSQL+PostGIS)

I would like to have data in PostgreSQL in a format as close as possible to pgSnapshot, with a fast mechanism for loading data from PBF and the possibility of parallel loading into the database. It would be ideal if the data were partitioned so that within one partition there were objects located close to each other and the partition grid was approximately the same in radius, both at the equator and at the poles of the planet. Of the systems of geopartitioning, I considered options - division by administrative borders / countries, partitioning by a rectangular grid, hierarchical systems S2 and H3. And the choice was made in favor of the latter, since this hierarchical H3 indexing system has advantages in preserving distances (approximately the same segment radius anywhere in the world), great for routing tasks, libraries for working with H3 are well supported and constantly evolving and available for PostgreSQL/JVM.

As a result, I developed a Java program and designed a pgSnapshot compatible database schema that uses PostgreSQL declarative data partitioning based on H3 ranges of the object coordinate index. A utility that can automatically create scripts for the database schema, prepare data for loading from PBF into tables and load them into the database in multithreaded fashion.

It was a path of numerous experiments, starting with modifying the Osmosis code and osm‑parquetizer. I collected the coordinates of the line points using a memory mapped file for a hundred gigabytes in Java code, then to throw out this code of my own and reuse the add-locations-to-ways functionality from osmium - the most optimized solution available that does the same much faster. In addition, osmium can collect multipolygons, which is much faster and more reliable than doing it at the database level, here I also decided to rely on a solution tested by the community.

If you are still attracted by Big Data, we call on Apache Arrow to help

We now have geodata partitioned by H3 indexes in PostgreSQL and all the reliability and quality of this ecosystem. But if we need to transfer the same partitioned data to the Apache Spark or Hadoop ecosystem, then we would like to immediately get the data from the PBF in the Arrow / Parquet format without the intermediate PostgreSQL. Taking into account my previous experience with osm‑parquetizer optimizations, I added the ability to export geodata through the Apache Arrow serialization format to my utility. To get Parquet files partitioned by H3 index, an additional post-processing step of the resulting set of files is needed, due to a flaw in the Arrow Java API (for now I am doing this with an additional python post-processing script). Among the advantages of my utility over osm‑parquetizer are greater parallelism in the transformation of PBF data, collection of coordinates for lines, and the presence of an H3 geoindex for nodes, ways. One issue of choice for geoanalytics in parquet is how to store the lines. Some of the alternative solutions save them as an array of points, but for now I decided to encode the lines in WKB format.

Openstreetmap H3 developed on Java

As I mentioned above, this project evolved from Osmosis and osm-parquetizer and was rewritten from scratch, taking into account all the issues that I collected in these Java projects. Project build script on maven, and the application itself is written as a command line utility.

All this will allow in the future to easily transfer processing to Apache Spark, which will further increase the parallelism of processing PBF dump sections in a distributed environment. But again, you need to complete the prototype and measure performance - is the game worth the candle. In my opinion, this approach would make sense for generating parquet files, but hardly for PostgreSQL TSV files.

First steps with Openstreetmap_h3

So, all we need to get started is a PostgreSQL 14 database, Openstreetmap project data, and a regular laptop. In my case it's a Dell Latitude 7320: i7-1165G7@2.80GHz, 16GB RAM with 2TB NVMe drive. The project works both with a dump of the whole world in PBF format, and with regional uploads from Geofabrik:

Clone and build the project:

git clone https://github.com/igor-suhorukov/openstreetmap_h3.git
cd openstreetmap_h3
mvn install

The project requires Docker to work. On startup, a list of available options and their descriptions are displayed:

~/dev/projects/oss_contrib/openstreetmap_h3$ java -jar target/osm-to-pgsnapshot-schema-ng-1.0-SNAPSHOT.jar
The following option is required: [-source_pbf]
Usage: <main class> [options]
  Options:
  * -source_pbf
      Source path for OpenStreetMap data in PBF format
    -collect_only_statistics
      Collect only statistics from data - partition distribution
      Default: false
    -columnar_storage
      Use columnar storage in PostgreSql tables for nodes/ways/multipolygon
      Default: false
    -data_partition_ratio
      Filling ratio from maximum size of partition. This parameter change 
      PostgreSQL partitions count
      Default: 0.48
    -help
      Information about command line parameters
    -pg_script_count
      Script count for PostgreSQL parallel COPY
      Default: 4
    -result_in_arrow
      Save data transforming and enrichment in Apache Arrow format
      Default: false
    -result_in_tsv
      Save result data in TabSeparatedValue format for PostgreSQL COPY
      Default: true
    -scale_approx_calc
      Approximate scale calculation. Value 'false' - distance in meter
      Default: false
    -skip_buildings
      Skip any ways with 'building' tag
      Default: false
    -worker_threads
      Worker threads count for data processing
      Default: 4

After this step, download the OSM PBD dump of the planet or any one country from Geofabric:

java -jar target/osm-to-pgsnapshot-schema-ng-1.0-SNAPSHOT.jar -source_pbf /home/acc/dev/map/thailand/thailand-latest.osm.pbf

As a result, we get a set of scripts for parallel loading into the database.

I use my PostgreSQL docker image to automatically create a database (Dockerfile exists in the project repository) with pre-installed extensions and a database initialization script that supports multi-threaded data loading:

docker run --name postgis-thailand --memory=12g --memory-swap=12g --memory-swappiness 0 --shm-size=1g -v /home/acc/dev/map/database/thailand:/var/lib/postgresql/data -v /home/acc/dev/map/thailand/thailand-latest_loc_ways:/input -e POSTGRES_PASSWORD=osmworld -e LD_LIBRARY_PATH=/usr/lib/jvm/java-11-openjdk-amd64/lib/server/ -d -p 5432:5432 -p 5005:5005 5d411c3be57f -c checkpoint_timeout='15 min' -c checkpoint_completion_target=0.9 -c shared_buffers='4096 MB' -c wal_buffers=-1 -c bgwriter_delay=200ms -c bgwriter_lru_maxpages=100 -c bgwriter_lru_multiplier=2.0 -c bgwriter_flush_after=0 -c max_wal_size='32768 MB' -c min_wal_size='16384 MB'

We launch the database container with OSM data from the whole world and connect to the database with any client convenient for the user:

docker start postgis-planet-220704
psql -h 127.0.0.1 -p 5432 -U postgres -d osmworld

And admire the statistics of the database of the whole world (587 GB):

         Table          | Rows  |     Total Size     |     Table Size     |  Index(es) Size  |     TOAST Size     
------------------------+-------+--------------------+--------------------+------------------+--------------------
 *** TOTAL ***          | ~1B   | 587 GB (100.00%)   | 501 GB (100.00%)   | 54 GB (100.00%)  | 32 GB (100.00%)
                        |       |                    |                    |                  | 
 ways_051               | ~21M  | 12 GB (2.02%)      | 11 GB (2.14%)      | 1070 MB (1.94%)  | 73 MB (0.22%)
 relation_members       | ~113M | 9523 MB (1.58%)    | 6136 MB (1.20%)    | 3386 MB (6.15%)  | 8192 bytes (0.00%)
 ways_043               | ~14M  | 8102 MB (1.35%)    | 7334 MB (1.43%)    | 737 MB (1.34%)   | 31 MB (0.09%)
 ways_052               | ~14M  | 8013 MB (1.33%)    | 7200 MB (1.40%)    | 729 MB (1.32%)   | 84 MB (0.26%)
 ways_002               | ~17M  | 7923 MB (1.32%)    | 7013 MB (1.37%)    | 870 MB (1.58%)   | 40 MB (0.12%)
 ways_001               | ~15M  | 7360 MB (1.22%)    | 6485 MB (1.26%)    | 788 MB (1.43%)   | 87 MB (0.26%)
 ways_081               | ~14M  | 6963 MB (1.16%)    | 6127 MB (1.19%)    | 716 MB (1.30%)   | 120 MB (0.37%)
 ways_004               | ~13M  | 6886 MB (1.15%)    | 6108 MB (1.19%)    | 691 MB (1.25%)   | 87 MB (0.27%)
 ways_029               | ~11M  | 6824 MB (1.14%)    | 6197 MB (1.21%)    | 567 MB (1.03%)   | 60 MB (0.18%)
 ways_047               | ~11M  | 6750 MB (1.12%)    | 6101 MB (1.19%)    | 577 MB (1.05%)   | 71 MB (0.22%)
 ways_093               | ~13M  | 6522 MB (1.09%)    | 5749 MB (1.12%)    | 644 MB (1.17%)   | 128 MB (0.39%)
 ways_095               | ~13M  | 6451 MB (1.07%)    | 5689 MB (1.11%)    | 671 MB (1.22%)   | 91 MB (0.28%)
 ways_090               | ~12M  | 6405 MB (1.07%)    | 5708 MB (1.11%)    | 620 MB (1.13%)   | 77 MB (0.23%)
 ways_003               | ~12M  | 6401 MB (1.07%)    | 5553 MB (1.08%)    | 641 MB (1.16%)   | 206 MB (0.63%)
 multipolygon_32767     | ~187k | 6366 MB (1.06%)    | 338 MB (0.07%)     | 27 MB (0.05%)    | 6002 MB (18.32%)
 ways_005               | ~12M  | 6317 MB (1.05%)    | 5590 MB (1.09%)    | 601 MB (1.09%)   | 126 MB (0.38%)
 ways_092               | ~12M  | 6227 MB (1.04%)    | 5511 MB (1.07%)    | 610 MB (1.11%)   | 107 MB (0.33%)
 ways_074               | ~11M  | 6208 MB (1.03%)    | 5507 MB (1.07%)    | 587 MB (1.07%)   | 114 MB (0.35%)
 ways_091               | ~11M  | 6158 MB (1.02%)    | 5425 MB (1.06%)    | 578 MB (1.05%)   | 156 MB (0.47%)
 ways_098               | ~12M  | 6154 MB (1.02%)    | 5421 MB (1.06%)    | 595 MB (1.08%)   | 138 MB (0.42%)
 ways_089               | ~11M  | 6128 MB (1.02%)    | 5418 MB (1.06%)    | 576 MB (1.05%)   | 134 MB (0.41%)
 ways_080               | ~11M  | 5973 MB (0.99%)    | 5206 MB (1.01%)    | 549 MB (1.00%)   | 217 MB (0.66%)
 ways_097               | ~11M  | 5940 MB (0.99%)    | 5298 MB (1.03%)    | 547 MB (0.99%)   | 95 MB (0.29%)
 ways_045               | ~11M  | 5877 MB (0.98%)    | 5261 MB (1.03%)    | 546 MB (0.99%)   | 70 MB (0.21%)
 ways_053               | ~9M   | 5810 MB (0.97%)    | 5289 MB (1.03%)    | 487 MB (0.88%)   | 34 MB (0.10%)
 ways_019               | ~10M  | 5794 MB (0.96%)    | 5124 MB (1.00%)    | 517 MB (0.94%)   | 153 MB (0.47%)
 ways_006               | ~10M  | 5774 MB (0.96%)    | 5033 MB (0.98%)    | 510 MB (0.93%)   | 231 MB (0.70%)
 ways_046               | ~10M  | 5702 MB (0.95%)    | 5094 MB (0.99%)    | 505 MB (0.92%)   | 103 MB (0.31%)
 ways_042               | ~10M  | 5683 MB (0.95%)    | 5075 MB (0.99%)    | 496 MB (0.90%)   | 112 MB (0.34%)
 ways_049               | ~9M   | 5663 MB (0.94%)    | 5118 MB (1.00%)    | 456 MB (0.83%)   | 89 MB (0.27%)
 ways_037               | ~9M   | 5634 MB (0.94%)    | 5040 MB (0.98%)    | 447 MB (0.81%)   | 147 MB (0.45%)
 ways_054               | ~9M   | 5627 MB (0.94%)    | 4856 MB (0.95%)    | 485 MB (0.88%)   | 286 MB (0.87%)
 ways_017               | ~10M  | 5605 MB (0.93%)    | 4950 MB (0.97%)    | 507 MB (0.92%)   | 148 MB (0.45%)
 ways_075               | ~9M   | 5587 MB (0.93%)    | 4940 MB (0.96%)    | 488 MB (0.89%)   | 159 MB (0.49%)
 ways_026               | ~9M   | 5585 MB (0.93%)    | 5064 MB (0.99%)    | 482 MB (0.88%)   | 39 MB (0.12%)
 ways_048               | ~9M   | 5584 MB (0.93%)    | 4985 MB (0.97%)    | 484 MB (0.88%)   | 114 MB (0.35%)
 ways_082               | ~10M  | 5548 MB (0.92%)    | 4890 MB (0.95%)    | 500 MB (0.91%)   | 158 MB (0.48%)
 ways_000               | ~9M   | 5460 MB (0.91%)    | 4750 MB (0.93%)    | 461 MB (0.84%)   | 248 MB (0.76%)
 ways_094               | ~10M  | 5457 MB (0.91%)    | 4737 MB (0.92%)    | 493 MB (0.89%)   | 227 MB (0.69%)
 ways_033               | ~8M   | 5454 MB (0.91%)    | 4808 MB (0.94%)    | 403 MB (0.73%)   | 243 MB (0.74%)
 ways_087               | ~9M   | 5445 MB (0.91%)    | 4808 MB (0.94%)    | 463 MB (0.84%)   | 173 MB (0.53%)
 ways_010               | ~9M   | 5357 MB (0.89%)    | 4774 MB (0.93%)    | 440 MB (0.80%)   | 143 MB (0.44%)
 ways_040               | ~9M   | 5348 MB (0.89%)    | 4722 MB (0.92%)    | 467 MB (0.85%)   | 159 MB (0.48%)
 ways_088               | ~9M   | 5262 MB (0.88%)    | 4662 MB (0.91%)    | 455 MB (0.83%)   | 145 MB (0.44%)
 ways_060               | ~8M   | 5243 MB (0.87%)    | 4673 MB (0.91%)    | 420 MB (0.76%)   | 150 MB (0.46%)
 ways_096               | ~10M  | 5207 MB (0.87%)    | 4613 MB (0.90%)    | 495 MB (0.90%)   | 99 MB (0.30%)
 ways_024               | ~8M   | 5205 MB (0.87%)    | 4722 MB (0.92%)    | 431 MB (0.78%)   | 52 MB (0.16%)
 ways_071               | ~9M   | 5198 MB (0.87%)    | 4625 MB (0.90%)    | 469 MB (0.85%)   | 104 MB (0.32%)
:

Let's use the example from my previous posts and count the number of school buildings around the world, spitted down by the H3 index:

docker start postgis-planet-220704

psql -h 127.0.0.1 -p 5432 -U postgres -d osmworld

Password for user postgres: 
? ? postgres_dba 6.0 installed. Use ":dba" to see menu
Timing is on.
psql (14.6 (Ubuntu 14.6-1.pgdg20.04+1), server 14.1 (Debian 14.1-1.pgdg110+1))
Type "help" for help.

osmworld=# \timing on
Timing is on.
osmworld=# create table school 
    as select h3_3, count(*) as "count" from ways 
        where closed and (tags->'building' = 'school' 
         or (tags->'building' is not null and tags->'amenity'='school')) 
       group by h3_3 order by 2 desc;
SELECT 6783
Time: 215154,128 ms (03:35,154)

Let's look at the query plan, what took these 03:35 minutes from the database, how our query was executed in parallel on all partitions:

explain create table school 
    as select h3_3, count(*) as "count" from ways 
        where closed and (tags->'building' = 'school' 
         or (tags->'building' is not null and tags->'amenity'='school')) 
       group by h3_3 order by 2 desc

QUERY PLAN                                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=13217952.21..13219300.04 rows=11052 width=10)
   Workers Planned: 6
   ->  Sort  (cost=13216952.12..13216956.72 rows=1842 width=10)
         Sort Key: (count(*)) DESC
         ->  Parallel Append  (cost=1107174.96..13216852.21 rows=1842 width=10)
               ->  GroupAggregate  (cost=1889369.47..1890603.83 rows=7 width=10)
                     Group Key: ways_51.h3_3
                     ->  Sort  (cost=1889369.47..1889780.90 rows=164572 width=2)
                           Sort Key: ways_51.h3_3
                           ->  Seq Scan on ways_051 ways_51  (cost=0.00..1873744.28 rows=164572 width=2)
                                 Filter: (closed AND (((tags -> 'building'::text) = 'school'::text) OR (((tags -> 'building'::text) IS NOT NULL) AND ((tags -> 'amenity'::text) = 'school'::text))))
               ->  HashAggregate  (cost=1278539.18..1278539.58 rows=40 width=10)
                     Group Key: ways_2.h3_3
                     ->  Seq Scan on ways_002 ways_2  (cost=0.00..1277795.15 rows=148806 width=2)
                           Filter: (closed AND (((tags -> 'building'::text) = 'school'::text) OR (((tags -> 'building'::text) IS NOT NULL) AND ((tags -> 'amenity'::text) = 'school'::text))))
               ->  GroupAggregate  (cost=1269524.73..1270332.33 rows=7 width=10)
                     Group Key: ways_43.h3_3
                     ->  Sort  (cost=1269524.73..1269793.90 rows=107671 width=2)
                           Sort Key: ways_43.h3_3
                           ->  Seq Scan on ways_043 ways_43  (cost=0.00..1260525.44 rows=107671 width=2)
                                 Filter: (closed AND (((tags -> 'building'::text) = 'school'::text) OR (((tags -> 'building'::text) IS NOT NULL) AND ((tags -> 'amenity'::text) = 'school'::text))))
               ->  GroupAggregate  (cost=1248285.94..1249040.42 rows=7 width=10)
                     Group Key: ways_52.h3_3
                     ->  Sort  (cost=1248285.94..1248537.41 rows=100588 width=2)
                           Sort Key: ways_52.h3_3
                           ->  Seq Scan on ways_052 ways_52  (cost=0.00..1239928.03 rows=100588 width=2)
                                 Filter: (closed AND (((tags -> 'building'::text) = 'school'::text) OR (((tags -> 'building'::text) IS NOT NULL) AND ((tags -> 'amenity'::text) = 'school'::text))))
 ...

 JIT:
   Functions: 806
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(612 rows)

Let's visualize the result in QGIS:

select 
   h3_to_geo_boundary_geometry(h3_3::h3index), 
   count 
  from school 
      where count>100

PostgreSQL is supported by most data visualization software, complex queries with window functions, subqueries also work. Since this DB can be run on almost any hardware, geoanalytics is available everywhere, only with some performance penalties.

Welcome to the world of geodata analytics!

The uniqueness of the OpenStreetMap project is the ability of everyone on the planet to use the data of the project, but given the amount of data, it was not so easy. Now you don’t need a powerful high performance specialized server to run the database, and I hope that my approach to OSM data partitioning and the ability to use PostgreSQL for geospatial data analytics will allow more people to make queries to this unique project and develop new services that will take into account distances, real-world objects and improve the world around us.

Hubs: