Processing large OpenStreetMap datasets for research

Introduction

This practical was developed for the OpenGeoHub summer school 2023.

As outlined in the session abstract, will cover

  • How and where to download OSM data
  • How to process small amounts of OSM data using the osmdata R package
  • How to process large OSM ‘extracts’ data with the osmextract R package
  • Other command line tools for working with OSM data, including the mature and widely used osmium tool, the pyrosm Python package and the osm2streets web application and Rust codebase

Finally, the session will outline ideas for using OSM data to support the fast and fair decarbonisation of the global economy.

Prerequisites

To participate in the session, all you need is R and following packages:

pkgs = c(
    "tidyverse",
    "osmdata",
    "osmextract"
)
remotes::install_cran(pkgs)

You may want to install the following packages for Section 6:

We will get data representing case study areas:

How and where to download OSM data

There are two main ways to download OSM data:

  • In small amounts, which you can get from osm.org directly, and from services that host the Overpass API and provide free access.
  • In large amounts, which you can get from a smaller number of extract providers. Because the focus of this session is on large amounts of data, and because providers of small uncompressed datasets cannot scale to national or global datasets, we will focus on this second way of getting data, with key providers described below.

Uncompressed OSM providers

Queries to Overpass API providers can be made directly from the Overpass Turbo web application, which has a convenient web interface that is ideal for exploring the data and writing queries iteratively.

Overpass users the Overpass QL, an example of which is provided below. You can see the results of a query at this endpoint, for example: https://overpass-turbo.eu/?Q=%28%0A+++node%2851.249%2C7.148%2C51.251%2C7.152%29%3B%0A+++%3C%3B%0A%29%3B%0Aout+meta%3B

This can be written in Overpass QL as:

(
  node(51.249,7.148,51.251,7.152);
  <;
);
out meta;

After saving this query as a file (e.g. called query.txt), you can download the data using the curl command line tool as follows:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d @query.txt https://overpass-api.de/api/interpreter > data.osm

As outlined in the providers_comparison vignette in the osmextract package, there are several providers of OSM data. The main ones that provide regular extracts without need for logins are:

library(osmextract)
Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright.
Check the package website, https://docs.ropensci.org/osmextract/, for more details.
library(sf)
Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE

Extracts from each provider are shown in the figures below, generated by code that can be ‘unfolded’ by clicking on the arrows:

Geofabrik

geofabrik is a company that provides map-based services and free downloads of OSM extracts that are updated daily. These extracts are based on a division of the world into different regions, at 4 different levels. Zones in level 1 cover a whole continent (plus Russian Federation):

Code
par(mar = rep(0, 4))
plot(geofabrik_zones[geofabrik_zones$level == 1, "name"], key.pos = NULL, main = NULL)

Level 2 contains polygons representing several countries all around the world:

Code
plot(geofabrik_zones[geofabrik_zones$level == 2, "name"], key.pos = NULL, main = NULL)

Geofabrik also defines several special zones, such as Alps, Britain and Ireland, Germany, Austria and Switzerland, US Midwest, US Northeast, US Pacific, US South and US West (level 3). Moreover, it contains extracts relative to some administrative subregions, mainly in Europe, Russia, Canada and South America:

Code
plot(geofabrik_zones[geofabrik_zones$level == 3, "name"], key.pos = NULL, main = NULL)

Check ?geofabrik_zones and the provider’s webpage for more details.

Openstreetmap.fr

openstreetmap_fr extracts are taken from http://download.openstreetmap.fr/, a web-service that provides OSM data updated every few minutes. The extracts are based on several regions, such as the continents (level 1):

Code
# Russian federation is considered as a level 1 zone
plot(openstreetmap_fr_zones[openstreetmap_fr_zones$level == 1, "name"], key.pos = NULL, main = NULL)

or some countries around the world (less than geofabrik’s level 2 zones):

Code
plot(openstreetmap_fr_zones[openstreetmap_fr_zones$level == 2, "name"], key.pos = NULL, main = NULL)

India,

Code
plot(openstreetmap_fr_zones[openstreetmap_fr_zones$parent == "india", "name"], key.pos = NULL, main = NULL)

France,

Code
ids_2 = openstreetmap_fr_zones$parent %in% "france"
ids_3 = openstreetmap_fr_zones$parent %in% openstreetmap_fr_zones$id[ids_2]

plot(openstreetmap_fr_zones[ids_2 | ids_3, "name"], key.pos = NULL, main = NULL)

and Brazil

Code
ids_2 = openstreetmap_fr_zones$parent %in% "brazil"
ids_3 = openstreetmap_fr_zones$parent %in% openstreetmap_fr_zones$id[ids_2]

plot(openstreetmap_fr_zones[ids_2 | ids_3, "name"], key.pos = NULL, main = NULL)

BBBike

bbbike provider is based on https://download.bbbike.org/osm/bbbike/. It is quite different from any other provider supported in osmextract since it contains OSM data for more than 200 cities worldwide.

bbbike provider is the safest choice if you are looking for OSM data relative to a particular city in the world.

How to process small amounts of OSM data using the osmdata R package

The osmdata package is a mature and widely used tool for working with OSM data in R. It is designed to work with small amounts of data, such as the area around a city or a country. It is not designed to work with large amounts of data, such as the whole of Europe or the world. For that, we need a different approach, which is covered in the next section.

How to process large OSM ‘extracts’ data with the osmextract R package

The quickest way to get large OSM datasets in R (and possibly in any data analysis framework) is to use the osmextract package.

The package makes your life easy by automating many parts of the OSM extract identification, download, and processing pipeline, so you can focus on the analysis and high-impact reasearch!

Finding an extract to download

Let’s see how it works for the city of Poznan:

We geocode the coordinates of Poznan, Poland

poznan = tmaptools::geocode_OSM("Poznan, Poland")$coords
# poznan = c(x = 16.933, y = 52.408)

and look for a match in the OSM extracts using oe_match():

oe_match(poznan, provider = "geofabrik")
$url
[1] "https://download.geofabrik.de/europe/poland/wielkopolskie-latest.osm.pbf"

$file_size
[1] 1.26e+08
oe_match(poznan, provider = "bbbike")
$url
[1] "https://download.bbbike.org/osm/bbbike/Poznan/Poznan.osm.pbf"

$file_size
[1] 25856507
oe_match(poznan, provider = "openstreetmap_fr")
$url
[1] "http://download.openstreetmap.fr/extracts/europe/poland/wielkopolskie-latest.osm.pbf"

$file_size
[1] 143404382

As shown above, bbbike is the only provide that provides a match for Poznan (the others match with all of Poland).

Downloading the extract

We can download the extract using oe_get(), noting the user of layer = "points" to get the points layer of the OSM data:

poznan = oe_get("Poznan", provider = "bbbike", force_vectortranslate = TRUE)

Find where the file is as follows:

poznan_file = oe_find("Poznan", provider = "bbbike", return_gpkg = FALSE)
# [1] "/home/robin/data/osm/bbbike_Poznan.osm.pbf"

Note: you can check the location where datasets will be downloaded and translated using the oe_download_directory() function (see ?oe_download_directory for details).

We uploaded data for Poznan to github.com/Robinlovelace/opengeohub2023/releases. For future reference, you can access the 25 MB PBF file from https://github.com/Robinlovelace/opengeohub2023/releases/download/v2/bbbike_Poznan.osm.pbf

file.copy(poznan_file, ".")
msg_upload_github = glue::glue("gh release create v2 {poznan_file}")
system(msg_upload_github)

Note: that takes quite a while download, so we will use a smaller extract for the rest of the session:

monaco_osm_points = oe_get("monaco", provider = "bbbike", layer = "points")
No exact match found for place = monaco and provider = bbbike. Best match is Moscow. 
Checking the other providers.
An exact string match was found using provider = geofabrik.
Downloading the OSM extract:

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |=========                                                             |  14%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |=======================                                               |  34%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=====================================                                 |  54%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%
File downloaded!
Starting with the vectortranslate operations on the input file!
0...10...20...30...40...50...60...70...80...90...100 - done.
Finished the vectortranslate operations on the input file!
Reading layer `points' from data source 
  `/tmp/RtmpCveZdk/geofabrik_monaco-latest.gpkg' using driver `GPKG'
Simple feature collection with 3056 features and 10 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 7.408202 ymin: 43.51654 xmax: 7.500245 ymax: 43.75175
Geodetic CRS:  WGS 84
# ?oe_get
monaco_osm_lines = oe_get("monaco", provider = "bbbike", layer = "lines")
monaco_osm_mlines = oe_get("monaco", provider = "bbbike", layer = "multilinestrings")
monaco_osm_polygons = oe_get("monaco", provider = "bbbike", layer = "multipolygons")
monaco_osm_other = oe_get("monaco", provider = "bbbike", layer = "other_relations")

You can look at the files downloaded as follows:

f = list.files(oe_download_directory(), pattern = "monaco", full.names = TRUE)
f
[1] "/tmp/RtmpCveZdk/geofabrik_monaco-latest.gpkg"   
[2] "/tmp/RtmpCveZdk/geofabrik_monaco-latest.osm.pbf"
monaco = sf::read_sf(f[1])
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
automatically selected the first layer in a data source containing more than
one.
monaco = sf::read_sf(f[2], layer = "lines")

Let’s take a look at the size of each layer, in units of MB:

Code
sizes_mb = sapply(list(monaco_osm_points, monaco_osm_lines, monaco_osm_mlines, monaco_osm_polygons, monaco_osm_other), function(x) {
    round(object.size(x) / 1e6, 1)
})
layer_names = c("points", "lines", "multilinestrings", "multipolygons", "other_relations")
n_features = sapply(list(monaco_osm_points, monaco_osm_lines, monaco_osm_mlines, monaco_osm_polygons, monaco_osm_other), nrow)
size_df = data.frame(
  layer = layer_names,
  size_mb = sizes_mb,
  n_features = n_features,
  kb_per_feature = sizes_mb / n_features * 1e3
)
knitr::kable(size_df)
layer size_mb n_features kb_per_feature
points 2.0 3056 0.6544503
lines 2.7 3066 0.8806262
multilinestrings 1.1 61 18.0327869
multipolygons 2.2 1729 1.2724118
other_relations 0.3 100 3.0000000

Exercises

  • Create maps of the different layers using {tmap}, {ggplot2} or a mapping package of your choice.
  • Which layer is most interesting for your research?
  • Are there any phenomena that are represented in more than one layer and, if so, thoughts on how to combine them?
  • Take a look at the demo.Rmd file in the sfnetworks/OGH2023 repo and try to reproduce the results for Poznan.

Clipping an area of interest

Rather than read-in the entire extract, you can read-in a subset…

pois_buffer_simple = sf::st_read("https://github.com/Robinlovelace/opengeohub2023/raw/main/pois_buffer_simple.geojson")
Reading layer `pois_buffer_simple' from data source 
  `https://github.com/Robinlovelace/opengeohub2023/raw/main/pois_buffer_simple.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 0 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 16.92972 ymin: 52.45306 xmax: 16.96174 ymax: 52.47165
Geodetic CRS:  WGS 84

Other command line tools for working with OSM data

pyrosm

Install the Python package pyrosm as follows:

pip install pyrosm

Search for Poznan in extracts available from pyrosm as follows (note: this fails for me currently as documented in github.com/HTenkanen/pyrosm/issues/217):

import pyrosm
from pyrosm import OSM
import geopandas as gpd

poznan_file = pyrosm.get_data("Poznan")
osm = OSM(poznan_file)
poznan_cycling = osm.get_network(network_type="cycling")
poznan_cycling.plot()

osmnx

Install the Python package osmnx as follows:

pip install osmnx
import osmnx as ox
/home/runner/.virtualenvs/r-reticulate/lib/python3.10/site-packages/geopandas/_compat.py:124: UserWarning: The Shapely GEOS version (3.11.1-CAPI-1.17.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow.
  warnings.warn(
import pandas as pd
import geopandas as gpd
# Get cycle netework for Poznan
poznan_polygon = ox.geocode_to_gdf("Poznan, Poland")
poznan_polygon.plot();

That is quite a big network, so let’s get the area of the polygon and use that to get a smaller network from GitHub:

# Get data from https://github.com/Robinlovelace/opengeohub2023/raw/main/pois_buffer_simple.geojson:
poznan_small = gpd.read_file("https://github.com/Robinlovelace/opengeohub2023/raw/main/pois_buffer_simple.geojson")
poznan_small.plot();

Download the cycling network as follows:

G_cycle = ox.graph_from_polygon(poznan_small.geometry[0], network_type="bike")

Plot the results:

ox.plot_graph(G_cycle)

Get basic stats as follows:

area = ox.project_gdf(poznan_small).unary_union.area
stats = ox.basic_stats(G_cycle, area=area)
pd.Series(stats)
n                                                                             568
m                                                                            1467
k_avg                                                                    5.165493
edge_length_total                                                      101250.782
edge_length_avg                                                         69.018938
streets_per_node_avg                                                     2.707746
streets_per_node_counts                 {0: 0, 1: 126, 2: 1, 3: 355, 4: 85, 5: 1}
streets_per_node_proportions    {0: 0.0, 1: 0.22183098591549297, 2: 0.00176056...
intersection_count                                                            442
street_length_total                                                     51276.381
street_segment_count                                                          749
street_length_avg                                                       68.459788
circuity_avg                                                             1.041883
self_loop_proportion                                                      0.00267
node_density_km                                                         178.71231
intersection_density_km                                                139.068382
edge_density_km                                                      31856.973767
street_density_km                                                    16133.310698
dtype: object

We can convert the object into a ‘GeoDataFrame’ as follows:

cycle_gdf = ox.graph_to_gdfs(G_cycle, edges=True)
cycle_gdf[1].plot();

osmium

Download the bbbike_Poznan.osm.pbf file, e.g. as follows:

if [ ! -f bbbike_Poznan.osm.pbf ]; then
    wget https://github.com/Robinlovelace/opengeohub2023/releases/download/v2/bbbike_Poznan.osm.pbf
fi

Extract only lines that are tagged as highway=cycleway as follows:

osmium tags-filter bbbike_Poznan.osm.pbf w/highway=cycleway -o poznan_cycleways.osm.pbf --overwrite

We can check that the operation worked with R or Python, e.g. as follows:

poznan_cycleways = sf::read_sf("poznan_cycleways.osm.pbf", layer = "lines")
plot(poznan_cycleways$geometry)

Exercises

  1. Use the query argument of oe_get() to download only cycleways from the bbbike extract of Poznan (note: you may need to use force_vectortranslate = TRUE to update the data in the .gpkg file).
  2. Develop a more sophisticated query to get an active travel network for Poznan
    • Bonus: add highways that are not footways/cycleways etc to ‘fill the gaps’ in the network generated in the preview step.
  3. Bonus: generate a simple measure of walkability or cyclability, perhaps with reference to the academic literature on ‘LTS’, the results shown in acteng/cyclability, or other source.
    • Bonus bonus: where would you prioritise new active travel infrastructure in Poznan based on these results?

Ideas for using OSM data

Take a look at the a-b-street/osm2streets repo and web app. Download .geojson files representing 2D extrusion of the network and explore in R, Python or a tool of your choice.

Test out the A/B Street game.