Exploring Large-scale Raster Population Data

Image by the author.

Visualizing geospatial population data across multiple scales using Python: global, country, and urban-level data

I have often seen beautiful population maps circulating online; however, I usually got stuck at some technical parts, like visualizing other map segments than shown in the tutorial or turning the large-scale raster data into more computation-friendly vector formats. I overcome some of these shortcomings in this article with a hands-on guide to two primary global population data sources.

It is also important to note that besides their aesthetic value, population data and maps showing them are amongst the most basic information and valuable information one can gather and incorporate for any urban development or location intelligence task. They come in particularly handy in use cases such as planning new amenities, site selection and catchment analysis, estimating the scale of urban products, or profiling different neighborhoods, just to mention a few.

1. Data sources

I am relying on the following two fine-grained population estimate data sources, which you can download the files via the attached links (at the date of the publication):

2. Visualizing the Global Human Settlement Layer

2.1. Import the data!

I first came across with this dataset in the Datashader tutorial of the Architecture Performance. After reproducing their visualization, I hit a few roadblocks by extending it to a global map, which initiated this work, so now I show you the workarounds I found on how to do that!

First, parse the raster file using the xarray package.

import rioxarray

file_path = "GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0/GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0.tif"

data_array = rioxarray.open_rasterio(file_path, chunks=True, lock=False)
data_array

The output of this cell is a detailed description of the data set:

2.2. Visualizing segments of the data

We can already see this is a pretty challenging amount of data for most of the standard laptops. Anyway, let’s try to visualize it using Datashader, a really handy tool for geospatial datasets of this magnitude:

# WARNING: THIS CODE BLOCK WILL MOST LIKELY CAUSE A MEMORY OVERFLOW ERROR

import datashader as ds
import xarray as xr
from colorcet import palette
from datashader import transfer_functions as tf

# prepare to plot
data_array_p = xr.DataArray(data_array)[0]
data_array_p = data_array_p.where(data_array_p > 0)
data_array_p = data_array_p.compute()

# get the image size
size = 1200
asp = data_array_p.shape[0] / data_array_p.shape[1]

# create the data shader canvas
cvs = ds.Canvas(plot_width=size, plot_height=int(asp*size))
raster = cvs.raster(data_array_p)

# draw the image
cmap = palette["fire"]
img = tf.shade(raster, how="eq_hist", cmap=cmap)
img

While this code looks technically OK, my 2021 M1 Macbook Pro with 16GB of RAM gives a miserable memory overflow error. So, let’s crop the image to look into the data! For this, I follow the Architecture Performance and focus on Europe, which I follow for the time being, as it certainly works.

However, the main question I will answer later is how can we visualize the entire globe’s data despite such memory limitations but still using our local machine? Wait for it!

import datashader as ds
import xarray as xr
from colorcet import palette
from datashader import transfer_functions as tf
import numpy as np

# crop the data array
data_array_c = data_array.rio.clip_box(minx=-1_000_000.0, miny=4_250_000.0, maxx=2_500_000.0, maxy=7_750_000.0)
data_array_c = xr.DataArray(data_array_c)

# prepare to plot
data_array_c = xr.DataArray(data_array_c)[0]
data_array_c = data_array_c.where(data_array_c > 0)
data_array_c = data_array_c.compute()
data_array_c = np.flip(data_array_c, 0)

# get the image size
size = 1200
asp = data_array_c.shape[0] / data_array_c.shape[1]

# create the data shader canvas
cvs = ds.Canvas(plot_width=size, plot_height=int(asp*size))
raster = cvs.raster(data_array_c)

# draw the image
cmap = palette["fire"]
img = tf.shade(raster, how="eq_hist", cmap=cmap)
img = tf.set_background(img, "black")

img

This code block outputs the following visual:

Population distribution in Europe. Image by the author.

Using the ‘fire’ colormap seems like an industry standard for a good reason; however, if you wish to mix things up, you may find other coloring schemes here and applied below:

# create the data shader canvas
cvs = ds.Canvas(plot_width=size, plot_height=int(asp*size))
raster = cvs.raster(data_array_c)

# draw the image
cmap = palette["bmw"]
img = tf.shade(raster, how="eq_hist", cmap=cmap)
img = tf.set_background(img, "black")

img

This code block outputs the following visual:

Population distribution in Europe. Image by the author.

2.3. Visualizing the entire globe

So the data is there, but what if you have a regular computer at hand and still want to visualize the whole world at a resolution of 100m? The workaround I will show you here is relatively simple — I split the entire raster image into about a hundred smaller tiles so that my computer handles them well one by one, then use some image manipulation tricks to merge them into one image file.

However, before moving on — a quick note. There is also an option to downsample XArray arrays in the following way — however, I couldn’t find an appropriate downscaling that could have handled the whole data set. Additionally, I didn’t want to lose accuracy and wanted to see the entire data set in full glory.

# a quick way to down-sample the data
downsampling_factor = 20

downsampled_data_array = data_array.coarsen(x=downsampling_factor, y=downsampling_factor).mean()
downsampled_data_array

And the output — which is worth contracting to the previously plotted data_array:

To split the whole raster image into grid segments, first, get its boundaries and define N as the step size. Then, create the list of image segment boundaries.

minx = float(data_array.x.min().values)
maxx = float(data_array.x.max().values)
miny = float(data_array.y.min().values)
maxy = float(data_array.y.max().values)

N = 10
xstep = (maxx-minx) / N
ystep = (maxy-miny) / N

xsteps = list(np.arange(minx, maxx, xstep))
ysteps = list(np.arange(miny, maxy, ystep))

Now, iterate over each x and y step and create each image segment, where each image file is named after its position in the original grid. This loop may take a while.

import os
foldout = 'world_map_image_segments'
if not os.path.exists(foldout):
os.makedirs(foldout)

for idx_x, x_coord in enumerate(xsteps):
for idx_y, y_coord in enumerate(ysteps):

if not os.path.exists(foldout+'/'+str(idx_x)+'_'+str(idx_y)+'.png'):

data_array_c = data_array.rio.clip_box( minx=x_coord, miny=y_coord, maxx=x_coord+xstep, maxy=y_coord+ystep)
data_array_c = xr.DataArray(data_array_c)[0]
data_array_c = data_array_c.fillna(0)
data_array_c = data_array_c.where(data_array_c > 0)
data_array_c = data_array_c.compute()
data_array_c = np.flip(data_array_c, 0)

size = 2000
asp = data_array_c.shape[0] / data_array_c.shape[1]

cvs = ds.Canvas(plot_width=size, plot_height=int(asp*size))
raster = cvs.raster(data_array_c)

cmap = palette["fire"]
img = tf.shade(raster, how="eq_hist", cmap=cmap)
img = tf.set_background(img, "black")

pil_image = img.to_pil()
pil_image.save(foldout+'/'+str(idx_x)+'_'+str(idx_y)+ '.png')
print('SAVED: ', x_coord, y_coord, y_coord+xstep,y_coord+ystep)

Finally, if we have all the image segments, we can assemble them quickly with the following function. For this piece of code, I also asked for some hints from ChatGPT to speed things up, but as usual, it also needed some manual tweaks.

from PIL import Image


def find_dimensions(image_dir):
max_x = 0
max_y = 0

for filename in os.listdir(image_dir):
if filename.endswith(".png"):
x, y = map(int, os.path.splitext(filename)[0].split("_"))
max_x = max(max_x, x)
max_y = max(max_y, y)

return max_x + 1, max_y + 1


image_dir = foldout
segment_width = size
segment_height = int(asp*size)


# Determine the dimensions of the large image
large_image_width, large_image_height = find_dimensions(image_dir)

# Create an empty large image (white background)
large_image = Image.new("RGB", (large_image_width * segment_width, large_image_height * segment_height), "black")

# Loop through the individual image segments and paste them into the large image
for filename in sorted(os.listdir(image_dir)):
if filename.endswith(".png"):
x, y = map(int, os.path.splitext(filename)[0].split("_"))
segment_image = Image.open(os.path.join(image_dir, filename))
x_offset = x * segment_width
y_offset = large_image_height * segment_height-1*y * segment_height
large_image.paste(segment_image, (x_offset, y_offset))

# Save the merged large image
large_image.save("global_population_map.png")

And the final result here, the entire globe mapped out:

Global opulation distribution. Image by the author.

3. Visualizing and transforming WorldPop data

The second data source I would like to show you is the WorldPop population database, which has continents and countries separately at various resolutions. In this example, complementing the previous section’s continent and global levels, here I shoot for the level of countries and cities. For instance, I go with Germany and a resolution of 100m curated in 2020, and also show you how to carve out a city from the whole country and turn it into an easy-to-use vector format using GeoPandas.

3.1. Visualizing WorldPop data

Using the previous methods, we can again quickly visualize this raster file:

# parse the data
data_file = 'deu_ppp_2020_constrained.tif'
data_array = rioxarray.open_rasterio(data_file, chunks=True, lock=False)

# prepare the data
data_array = xr.DataArray(data_array)[0]
data_array = data_array.where(data_array > 0)
data_array = data_array.compute()
data_array = np.flip(data_array, 0)

# get the image size
size = 1200
asp = data_array.shape[0] / data_array.shape[1]

# create the data shader canvas
cvs = ds.Canvas(plot_width=size, plot_height=int(asp*size))
raster = cvs.raster(data_array)

# draw the image
cmap = palette["fire"]
img = tf.shade(raster, how="eq_hist", cmap=cmap)
img = tf.set_background(img, "black")
img

This code block outputs the following visual:

Population distribution of Germany. Image by the author.

3.2. Transforming WorldPop data

After visualizing the entire planet, the continent of Europe, and the country of Germany, I would like to get more hands-on with the city of Berlin and show you how to transform such raster data into vector format and manipulate it easily with GeoPandas. For this, I access the administrative boundaries of Berlin in a geojson format here.

This admin file contains the boroughs of Berlin, so first, I merge them into the city as a whole.

from shapely.ops import cascaded_union
import geopandas as gpd



admin = gpd.read_file('tufts-berlin-bezirke-boroughs01-geojson.json')
admin = gpd.GeoDataFrame(cascaded_union(admin.geometry.to_list()), columns = ['geometry']).head(1)

admin.plot()

This code block outputs the following visual:

The admin boundaries of Berlin. Image by the author.

Now turn the xarray into a Pandas DataFrame, extract the geometry information, and build a GeoPandas GeoDataFrame. One way to do this:

import pandas as pd

df_berlin = pd.DataFrame(data_array.to_series(), columns = ['population']).dropna()

Now, build a GeoDataFrame from this, focusing on Berlin:

from shapely.geometry import Point

# find the limiting bounding box for easier coodinate-selection
minx, miny, maxx, maxy = admin.bounds.T[0].to_list()

points = []
population = df_berlin.population.to_list()
indicies = list(df_berlin.index)

# create Point geometries from the points falling into this bounding box
geodata = []
for ijk, (lon, lat) in enumerate(indicies):
if minx <= lat <= maxx and miny <= lon <= maxy:
geodata.append({'geometry' : Point(lat, lon), 'population' : population[ijk]})

# build a GeoDataFrame
gdf_berlin = gpd.GeoDataFrame(geodata)
gdf_berlin = gpd.overlay(gdf_berlin, admin)

Then, visualize the population as vector data:

import matplotlib.pyplot as plt


f, ax = plt.subplots(1,1,figsize=(15,15))

admin.plot(ax=ax, color = 'k', edgecolor = 'orange', linewidth = 3)

gdf_berlin.plot(column = 'population',
cmap = 'inferno',
ax=ax,
alpha = 0.9,
markersize = 0.25)

ax.axis('off')
f.patch.set_facecolor('black')

This code block outputs the following visual:

Population distribution of Berlin. Image by the author.

Finally, here we have a standard GeoDataFrame with 100m resolution population levels assigned to each point geometry corresponding to each pixel in the raster file.

Summary

In this article, I explored two global population data sets based on various, which estimate population levels by combining various approximation, measurement, and modeling approaches at a remarkable spatial resolution of 100m using raster grids. This type of information is highly valuable for a wide range of applications in urban development and location intelligence, such as infrastructure planning, site selection, neighborhood profiling, and more. From the technical side, I showed examples at three spatial levels, covering the entire globe, then zooming in to countries, and finally, cities. While the methodology can handle even smaller resolutions, this all happened on a single laptop using powerful Python libraries like Xarray, DataShader, and GeoPandas.


Exploring Large-scale Raster Population Data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Related Articles

CycleGAN: Unpaired Image-to-Image Translation (Part 3)

Table of Contents CycleGAN: Unpaired Image-to-Image Translation (Part 3) Configuring Your Development Environment Need Help Configuring Your Development Environment? Project Structure Implementing CycleGAN Training Implementing Training Callback Implementing Data Pipeline and Model Training Perform Image-to-Image Translation Summary Citation Information CycleGAN:…
The post CycleGAN: Unpaired Image-to-Image Translation (Part 3) appeared first on PyImageSearch.

Interfaces for Explaining Transformer Language Models

Interfaces for exploring transformer language models by looking at input saliency and neuron activation. Explorable #1: Input saliency of a list of countries generated by a language model Tap or hover over the output tokens: Explorable #2: Neuron activation analysis reveals four groups of neurons, each is associated with generating a certain type of token Tap or hover over the sparklines on the left to isolate a certain factor: The Transformer architecture has been powering a number of the recent advances in NLP. A breakdown of this architecture is provided here . Pre-trained language models based on the architecture, in both its auto-regressive (models that use their own output as input to next time-steps and that process tokens from left-to-right, like GPT2) and denoising (models trained by corrupting/masking the input and that process tokens bidirectionally, like BERT) variants continue to push the envelope in various tasks in NLP and, more recently, in computer vision. Our understanding of why these models work so well, however, still lags behind these developments. This exposition series continues the pursuit to interpret and visualize the inner-workings of transformer-based language models. We illustrate how some key interpretability methods apply to transformer-based language models. This article focuses on auto-regressive models, but these methods are applicable to other architectures and tasks as well. This is the first article in the series. In it, we present explorables and visualizations aiding the intuition of: Input Saliency methods that score input tokens importance to generating a token. Neuron Activations and how individual and groups of model neurons spike in response to inputs and to produce outputs. The next article addresses Hidden State Evolution across the layers of the model and what it may tell us about each layer’s role.

Technology Roundtable

Technology Roundtable is an opportunity for technology architects in the technology industry to learn, innovate and collaborate with their peers. Roundtable members work together on industry priorities and general topics of interest and concern related to open source technology initiatives.