import json
import numpy as np
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import dendrogram
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
Guide
This notebook contains the basic overview how to interact with the data product. See the overview of its structure in Data.
Required dependencies
A list of Python packages required to run this notebook:
numpy
pandas
scipy
geopandas
matplotlib
pyarrow
Set the path to the files. This assumes a local copy of the data product but note that Zenodo allows direct access to cloud optimized files like these. Just geopandas.read_parquet
currently does not support reading from HTTPS.
= './data_product' data_product_dir
Plotting building data
The results from the clustering are assigned to buildings and are stored in {NUTS1}.parquet
. The data has the following columns:
geometry
- the geometry of the building.morphotope_id
- the morphotope, the building falls into.initially_noise
- whether the morphotope and the building were initially noise and assigned to a cluster later.level_1_label
- the level 1 classiffication of the building, corresponding to a cut of the hierarchy of 2500.level_2_label
- the level 2 classiffication of the building, corresponding to a cut of the hierarchy of 1600.level_3_label
- the level 3 classiffication of the building, corresponding to a cut of the hierarchy of 1000.level_4_label
- the level 4 classiffication of the building, corresponding to a cut of the hierarchy of 700.level_5_label
- the level 5 classiffication of the building, corresponding to a cut of the hierarchy of 500.level_6_label
- the level 6 classiffication of the building, corresponding to a cut of the hierarchy of 300.level_7_label
- the level 7 classiffication of the building, corresponding to a cut of the hierarchy of 200.
You can quickly search the file using the bounding boxes in the EPSG:3035
projection.
%%time
= gpd.read_parquet(f'{data_product_dir}/cz0.parquet', bbox=(4633248, 3007112, 4636735, 3009478)) buildings_subset
CPU times: user 749 ms, sys: 485 ms, total: 1.23 s
Wall time: 407 ms
This read a portion of the data from the CZ0 NUTS1 region (Czechia).
2) buildings_subset.head(
geometry | morphotope_id | initially_noise | level_1_label | level_2_label | level_3_label | level_4_label | level_5_label | level_6_label | level_7_label | |
---|---|---|---|---|---|---|---|---|---|---|
69333_174961 | POLYGON ((4633241.84 3007139.255, 4633252.999 ... | 69333_849_709 | False | 1 | 1 | 1 | 1 | 2 | 4 | 6 |
69333_174926 | POLYGON ((4633264.399 3007127.003, 4633266.132... | 69333_849_709 | False | 1 | 1 | 1 | 1 | 2 | 4 | 6 |
See how it looks like on a map, using the first bifurcation of the taxonomy.
='level_1_label', categorical=True) buildings_subset.plot(column
Plotting morphotope delineation
The results from the morphotope delineation are stored in {NUTS1}_morphotopes.parquet
. The data has the following columns: - geometry
- the geometry of the building. - morphotope_id
- the morphotope, the building falls into. - initially_noise
- whether the morphotope and the building were initially noise and assigned to a cluster later. - level_1_label
- the level 1 classiffication of the building, corresponding to a cut of the hierarchy of 2500. - level_2_label
- the level 2 classiffication of the building, corresponding to a cut of the hierarchy of 1600. - level_3_label
- the level 3 classiffication of the building, corresponding to a cut of the hierarchy of 1000. - level_4_label
- the level 4 classiffication of the building, corresponding to a cut of the hierarchy of 700. - level_5_label
- the level 5 classiffication of the building, corresponding to a cut of the hierarchy of 500. - level_6_label
- the level 6 classiffication of the building, corresponding to a cut of the hierarchy of 300. - level_7_label
- the level 7 classiffication of the building, corresponding to a cut of the hierarchy of 200.
You can quickly search the file using the bounding boxes in the EPSG:3035
projection.
%%time
= gpd.read_parquet(f'{data_product_dir}/cz0_morphotopes.parquet', bbox=(4633248, 3007112, 4636735, 3009478)) morphotope_subset
CPU times: user 1.09 s, sys: 298 ms, total: 1.39 s
Wall time: 1.29 s
This read the same spatial portion of Czech data.
2) morphotope_subset.head(
geometry | morphotope_id | initially_noise | level_1_label | level_2_label | level_3_label | level_4_label | level_5_label | level_6_label | level_7_label | |
---|---|---|---|---|---|---|---|---|---|---|
2690 | MULTIPOLYGON (((4632562.899 3007004.802, 46325... | 69333_849_709 | False | 1 | 1 | 1 | 1 | 2 | 4 | 6 |
4461 | MULTIPOLYGON (((4633261.396 3007177.453, 46332... | 69333_849_-1621 | True | 2 | 4 | 8 | 17 | 24 | 55 | 101 |
These align with the labels allocated to buildings.
= morphotope_subset.plot(column='level_6_label', categorical=True, alpha=.5)
ax ='level_6_label', categorical=True, ax=ax) buildings_subset.plot(column
Data
The data product also contains the building-level morphometric attributes. The Parquet files are linkable via index and shall be aligned.
Building-level data
= pd.read_parquet(f"{data_product_dir}/cz0_data.parquet")
data 2) data.head(
sdbAre | sdbPer | sdbCoA | ssbCCo | ssbCor | ssbSqu | ssbERI | ssbElo | ssbCCM | ssbCCD | ... | mibElo | mibERI | mibCCo | mibLAL | mibFR | mibSCo | micBAD | licBAD | misBAD | midBAD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
51289_956483 | 124.380966 | 45.522812 | 0.0 | 0.513258 | 5.0 | 18.040710 | 1.012931 | 0.596085 | 7.803086 | 1.393862 | ... | 0.356651 | 1.002488 | 0.404339 | 27.314951 | 3.396414 | 3.396414 | 48.194645 | 42.634842 | 167.245316 | 122.123451 |
51289_956636 | 11.252568 | 17.124670 | 0.0 | 0.272640 | 3.0 | 30.028931 | 0.855039 | 0.428266 | 3.320516 | 0.805084 | ... | 0.356651 | 1.002488 | 0.404339 | 27.314951 | 3.396414 | 3.396414 | 79.993858 | 42.634842 | 167.245316 | 122.123451 |
2 rows × 76 columns
When joining the data, you can easily visualise it.
= buildings_subset.join(data, how='left')
joined
'mtbSWR') joined.plot(
Morphotope-level data
For each non-noise morphotope, we release aggregated morphometric values for a direct comparison.
= pd.read_parquet(f'{data_product_dir}/morphotope_data.parquet')
morphotope_data 2) morphotope_data.head(
sdbAre | sdbPer | sdbCoA | ssbCCo | ssbSqu | ssbERI | ssbElo | libNCo | ldbPWL | ltcBuA | ... | mibElo | mibERI | mibCCo | mibLAL | mibFR | mibSCo | micBAD | midBAD | limLPS | limAre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4_0_1 | 0.987137 | 1.433775 | -0.003262 | -1.949620 | -0.424706 | 0.483494 | -2.019151 | -0.08234 | 0.208525 | 0.081287 | ... | -1.315770 | 0.462303 | -0.613538 | 0.493512 | 1.521773 | 1.521773 | 3.654263 | 4.850005 | 0.149898 | 3.887056 |
4_0_10 | 0.999498 | 1.196410 | -0.003262 | -0.789406 | -0.342916 | 0.486163 | -1.535531 | -0.08234 | 0.182541 | -0.233554 | ... | -0.153381 | 0.395069 | 0.056334 | 0.486932 | 1.216988 | 1.216988 | 1.178967 | 1.099998 | 0.149898 | 0.893068 |
2 rows × 58 columns
This is joinable to morphotope geometry based on index of data and a morphotope_id
. The same can be used to join the data to buildings.
Taxonomic tree
The data for the taxonomic tree - ward_linkage_10.npy
- is in scipy.cluster.hierarchical.linkage
format.
The input data for the taxonomy is a subset of morphotopes and their characteristics. Only extreme outliers were dropped from the full morphotope list when producing the taxonomy.
= np.load(f'{data_product_dir}ward_linkage.npy') linkage_matrix
Since the linkage matrix is large - around 500,000 observations - we do not plot labels
= plt.subplots(figsize=(10,10))
fig,ax = dendrogram(linkage_matrix,ax=ax, no_labels=True) _
You can use fcluster to cut it at arbitrary values.
= [1000, 1600, 2500]
cutoffs
= []
cluster_frame for c in cutoffs:
= fcluster(linkage_matrix, t=c, criterion='distance')
clusters
cluster_frame.append(pd.Series(clusters))
= pd.concat(cluster_frame, axis=1)
cluster_frame = cutoffs
cluster_frame.columns = morphotope_data.index
cluster_frame.index cluster_frame
1000 | 1600 | 2500 | |
---|---|---|---|
4_0_1 | 1 | 1 | 1 |
4_0_10 | 1 | 1 | 1 |
4_0_100 | 7 | 4 | 2 |
4_0_101 | 7 | 4 | 2 |
4_0_102 | 1 | 1 | 1 |
... | ... | ... | ... |
152511_809_0 | 4 | 2 | 1 |
152511_8_0 | 3 | 2 | 1 |
152511_8_1 | 5 | 2 | 1 |
152511_8_2 | 5 | 2 | 1 |
152511_901_0 | 4 | 2 | 1 |
519259 rows × 3 columns
Linking the new cuts to the buildings
You have to link back the cluster to the buildings to plot using the morphotope ID. Some buildings are not part of morphotopes, or are part of outlier morphotopes. They come out as NA in the mapping, and they have to be handled seperately.
= buildings_subset.morphotope_id.map(cluster_frame[1000].to_dict()) new_clusters
=new_clusters.fillna(-1), categorical=True) buildings_subset.plot(column
In this case we are mapping the morphotopes to a cut that exists, so we can verify the results
from pandas.testing import assert_series_equal
# drop outliers and noise and compare only morphotopes that are part of the clustering
= new_clusters.dropna().astype(int)
new_assignment = buildings_subset.loc[~buildings_subset.initially_noise, 'level_3_label']
old_assignment
=False) assert_series_equal(new_assignment, old_assignment, check_names
Handling noise and outliers
The easiest way to deal with outliers is to use the buildings.parquet file, the initially noise column and the lowest level to the custom cut.
= 'level_7_label' mapping_level
Create a dictionary that maps the old assignemnts at the specified level to the new cluster values.
= new_assignment.drop_duplicates()
new_assignment_mapping = pd.Series(new_assignment_mapping.values, buildings_subset.loc[new_assignment_mapping.index, mapping_level])
new_assignment_mapping new_assignment_mapping
level_7_label
6 1
77 7
102 8
73 6
18 2
47 5
40 4
dtype: int64
= new_clusters[new_clusters.isna()].index
new_noise = buildings_subset.loc[new_noise, mapping_level].map(new_assignment_mapping) new_clusters.loc[new_noise]
=new_clusters, categorical=True) buildings_subset.plot(column
Naming clusters & branches
The data product also contains full names and pen portraits of the first three levels of taxonomy.
with open(f"{data_product_dir}/label_name.json", 'r') as f:
= json.load(f) cluster_names
cluster_names
{'1': {'1': 'Incoherent Fabric', '2': 'Coherent Fabric'},
'2': {'1': 'Incoherent Large-Scale Fabric',
'2': 'Incoherent Small-Scale Fabric',
'3': 'Coherent Interconnected Fabric',
'4': 'Coherent Dense Fabric'},
'3': {'1': 'Incoherent Large-Scale Homogeneous Fabric',
'2': 'Incoherent Large-Scale Heterogeneous Fabric',
'3': 'Incoherent Small-Scale Linear Fabric',
'4': 'Incoherent Small-Scale Sparse Fabric',
'5': 'Incoherent Small-Scale Compact Fabric',
'6': 'Coherent Interconnected Fabric',
'7': 'Coherent Dense Disjoint Fabric',
'8': 'Coherent Dense Adjacent Fabric'}}
First level keys capture the level of labels, second level the mapping of a label to a name.
with open(f"{data_product_dir}/pen_portraits.json", 'r') as f:
= json.load(f) cluster_descriptions
cluster_descriptions
{'1': {'Incoherent Fabric': 'Incoherent fabric covers a wide morphological variety, with a common theme of partial or complete breakage of the traditional structural roles of streets, plots, and buildings. Common for modernist period, post-modern, and industrial developments, this branch has less typically less connected street networks and may showcase buildings facing open spaces and internal parts of blocks rather than streets. At the same time, it contains less defined village developments.',
'Coherent Fabric': 'In coherent fabric, all streets, plots, and buildings take their traditional structural roles in defining the spatial arrangement of the urban form. It is common for traditional European development with densely connected street networks and legible plot structure, facilitating direct relation between buildings and streets.'},
'2': {'Incoherent Large-Scale Fabric': 'Incoherent large-scale fabric captures typically urban development composed of buildings larger than the average, that may or may not be far from each other, creating large open spaces. Streets tend to be of an utilitarian use, rather than a structural one, typical for modernist housing estates or industrial zones.',
'Incoherent Small-Scale Fabric': 'Incoherent small-scale fabric is mostly non-urban development capturing various kinds of villages and small towns, which show high variation of morphological properties. Buildings tend to be smaller, but distances between them vary, as well as the relations between buildings and streets.',
'Coherent Interconnected Fabric': 'Coherent interconnected fabric is typical for historical city and town centres, where buildings form intensive development. In this branch, the built-up density and local street connectivity are high, while inter-building distances remain relatively small. Buildings frequently share walls, forming larger structures with courtyards along relatively short and narrow streets.',
'Coherent Dense Fabric': 'Coherent dense fabric captures morphology typical for urban residential areas with lower density, where blocks are defined by streets more than buildings. The street networks are well defined and connected with buildings being either adjacent (e.g. row houses) or disjoint (e.g. urban villas).'},
'3': {'Incoherent Large-Scale Homogeneous Fabric': 'Incoherent large-scale homogeneous fabric consists of the large buildings with moderate variations in size and shape, as well as low to moderate street connectivity and wide streets. The resulting environment is spacious, with significant open areas between structures, typical of modernist housing, with areas showing a relatively high degree of homogeneity caused by underlying planning principles.',
'Incoherent Large-Scale Heterogeneous Fabric': 'Incoherent large-scale heterogeneous fabric consists of the largest buildings with notable variations in size and shape, as well as low to moderate street connectivity and wide streets. The design does not emphasise sunlight exposure, creating broad but less refined configurations, typical of industrial and other service areas.',
'Incoherent Small-Scale Linear Fabric': 'Incoherent small-scale linear fabric has a moderate built-up area and low local street connectivity, typically forming long linear villages. Its streets are long, linear, wide, and there are minimal shared walls between structures.',
'Incoherent Small-Scale Sparse Fabric': 'Incoherent small-scale sparse fabric is characterised by low built-up density, low street connectivity, large distances between buildings, few shared walls, and large open spaces around buildings. The streets are few, open, and wide. The buildings are small to moderate in size, and their layout is more typical of rural areas.',
'Incoherent Small-Scale Compact Fabric': 'Incoherent small-scale compact fabric has low to moderate built-up area and street connectivity. Buildings exhibit a consistent alignment among themselves and also along streets of varying length, width, and linearity. There is also a significant number of shared walls between structures, typical for more traditional villages.',
'Coherent Interconnected Fabric': 'Coherent interconnected fabric is typical for historical city and town centres, where buildings form intensive development. In this branch, the built-up density and local street connectivity are high, while inter-building distances remain relatively small. Buildings frequently share walls, forming larger structures with courtyards along relatively short and narrow streets.',
'Coherent Dense Disjoint Fabric': 'The coherent dense disjoint fabric has moderate to high built-up density and local street connectivity, with longer and wider streets compared to other dense developments. Shared walls between buildings are less common, and distances within buildings are moderate, reflecting a pattern of standalone structures within a robust street network.',
'Coherent Dense Adjacent Fabric': 'In coherent dense adjacent fabric, the built-up density and local street connectivity are high, while inter-building distances remain relatively small. Buildings frequently share walls, forming larger structures along relatively short and narrow streets.'}}