The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated
The saying goes that 80% of data collected, stored and maintained by governments can be associated with geographical locations. Although never empirically proven, it illustrates the importance of location within data. Ever growing data volumes put constraints on systems that handle geospatial data. Common Big Data compute engines, originally designed to scale for textual data, need adaptation to work efficiently with geospatial data — think of geographical indexes, partitioning, and operators. Here, I present and illustrate how to utilize the Microsoft Fabric Spark compute engine, with the natively integrated ESRI GeoAnalytics engine# for geospatial big data processing and analytics.
The optional GeoAnalytics capabilities within Fabric enable the processing and analytics of vector-type geospatial data, where vector-type geospatial data refers to points, lines, polygons. These capabilities include more than 150 spatial functions to create geometries, test, and select spatial relationships. As it extends Spark, the GeoAnalytics functions can be called when using Python, SQL, or Scala. These spatial operations apply automatically spatial indexing, making the Spark compute engine also efficient for this data. It can handle 10 extra common spatial data formats to load and save data spatial data, on top of the Spark natively supported data source formats. This blog post focuses on the scalable geospatial compute engines as has been introduced in my post about geospatial in the age of AI.
Demonstration explained
Here, I demonstrate some of these spatial capabilities by showing the data manipulation and analytics steps on a large dataset. By using several tiles covering point cloud data, an enormous dataset starts to form, while it still covers a relatively small area. The open Dutch AHN dataset, which is a national digital elevation and surface model, is currently in its fifth update cycle, and spans a period of nearly 30 years. Here, the data from the second, third, and forth acquisition is used, as these hold full national coverage, while the first version did not include a point cloud release.
Another Dutch open dataset, namely building data, the BAG, is used to illustrate spatial selection. The building dataset contains the footprint of the buildings as polygons. Currently, this dataset holds more than 11 million buildings. To test the spatial functions, I use only 4 AHN tiles per AHN version. Thus in this case, 12 tiles, each of 5 x 6.25 km. Totalling to more than 3.5 billion points within an area of 125 square kilometers. The chosen area covers the municipality of Loppersum, an area prone to land subsidence due to gas extraction.
The steps to take include the selection of buildings within the area of Loppersum, selecting the x,y,z-points from the roofs of the buildings. Then, we bring the 3 datasets into one dataframe and do an extra analysis with it. A spatial regression to predict the expected height of a building based on its height history as well as the history of the buildings in its direct surroundings. Not necessarily the best analysis to perform on this data to come to actual predictions* but it suits merely the purpose of demonstrating the spatial processing capabilities of Fabric’s ESRI GeoAnalytics. All the below code snippets are also available as notebooks on github.
Step 1: Read data
Spatial data can come in many different data formats; we conform to the geoparquet data format for further processing. The BAG building data, both the footprints as well as the accompanied municipality boundaries, come in geoparquet format already. The point cloud AHN data, version 2, 3 and 4, however, comes as LAZ file formats — a compressed industry standard format for point clouds. I have not found a Spark library to read LAZ, and created a txt file, separately, with the LAStools+ first.
# ESRI - FABRIC reference: /
# Import the required modules
import geoanalytics_fabric
from geoanalytics_fabric.sql import functions as ST
from geoanalytics_fabric import extensions
# Read ahn file from OneLake
# AHN lidar data source: /
ahn_csv_path = "Files/AHN lidar/AHN4_csv"
lidar_df = spark.read.options.csvlidar_df = lidar_df.selectExprlidar_df.printSchemalidar_df.showlidar_df.countThe above code snippet& provides the below results:
Now, with the spatial functions make_point and srid the x,y,z columns are transformed to a point geometry and set it to the specific Dutch coordinate system, see the below code snippet&:
# Create point geometry from x,y,z columns and set the spatial refrence system
lidar_df = lidar_df.select.alias)
lidar_df = lidar_df.withColumn)
lidar_df = lidar_df.select.alias)\
.withColumn)
lidar_df.printSchemalidar_df.showBuilding and municipality data can be read with the extended spark.read function for geoparquet, see the code snippet&:
# Read building polygon data
path_building = "Files/BAG NL/BAG_pand_202504.parquet"
df_buildings = spark.read.format.load# Read woonplaats datapath_woonplaats = "Files/BAG NL/BAG_woonplaats_202504.parquet"
df_woonplaats = spark.read.format.load# Filter the DataFrame where the "woonplaats" column contains the string "Loppersum"
df_loppersum = df_woonplaats.filter.contains)
Step 2: Make selections
In the accompanying notebooks, I read and write to geoparquet. To make sure the right data is read correctly as dataframes, see the following code snippet:
# Read building polygon data
path_building = "Files/BAG NL/BAG_pand_202504.parquet"
df_buildings = spark.read.format.load# Read woonplaats datapath_woonplaats = "Files/BAG NL/BAG_woonplaats_202504.parquet"
df_woonplaats = spark.read.format.load# Filter the DataFrame where the "woonplaats" column contains the string "Loppersum"
df_loppersum = df_woonplaats.filter.contains)
With all data in dataframes it becomes a simple step to do spatial selections. The following code snippet& shows how to select the buildings within the boundaries of the Loppersum municipality, and separately makes a selection of buildings that existed throughout the period. This resulted in 1196 buildings, out of the 2492 buildings currently.
# Clip the BAG buildings to the gemeente Loppersum boundary
df_buildings_roi = Clip.run# select only buildings older then AHN data= 2009)
# and with a status in usedf_buildings_roi_select = df_buildings_roi.where&)
The three AHN versions used, further named as T1, T2 and T3 respectively, are then clipped based on the selected building data. The AggregatePoints function can be utilized to calculate, in this case from the heightsome statistics, like the mean per roof, the standard deviation and the number of z-values it is based upon; see the code snippet:
# Select and aggregrate lidar points from buildings within ROI
df_ahn2_result = AggregatePoints\
.setPolygons\
.addSummaryField\
.addSummaryField\
.rundf_ahn3_result = AggregatePoints\
.setPolygons\
.addSummaryField\
.addSummaryField\
.rundf_ahn4_result = AggregatePoints\
.setPolygons\
.addSummaryField\
.addSummaryField\
.runStep 3: Aggregate and Regress
As the GeoAnalytics function Geographically Weighted Regressioncan only work on point data, from the building polygons their centroid is extracted with the centroid function. The 3 dataframes are joined to one, see also the notebook, and it is ready to perform the GWR function. In this instance, it predicts the height for T3based on local regression functions.
# Import the required modules
from geoanalytics_fabric.tools import GWR
# Run the GWR tool to predict AHN4height values for buildings at Loppersum
resultGWR = GWR\
.setExplanatoryVariables\
.setDependentVariable\
.setLocalWeightingScheme\
.setNumNeighbors\
.runIncludeDiagnosticsThe model diagnostics can be consulted for the predicted z value, in this case, the following results were generated. Note, again, that these results cannot be used for real world applications as the data and methodology might not best fit the purpose of subsidence modelling — it merely shows here Fabric GeoAnalytics functionality.
R20.994AdjR20.981AICc1509Sigma20.046EDoF378
Step 4: Visualize results
With the spatial function plot, results can be visualized as maps within the notebook — to be used only with the Python API in Spark. First, a visualization of all buildings within the municipality of Loppersum.
# visualize Loppersum buildings
df_buildings.st.plotHere is a visualization of the height difference between T3and T3 predicted.
# Vizualize difference of predicted height and actual measured height Loppersum area and buildings
axes = df_loppersum.st.plot, alpha=0)
axes.set, ylim=)
df_buildings.st.plot#, color='xkcd:sea blue'
df_with_difference.st.plotSummary
This blog post discusses the significance of geographical data. It highlights the challenges posed by increasing data volumes on Geospatial data systems and suggests that traditional big data engines must adapt to handle geospatial data efficiently. Here, an example is presented on how to use the Microsoft Fabric Spark compute engine and its integration with the ESRI GeoAnalytics engine for effective geospatial big data processing and analytics.
Opinions here are mine.
Footnotes
# in preview
* for modelling the land subsidence with much higher accuracy and temporal frequency other approaches and data can be utilized, such as with satellite InSAR methodology+ Lastools is used here separately, it would be fun to test the usage of Fabric User data functions, or to utilize an Azure Function for this purpose.
& code snippets here are set up for readability, not necessarily for efficiency. Multiple data processing steps could be chained.
References
GitHub repo with notebooks: delange/Fabric_GeoAnalytics
Microsoft Fabric: Microsoft Fabric documentation – Microsoft Fabric | Microsoft Learn
ESRI GeoAnalytics for Fabric: Overview | ArcGIS GeoAnalytics for Microsoft Fabric | ArcGIS Developers
AHN: Home | AHN
BAG: Over BAG – Basisregistratie Adressen en Gebouwen – Kadaster.nl zakelijk
Lastools: LAStools: converting, filtering, viewing, processing, and compressing LIDAR data in LAS and LAZ format
Surface and Object Motion Map: Bodemdalingskaart –
The post The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated appeared first on Towards Data Science.
#geospatial #capabilities #microsoft #fabric #esri
The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated
The saying goes that 80% of data collected, stored and maintained by governments can be associated with geographical locations. Although never empirically proven, it illustrates the importance of location within data. Ever growing data volumes put constraints on systems that handle geospatial data. Common Big Data compute engines, originally designed to scale for textual data, need adaptation to work efficiently with geospatial data — think of geographical indexes, partitioning, and operators. Here, I present and illustrate how to utilize the Microsoft Fabric Spark compute engine, with the natively integrated ESRI GeoAnalytics engine# for geospatial big data processing and analytics.
The optional GeoAnalytics capabilities within Fabric enable the processing and analytics of vector-type geospatial data, where vector-type geospatial data refers to points, lines, polygons. These capabilities include more than 150 spatial functions to create geometries, test, and select spatial relationships. As it extends Spark, the GeoAnalytics functions can be called when using Python, SQL, or Scala. These spatial operations apply automatically spatial indexing, making the Spark compute engine also efficient for this data. It can handle 10 extra common spatial data formats to load and save data spatial data, on top of the Spark natively supported data source formats. This blog post focuses on the scalable geospatial compute engines as has been introduced in my post about geospatial in the age of AI.
Demonstration explained
Here, I demonstrate some of these spatial capabilities by showing the data manipulation and analytics steps on a large dataset. By using several tiles covering point cloud data, an enormous dataset starts to form, while it still covers a relatively small area. The open Dutch AHN dataset, which is a national digital elevation and surface model, is currently in its fifth update cycle, and spans a period of nearly 30 years. Here, the data from the second, third, and forth acquisition is used, as these hold full national coverage, while the first version did not include a point cloud release.
Another Dutch open dataset, namely building data, the BAG, is used to illustrate spatial selection. The building dataset contains the footprint of the buildings as polygons. Currently, this dataset holds more than 11 million buildings. To test the spatial functions, I use only 4 AHN tiles per AHN version. Thus in this case, 12 tiles, each of 5 x 6.25 km. Totalling to more than 3.5 billion points within an area of 125 square kilometers. The chosen area covers the municipality of Loppersum, an area prone to land subsidence due to gas extraction.
The steps to take include the selection of buildings within the area of Loppersum, selecting the x,y,z-points from the roofs of the buildings. Then, we bring the 3 datasets into one dataframe and do an extra analysis with it. A spatial regression to predict the expected height of a building based on its height history as well as the history of the buildings in its direct surroundings. Not necessarily the best analysis to perform on this data to come to actual predictions* but it suits merely the purpose of demonstrating the spatial processing capabilities of Fabric’s ESRI GeoAnalytics. All the below code snippets are also available as notebooks on github.
Step 1: Read data
Spatial data can come in many different data formats; we conform to the geoparquet data format for further processing. The BAG building data, both the footprints as well as the accompanied municipality boundaries, come in geoparquet format already. The point cloud AHN data, version 2, 3 and 4, however, comes as LAZ file formats — a compressed industry standard format for point clouds. I have not found a Spark library to read LAZ, and created a txt file, separately, with the LAStools+ first.
# ESRI - FABRIC reference: /
# Import the required modules
import geoanalytics_fabric
from geoanalytics_fabric.sql import functions as ST
from geoanalytics_fabric import extensions
# Read ahn file from OneLake
# AHN lidar data source: /
ahn_csv_path = "Files/AHN lidar/AHN4_csv"
lidar_df = spark.read.options.csvlidar_df = lidar_df.selectExprlidar_df.printSchemalidar_df.showlidar_df.countThe above code snippet& provides the below results:
Now, with the spatial functions make_point and srid the x,y,z columns are transformed to a point geometry and set it to the specific Dutch coordinate system, see the below code snippet&:
# Create point geometry from x,y,z columns and set the spatial refrence system
lidar_df = lidar_df.select.alias)
lidar_df = lidar_df.withColumn)
lidar_df = lidar_df.select.alias)\
.withColumn)
lidar_df.printSchemalidar_df.showBuilding and municipality data can be read with the extended spark.read function for geoparquet, see the code snippet&:
# Read building polygon data
path_building = "Files/BAG NL/BAG_pand_202504.parquet"
df_buildings = spark.read.format.load# Read woonplaats datapath_woonplaats = "Files/BAG NL/BAG_woonplaats_202504.parquet"
df_woonplaats = spark.read.format.load# Filter the DataFrame where the "woonplaats" column contains the string "Loppersum"
df_loppersum = df_woonplaats.filter.contains)
Step 2: Make selections
In the accompanying notebooks, I read and write to geoparquet. To make sure the right data is read correctly as dataframes, see the following code snippet:
# Read building polygon data
path_building = "Files/BAG NL/BAG_pand_202504.parquet"
df_buildings = spark.read.format.load# Read woonplaats datapath_woonplaats = "Files/BAG NL/BAG_woonplaats_202504.parquet"
df_woonplaats = spark.read.format.load# Filter the DataFrame where the "woonplaats" column contains the string "Loppersum"
df_loppersum = df_woonplaats.filter.contains)
With all data in dataframes it becomes a simple step to do spatial selections. The following code snippet& shows how to select the buildings within the boundaries of the Loppersum municipality, and separately makes a selection of buildings that existed throughout the period. This resulted in 1196 buildings, out of the 2492 buildings currently.
# Clip the BAG buildings to the gemeente Loppersum boundary
df_buildings_roi = Clip.run# select only buildings older then AHN data= 2009)
# and with a status in usedf_buildings_roi_select = df_buildings_roi.where&)
The three AHN versions used, further named as T1, T2 and T3 respectively, are then clipped based on the selected building data. The AggregatePoints function can be utilized to calculate, in this case from the heightsome statistics, like the mean per roof, the standard deviation and the number of z-values it is based upon; see the code snippet:
# Select and aggregrate lidar points from buildings within ROI
df_ahn2_result = AggregatePoints\
.setPolygons\
.addSummaryField\
.addSummaryField\
.rundf_ahn3_result = AggregatePoints\
.setPolygons\
.addSummaryField\
.addSummaryField\
.rundf_ahn4_result = AggregatePoints\
.setPolygons\
.addSummaryField\
.addSummaryField\
.runStep 3: Aggregate and Regress
As the GeoAnalytics function Geographically Weighted Regressioncan only work on point data, from the building polygons their centroid is extracted with the centroid function. The 3 dataframes are joined to one, see also the notebook, and it is ready to perform the GWR function. In this instance, it predicts the height for T3based on local regression functions.
# Import the required modules
from geoanalytics_fabric.tools import GWR
# Run the GWR tool to predict AHN4height values for buildings at Loppersum
resultGWR = GWR\
.setExplanatoryVariables\
.setDependentVariable\
.setLocalWeightingScheme\
.setNumNeighbors\
.runIncludeDiagnosticsThe model diagnostics can be consulted for the predicted z value, in this case, the following results were generated. Note, again, that these results cannot be used for real world applications as the data and methodology might not best fit the purpose of subsidence modelling — it merely shows here Fabric GeoAnalytics functionality.
R20.994AdjR20.981AICc1509Sigma20.046EDoF378
Step 4: Visualize results
With the spatial function plot, results can be visualized as maps within the notebook — to be used only with the Python API in Spark. First, a visualization of all buildings within the municipality of Loppersum.
# visualize Loppersum buildings
df_buildings.st.plotHere is a visualization of the height difference between T3and T3 predicted.
# Vizualize difference of predicted height and actual measured height Loppersum area and buildings
axes = df_loppersum.st.plot, alpha=0)
axes.set, ylim=)
df_buildings.st.plot#, color='xkcd:sea blue'
df_with_difference.st.plotSummary
This blog post discusses the significance of geographical data. It highlights the challenges posed by increasing data volumes on Geospatial data systems and suggests that traditional big data engines must adapt to handle geospatial data efficiently. Here, an example is presented on how to use the Microsoft Fabric Spark compute engine and its integration with the ESRI GeoAnalytics engine for effective geospatial big data processing and analytics.
Opinions here are mine.
Footnotes
# in preview
* for modelling the land subsidence with much higher accuracy and temporal frequency other approaches and data can be utilized, such as with satellite InSAR methodology+ Lastools is used here separately, it would be fun to test the usage of Fabric User data functions, or to utilize an Azure Function for this purpose.
& code snippets here are set up for readability, not necessarily for efficiency. Multiple data processing steps could be chained.
References
GitHub repo with notebooks: delange/Fabric_GeoAnalytics
Microsoft Fabric: Microsoft Fabric documentation – Microsoft Fabric | Microsoft Learn
ESRI GeoAnalytics for Fabric: Overview | ArcGIS GeoAnalytics for Microsoft Fabric | ArcGIS Developers
AHN: Home | AHN
BAG: Over BAG – Basisregistratie Adressen en Gebouwen – Kadaster.nl zakelijk
Lastools: LAStools: converting, filtering, viewing, processing, and compressing LIDAR data in LAS and LAZ format
Surface and Object Motion Map: Bodemdalingskaart –
The post The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated appeared first on Towards Data Science.
#geospatial #capabilities #microsoft #fabric #esri
·184 Visualizações