- Databricks Spatial SQL: 90+ native functions, 17x faster than Apache Sedona for common operations
- H3 indexing via Mosaic: spatial joins on 100M+ records in seconds, not hours
- Unity Catalog for geospatial: versioned raster/vector datasets with lineage tracking
- Real benchmark: 368x faster than equivalent ArcPy workflow on identical data (847s to 2.3s)
You have 50 million parcels. A weekly flood risk analysis that takes 4 weeks manually. And your boss wants it in the cloud. Which platform?
This post is about Databricks - honest, production-tested, with code you can copy. We ran these benchmarks on real data for a top-5 global reinsurer. The numbers are specific because they come from actual production runs, not marketing slides.
Geospatial in Cloud Series
This is Part 1 of our Geospatial in Cloud series. Each post is self-contained. Part 2 covers AWS. Part 3 covers GCP. Read the one that matches your stack.
Why Databricks for Geospatial
Before 2023, Databricks was just Spark with a notebook UI. Geospatial meant bolting on Apache Sedona, fighting with UDF serialisation, and hoping your cluster didn't run out of memory mid-join.
That changed. Databricks now has:
Native Spatial SQL (90+ functions)
ST_Contains, ST_Buffer, ST_Distance, ST_Intersection - all built into the SQL engine. No UDFs, no Sedona dependency, no serialisation overhead.
Mosaic Library (H3 indexing)
Uber's hexagonal grid system, integrated natively. Spatial joins on 100M+ records drop from hours to seconds.
Unity Catalog (Data Governance)
Versioned geospatial datasets with lineage tracking. One set of permissions for spatial and non-spatial data. Audit logs out of the box.
Photon Engine (C++ Native Execution)
Spatial operations run on Databricks' vectorised C++ engine. This is why native Spatial SQL is 17x faster than equivalent Sedona UDFs on the same cluster.
The real reason teams choose it: they already have Databricks for analytics. Adding geospatial to an existing lakehouse is cheaper than standing up separate GIS infrastructure. It's incremental compute cost, not new platform cost.
Spatial SQL Deep Dive
The most common geospatial operation is a spatial join: "which parcels fall inside which flood zones?" In Databricks, this is standard SQL.
-- Spatial join: parcels to flood zones
SELECT p.parcel_id, p.area_sqm, f.flood_zone, f.return_period FROM parcels p JOIN flood_zones f ON ST_Contains(f.geometry, p.geometry) WHERE p.country = 'DE'
No special syntax. No Python UDFs wrapping Java libraries. Just SQL that happens to have spatial predicates. Your data analysts already know how to write this.
A slightly more complex example - buffer analysis. Find all parcels within 1km of each site:
-- Buffer + intersection: affected parcels per site
SELECT s.site_id, COUNT(p.parcel_id) as affected_parcels, SUM(p.area_sqm) as total_area FROM sites s JOIN parcels p ON ST_Intersects( ST_Buffer(s.geometry, 1000), -- 1km buffer p.geometry ) GROUP BY s.site_id
KEY INSIGHT: DITCH SEDONA UDFS
Databricks Spatial SQL runs on Photon, which is C++ native. This means spatial operations run 17x faster than equivalent Sedona UDFs on the same cluster. Most online tutorials still show the Sedona approach - switch to native SQL and your costs drop immediately.
For a deep dive on the file formats that make this efficient, see our guide on cloud-native geospatial formats (GeoParquet, COG, STAC).
H3 Indexing with Mosaic
H3 is Uber's hexagonal hierarchical spatial index. Think of it as dividing the entire planet into hexagons at multiple resolutions. Every point on Earth maps to a specific hexagon ID at each resolution level.
Why hexagons? Unlike squares, hexagons have a consistent distance from centre to edge in every direction. This matters for spatial joins because it means the index prunes irrelevant partitions evenly, regardless of direction.
The practical difference is dramatic. A spatial join on 100M parcels x 50K flood zones took 47 minutes with ST_Contains. With H3 pre-indexing at resolution 9, the same join took 12 seconds.
# H3 indexing with Mosaic
from mosaic import enable_mosaic
enable_mosaic(spark)
# Index geometries to H3 resolution 9
df = df.withColumn(
"h3_index",
grid_tessellateexplode("geometry", lit(9))
)
# Now spatial joins use H3 index lookups
# instead of full geometry comparisonsH3 INDEXING IMPACT
100M parcels x 50K flood zones
WHEN NOT TO USE H3
Point-in-polygon with low cardinality (fewer than 10K polygons) - native ST_Contains is faster because H3 indexing overhead exceeds the join savings. The tessellation step itself has a cost. Only use H3 when your polygon count justifies it.
Unity Catalog for Rasters
Databricks is vector-first. But most real geospatial workflows involve rasters too - DEMs, satellite imagery, climate grids. Unity Catalog Volumes let you store and version these alongside your vector data.
THE GOTCHA NOBODY DOCUMENTS
Databricks Volumes appear to support file I/O, but _tiffSeekProc: Operation not supported kills any library that tries random-access reads on TIFF files. GDAL, rasterio, and any TIFF-based workflow will fail silently or throw cryptic errors.
The fix: two-stage read.
# Two-stage raster read pattern
import shutil
# Stage 1: Copy from Unity Catalog Volume to local SSD
shutil.copy(
"/Volumes/catalog/schema/rasters/dem.tif",
"/local_disk0/tmp/dem.tif"
)
# Stage 2: Process locally (random-access works on local disk)
import rasterio
with rasterio.open("/local_disk0/tmp/dem.tif") as src:
data = src.read(1)
transform = src.transformThis adds 2-5 seconds of copy time per file, but it's the only reliable pattern. We've filed the issue with Databricks - FUSE filesystem support for random-access reads is on their roadmap but not yet available.
The upside of Unity Catalog: versioning, lineage tracking, and fine-grained access control on your geospatial datasets. One governance layer for everything - vectors, rasters, tabular data. No separate data management for spatial files.
Real Benchmarks
These benchmarks compare the same operations on identical datasets. ArcPy ran on a high-spec desktop. Databricks ran on a 4-node cluster. Both processed GeoParquet data stored on Delta Lake.
ARCPY VS DATABRICKS - SAME DATA, SAME OPERATIONS
368x faster
173x faster
108x faster
CLUSTER CONFIGURATION
- Cluster: 4x i3.xlarge workers (4 vCPU, 30.5 GB RAM each)
- Runtime: 14.3 LTS with Photon enabled
- Data format: GeoParquet on Delta Lake
- ArcPy machine: Dell Precision 7920 (Xeon W-2295, 128GB RAM, $7K ESRI Advanced licence)
The 368x number gets attention, but the real story is cost. That ArcPy run needed a $7,000/year ArcGIS Advanced licence on a $3,000 desktop. The Databricks run cost $0.47 in compute. At 52 weekly runs, that's $24.44/year vs $10,000+/year.

Cost Comparison vs ESRI
Beyond raw performance, the cost difference compounds over time. Here's a side-by-side for a mid-sized geospatial team (20-50 users, 1TB vector data, weekly batch analysis).
ESRI STACK
YEAR 1 TOTAL
$60K-$200K+
YEAR 3 TOTAL
$180K-$600K+
DATABRICKS STACK
YEAR 1 TOTAL
$5K-$15K
YEAR 3 TOTAL
$15K-$45K
HONEST CAVEAT: MIGRATION ISN'T FREE
Year 1 includes migration effort. Budget 4-8 weeks of engineering time to migrate ArcPy scripts and retrain analysts. This is real cost that most cloud vendors conveniently forget to mention. For teams evaluating this transition, our earlier Databricks deep-dive covers the ecosystem evaluation in detail.
When NOT to Use Databricks for Geospatial
This section builds the most trust. We sell Databricks migration services. We still tell clients not to use it in these scenarios:
1. Small datasets (< 1M records)
PostGIS on a single server is simpler, cheaper, and fast enough. Databricks overhead (cluster startup, job scheduling) isn't worth it for small data. A $20/month managed Postgres handles most small-to-medium workloads better.
2. Real-time spatial queries
Databricks is batch-oriented. Cluster startup alone takes 2-5 minutes. For sub-second spatial queries serving a web application, use PostGIS or a dedicated spatial index (R-tree). Databricks will never match the latency of a warm in-memory spatial index.
3. Heavy raster processing
Databricks is vector-first. For large-scale raster analysis (satellite imagery time series, DEM processing, multi-band classification), consider Google Earth Engine or a dedicated raster processing pipeline. The FUSE filesystem limitation with TIFFs makes raster-heavy work painful.
4. Teams without existing Databricks
If your organisation doesn't already use Databricks, the overhead of adopting it JUST for geospatial is rarely justified. Platform licensing, training, infrastructure setup - the incremental cost argument vanishes when there is no existing investment. Start with PostGIS or DuckDB Spatial.
5. Desktop workflows that work fine
If an analyst processes 10 files a week in QGIS and is happy with the results, don't migrate for the sake of it. Cloud migration has a real productivity cost during transition. The analyst who knew every ArcGIS shortcut is now a beginner in notebooks. Only migrate when scale demands it.
Getting Started
If you've read this far and Databricks still makes sense for your workloads, here's the minimum viable setup:
1. Cluster Configuration
Start with 2x i3.xlarge workers (4 vCPU, 30.5GB RAM each). Enable Photon. Use Runtime 14.3 LTS or later for native Spatial SQL. Enable autoscaling (min 2, max 8) to handle varying workloads without overpaying.
2. Enable Mosaic
%pip install databricks-mosaic from mosaic import enable_mosaic enable_mosaic(spark)
3. Read GeoParquet from Cloud Storage
# From S3
df = spark.read.format("geoparquet").load(
"s3://your-bucket/parcels/*.parquet"
)
# From Azure Data Lake Storage
df = spark.read.format("geoparquet").load(
"abfss://container@account.dfs.core.windows.net/parcels/"
)4. Register as a SQL Table
# Write to Delta Lake with Unity Catalog
df.write.format("delta").saveAsTable("catalog.schema.parcels")
# Now query with Spatial SQL
spark.sql("""
SELECT parcel_id, ST_Area(geometry) as area
FROM catalog.schema.parcels
WHERE ST_Contains(ST_GeomFromWKT('POLYGON(...)'), geometry)
""")From zero to running spatial queries: about 30 minutes if you already have a Databricks workspace. Most of that time is waiting for the cluster to start.
Frequently Asked Questions
Can Databricks handle geospatial data?
Yes. Databricks has 90+ native Spatial SQL functions, H3 indexing via the Mosaic library, and native GeoParquet support. It handles vector data exceptionally well at scale (100M+ records). Raster support is improving but currently requires workarounds for random-access file reads.
Is Databricks faster than ArcGIS for geospatial analysis?
For large-scale batch operations, significantly faster. A spatial join on 50M parcels runs in 2.3 seconds on Databricks vs 847 seconds in ArcPy (368x faster). However, for small datasets or real-time queries, a simpler tool like PostGIS may be more appropriate.
How much does Databricks cost for geospatial workloads?
A weekly spatial analysis job on a 4-node cluster costs approximately $0.47 per run ($24/year). Compare this to an ArcGIS Advanced licence at $7,000/year. Storage on Delta Lake costs roughly $23/month per TB of vector data.
Databricks is not the right choice for everyone. But for teams that already have the platform, adding geospatial is one of the highest-ROI moves available.
368x faster. 90% cheaper. And your data analysts can write the queries in SQL they already know. The barrier isn't technology - it's knowing the patterns that work in production and the gotchas that don't appear in tutorials.
That's what this series is for. Honest, practitioner-tested guidance for running geospatial workloads in the cloud.
Get Workflow Automation Insights
Monthly tips on automating GIS workflows, open-source tools, and lessons from enterprise deployments. No spam.
