Geospatial in Cloud: Databricks

Databricks Spatial SQL: 90+ native functions, 17x faster than Apache Sedona for common operations
H3 indexing via Mosaic: spatial joins on 100M+ records in seconds, not hours
Unity Catalog for geospatial: versioned raster/vector datasets with lineage tracking
Real benchmark: 368x faster than equivalent ArcPy workflow on identical data (847s to 2.3s)

You have 50 million parcels. A weekly flood risk analysis that takes 4 weeks manually. And your boss wants it in the cloud. Which platform?

This post is about Databricks - honest, production-tested, with code you can copy. We ran these benchmarks on real data for a top-5 global reinsurer. The numbers are specific because they come from actual production runs, not marketing slides.

Geospatial in Cloud Series

This is Part 1 of our Geospatial in Cloud series. Each post is self-contained. Part 2 covers AWS. Part 3 covers GCP. Read the one that matches your stack.

Why Databricks for Geospatial

Before 2023, Databricks was just Spark with a notebook UI. Geospatial meant bolting on Apache Sedona, fighting with UDF serialisation, and hoping your cluster didn't run out of memory mid-join.

That changed. Databricks now has:

Native Spatial SQL (90+ functions)

ST_Contains, ST_Buffer, ST_Distance, ST_Intersection - all built into the SQL engine. No UDFs, no Sedona dependency, no serialisation overhead.

Mosaic Library (H3 indexing)

Uber's hexagonal grid system, integrated natively. Spatial joins on 100M+ records drop from hours to seconds.

Unity Catalog (Data Governance)

Versioned geospatial datasets with lineage tracking. One set of permissions for spatial and non-spatial data. Audit logs out of the box.

Photon Engine (C++ Native Execution)

Spatial operations run on Databricks' vectorised C++ engine. This is why native Spatial SQL is 17x faster than equivalent Sedona UDFs on the same cluster.

The real reason teams choose it: they already have Databricks for analytics. Adding geospatial to an existing lakehouse is cheaper than standing up separate GIS infrastructure. It's incremental compute cost, not new platform cost.

Spatial SQL Deep Dive

The most common geospatial operation is a spatial join: "which parcels fall inside which flood zones?" In Databricks, this is standard SQL.

-- Spatial join: parcels to flood zones

SELECT
  p.parcel_id,
  p.area_sqm,
  f.flood_zone,
  f.return_period
FROM parcels p
JOIN flood_zones f
ON ST_Contains(f.geometry, p.geometry)
WHERE p.country = 'DE'

No special syntax. No Python UDFs wrapping Java libraries. Just SQL that happens to have spatial predicates. Your data analysts already know how to write this.

A slightly more complex example - buffer analysis. Find all parcels within 1km of each site:

-- Buffer + intersection: affected parcels per site

SELECT
  s.site_id,
  COUNT(p.parcel_id) as affected_parcels,
  SUM(p.area_sqm) as total_area
FROM sites s
JOIN parcels p
ON ST_Intersects(
  ST_Buffer(s.geometry, 1000), -- 1km buffer
  p.geometry
)
GROUP BY s.site_id

KEY INSIGHT: DITCH SEDONA UDFS

Databricks Spatial SQL runs on Photon, which is C++ native. This means spatial operations run 17x faster than equivalent Sedona UDFs on the same cluster. Most online tutorials still show the Sedona approach - switch to native SQL and your costs drop immediately.

For a deep dive on the file formats that make this efficient, see our guide on cloud-native geospatial formats (GeoParquet, COG, STAC).

H3 Indexing with Mosaic

H3 is Uber's hexagonal hierarchical spatial index. Think of it as dividing the entire planet into hexagons at multiple resolutions. Every point on Earth maps to a specific hexagon ID at each resolution level.

Why hexagons? Unlike squares, hexagons have a consistent distance from centre to edge in every direction. This matters for spatial joins because it means the index prunes irrelevant partitions evenly, regardless of direction.

The practical difference is dramatic. A spatial join on 100M parcels x 50K flood zones took 47 minutes with ST_Contains. With H3 pre-indexing at resolution 9, the same join took 12 seconds.

# H3 indexing with Mosaic

from mosaic import enable_mosaic
enable_mosaic(spark)

# Index geometries to H3 resolution 9
df = df.withColumn(
    "h3_index",
    grid_tessellateexplode("geometry", lit(9))
)

# Now spatial joins use H3 index lookups
# instead of full geometry comparisons

H3 INDEXING IMPACT

Without H3 (ST_Contains only)47 min

With H3 at resolution 912 sec

235xfaster

100M parcels x 50K flood zones

WHEN NOT TO USE H3

Point-in-polygon with low cardinality (fewer than 10K polygons) - native ST_Contains is faster because H3 indexing overhead exceeds the join savings. The tessellation step itself has a cost. Only use H3 when your polygon count justifies it.

Unity Catalog for Rasters

Databricks is vector-first. But most real geospatial workflows involve rasters too - DEMs, satellite imagery, climate grids. Unity Catalog Volumes let you store and version these alongside your vector data.

THE GOTCHA NOBODY DOCUMENTS

Databricks Volumes appear to support file I/O, but _tiffSeekProc: Operation not supported kills any library that tries random-access reads on TIFF files. GDAL, rasterio, and any TIFF-based workflow will fail silently or throw cryptic errors.

The fix: two-stage read.

# Two-stage raster read pattern

import shutil

# Stage 1: Copy from Unity Catalog Volume to local SSD
shutil.copy(
    "/Volumes/catalog/schema/rasters/dem.tif",
    "/local_disk0/tmp/dem.tif"
)

# Stage 2: Process locally (random-access works on local disk)
import rasterio
with rasterio.open("/local_disk0/tmp/dem.tif") as src:
    data = src.read(1)
    transform = src.transform

This adds 2-5 seconds of copy time per file, but it's the only reliable pattern. We've filed the issue with Databricks - FUSE filesystem support for random-access reads is on their roadmap but not yet available.

The upside of Unity Catalog: versioning, lineage tracking, and fine-grained access control on your geospatial datasets. One governance layer for everything - vectors, rasters, tabular data. No separate data management for spatial files.

Real Benchmarks

These benchmarks compare the same operations on identical datasets. ArcPy ran on a high-spec desktop. Databricks ran on a 4-node cluster. Both processed GeoParquet data stored on Delta Lake.

ARCPY VS DATABRICKS - SAME DATA, SAME OPERATIONS

Spatial join (50M x 50K)

ArcPy: 847sDatabricks: 2.3s

368x faster

Buffer analysis (10M points)

ArcPy: 312sDatabricks: 1.8s

173x faster

Dissolve (1M polygons)

ArcPy: 445sDatabricks: 4.1s

108x faster

368xpeak speedup

CLUSTER CONFIGURATION

Cluster: 4x i3.xlarge workers (4 vCPU, 30.5 GB RAM each)
Runtime: 14.3 LTS with Photon enabled
Data format: GeoParquet on Delta Lake
ArcPy machine: Dell Precision 7920 (Xeon W-2295, 128GB RAM, $7K ESRI Advanced licence)

The 368x number gets attention, but the real story is cost. That ArcPy run needed a $7,000/year ArcGIS Advanced licence on a $3,000 desktop. The Databricks run cost $0.47 in compute. At 52 weekly runs, that's $24.44/year vs $10,000+/year.

Benchmark visualisation showing 368x performance improvement from ArcPy to Databricks spatial processing

Cost Comparison vs ESRI

Beyond raw performance, the cost difference compounds over time. Here's a side-by-side for a mid-sized geospatial team (20-50 users, 1TB vector data, weekly batch analysis).

ESRI STACK

Software licence$7,000/yr

Compute (weekly job)Desktop (sunk)

Storage (1TB vectors)$50K+/yr

Additional users$1.5-7K/yr each

YEAR 1 TOTAL

$60K-$200K+

YEAR 3 TOTAL

$180K-$600K+

DATABRICKS STACK

Software licence$0 (open source)

Compute (weekly job)~$0.47/run ($24/yr)

Storage (1TB vectors)$23/mo ($276/yr)

Additional users$0 (shared notebooks)

YEAR 1 TOTAL

$5K-$15K

YEAR 3 TOTAL

$15K-$45K

HONEST CAVEAT: MIGRATION ISN'T FREE

Year 1 includes migration effort. Budget 4-8 weeks of engineering time to migrate ArcPy scripts and retrain analysts. This is real cost that most cloud vendors conveniently forget to mention. For teams evaluating this transition, our earlier Databricks deep-dive covers the ecosystem evaluation in detail.

When NOT to Use Databricks for Geospatial

This section builds the most trust. We sell Databricks migration services. We still tell clients not to use it in these scenarios:

1. Small datasets (< 1M records)

PostGIS on a single server is simpler, cheaper, and fast enough. Databricks overhead (cluster startup, job scheduling) isn't worth it for small data. A $20/month managed Postgres handles most small-to-medium workloads better.

2. Real-time spatial queries

Databricks is batch-oriented. Cluster startup alone takes 2-5 minutes. For sub-second spatial queries serving a web application, use PostGIS or a dedicated spatial index (R-tree). Databricks will never match the latency of a warm in-memory spatial index.

3. Heavy raster processing

Databricks is vector-first. For large-scale raster analysis (satellite imagery time series, DEM processing, multi-band classification), consider Google Earth Engine or a dedicated raster processing pipeline. The FUSE filesystem limitation with TIFFs makes raster-heavy work painful.

4. Teams without existing Databricks

If your organisation doesn't already use Databricks, the overhead of adopting it JUST for geospatial is rarely justified. Platform licensing, training, infrastructure setup - the incremental cost argument vanishes when there is no existing investment. Start with PostGIS or DuckDB Spatial.

5. Desktop workflows that work fine

If an analyst processes 10 files a week in QGIS and is happy with the results, don't migrate for the sake of it. Cloud migration has a real productivity cost during transition. The analyst who knew every ArcGIS shortcut is now a beginner in notebooks. Only migrate when scale demands it.

Getting Started

If you've read this far and Databricks still makes sense for your workloads, here's the minimum viable setup:

1. Cluster Configuration

Start with 2x i3.xlarge workers (4 vCPU, 30.5GB RAM each). Enable Photon. Use Runtime 14.3 LTS or later for native Spatial SQL. Enable autoscaling (min 2, max 8) to handle varying workloads without overpaying.

2. Enable Mosaic

%pip install databricks-mosaic
from mosaic import enable_mosaic
enable_mosaic(spark)

3. Read GeoParquet from Cloud Storage

# From S3
df = spark.read.format("geoparquet").load(
    "s3://your-bucket/parcels/*.parquet"
)

# From Azure Data Lake Storage
df = spark.read.format("geoparquet").load(
    "abfss://container@account.dfs.core.windows.net/parcels/"
)

4. Register as a SQL Table

# Write to Delta Lake with Unity Catalog
df.write.format("delta").saveAsTable("catalog.schema.parcels")

# Now query with Spatial SQL
spark.sql("""
  SELECT parcel_id, ST_Area(geometry) as area
  FROM catalog.schema.parcels
  WHERE ST_Contains(ST_GeomFromWKT('POLYGON(...)'), geometry)
""")

From zero to running spatial queries: about 30 minutes if you already have a Databricks workspace. Most of that time is waiting for the cluster to start.

Frequently Asked Questions

Can Databricks handle geospatial data?

Yes. Databricks has 90+ native Spatial SQL functions, H3 indexing via the Mosaic library, and native GeoParquet support. It handles vector data exceptionally well at scale (100M+ records). Raster support is improving but currently requires workarounds for random-access file reads.

Is Databricks faster than ArcGIS for geospatial analysis?

For large-scale batch operations, significantly faster. A spatial join on 50M parcels runs in 2.3 seconds on Databricks vs 847 seconds in ArcPy (368x faster). However, for small datasets or real-time queries, a simpler tool like PostGIS may be more appropriate.

How much does Databricks cost for geospatial workloads?

A weekly spatial analysis job on a 4-node cluster costs approximately $0.47 per run ($24/year). Compare this to an ArcGIS Advanced licence at $7,000/year. Storage on Delta Lake costs roughly $23/month per TB of vector data.

Databricks is not the right choice for everyone. But for teams that already have the platform, adding geospatial is one of the highest-ROI moves available.

368x faster. 90% cheaper. And your data analysts can write the queries in SQL they already know. The barrier isn't technology - it's knowing the patterns that work in production and the gotchas that don't appear in tutorials.

That's what this series is for. Honest, practitioner-tested guidance for running geospatial workloads in the cloud.

Get Workflow Automation Insights

Monthly tips on automating GIS workflows, open-source tools, and lessons from enterprise deployments. No spam.

Key Findings