- Runtime 17.1+ brings 90+ native Spatial SQL functions (Public Preview) - no Sedona dependency, no UDF overhead. Databricks reports 17x faster spatial joins than Sedona on identical clusters
- Two spatial approaches now matter: Native Spatial SQL (fastest, simplest, recommended for new projects) and native H3 functions (40 built-in since Runtime 11.2, Photon-only). Mosaic is no longer in active development
- H3 indexing converts expensive geometry joins into integer equality joins. On large datasets (100M+ records), this changes query time from minutes to seconds - but the speedup depends heavily on data distribution and resolution choice
- Delta Lake is the storage advantage: Z-ordering by geohash gives 10-50x faster spatial queries, liquid clustering (Runtime 13.3+) automates this entirely, and time travel provides built-in audit trails for regulatory compliance
- Asset Bundles replace manual notebook deployment: one databricks.yml file defines jobs, clusters, and schedules across dev/staging/prod. CI/CD validates before deploy
- Cost comparison: a 4-node cluster running a weekly spatial join costs roughly $0.47 in compute per run. But factor in platform licensing, storage, and cluster idle time for the real picture
You have 50 million parcels. A weekly flood risk analysis that takes 4 weeks manually. And your boss wants it in the cloud. Which platform?
This post is about Databricks - production-tested, with patterns you can copy. These benchmarks come from real production pipelines processing reinsurance exposure data. The numbers are specific because they come from actual runs, not marketing slides.
Geospatial in Cloud Series
This is Part 1 of our Geospatial in Cloud series. Each post is self-contained. Part 2 covers AWS. Part 3 covers GCP. Part 4 covers Snowflake. Read the one that matches your stack.
Why Databricks for Geospatial
Before 2023, Databricks was just Spark with a notebook UI. Geospatial meant bolting on Apache Sedona, fighting with UDF serialisation, and hoping your cluster didn't run out of memory mid-join. That era is over.
Runtime 17.1 changed everything. Databricks now ships with 90+ native spatial SQL functions plus 40 H3 functions baked directly into the query engine. No external libraries to install, no UDFs to register, no JAR dependencies to manage. Spatial joins, buffer analysis, distance calculations, intersection tests - all expressed as standard SQL that your data analysts already know how to write. One caveat: the ST functions are still in Public Preview as of early 2026. The H3 functions (available since Runtime 11.2) are GA.
Native Spatial SQL on Photon (Runtime 17+)
ST_Contains, ST_Buffer, ST_Distance, ST_Intersection and 86 more functions run directly on Databricks' vectorised C++ engine. No serialisation overhead, no JVM interop, no UDF bottleneck. This is 17x faster than the same operations via Sedona on the same cluster.
Mosaic Library for H3 Spatial Indexing
Uber's hexagonal grid system, integrated as a first-party Databricks library. When your spatial joins hit 100M+ records, H3 pre-indexing via Mosaic turns 47-minute operations into 12-second operations. It is the difference between overnight batch and interactive analysis.
Unity Catalog for Data Governance
Versioned geospatial datasets with lineage tracking. One governance model for vectors, rasters, and tabular data. Fine-grained access control and audit logs. No separate data management layer for spatial files.
GeoParquet as the Native Format
Read and write GeoParquet directly from cloud storage (S3, ADLS, GCS). No format conversion step, no Shapefile limitations. Columnar storage means the query engine reads only the columns referenced in your query, cutting I/O costs dramatically.
The real reason teams choose it: they already have Databricks for analytics. Adding geospatial to an existing lakehouse is incremental compute cost, not new platform cost. The spatial SQL functions work on the same tables, with the same permissions, in the same notebooks your team already uses.
Native Spatial SQL (Runtime 17+)
The most common geospatial operation is a spatial join: “which parcels fall inside which flood zones?” In Databricks Runtime 17+, this is standard SQL with spatial predicates. No special syntax, no Python wrappers, no external libraries. Your data analysts already know the language - they just need to learn a handful of spatial functions.
A spatial join uses ST_Contains to test containment. A buffer analysis uses ST_Buffer combined with ST_Intersects. Distance calculations use ST_Distance. The syntax is identical to PostGIS, making migration straightforward for teams with PostGIS experience.
Why Photon matters here: these spatial functions execute on Databricks' vectorised C++ engine. Traditional Spark would serialise geometry objects between Java and Python, losing performance at every boundary. Photon operates on columnar data natively. Databricks reports spatial joins running 17x faster than Sedona equivalents on identical cluster configurations - though independent benchmarks (SpatialBench, January 2026) show more nuanced results depending on query type.
KEY INSIGHT: MOST TUTORIALS ARE OUTDATED
Search for “Databricks geospatial” and most results still show the old Sedona approach: install the JAR, register UDFs, wrap everything in Python. With Runtime 17+, you skip all of that. Native Spatial SQL works out of the box. If you are starting a new geospatial project on Databricks today, start with native SQL. Only reach for Sedona or Mosaic when native SQL genuinely cannot handle your specific workload.
The 90+ spatial functions cover the full range of operations: spatial predicates (contains, intersects, within, touches, crosses), measurements (area, length, distance), transformations (buffer, centroid, convex hull, simplify, union), constructors (point, polygon, linestring from WKT/WKB/GeoJSON), and accessors (coordinates, SRID, envelope). For the vast majority of enterprise geospatial workloads - flood exposure, parcel analysis, proximity scoring, portfolio aggregation - native SQL covers everything.
For detail on the file formats that make this efficient, see our guide on cloud-native geospatial formats (GeoParquet, COG, STAC).
Sedona vs Mosaic vs Native SQL
The Databricks geospatial ecosystem has shifted significantly. Understanding what is current and what is legacy saves weeks of going down the wrong path.
| APPROACH | BEST FOR | STATUS | COMPLEXITY |
|---|---|---|---|
| Native Spatial SQL + H3 | All spatial operations including H3 indexing (90+ ST functions, 40 H3 functions) | Active (Public Preview, Runtime 17.1+) | Lowest - just SQL |
| Apache Sedona | Complex geometry operations, spatial R-tree index, existing workflows | Active (independent project) | Medium - JAR dependencies |
| Mosaic | H3 tessellation (was the bridge before native H3) | No longer in active development | Locked to Runtime 13 |
Native Spatial SQL + H3 is the default choice. It covers 80-90% of enterprise geospatial workloads with zero setup overhead. Standard spatial joins, buffer analysis, distance calculations, area computations, and H3 indexing - all run directly on Photon with no external dependencies. Note: both the ST functions and H3 require Photon-enabled clusters (Pro or Serverless tier).
MOSAIC IS NO LONGER IN ACTIVE DEVELOPMENT
If you find tutorials recommending the Mosaic library for H3 operations, that advice is outdated. Mosaic 0.4.x only supports Runtime 13 and is locked to that version. Databricks now ships 40 native H3 functions built into the SQL engine. Use those instead. If you have existing Mosaic code, plan to migrate to native H3 functions.
Sedona remains valid for complex geometry. Operations not yet available in native SQL - Voronoi diagrams, advanced topology, custom R-tree indexes - still require Sedona. The SpatialBench benchmark from Apache Sedona (January 2026) showed Sedona delivering up to 6x better price-performance than Databricks Serverless on certain query types. Neither platform finished all benchmark queries. The reality: both have strengths depending on workload.
CRITICAL: SEDONAREGISTRATOR IS DEPRECATED
If you find tutorials telling you to call SedonaRegistrator.registerAll(), that API is deprecated in Sedona 1.5+. The modern initialisation uses SedonaContext. Better yet, skip Sedona entirely for standard operations and use native Spatial SQL. One fewer dependency to manage, one fewer JAR version to track, one fewer thing to break during runtime upgrades.
H3 Indexing (Native, Not Mosaic)
H3 is Uber's hexagonal hierarchical spatial index. Think of it as dividing the entire planet into hexagons at multiple resolutions. Every point on Earth maps to a specific hexagon ID at each resolution level. Databricks ships 40 native H3 functions (since Runtime 11.2) - no external library needed.
Why hexagons? Unlike squares, hexagons have a consistent distance from centre to edge in every direction. This matters for spatial joins because it means the index prunes irrelevant partitions evenly, regardless of direction.
The practical difference is significant. H3 converts expensive geometry comparisons into integer equality joins. Instead of testing every geometry against every other geometry, H3 pre-filters to only compare geometries sharing the same hexagonal cell. For large datasets (100M+ records), this turns queries that take minutes into queries that take seconds. The exact speedup depends on data distribution, resolution choice, and cluster size - Databricks reports up to 90x cost reduction using H3-centric vs geometry-centric approaches.
The setup is straightforward: use the native h3_longlatash3 and h3_polyfillash3 functions to tessellate your geometries at the appropriate resolution. From that point forward, spatial joins use H3 cell lookups instead of full geometry comparisons. No Mosaic library needed - these are built into the SQL engine.
Resolution choice matters. Resolution 9 (roughly 174m per hexagon) is optimal for urban analyses - parcel boundaries, building footprints, address-level work. Resolution 7 (roughly 1.2km per hexagon) suits regional analyses - flood zones, administrative boundaries, catchment areas. Going too fine wastes memory on index overhead; going too coarse defeats the purpose of pre-filtering. Most enterprise workloads settle on resolution 8 or 9.
One critical detail on H3 coordinate order: the H3 functions expect (latitude, longitude) order - the opposite of most GIS tools which use (longitude, latitude). This trips every team at least once. If your spatial joins return zero matches on data you know overlaps, check the coordinate order first.
WHY H3 CHANGES EVERYTHING FOR LARGE JOINS
Databricks reports up to 90x cost reduction with H3-centric approach. Actual speedup varies by dataset.
WHEN NOT TO USE H3
Point-in-polygon with low cardinality (fewer than 10K polygons) - native ST_Contains is faster because H3 indexing overhead exceeds the join savings. The tessellation step itself has a cost. Only use H3 when your polygon count justifies it.
Unity Catalog for Rasters
Databricks is vector-first. But most real geospatial workflows involve rasters too - DEMs, satellite imagery, climate grids. Unity Catalog Volumes let you store and version these alongside your vector data.
THE TRAP NOBODY DOCUMENTS
Databricks Volumes appear to support file I/O, but _tiffSeekProc: Operation not supported kills any library that tries random-access reads on TIFF files. GDAL, rasterio, and any TIFF-based workflow will fail silently or throw cryptic errors.
The fix: two-stage read.
The fix is a two-stage pattern: copy the file from the Volume to /local_disk0/tmp on the worker node, then open it with rasterio or GDAL from the local path. The /local_disk0/ path is ephemeral SSD attached to the instance - fast reads, proper seek support, wiped on cluster termination. This adds 2-5 seconds of copy time per file, but it is the only reliable pattern.
THIS ALSO AFFECTS READS
GeoPackage reads fail on Volumes too. SQLite creates a -wal (Write-Ahead Log) file even for read operations. The FUSE mount cannot handle this. Stage to local disk first, read, then clean up. This catches everyone the first time - the error message gives no hint that the read path is the problem.
The same two-stage pattern applies on every cloud platform. S3 on AWS, GCS on Google Cloud - all object stores are append-only at the protocol level. They cannot seek backwards in a file. Any format that requires random access (GeoTIFF, GeoPackage, Shapefile) needs local staging first. We cover the AWS and GCP variants in their respective guides in this series.
For long-running jobs, clean up /local_disk0/tmp periodically. The ephemeral disk is finite and will fill up if you process thousands of rasters without clearing intermediate files. The disk is wiped on cluster termination, but during execution it is your responsibility to manage.
The upside of Unity Catalog: versioning, lineage tracking, and fine-grained access control on your geospatial datasets. One governance layer for everything - vectors, rasters, tabular data. No separate data management for spatial files.
Delta Lake for Spatial Data
Delta Lake is the storage layer that makes Databricks geospatial genuinely different from running Spark on raw Parquet files. Three capabilities matter for spatial workloads: physical data layout, time travel, and incremental processing.
Z-ordering by geospatial columns is the first thing to configure. When you run OPTIMIZE with ZORDER BY on latitude and longitude columns, Delta Lake physically reorganises the Parquet files so that spatially close records are stored together on disk. A spatial query that previously scanned 100% of files now skips 90-95% of them because the file-level statistics tell the engine those files contain no relevant geometries. For a 1TB dataset, this typically means 10-50x faster query times with zero code changes - just a one-time OPTIMIZE command.
Liquid clustering (Runtime 13.3+) is better than Z-ordering. Instead of requiring periodic manual OPTIMIZE runs, liquid clustering continuously reorganises data as it is written. Declare your clustering columns at table creation time and Delta Lake handles the rest. For geospatial tables that receive regular updates - sensor feeds, daily parcel snapshots, incremental flood model outputs - this eliminates the operational burden of scheduling OPTIMIZE jobs. The query performance is equivalent or better than Z-ordering, with none of the maintenance.
Time travel is the audit trail that regulators love. Every write to a Delta table creates a new version. You can query any previous version by number or timestamp. For regulated industries - insurance, banking, government - this means you can reproduce exactly what the spatial analysis showed on any given date. No separate versioning system, no manual snapshots, no "which version of the flood model was current on March 15th?" questions. Delta keeps every version automatically, with a configurable retention period (default 7 days, extendable for compliance).
Change Data Feed (CDF) enables incremental processing. Enable CDF on a Delta table and you get a log of every insert, update, and delete. For spatial pipelines that run daily on datasets where only 1-2% of records change, this means processing only the changes instead of the entire dataset. A nightly parcel update that used to reprocess 50M records now processes only the 500K that changed. The cost reduction compounds: less compute, less time, less money.
DELTA LAKE SPATIAL OPTIMISATION IMPACT
Deployment with Asset Bundles
The biggest operational difference between a prototype notebook and a production geospatial pipeline is deployment. Asset Bundles solve this with a single declarative YAML file - databricks.yml - that defines everything: jobs, cluster configurations, schedules, and environment-specific overrides for dev, staging, and production.
Instead of manually configuring jobs through the UI (which inevitably leads to drift between environments), you declare the entire pipeline as code. A single databricks bundle deploy command validates the configuration, checks for misconfigurations, and deploys to the target environment. The bundle validates before deploying - catching issues like non-existent cluster policies or invalid instance types before they fail at runtime.
For geospatial teams migrating from ArcPy, this replaces the "copy the script to the server and hope it works" deployment model. Your spatial pipeline, its cluster configuration, its schedule, and its environment variables all live in version control. CI/CD pipelines can deploy automatically on merge. No more "it worked on my machine" debugging sessions.
MODULE CACHE: THE TRAP THAT WASTES HOURS
When you update a shared Python module during development, the Spark driver picks up the new code immediately - but the executors still have the old version cached from a previous task. Your driver and workers are running different code versions. The fix is dbutils.library.restartPython() after any module update. For production jobs, pin module versions in your cluster library configuration rather than relying on notebook-scoped imports. This eliminates the cache problem entirely.
Real Benchmarks
These benchmarks compare the same operations on identical datasets. ArcPy ran single-threaded on a high-spec desktop. Databricks ran distributed across a 4-node cluster (16 vCPU, 122 GB RAM total). This is not a fair hardware comparison - it is a comparison of workflow approaches: single-desktop GIS vs distributed cloud processing.
CONTEXT ON THESE NUMBERS
Any distributed engine (PostGIS with parallel queries, Sedona, DuckDB Spatial on equivalent hardware) would also massively outperform single-threaded ArcPy. The value of these benchmarks is not "Databricks is 368x faster than everything" - it is "moving from desktop GIS to distributed cloud processing changes the game for large datasets". The specific speedup depends on your cluster size, data volume, and query complexity.
ARCPY VS DATABRICKS - SAME DATA, SAME OPERATIONS
368x faster
173x faster
108x faster
CLUSTER CONFIGURATION
- Cluster: 4x i3.xlarge workers (4 vCPU, 30.5 GB RAM each)
- Runtime: 17 LTS with Photon enabled (native Spatial SQL)
- Data format: GeoParquet on Delta Lake
- ArcPy machine: Dell Precision 7920 (Xeon W-2295, 128GB RAM, $7K ESRI Advanced licence)
The 368x number gets attention, but context matters. That ArcPy run was single-threaded on a desktop. Any distributed engine would be dramatically faster. The more useful comparison is cost: that ArcPy setup needed a $7,000/year ArcGIS Advanced licence on a $3,000 desktop. The Databricks run cost $0.47 in marginal compute. At 52 weekly runs, that is $24.44/year in compute alone - but add Databricks platform licensing, storage, and cluster idle time for the real total.

Cost Comparison vs ESRI
Beyond raw performance, the cost difference compounds over time. Here's a side-by-side for a mid-sized geospatial team (20-50 users, 1TB vector data, weekly batch analysis).
ESRI STACK
YEAR 1 TOTAL
$60K-$200K+
YEAR 3 TOTAL
$180K-$600K+
DATABRICKS STACK
YEAR 1 TOTAL
$5K-$15K
YEAR 3 TOTAL
$15K-$45K
CAVEAT: MIGRATION ISN'T FREE
Year 1 includes migration effort. Budget 4-8 weeks of engineering time to migrate ArcPy scripts and retrain analysts. This is real cost that most cloud vendors conveniently forget to mention. For teams evaluating this transition, our ArcPy migration playbook covers the ecosystem evaluation in detail.
When NOT to Use Databricks for Geospatial
This section builds the most trust. Databricks is excellent for the right workloads. But it is the wrong choice in these scenarios:
1. Small datasets (< 1M records)
PostGIS on a single server is simpler, cheaper, and fast enough. Databricks overhead (cluster startup, job scheduling) isn't worth it for small data. A $20/month managed Postgres handles most small-to-medium workloads better.
2. Real-time spatial queries
Databricks is batch-oriented. Cluster startup alone takes 2-5 minutes. For sub-second spatial queries serving a web application, use PostGIS or a dedicated spatial index (R-tree). Databricks will never match the latency of a warm in-memory spatial index.
3. Heavy raster processing
Databricks is vector-first. For large-scale raster analysis (satellite imagery time series, DEM processing, multi-band classification), consider Google Earth Engine or a dedicated raster processing pipeline. The FUSE filesystem limitation with TIFFs makes raster-heavy work painful.
4. Teams without existing Databricks
If your organisation doesn't already use Databricks, the overhead of adopting it JUST for geospatial is rarely justified. Platform licensing, training, infrastructure setup - the incremental cost argument vanishes when there is no existing investment. Start with PostGIS or DuckDB Spatial.
5. Desktop workflows that work fine
If an analyst processes 10 files a week in QGIS and is happy with the results, don't migrate for the sake of it. Cloud migration has a real productivity cost during transition. The analyst who knew every ArcGIS shortcut is now a beginner in notebooks. Only migrate when scale demands it.
Getting Started
If you've read this far and Databricks still makes sense for your workloads, here's the minimum viable setup:
1. Cluster Configuration
Start with 2x i3.xlarge workers (4 vCPU, 30.5GB RAM each). Enable Photon. Use Runtime 17 LTS for native Spatial SQL. Enable autoscaling (min 2, max 8) to handle varying workloads without overpaying. Use cluster pools to reduce startup time from 3-8 minutes to under 60 seconds - pools keep idle instances warm and ready. This configuration handles most mid-sized geospatial workloads (up to 100M records) comfortably.
2. Enable Mosaic (optional)
Install the databricks-mosaic package via pip in your notebook, then initialise it on your Spark session. Only do this if your workloads involve 50M+ record spatial joins where H3 indexing provides a measurable benefit. For smaller workloads, native Spatial SQL is sufficient without Mosaic.
3. Load GeoParquet from Cloud Storage
Read GeoParquet files directly from S3, Azure Data Lake Storage, or GCS using Spark's native GeoParquet reader. No format conversion required. The reader handles geometry deserialisation automatically, and columnar access means you only read the columns your query references.
4. Register as a Delta Table and Query
Write your GeoParquet data to a Delta Lake table registered in Unity Catalog. From that point forward, any spatial SQL query works directly against the table - ST_Contains, ST_Buffer, ST_Distance, and the full set of 90+ spatial functions. Your analysts can query it from SQL notebooks, BI tools, or programmatic APIs.
From zero to running spatial queries: about 30 minutes if you already have a Databricks workspace. Most of that time is waiting for the cluster to start. The spatial capability itself requires no additional infrastructure.
Frequently Asked Questions
Can Databricks handle geospatial data?
Yes. Databricks has 90+ native Spatial SQL functions, H3 indexing via the Mosaic library, and native GeoParquet support. It handles vector data exceptionally well at scale (100M+ records). Raster support is improving but currently requires workarounds for random-access file reads.
Is Databricks faster than ArcGIS for geospatial analysis?
For large-scale batch operations, significantly faster. A spatial join on 50M parcels runs in 2.3 seconds on Databricks vs 847 seconds in ArcPy (368x faster). However, for small datasets or real-time queries, a simpler tool like PostGIS may be more appropriate.
How much does Databricks cost for geospatial workloads?
A weekly spatial analysis job on a 4-node cluster costs approximately $0.47 per run ($24/year). Compare this to an ArcGIS Advanced licence at $7,000/year. Storage on Delta Lake costs roughly $23/month per TB of vector data.
Databricks is not the right choice for everyone. But for teams that already have the platform, adding geospatial is one of the highest-ROI moves available.
368x faster. 90% cheaper. And your data analysts can write the queries in SQL they already know. The barrier isn't technology - it's knowing the patterns that work in production and the pitfalls that don't appear in tutorials.
That's what this series is for. Practitioner-tested guidance for running geospatial workloads in the cloud.
Get Workflow Automation Insights
Monthly tips on automating GIS workflows, open-source tools, and lessons from enterprise deployments. No spam.
