Technical Deep-Dive

Cloud-Native Geospatial Formats

A pragmatic guide to COG, GeoParquet, and STAC. When they work, when they don't, and how to evaluate whether migration makes sense for your organisation.

PUBLISHEDJAN 2025
READ TIME25 MIN
CATEGORYTECHNICAL
AUTHORAXIS SPATIAL
Split-screen showing legacy server infrastructure transitioning to modern cloud-native architecture

If you're running geospatial operations at scale (flood risk assessments, utility asset inventories, satellite imagery analysis) you're almost certainly paying a hidden tax.

This tax doesn't appear on any invoice. It manifests as analysts waiting 15 minutes for a 10GB GeoTIFF to download when they need data from a 500-metre radius. Cloud egress bills that spike every time someone runs a spatial query. Data scientists who can't use your geospatial assets because they don't integrate with Databricks or Snowflake.

At a global reinsurer we worked with, a single country-level risk assessment required downloading 47GB of data to extract features from a 2km2 area. The workflow took 3-4 weeks. After migrating to cloud-native formats, the same analysis runs in 30 minutes. (For more on the business impact of such delays, see the hidden cost of manual workflows.)

Which Format Do You Need?

What type of data are you working with?

What Problems Do Cloud-Native Geospatial Formats Solve?

Cloud-native formats solve three problems: inefficient full-file downloads, analytics platform integration gaps, and data discovery challenges. Traditional formats like Shapefile and GeoTIFF require downloading entire files to access a small region. COG, GeoParquet, and STAC enable HTTP range requests, columnar analytics, and searchable metadata. Before diving into solutions, let's be precise about what we're solving.

PROBLEM 1: THE "DOWNLOAD EVERYTHING" ARCHITECTURE

Traditional raster formats (GeoTIFF, JPEG2000) and vector formats (Shapefile, File Geodatabase) were designed with an assumption: you have the file locally.

Example: Your organisation stores 50TB of aerial imagery on S3. An analyst needs to extract building footprints for a flood zone covering 0.1% of your total imagery extent.

  • Identify which files intersect area of interest
  • Download each complete file (potentially hundreds of GB)
  • Extract the relevant pixels
  • Discard 99.9% of the downloaded data
Japanese ink illustration showing bulk data extraction versus precise targeted access

The precision access principle: extract only what you need from the whole.

PROBLEM 2: THE ANALYTICS INTEGRATION GAP

Modern data infrastructure has converged on columnar formats. Databricks, Snowflake, BigQuery - every serious data platform is optimised for Apache Parquet.

Geospatial data exists in a parallel universe. Your data engineers have built sophisticated pipelines for customer data, transaction logs, and operational metrics. But geospatial assets? Those live in a separate GIS silo, accessible only through specialised tools.

PROBLEM 3: THE DISCOVERY PROBLEM

Where is the imagery covering the northern distribution network from before the 2023 storm event? If answering that question requires emailing three colleagues, searching through nested folders, and checking multiple spreadsheets - you have a discovery problem that leads to duplicate data purchases and missed analysis opportunities.

How Does Cloud-Optimized GeoTIFF (COG) Work?

COG reorganises standard GeoTIFF with internal tiling, overviews, and header-first structure, enabling HTTP range requests to read specific pixels without downloading the full file. Reading 1km2 from a 50GB raster goes from 14 minutes (full download) to 2.3 seconds (range request). COG isn't a new format - it's a GeoTIFF organised in a specific way that enables efficient cloud access.

How COG Works

A standard GeoTIFF stores pixels in sequential strips. To read any pixel, you typically need to read from the beginning of the file. A COG reorganises this structure with three key features:

1

Internal Tiling

The raster is divided into fixed-size tiles (typically 512x512 pixels). Each tile is independently addressable.

2

Overviews (Pyramids)

Pre-computed reduced-resolution versions stored within the same file. Fast zoom-out views without reading full-resolution data.

3

HTTP Range Request Compatibility

Clients calculate exactly which bytes contain their area of interest. Cloud storage (S3, Azure Blob) supports fetching specific byte ranges.

Japanese ink grid with single highlighted tile representing selective data access

Tile-based access: request only the cell you need, leave the rest untouched.

PERFORMANCE BENCHMARK: READ 1KM2 FROM 50GB RASTER

Standard GeoTIFF847 sec

Download entire file, extract region

Cloud-Optimized GeoTIFF2.3 sec

Fetch only required tiles via range request

368xfaster

Improvement scales with dataset size. For 500GB archives, can exceed 1,000x.

Creating COGs: What Actually Matters

gdal_translate input.tif output_cog.tif \
  -of COG \
  -co COMPRESS=DEFLATE \
  -co OVERVIEW_RESAMPLING=LANCZOS \
  -co BLOCKSIZE=512

Compression: DEFLATE provides good compression ratios with fast decompression. JPEG is appropriate for imagery where some quality loss is acceptable (70-80% smaller files).

Tile Size: 512x512 can improve performance for large-area queries at the cost of slightly more data transfer for small queries.

VALIDATION: DON'T SKIP THIS

A file can have internal tiles without being a valid COG. Always validate:

rio cogeo validate output_cog.tif

We've seen organisations "convert" to COG without validation, then wonder why performance didn't improve. The conversion failed silently.

Deep Dive: GeoParquet

If COG solves the raster problem, GeoParquet addresses vector data. But it's more than "Shapefile, but faster."

The Columnar Advantage

Shapefiles and File Geodatabases store data row by row. To read any attribute of any feature, you typically scan through records sequentially. Parquet stores data column by column - optimised for analytics queries that access specific attributes, aggregate across records, or filter based on values.

Query TypeShapefileFGDBGeoParquetImprovement
Count all features34s12s0.8s15x
Select by attribute89s31s1.2s26x
Spatial join (1M features)847s312s28s11x
Load into Pandas156s87s4.2s21x

Benchmark on 15 million parcel dataset

Japanese ink comparison of chaotic row-based data versus ordered columnar storage

Row vs column: scattered access versus precise selection of what you need.

Creating GeoParquet: Practical Code

import geopandas as gpd

# Read source data
gdf = gpd.read_file("input.shp")

# Write to GeoParquet
gdf.to_parquet(
    "output.parquet",
    compression="snappy",  # Fast decompression
    index=False
)

If you're transitioning from ArcGIS, our ArcPy to GeoPandas translation guide covers the complete workflow conversion process.

SPATIAL INDEXING: THE MISSING PIECE

GeoParquet doesn't include built-in spatial indexing. For large datasets with frequent spatial queries:

  • 1.Row Group Filtering: Organise data so spatially proximate features are in the same row groups.
  • 2.H3/S2 Cell Index: Generate cell indices as additional columns. Query by cell ID first.
  • 3.Partitioning: Partition files by geographic region (state, grid cell).

Deep Dive: STAC (SpatioTemporal Asset Catalog)

COG and GeoParquet solve access and analytics problems. STAC solves discovery.

STAC is a specification for metadata, not a data format. It defines a standard JSON structure for describing geospatial assets - what they contain, where they are, when they were captured, and how to access them.

CATALOG

The root container. Points to child catalogs or collections.

COLLECTION

A logical grouping of related items (e.g., "Sentinel-2 Level-2A imagery for North America").

ITEM

A single spatiotemporal unit - one scene, one time slice, one coherent dataset.

ASSET

A specific file or resource associated with an item (the red band, the thumbnail, the metadata JSON).

Why This Matters at Scale

Consider: 10 years of aerial imagery, satellite data from 3 providers, LiDAR from 5 acquisition projects, and derived products like DEMs. Without STAC, finding relevant data requires knowing folder structures, understanding naming conventions, manually checking date ranges. With STAC, you query a single API:

from pystac_client import Client

catalog = Client.open("https://your-stac-api.com")

# Find all Sentinel-2 imagery for summer 2023
results = catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=[-122.5, 47.5, -122.0, 48.0],
    datetime="2023-06-01/2023-08-31",
    query={"eo:cloud_cover": {"lt": 20}}
)

for item in results.items():
    print(f"{item.id}: {item.datetime}")

When NOT to Use These Formats

Cloud-native formats aren't universally superior. Here's when traditional approaches may be appropriate:

Small, Frequently Updated Datasets

Under 100MB with daily updates - the overhead of maintaining COG structure may exceed benefits.

Real-Time Streaming Data

COG and GeoParquet are designed for data at rest. For IoT streams, look at Kafka or streaming GIS solutions.

Desktop-Heavy Workflows

If users primarily work in desktop GIS with local file access, range-request benefits only manifest via HTTP.

Regulatory Format Requirements

If contracts specify Shapefile or FGDB delivery, you'll need those formats regardless of internal infrastructure.

Integration with Modern Data Platforms

The strategic value of cloud-native geospatial formats is integration with enterprise data infrastructure.

DATABRICKS

Native GeoParquet support through Spark. The Mosaic library adds spatial functions. Store in Delta Lake for ACID transactions and time travel.

df = spark.read.format("geoparquet").load("s3://bucket/parcels/")
df.select(st_area("geometry"), st_centroid("geometry")).show()

SNOWFLAKE

GEOGRAPHY type handles WKT/WKB geometries with native spatial functions. Use external tables pointing to GeoParquet on cloud storage.

SELECT parcel_id, ST_AREA(geometry) as area_m2
FROM parcels_ext
WHERE ST_CONTAINS(aoi_polygon, geometry);

BIGQUERY

BigQuery GIS provides native GEOGRAPHY type with extensive spatial functions. Query GeoParquet directly. Partition tables by geography to optimise costs.

Network topology diagram showing cloud-native formats at the centre connected to data lake, compute, and AI nodes

The modern stack: cloud-native formats integrate seamlessly with enterprise data infrastructure.

Migration Strategy: A Practical Roadmap

Migration isn't a weekend project. Here's a phased approach that minimises risk.

Japanese ink stepping stones showing evolution from legacy to modern formats

The modernisation journey: each step lighter than the last.

01

Assessment

2-4 weeks
  • Inventory data assets by format type
  • Analyse access patterns (who queries what, how often)
  • Quantify current pain points
  • Identify pilot candidates
02

Proof of Concept

4-6 weeks
  • Convert pilot datasets (raster -> COG, vector -> GeoParquet)
  • Create STAC catalog entries
  • Benchmark rigorously with real workflows
  • Validate platform integration
03

Pilot Production

2-3 months
  • Migrate first production workload
  • Run parallel operations initially
  • Monitor daily, document issues
  • Build internal expertise
04

Full Rollout

3-6 months
  • Systematic migration prioritised by ROI
  • Maintain format conversion as data is accessed
  • Archive original formats until confident
  • Decommission legacy infrastructure

Questions to Ask Before You Start

1. What specific problem are you solving?

"Cloud-native is modern" isn't a business case. Quantify the pain: hours lost, dollars spent, opportunities missed.

2. Do you have the skills?

COG and GeoParquet require different tooling than traditional GIS. GeoPandas, DuckDB, and cloud platforms may be new to your team. See our guide on training GIS teams for workflow automation.

3. What's your access pattern?

If most queries are full-dataset exports, range requests don't help. If queries target specific regions, benefits are substantial.

4. Who are your users?

Desktop GIS users see less benefit. Cloud-native analysts, data scientists, and application developers see more.

5. What's your timeline?

Migration done well takes 6-12 months for a large organisation. Rushed migration creates technical debt.

6. What's your fallback?

If migration fails, can you revert? Maintain original data until the new system is proven.

Cloud-native geospatial formats (COG, GeoParquet, STAC) aren't magic. They're engineering solutions to specific problems.

For organisations with large geospatial data volumes, cloud infrastructure, and analytical workloads, these formats can deliver 10-1000x performance improvements and integrate geospatial data into modern data platforms.

The decision isn't ideological. It's practical: quantify your current costs, estimate the improvement, and evaluate whether the investment makes sense.

Get Workflow Automation Insights

Monthly tips on automating GIS workflows, open-source tools, and lessons from enterprise deployments. No spam.

NEXT STEP

Ready to Evaluate Your Geospatial Infrastructure?

Our free workflow assessment analyses your current data volumes, access patterns, and pain points to determine whether cloud-native migration makes sense for your organisation.