Skip to content

geoparquet-io

Tests Python Version License Code style: ruff

Fast I/O and transformation tools for GeoParquet files using PyArrow and DuckDB.

Features

  • Fast: Built on PyArrow and DuckDB for high-performance operations
  • Pipeable: Chain commands with Unix pipes using Arrow IPC streaming - no intermediate files
  • Comprehensive: Sort, extract, partition, enhance, validate, and upload GeoParquet files
  • Cloud-Native: Read from and write to S3, GCS, Azure, and HTTPS sources
  • Spatial Indexing: Add bbox, H3 hexagonal cells, KD-tree partitions, and admin divisions
  • Best Practices: Automatic optimization following GeoParquet 1.1 and 2.0 specs
  • Parquet Geo Types support: Read and write Parquet geometry and geography types
  • Flexible: CLI and Python API for any workflow
  • Tested: Extensive test suite across Python 3.10-3.13 and all platforms

Quick Example

# Install
pip install geoparquet-io

# Convert Shapefile/GeoJSON/GeoPackage/CSV to optimized GeoParquet
gpio convert input.shp output.parquet

# Inspect file structure and metadata
gpio inspect myfile.parquet

# Check file quality and best practices
gpio check all myfile.parquet

# Add bounding box column for faster queries
gpio add bbox input.parquet output.parquet

# Sort using Hilbert curve for spatial locality
gpio sort hilbert input.parquet output_sorted.parquet

# Partition into separate files by country
gpio partition admin buildings.parquet output_dir/

# Chain commands with Unix pipes - no intermediate files
gpio extract --limit 10000 input.parquet | gpio add bbox - | gpio sort hilbert - output.parquet

Why geoparquet-io?

GeoParquet is a cloud-native geospatial data format that combines the efficiency of Parquet with geospatial capabilities. This toolkit helps you:

  • Optimize file layout for cloud-native access patterns
  • Add spatial indices for faster queries and analysis
  • Validate compliance with GeoParquet best practices
  • Transform large datasets efficiently using columnar operations

Getting Started

New to geoparquet-io? Start here:

Command Reference

  • convert - Convert vector formats to optimized GeoParquet
  • inspect - Examine file metadata and preview data
  • meta - Deep dive into file structure and metadata
  • extract - Filter and subset GeoParquet files
  • check - Validate files and fix issues automatically
  • sort - Spatially sort using Hilbert curves
  • add - Enhance files with spatial indices
  • partition - Split files into optimized partitions
  • upload - Upload files to cloud storage (S3, GCS, Azure)
  • stac - Generate STAC metadata for datasets
  • benchmark - Compare conversion performance
  • piping - Chain commands with Unix pipes

Python API

Use gpio programmatically for the best performance and integration with Python workflows:

import geoparquet_io as gpio

gpio.read('input.parquet') \
    .add_bbox() \
    .sort_hilbert() \
    .write('output.parquet')

Concepts

Support

License

Apache 2.0 - See LICENSE for details.