What is GeoParquet?¶
GeoParquet is a standardized way to store geospatial vector data in Apache Parquet files. It combines Parquet's columnar storage efficiency with metadata conventions for geometry columns.
Why GeoParquet?¶
Traditional geospatial formats have limitations:
| Format | Limitation |
|---|---|
| Shapefile | 2GB size limit, separate files, limited data types |
| GeoJSON | Text-based (slow, large), no streaming |
| GeoPackage | Single-file but not cloud-optimized |
GeoParquet solves these by leveraging Parquet:
- No size limits: Handle billions of features
- Fast queries: Columnar storage, predicate pushdown
- Cloud-native: Partial reads from S3, GCS, Azure
- Type-rich: Full support for complex data types
- Compression: Efficient storage with ZSTD, LZ4, etc.
Key Concepts¶
Geometry Column¶
GeoParquet files have one or more geometry columns storing spatial data. The format supports:
- Points, LineStrings, Polygons
- Multi-geometries (MultiPoint, MultiLineString, MultiPolygon)
- GeometryCollections
Geometries are stored as Well-Known Binary (WKB) encoded data.
GeoParquet Metadata¶
GeoParquet adds a geo key to Parquet's key-value metadata containing:
{
"version": "1.1.0",
"primary_column": "geometry",
"columns": {
"geometry": {
"encoding": "WKB",
"geometry_types": ["Polygon", "MultiPolygon"],
"crs": { ... },
"covering": { ... }
}
}
}
This metadata tells tools how to interpret the geometry data.
Coordinate Reference System (CRS)¶
GeoParquet uses PROJJSON to define the coordinate reference system. Most data uses WGS84 (EPSG:4326), but any CRS is supported.
Bounding Box (Bbox) Columns¶
A key optimization for spatial queries. Bbox columns store precomputed bounding boxes:
bbox: struct<xmin: double, ymin: double, xmax: double, ymax: double>
This enables fast spatial filtering by checking bbox overlap before expensive geometry operations.
Covering Metadata¶
The covering metadata in GeoParquet 1.1+ tells query engines how to use bbox columns for filtering:
"covering": {
"bbox": {
"xmin": ["bbox", "xmin"],
"ymin": ["bbox", "ymin"],
"xmax": ["bbox", "xmax"],
"ymax": ["bbox", "ymax"]
}
}
Row Groups¶
Parquet files are divided into row groups (chunks of rows). Spatial ordering within row groups improves query performance by keeping nearby features together.
GeoParquet Versions¶
Version 1.0.0¶
- Initial stable release
- Core geometry column support
- CRS and encoding metadata
Version 1.1.0 (Current)¶
- Added
coveringmetadata for bbox columns - Better interoperability with query engines
- Recommended for new files
Version 2.0 / Parquet-Geo¶
- Native Parquet GEOMETRY and GEOGRAPHY logical types
- Geometry stored in Parquet's native format (not just WKB)
- Still emerging specification
When to Use GeoParquet¶
Use GeoParquet when: - Files are larger than a few hundred MB - You need cloud-native access patterns - Analytics and querying are primary use cases - You're building data pipelines - You need efficient compression
Consider alternatives when: - Interoperability with legacy GIS tools is critical (use Shapefile) - Data is small and simplicity matters (use GeoJSON) - Editing workflows are primary use case (use GeoPackage)
Where gpio Fits In¶
geoparquet-io helps you create and optimize GeoParquet files:
| Task | gpio Command |
|---|---|
| Convert from other formats | gpio convert |
| Add bbox columns | gpio add bbox |
| Apply spatial ordering | gpio sort hilbert |
| Validate best practices | gpio check all |
| Partition large files | gpio partition |
Learn More¶
- GeoParquet Specification - Official specification
- Best Practices Guide - Optimization techniques
- Quick Start Tutorial - Get started with gpio