Sorting Data¶
The sort command reorders GeoParquet files for optimal performance and query efficiency.
Sorting Methods¶
- Hilbert curve - Optimal spatial ordering using Hilbert space-filling curve
- Column - Sort by any column(s) for non-spatial ordering needs
Hilbert Curve Ordering¶
gpio sort hilbert input.parquet output.parquet
# From HTTPS to S3
gpio sort hilbert https://example.com/data.parquet s3://bucket/sorted.parquet --profile prod
import geoparquet_io as gpio
gpio.read('input.parquet').sort_hilbert().write('output.parquet')
# With upload to S3
gpio.read('https://example.com/data.parquet') \
.sort_hilbert() \
.upload('s3://bucket/sorted.parquet', profile='prod')
Reorders rows using a Hilbert space-filling curve, which:
- Improves spatial locality
- Increases compression ratios
- Optimizes cloud-native access patterns
- Enhances query performance
Options¶
# Specify geometry column
gpio sort hilbert input.parquet output.parquet -g geom
# Add bbox column if missing
gpio sort hilbert input.parquet output.parquet --add-bbox
# Custom compression
gpio sort hilbert input.parquet output.parquet --compression GZIP --compression-level 9
# Row group sizing
gpio sort hilbert input.parquet output.parquet --row-group-size-mb 256
# Verbose output
gpio sort hilbert input.parquet output.parquet --verbose
Compression Options¶
Supported formats:
ZSTD(default) - Best balance, level 1-22, default 15GZIP- Wide compatibility, level 1-9, default 6BROTLI- High compression, level 1-11, default 6LZ4- FastestSNAPPY- Fast, good compressionUNCOMPRESSED- No compression
Row Group Sizing¶
Control row group sizes for optimal performance:
# Exact row count
gpio sort hilbert input.parquet output.parquet --row-group-size 100000
# Target size in MB/GB
gpio sort hilbert input.parquet output.parquet --row-group-size-mb 256MB
gpio sort hilbert input.parquet output.parquet --row-group-size-mb 1GB
Column Ordering¶
Sort by any column(s) for non-spatial ordering needs:
# Sort by a single column
gpio sort column input.parquet output.parquet name
# Sort by multiple columns (comma-separated)
gpio sort column input.parquet output.parquet country,city
# Sort in descending order
gpio sort column input.parquet output.parquet date --descending
import geoparquet_io as gpio
from geoparquet_io.api import ops
# Sort by a single column (fluent API)
gpio.read('input.parquet').sort_column('name').write('output.parquet')
# Sort in descending order
gpio.read('input.parquet').sort_column('date', descending=True).write('output.parquet')
# Multi-column sorting (requires ops API)
table = gpio.read('input.parquet')
sorted_arrow = ops.sort_column(table.to_arrow(), ['country', 'city'])
gpio.Table(sorted_arrow).write('output.parquet')
Multi-column sorting
Table.sort_column() accepts a single column. For multi-column sorting, use ops.sort_column() which accepts a list of column names.
Column sorting:
- Accepts one or more column names (comma-separated)
- Validates that columns exist before sorting
- Preserves all original columns and metadata
- Useful for time-series data or alphabetical ordering
Output Format¶
The output file:
- Follows GeoParquet 1.1 spec
- Preserves CRS information
- Includes bbox covering metadata
- Uses optimal row group sizes