Skip to content

Sorting Data

The sort command reorders GeoParquet files for optimal performance and query efficiency.

Sorting Methods

  • Hilbert curve - Optimal spatial ordering using Hilbert space-filling curve
  • Column - Sort by any column(s) for non-spatial ordering needs

Hilbert Curve Ordering

gpio sort hilbert input.parquet output.parquet

# From HTTPS to S3
gpio sort hilbert https://example.com/data.parquet s3://bucket/sorted.parquet --profile prod
import geoparquet_io as gpio

gpio.read('input.parquet').sort_hilbert().write('output.parquet')

# With upload to S3
gpio.read('https://example.com/data.parquet') \
    .sort_hilbert() \
    .upload('s3://bucket/sorted.parquet', profile='prod')

Reorders rows using a Hilbert space-filling curve, which:

  • Improves spatial locality
  • Increases compression ratios
  • Optimizes cloud-native access patterns
  • Enhances query performance

Options

# Specify geometry column
gpio sort hilbert input.parquet output.parquet -g geom

# Add bbox column if missing
gpio sort hilbert input.parquet output.parquet --add-bbox

# Custom compression
gpio sort hilbert input.parquet output.parquet --compression GZIP --compression-level 9

# Row group sizing
gpio sort hilbert input.parquet output.parquet --row-group-size-mb 256

# Verbose output
gpio sort hilbert input.parquet output.parquet --verbose

Compression Options

Supported formats:

  • ZSTD (default) - Best balance, level 1-22, default 15
  • GZIP - Wide compatibility, level 1-9, default 6
  • BROTLI - High compression, level 1-11, default 6
  • LZ4 - Fastest
  • SNAPPY - Fast, good compression
  • UNCOMPRESSED - No compression

Row Group Sizing

Control row group sizes for optimal performance:

# Exact row count
gpio sort hilbert input.parquet output.parquet --row-group-size 100000

# Target size in MB/GB
gpio sort hilbert input.parquet output.parquet --row-group-size-mb 256MB
gpio sort hilbert input.parquet output.parquet --row-group-size-mb 1GB

Column Ordering

Sort by any column(s) for non-spatial ordering needs:

# Sort by a single column
gpio sort column input.parquet output.parquet name

# Sort by multiple columns (comma-separated)
gpio sort column input.parquet output.parquet country,city

# Sort in descending order
gpio sort column input.parquet output.parquet date --descending
import geoparquet_io as gpio
from geoparquet_io.api import ops

# Sort by a single column (fluent API)
gpio.read('input.parquet').sort_column('name').write('output.parquet')

# Sort in descending order
gpio.read('input.parquet').sort_column('date', descending=True).write('output.parquet')

# Multi-column sorting (requires ops API)
table = gpio.read('input.parquet')
sorted_arrow = ops.sort_column(table.to_arrow(), ['country', 'city'])
gpio.Table(sorted_arrow).write('output.parquet')

Multi-column sorting

Table.sort_column() accepts a single column. For multi-column sorting, use ops.sort_column() which accepts a list of column names.

Column sorting:

  • Accepts one or more column names (comma-separated)
  • Validates that columns exist before sorting
  • Preserves all original columns and metadata
  • Useful for time-series data or alphabetical ordering

Output Format

The output file:

  • Follows GeoParquet 1.1 spec
  • Preserves CRS information
  • Includes bbox covering metadata
  • Uses optimal row group sizes

See Also