Command Piping¶

gpio supports Unix-style command piping using Arrow IPC streaming. This allows you to chain multiple commands together without creating intermediate files, resulting in faster execution and reduced disk I/O.

Basic Piping¶

Use - as the input to read from stdin. Output is auto-detected - when stdout is piped to another command, gpio automatically streams Arrow IPC:

# Add bbox, then sort by Hilbert curve
gpio add bbox input.parquet | gpio sort hilbert - output.parquet

# Extract, add bbox, then add quadkey
gpio extract --limit 1000 input.parquet | gpio add bbox - | gpio add quadkey - output.parquet

You can also explicitly use - for output if preferred:

gpio add bbox input.parquet - | gpio sort hilbert - output.parquet

Supported Commands¶

All transformation commands support Arrow IPC piping:

Command	Stdin Input	Stdout Output
`extract`	Yes	Yes
`add bbox`	Yes	Yes
`add quadkey`	Yes	Yes
`add h3`	Yes	Yes
`add kdtree`	Yes	Yes
`add admin-divisions`	Yes	Yes
`sort hilbert`	Yes	Yes
`sort quadkey`	Yes	Yes
`sort column`	Yes	Yes
`reproject`	Yes	Yes
`convert geojson`	Yes	No (outputs GeoJSON to stdout)
`partition string`	Yes	No (writes to directory)
`partition quadkey`	Yes	No (writes to directory)
`partition h3`	Yes	No (writes to directory)
`partition kdtree`	Yes	No (writes to directory)
`partition admin`	Yes	No (writes to directory)

Performance Benefits¶

Piping eliminates intermediate file I/O, providing significant speedups for multi-step workflows:

Workflow	File-based	Piped	Speedup
add bbox → add quadkey → sort hilbert	34s	16s	53% faster

For even better performance, use the Python API which keeps data in memory.

Common Patterns¶

Transform Pipeline¶

Chain transformations without intermediate files:

gpio add bbox input.parquet | \
  gpio add quadkey - | \
  gpio sort hilbert - output.parquet

Extract and Transform¶

Filter data before applying transformations:

gpio extract --limit 10000 large_file.parquet | \
  gpio add bbox - | \
  gpio sort hilbert - subset.parquet

Spatial Filter and Partition¶

Filter by bounding box then partition:

gpio extract --bbox "-122.5,37.5,-122.0,38.0" input.parquet | \
  gpio add quadkey - | \
  gpio partition string --column quadkey --chars 4 - output_dir/

Column Selection Through Pipe¶

Select columns first, then add computed columns:

gpio extract --include-cols name,address input.parquet | \
  gpio add bbox - output.parquet

Add Multiple Spatial Indices¶

Chain multiple add commands to add several spatial indices:

gpio add bbox input.parquet | \
  gpio add h3 --resolution 9 - | \
  gpio add quadkey - | \
  gpio sort hilbert - output.parquet

Reproject and Transform¶

Reproject to a different CRS before adding indices:

gpio convert reproject --dst-crs EPSG:4326 input.parquet | \
  gpio add bbox - | \
  gpio sort hilbert - output.parquet

Full Processing Pipeline¶

Combine extract, reproject, add indices, sort, and partition:

gpio extract --bbox "-122.5,37.5,-122.0,38.0" input.parquet | \
  gpio add bbox - | \
  gpio add h3 --resolution 8 - | \
  gpio sort hilbert - | \
  gpio partition h3 --resolution 4 - output_dir/

How It Works¶

When you use - for output, gpio writes data in Arrow IPC streaming format instead of Parquet. This format:

Supports streaming (no need to buffer entire dataset)
Preserves schema and metadata
Enables zero-copy data transfer between processes
Is compatible with any Arrow-based tool

The receiving command reads the Arrow IPC stream, processes the data, and outputs either another Arrow stream (for further piping) or a Parquet file.

Auto-Detection¶

gpio automatically detects when stdout is piped to another process. You don't need to specify - for output:

# Output is auto-detected when piped
gpio add bbox input.parquet | gpio sort hilbert - output.parquet

# Explicit '-' also works
gpio add bbox input.parquet - | gpio sort hilbert - output.parquet

When output is omitted and stdout is piped, gpio streams Arrow IPC. When stdout is a terminal, gpio requires an explicit output path.

Error Handling¶

If a command in the pipeline fails, the error is propagated:

# If the file doesn't exist, the first command fails
gpio add bbox nonexistent.parquet - | gpio sort hilbert - output.parquet
# Error: File not found: nonexistent.parquet

For debugging, you can save intermediate results:

# Debug: save intermediate result
gpio add bbox input.parquet intermediate.parquet
gpio inspect intermediate.parquet
gpio sort hilbert intermediate.parquet output.parquet

Limitations¶

Partition commands: partition string, partition quadkey, etc. can read from stdin but always write to a directory (not stdout)
Remote output: Streaming to remote destinations (S3, HTTP) is not supported; use file output then gpio publish upload
Memory: Large datasets are streamed, but some operations (like Hilbert sorting) require loading the full dataset into memory

Python API Alternative¶

For maximum performance, use the Python API which keeps data in memory:

import geoparquet_io as gpio

# Equivalent to:
# gpio extract --bbox "..." input.parquet | gpio add bbox - | gpio sort hilbert - output.parquet

gpio.read('input.parquet') \
    .extract(bbox=(-122.5, 37.5, -122.0, 38.0)) \
    .add_bbox() \
    .sort_hilbert() \
    .write('output.parquet')

Performance Comparison¶

Approach	Time (75MB, 400K rows)	Notes
CLI (file-based)	34s	Each command writes intermediate file
CLI (piped)	16s	Arrow IPC streaming between commands
Python API	7s	In-memory, no I/O overhead

The Python API is up to 5x faster than file-based CLI operations because data stays in memory as Arrow tables. Use the Python API when:

You're building Python applications
Performance is critical
You need to integrate with other Python tools
You're working in Jupyter notebooks

See Python API Reference for full documentation.