CLI vs Python API¶
geoparquet-io offers two ways to work with GeoParquet files. This guide helps you choose the right approach for your use case.
Quick Comparison¶
| Feature | CLI | Python API |
|---|---|---|
| Performance | Good (with piping) | Best (in-memory) |
| Ease of use | Simple commands | Fluent chaining |
| Integration | Shell scripts, CI/CD | Python applications |
| Interactivity | Terminal | Jupyter notebooks |
| Remote files | Full support | Partial support |
When to Use the CLI¶
One-off File Operations¶
Quick transformations without writing code:
# Add bbox and sort
gpio add bbox input.parquet | gpio sort hilbert - output.parquet
# Check file quality
gpio check all myfile.parquet
# Inspect metadata
gpio inspect myfile.parquet --stats
Shell Scripts and Automation¶
CI/CD pipelines, cron jobs, and data processing scripts:
#!/bin/bash
for file in *.parquet; do
gpio add bbox "$file" | gpio sort hilbert - "optimized/$file"
done
Remote File Processing¶
Read from and write to cloud storage:
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet --profile my-aws
gpio sort hilbert https://example.com/data.parquet s3://bucket/sorted.parquet
Piping Multiple Commands¶
Chain operations with Unix pipes:
gpio extract --bbox "-122.5,37.5,-122.0,38.0" input.parquet | \
gpio add bbox - | \
gpio add h3 --resolution 9 - | \
gpio sort hilbert - output.parquet
When to Use the Python API¶
Python Applications¶
Integrate with existing Python code:
import geoparquet_io as gpio
def process_data(input_path: str, output_path: str):
gpio.read(input_path) \
.add_bbox() \
.sort_hilbert() \
.write(output_path)
Maximum Performance¶
The Python API is up to 5x faster than CLI:
import geoparquet_io as gpio
# Data stays in memory - no intermediate file I/O
result = gpio.read('input.parquet') \
.extract(limit=10000) \
.add_bbox() \
.add_h3(resolution=9) \
.sort_hilbert()
result.write('output.parquet')
Jupyter Notebooks¶
Interactive exploration and analysis:
import geoparquet_io as gpio
# Read and explore
table = gpio.read('data.parquet')
table.info()
# Transform and inspect
result = table.add_bbox().sort_hilbert()
print(f"Processed {result.num_rows} rows")
print(f"Bounds: {result.bounds}")
Conditional Processing¶
Apply different operations based on data characteristics:
import geoparquet_io as gpio
table = gpio.read('input.parquet')
# Apply different processing based on size
if table.num_rows > 1_000_000:
# Large file: add H3 for later partitioning
result = table.add_bbox().add_h3(resolution=9).sort_hilbert()
else:
# Small file: just optimize
result = table.add_bbox().sort_hilbert()
result.write('output.parquet')
Integration with PyArrow¶
Combine with other Arrow-based tools:
import pyarrow.parquet as pq
import geoparquet_io as gpio
from geoparquet_io.api import Table
# Read with PyArrow
arrow_table = pq.read_table('input.parquet')
# Process with gpio
table = Table(arrow_table)
result = table.add_bbox().sort_hilbert()
# Continue with PyArrow or other tools
arrow_result = result.to_arrow()
Reusable Pipelines¶
Define and apply standard processing pipelines:
from geoparquet_io.api import pipe, read
# Define reusable pipeline
optimize = pipe(
lambda t: t.add_bbox(),
lambda t: t.add_h3(resolution=9),
lambda t: t.sort_hilbert(),
)
# Apply to any file
result = optimize(read('input.parquet'))
result.write('output.parquet')
Performance Comparison¶
Benchmark on a 75MB file with 400K rows (add bbox + add quadkey + sort hilbert):
| Approach | Time | Relative Speed |
|---|---|---|
| CLI (file-based) | 34s | 1x (baseline) |
| CLI (piped) | 16s | 2x faster |
| Python API | 7s | 5x faster |
The Python API is faster because: - Data stays in memory as Arrow tables - No intermediate file I/O - Zero-copy operations where possible
Mixing CLI and Python¶
You can use both together:
import subprocess
import geoparquet_io as gpio
# Use CLI for remote file download
subprocess.run([
'gpio', 'extract', '--limit', '10000',
's3://bucket/huge.parquet', 'local_subset.parquet'
])
# Use Python API for processing
gpio.read('local_subset.parquet') \
.add_bbox() \
.sort_hilbert() \
.write('processed.parquet')
Equivalent Commands¶
Here are common operations in both styles:
Add Bbox and Sort¶
# CLI
gpio add bbox input.parquet | gpio sort hilbert - output.parquet
# Python
gpio.read('input.parquet').add_bbox().sort_hilbert().write('output.parquet')
Filter by Bounding Box¶
# CLI
gpio extract --bbox "-122.5,37.5,-122.0,38.0" input.parquet output.parquet
# Python
gpio.read('input.parquet').extract(bbox=(-122.5, 37.5, -122.0, 38.0)).write('output.parquet')
Add Multiple Indices¶
# CLI
gpio add bbox input.parquet | gpio add h3 --resolution 9 - | gpio add quadkey - output.parquet
# Python
gpio.read('input.parquet').add_bbox().add_h3(resolution=9).add_quadkey().write('output.parquet')
Partition Data¶
# CLI
gpio partition h3 input.parquet output_dir/ --resolution 6
# Python
gpio.read('input.parquet').add_h3(resolution=9).partition_by_h3('output/', resolution=6)
Summary¶
| Use Case | Recommendation |
|---|---|
| Quick one-off transformations | CLI |
| Shell scripts and CI/CD | CLI with piping |
| Remote file processing | CLI |
| Python applications | Python API |
| Jupyter notebooks | Python API |
| Maximum performance | Python API |
| Conditional processing | Python API |
| Integration with PyArrow | Python API |
See Also¶
- Quick Start Tutorial - Get started with both approaches
- Command Piping - CLI piping guide
- Python API Reference - Full Python API documentation