Skip to content

CLI vs Python API

geoparquet-io offers two ways to work with GeoParquet files. This guide helps you choose the right approach for your use case.

Quick Comparison

Feature CLI Python API
Performance Good (with piping) Best (in-memory)
Ease of use Simple commands Fluent chaining
Integration Shell scripts, CI/CD Python applications
Interactivity Terminal Jupyter notebooks
Remote files Full support Partial support

When to Use the CLI

One-off File Operations

Quick transformations without writing code:

# Add bbox and sort
gpio add bbox input.parquet | gpio sort hilbert - output.parquet

# Check file quality
gpio check all myfile.parquet

# Inspect metadata
gpio inspect myfile.parquet --stats

Shell Scripts and Automation

CI/CD pipelines, cron jobs, and data processing scripts:

#!/bin/bash
for file in *.parquet; do
    gpio add bbox "$file" | gpio sort hilbert - "optimized/$file"
done

Remote File Processing

Read from and write to cloud storage:

gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet --profile my-aws
gpio sort hilbert https://example.com/data.parquet s3://bucket/sorted.parquet

Piping Multiple Commands

Chain operations with Unix pipes:

gpio extract --bbox "-122.5,37.5,-122.0,38.0" input.parquet | \
    gpio add bbox - | \
    gpio add h3 --resolution 9 - | \
    gpio sort hilbert - output.parquet

When to Use the Python API

Python Applications

Integrate with existing Python code:

import geoparquet_io as gpio

def process_data(input_path: str, output_path: str):
    gpio.read(input_path) \
        .add_bbox() \
        .sort_hilbert() \
        .write(output_path)

Maximum Performance

The Python API is up to 5x faster than CLI:

import geoparquet_io as gpio

# Data stays in memory - no intermediate file I/O
result = gpio.read('input.parquet') \
    .extract(limit=10000) \
    .add_bbox() \
    .add_h3(resolution=9) \
    .sort_hilbert()

result.write('output.parquet')

Jupyter Notebooks

Interactive exploration and analysis:

import geoparquet_io as gpio

# Read and explore
table = gpio.read('data.parquet')
table.info()

# Transform and inspect
result = table.add_bbox().sort_hilbert()
print(f"Processed {result.num_rows} rows")
print(f"Bounds: {result.bounds}")

Conditional Processing

Apply different operations based on data characteristics:

import geoparquet_io as gpio

table = gpio.read('input.parquet')

# Apply different processing based on size
if table.num_rows > 1_000_000:
    # Large file: add H3 for later partitioning
    result = table.add_bbox().add_h3(resolution=9).sort_hilbert()
else:
    # Small file: just optimize
    result = table.add_bbox().sort_hilbert()

result.write('output.parquet')

Integration with PyArrow

Combine with other Arrow-based tools:

import pyarrow.parquet as pq
import geoparquet_io as gpio
from geoparquet_io.api import Table

# Read with PyArrow
arrow_table = pq.read_table('input.parquet')

# Process with gpio
table = Table(arrow_table)
result = table.add_bbox().sort_hilbert()

# Continue with PyArrow or other tools
arrow_result = result.to_arrow()

Reusable Pipelines

Define and apply standard processing pipelines:

from geoparquet_io.api import pipe, read

# Define reusable pipeline
optimize = pipe(
    lambda t: t.add_bbox(),
    lambda t: t.add_h3(resolution=9),
    lambda t: t.sort_hilbert(),
)

# Apply to any file
result = optimize(read('input.parquet'))
result.write('output.parquet')

Performance Comparison

Benchmark on a 75MB file with 400K rows (add bbox + add quadkey + sort hilbert):

Approach Time Relative Speed
CLI (file-based) 34s 1x (baseline)
CLI (piped) 16s 2x faster
Python API 7s 5x faster

The Python API is faster because: - Data stays in memory as Arrow tables - No intermediate file I/O - Zero-copy operations where possible

Mixing CLI and Python

You can use both together:

import subprocess
import geoparquet_io as gpio

# Use CLI for remote file download
subprocess.run([
    'gpio', 'extract', '--limit', '10000',
    's3://bucket/huge.parquet', 'local_subset.parquet'
])

# Use Python API for processing
gpio.read('local_subset.parquet') \
    .add_bbox() \
    .sort_hilbert() \
    .write('processed.parquet')

Equivalent Commands

Here are common operations in both styles:

Add Bbox and Sort

# CLI
gpio add bbox input.parquet | gpio sort hilbert - output.parquet
# Python
gpio.read('input.parquet').add_bbox().sort_hilbert().write('output.parquet')

Filter by Bounding Box

# CLI
gpio extract --bbox "-122.5,37.5,-122.0,38.0" input.parquet output.parquet
# Python
gpio.read('input.parquet').extract(bbox=(-122.5, 37.5, -122.0, 38.0)).write('output.parquet')

Add Multiple Indices

# CLI
gpio add bbox input.parquet | gpio add h3 --resolution 9 - | gpio add quadkey - output.parquet
# Python
gpio.read('input.parquet').add_bbox().add_h3(resolution=9).add_quadkey().write('output.parquet')

Partition Data

# CLI
gpio partition h3 input.parquet output_dir/ --resolution 6
# Python
gpio.read('input.parquet').add_h3(resolution=9).partition_by_h3('output/', resolution=6)

Summary

Use Case Recommendation
Quick one-off transformations CLI
Shell scripts and CI/CD CLI with piping
Remote file processing CLI
Python applications Python API
Jupyter notebooks Python API
Maximum performance Python API
Conditional processing Python API
Integration with PyArrow Python API

See Also