Uploading to Cloud Storage¶

The upload command uploads GeoParquet files to cloud object storage (S3, GCS, Azure) with parallel transfers and progress tracking.

Basic Usage¶

CLIPython

# Single file to S3
gpio publish upload input.parquet s3://bucket/path/output.parquet --profile my-profile

# Directory to S3
gpio publish upload data/ s3://bucket/dataset/ --profile my-profile

import geoparquet_io as gpio

# Upload to S3 with transform
gpio.read('input.parquet') \
    .sort_hilbert() \
    .upload('s3://bucket/path/output.parquet', profile='my-profile')

# Upload with S3-compatible endpoint (MinIO, etc)
gpio.read('input.parquet') \
    .upload(
        's3://bucket/path/output.parquet',
        s3_endpoint='minio.example.com:9000',
        s3_use_ssl=False
    )

Supported Destinations¶

Provider support via URL scheme:

AWS S3 - s3://bucket/path/
Google Cloud Storage - gs://bucket/path/
Azure Blob Storage - az://account/container/path/
HTTP stores - https://...

Authentication¶

For authentication setup, see the Remote Files guide.

Quick reference: - AWS S3: Use --profile flag or set AWS_PROFILE env var - Google Cloud Storage: Set GOOGLE_APPLICATION_CREDENTIALS - Azure: Set AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY

Options¶

Pattern Filtering¶

Upload only specific file types:

# Only JSON files
gpio publish upload data/ s3://bucket/dataset/ --pattern "*.json"

# Only Parquet files
gpio publish upload data/ s3://bucket/dataset/ --pattern "*.parquet"

Parallel Uploads¶

Control concurrency for directory uploads:

# Upload 8 files in parallel (default: 4)
gpio publish upload data/ s3://bucket/dataset/ --max-files 8

Trade-off: Higher parallelism = faster uploads but more bandwidth/memory usage.

Chunk Concurrency¶

Control concurrent chunks within each file:

# More concurrent chunks per file (default: 12)
gpio publish upload large.parquet s3://bucket/file.parquet --chunk-concurrency 20

Custom Chunk Size¶

Override default multipart upload chunk size:

# 10MB chunks instead of default 5MB
gpio publish upload data.parquet s3://bucket/file.parquet --chunk-size 10485760

Error Handling¶

By default, continues uploading remaining files if one fails:

# Stop immediately on first error
gpio publish upload data/ s3://bucket/dataset/ --fail-fast

Dry Run¶

Preview what would be uploaded without actually uploading:

gpio publish upload data/ s3://bucket/dataset/ --dry-run

Shows: - Files that would be uploaded - Total size - Destination paths - AWS profile (if specified)

S3-Compatible Storage¶

Upload to MinIO, Ceph, or other S3-compatible storage: