Skip to content

Uploading to Cloud Storage

The upload command uploads GeoParquet files to cloud object storage (S3, GCS, Azure) with parallel transfers and progress tracking.

Basic Usage

# Single file to S3
gpio publish upload input.parquet s3://bucket/path/output.parquet --profile my-profile

# Directory to S3
gpio publish upload data/ s3://bucket/dataset/ --profile my-profile
import geoparquet_io as gpio

# Upload to S3 with transform
gpio.read('input.parquet') \
    .sort_hilbert() \
    .upload('s3://bucket/path/output.parquet', profile='my-profile')

# Upload with S3-compatible endpoint (MinIO, etc)
gpio.read('input.parquet') \
    .upload(
        's3://bucket/path/output.parquet',
        s3_endpoint='minio.example.com:9000',
        s3_use_ssl=False
    )

Supported Destinations

Provider support via URL scheme:

  • AWS S3 - s3://bucket/path/
  • Google Cloud Storage - gs://bucket/path/
  • Azure Blob Storage - az://account/container/path/
  • HTTP stores - https://...

Authentication

For authentication setup, see the Remote Files guide.

Quick reference: - AWS S3: Use --profile flag or set AWS_PROFILE env var - Google Cloud Storage: Set GOOGLE_APPLICATION_CREDENTIALS - Azure: Set AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY

Options

Pattern Filtering

Upload only specific file types:

# Only JSON files
gpio publish upload data/ s3://bucket/dataset/ --pattern "*.json"

# Only Parquet files
gpio publish upload data/ s3://bucket/dataset/ --pattern "*.parquet"

Parallel Uploads

Control concurrency for directory uploads:

# Upload 8 files in parallel (default: 4)
gpio publish upload data/ s3://bucket/dataset/ --max-files 8

Trade-off: Higher parallelism = faster uploads but more bandwidth/memory usage.

Chunk Concurrency

Control concurrent chunks within each file:

# More concurrent chunks per file (default: 12)
gpio publish upload large.parquet s3://bucket/file.parquet --chunk-concurrency 20

Custom Chunk Size

Override default multipart upload chunk size:

# 10MB chunks instead of default 5MB
gpio publish upload data.parquet s3://bucket/file.parquet --chunk-size 10485760

Error Handling

By default, continues uploading remaining files if one fails:

# Stop immediately on first error
gpio publish upload data/ s3://bucket/dataset/ --fail-fast

Dry Run

Preview what would be uploaded without actually uploading:

gpio publish upload data/ s3://bucket/dataset/ --dry-run

Shows: - Files that would be uploaded - Total size - Destination paths - AWS profile (if specified)

S3-Compatible Storage

Upload to MinIO, Ceph, or other S3-compatible storage:

# MinIO without SSL
gpio publish upload data.parquet s3://bucket/file.parquet \
  --s3-endpoint minio.example.com:9000 \
  --s3-no-ssl

# Custom endpoint with specific region
gpio publish upload data/ s3://bucket/dataset/ \
  --s3-endpoint storage.example.com \
  --s3-region eu-west-1
import geoparquet_io as gpio

# MinIO without SSL
gpio.read('data.parquet').upload(
    's3://bucket/file.parquet',
    s3_endpoint='minio.example.com:9000',
    s3_use_ssl=False
)

# Custom endpoint with specific region
gpio.read('data.parquet').upload(
    's3://bucket/file.parquet',
    s3_endpoint='storage.example.com',
    s3_region='eu-west-1'
)

Options: - --s3-endpoint / s3_endpoint - Custom endpoint hostname and optional port - --s3-region / s3_region - Override region (defaults to us-east-1 for custom endpoints) - --s3-no-ssl / s3_use_ssl=False - Use HTTP instead of HTTPS

Directory Structure

When uploading directories, the structure is preserved:

# Input structure:
data/
  ├── region1/
     ├── file1.parquet
     └── file2.parquet
  └── region2/
      └── file3.parquet

# After upload to s3://bucket/dataset/:
s3://bucket/dataset/region1/file1.parquet
s3://bucket/dataset/region1/file2.parquet
s3://bucket/dataset/region2/file3.parquet

See Also