Uploading to Cloud Storage¶
The upload command uploads GeoParquet files to cloud object storage (S3, GCS, Azure) with parallel transfers and progress tracking.
Basic Usage¶
# Single file to S3
gpio publish upload input.parquet s3://bucket/path/output.parquet --profile my-profile
# Directory to S3
gpio publish upload data/ s3://bucket/dataset/ --profile my-profile
import geoparquet_io as gpio
# Upload to S3 with transform
gpio.read('input.parquet') \
.sort_hilbert() \
.upload('s3://bucket/path/output.parquet', profile='my-profile')
# Upload with S3-compatible endpoint (MinIO, etc)
gpio.read('input.parquet') \
.upload(
's3://bucket/path/output.parquet',
s3_endpoint='minio.example.com:9000',
s3_use_ssl=False
)
Supported Destinations¶
Provider support via URL scheme:
- AWS S3 -
s3://bucket/path/ - Google Cloud Storage -
gs://bucket/path/ - Azure Blob Storage -
az://account/container/path/ - HTTP stores -
https://...
Authentication¶
For authentication setup, see the Remote Files guide.
Quick reference:
- AWS S3: Use --profile flag or set AWS_PROFILE env var
- Google Cloud Storage: Set GOOGLE_APPLICATION_CREDENTIALS
- Azure: Set AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY
Options¶
Pattern Filtering¶
Upload only specific file types:
# Only JSON files
gpio publish upload data/ s3://bucket/dataset/ --pattern "*.json"
# Only Parquet files
gpio publish upload data/ s3://bucket/dataset/ --pattern "*.parquet"
Parallel Uploads¶
Control concurrency for directory uploads:
# Upload 8 files in parallel (default: 4)
gpio publish upload data/ s3://bucket/dataset/ --max-files 8
Trade-off: Higher parallelism = faster uploads but more bandwidth/memory usage.
Chunk Concurrency¶
Control concurrent chunks within each file:
# More concurrent chunks per file (default: 12)
gpio publish upload large.parquet s3://bucket/file.parquet --chunk-concurrency 20
Custom Chunk Size¶
Override default multipart upload chunk size:
# 10MB chunks instead of default 5MB
gpio publish upload data.parquet s3://bucket/file.parquet --chunk-size 10485760
Error Handling¶
By default, continues uploading remaining files if one fails:
# Stop immediately on first error
gpio publish upload data/ s3://bucket/dataset/ --fail-fast
Dry Run¶
Preview what would be uploaded without actually uploading:
gpio publish upload data/ s3://bucket/dataset/ --dry-run
Shows: - Files that would be uploaded - Total size - Destination paths - AWS profile (if specified)
S3-Compatible Storage¶
Upload to MinIO, Ceph, or other S3-compatible storage:
# MinIO without SSL
gpio publish upload data.parquet s3://bucket/file.parquet \
--s3-endpoint minio.example.com:9000 \
--s3-no-ssl
# Custom endpoint with specific region
gpio publish upload data/ s3://bucket/dataset/ \
--s3-endpoint storage.example.com \
--s3-region eu-west-1
import geoparquet_io as gpio
# MinIO without SSL
gpio.read('data.parquet').upload(
's3://bucket/file.parquet',
s3_endpoint='minio.example.com:9000',
s3_use_ssl=False
)
# Custom endpoint with specific region
gpio.read('data.parquet').upload(
's3://bucket/file.parquet',
s3_endpoint='storage.example.com',
s3_region='eu-west-1'
)
Options:
- --s3-endpoint / s3_endpoint - Custom endpoint hostname and optional port
- --s3-region / s3_region - Override region (defaults to us-east-1 for custom endpoints)
- --s3-no-ssl / s3_use_ssl=False - Use HTTP instead of HTTPS
Directory Structure¶
When uploading directories, the structure is preserved:
# Input structure:
data/
├── region1/
│ ├── file1.parquet
│ └── file2.parquet
└── region2/
└── file3.parquet
# After upload to s3://bucket/dataset/:
s3://bucket/dataset/region1/file1.parquet
s3://bucket/dataset/region1/file2.parquet
s3://bucket/dataset/region2/file3.parquet
See Also¶
- convert command - Convert vector formats to GeoParquet
- check command - Validate and fix GeoParquet files
- partition command - Partition GeoParquet files