Core Functions Reference¶
Auto-generated API reference for core functions.
Adding Columns¶
add_bbox_column¶
add_bbox_column(input_parquet, output_parquet=None, bbox_column_name='bbox', dry_run=False, verbose=False, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, profile=None, force=False, geoparquet_version=None)
¶
Add a bbox struct column to a GeoParquet file.
Supports Arrow IPC streaming: - Input "-" reads from stdin - Output "-" or None (with piped stdout) streams to stdout
Checks for existing bbox columns before adding. If a bbox column already exists:
- With covering metadata: Informs user and exits successfully (no action needed)
- Without metadata: Suggests using
gpio add bbox-metadatacommand - With --force: Replaces the existing bbox column
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_parquet
|
str
|
Path to the input parquet file (local, remote URL, or "-" for stdin) |
required |
output_parquet
|
str | None
|
Path to output file, "-" for stdout, or None for auto-detect |
None
|
bbox_column_name
|
str
|
Name for the bbox column (default: 'bbox') |
'bbox'
|
dry_run
|
bool
|
Whether to print SQL commands without executing them |
False
|
verbose
|
bool
|
Whether to print verbose output |
False
|
compression
|
str
|
Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED) |
'ZSTD'
|
compression_level
|
int | None
|
Compression level (varies by format) |
None
|
row_group_size_mb
|
float | None
|
Target row group size in MB |
None
|
row_group_rows
|
int | None
|
Exact number of rows per row group |
None
|
profile
|
str | None
|
AWS profile name (S3 only, optional) |
None
|
force
|
bool
|
Whether to replace an existing bbox column |
False
|
geoparquet_version
|
str | None
|
GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only) |
None
|
Note
Bbox covering metadata is automatically added when the file is written.
add_h3_column¶
add_h3_column(input_parquet, output_parquet=None, h3_column_name=DEFAULT_H3_COLUMN_NAME, h3_resolution=9, dry_run=False, verbose=False, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, profile=None, geoparquet_version=None)
¶
Add an H3 cell ID column to a GeoParquet file.
Supports Arrow IPC streaming: - Input "-" reads from stdin - Output "-" or None (with piped stdout) streams to stdout
Computes H3 cell IDs based on geometry centroids using the H3 hierarchical hexagonal grid system. The cell ID is stored as a VARCHAR (string) for maximum portability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_parquet
|
str
|
Path to the input parquet file (local, remote URL, or "-" for stdin) |
required |
output_parquet
|
str | None
|
Path to output file, "-" for stdout, or None for auto-detect |
None
|
h3_column_name
|
str
|
Name for the H3 column (default: 'h3_cell') |
DEFAULT_H3_COLUMN_NAME
|
h3_resolution
|
int
|
H3 resolution level (0-15) Res 7: ~5 km², Res 9: ~0.1 km², Res 11: ~1,770 m², Res 13: ~44 m², Res 15: ~0.9 m² Default: 9 (good balance for most use cases) |
9
|
dry_run
|
bool
|
Whether to print SQL commands without executing them |
False
|
verbose
|
bool
|
Whether to print verbose output |
False
|
compression
|
str
|
Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED) |
'ZSTD'
|
compression_level
|
int | None
|
Compression level (varies by format) |
None
|
row_group_size_mb
|
float | None
|
Target row group size in MB |
None
|
row_group_rows
|
int | None
|
Exact number of rows per row group |
None
|
profile
|
str | None
|
AWS profile name (S3 only, optional) |
None
|
geoparquet_version
|
str | None
|
GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only) |
None
|
add_kdtree_column¶
add_kdtree_column(input_parquet, output_parquet=None, kdtree_column_name='kdtree_cell', iterations=9, dry_run=False, verbose=False, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, force=False, sample_size=100000, auto_target_rows=None, profile=None, geoparquet_version=None)
¶
Add a KD-tree cell ID column to a GeoParquet file.
Supports Arrow IPC streaming: - Input "-" reads from stdin - Output "-" or None (with piped stdout) streams to stdout
Creates balanced spatial partitions using recursive splits alternating between X and Y dimensions at medians.
By default, uses approximate computation: computes partition boundaries on a sample, then applies to full dataset in a single pass.
Performance Note: Approximate mode is O(n), exact mode is O(n × iterations).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_parquet
|
str
|
Path to the input parquet file (local, remote URL, or "-" for stdin) |
required |
output_parquet
|
str | None
|
Path to output file, "-" for stdout, or None for auto-detect |
None
|
kdtree_column_name
|
str
|
Name for the KD-tree column (default: 'kdtree_cell') |
'kdtree_cell'
|
iterations
|
int
|
Number of recursive splits (1-20). Determines partition count: 2^iterations. If None, will be auto-computed based on auto_target_rows. |
9
|
dry_run
|
bool
|
Whether to print SQL commands without executing them |
False
|
verbose
|
bool
|
Whether to print verbose output |
False
|
compression
|
str
|
Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED) |
'ZSTD'
|
compression_level
|
int | None
|
Compression level (varies by format) |
None
|
row_group_size_mb
|
float | None
|
Target row group size in MB |
None
|
row_group_rows
|
int | None
|
Exact number of rows per row group |
None
|
force
|
bool
|
Force operation even on large datasets (not recommended) |
False
|
sample_size
|
int
|
Number of points to sample for computing boundaries. None for exact mode (default: 100000) |
100000
|
auto_target_rows
|
int | None
|
If set, auto-compute iterations to target this many rows per partition |
None
|
profile
|
str | None
|
AWS profile name (S3 only, optional) |
None
|
geoparquet_version
|
str | None
|
GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only) |
None
|
Spatial Operations¶
hilbert_order¶
hilbert_order(input_parquet, output_parquet=None, geometry_column='geometry', add_bbox_flag=False, verbose=False, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, profile=None, geoparquet_version=None)
¶
Reorder a GeoParquet file using Hilbert curve ordering.
Supports Arrow IPC streaming: - Input "-" reads from stdin - Output "-" or None (with piped stdout) streams to stdout
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_parquet
|
str
|
Path to input GeoParquet file (local, remote URL, or "-" for stdin) |
required |
output_parquet
|
str | None
|
Path to output file, "-" for stdout, or None for auto-detect |
None
|
geometry_column
|
str
|
Name of geometry column (default: 'geometry') |
'geometry'
|
add_bbox_flag
|
bool
|
Add bbox column before sorting if not present |
False
|
verbose
|
bool
|
Print verbose output |
False
|
compression
|
str
|
Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED) |
'ZSTD'
|
compression_level
|
int | None
|
Compression level (varies by format) |
None
|
row_group_size_mb
|
float | None
|
Target row group size in MB |
None
|
row_group_rows
|
int | None
|
Exact number of rows per row group |
None
|
profile
|
str | None
|
AWS profile name (S3 only, optional) |
None
|
geoparquet_version
|
str | None
|
GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only) |
None
|
add_bbox_metadata¶
add_bbox_metadata(parquet_file, verbose=False)
¶
Add bbox covering metadata to a GeoParquet file.
Updates the GeoParquet metadata to include bbox covering information, which enables spatial filtering optimizations in readers that support it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parquet_file
|
Path to the parquet file (will be modified in place) |
required | |
verbose
|
Print verbose output |
False
|
Partitioning¶
partition_by_string¶
partition_by_string(input_parquet, output_folder, column, chars=None, hive=False, overwrite=False, preview=False, preview_limit=15, verbose=False, force=False, skip_analysis=False, filename_prefix=None, profile=None, geoparquet_version=None)
¶
Partition a GeoParquet file by string column values or prefixes.
Supports Arrow IPC streaming for input: - Input "-" reads from stdin (output is always a directory)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_parquet
|
str
|
Input GeoParquet file (local, remote URL, or "-" for stdin) |
required |
output_folder
|
str
|
Output directory (always writes to directory, no stdout support) |
required |
column
|
str
|
Column name to partition by (required) |
required |
chars
|
int | None
|
Optional number of characters to use (prefix length) |
None
|
hive
|
bool
|
Use Hive-style partitioning |
False
|
overwrite
|
bool
|
Overwrite existing files |
False
|
preview
|
bool
|
Show preview of partitions without creating files |
False
|
preview_limit
|
int
|
Maximum number of partitions to show in preview (default: 15) |
15
|
verbose
|
bool
|
Verbose output |
False
|
force
|
bool
|
Force partitioning even if analysis detects issues |
False
|
skip_analysis
|
bool
|
Skip partition strategy analysis (for performance) |
False
|
filename_prefix
|
str | None
|
Optional prefix for partition filenames |
None
|
profile
|
str | None
|
AWS profile name (S3 only, optional) |
None
|
geoparquet_version
|
str | None
|
GeoParquet version to write |
None
|
partition_by_h3¶
partition_by_h3(input_parquet, output_folder, h3_column_name=DEFAULT_H3_COLUMN_NAME, resolution=9, hive=False, overwrite=False, preview=False, preview_limit=15, verbose=False, keep_h3_column=None, force=False, skip_analysis=False, filename_prefix=None, profile=None, geoparquet_version=None)
¶
Partition a GeoParquet file by H3 cells at specified resolution.
Supports Arrow IPC streaming for input: - Input "-" reads from stdin (output is always a directory)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_parquet
|
str
|
Input GeoParquet file (local, remote URL, or "-" for stdin) |
required |
output_folder
|
str
|
Output directory (always writes to directory, no stdout support) |
required |
h3_column_name
|
str
|
Name of the H3 column (default: 'h3_cell') |
DEFAULT_H3_COLUMN_NAME
|
resolution
|
int
|
H3 resolution level (0-15). Default: 9 |
9
|
hive
|
bool
|
Use Hive-style partitioning (column=value directories) |
False
|
overwrite
|
bool
|
Overwrite existing output directory |
False
|
preview
|
bool
|
Preview partition distribution without writing |
False
|
preview_limit
|
int
|
Max number of partitions to show in preview |
15
|
verbose
|
bool
|
Print verbose output |
False
|
keep_h3_column
|
bool | None
|
Keep H3 column in output partitions |
None
|
force
|
bool
|
Force operation even if analysis suggests issues |
False
|
skip_analysis
|
bool
|
Skip partition analysis |
False
|
filename_prefix
|
str | None
|
Prefix for output filenames |
None
|
profile
|
str | None
|
AWS profile name (S3 only, optional) |
None
|
geoparquet_version
|
str | None
|
GeoParquet version to write |
None
|
partition_by_kdtree¶
partition_by_kdtree(input_parquet, output_folder, kdtree_column_name='kdtree_cell', iterations=None, hive=False, overwrite=False, preview=False, preview_limit=15, verbose=False, keep_kdtree_column=None, force=False, skip_analysis=False, sample_size=100000, auto_target_rows=None, filename_prefix=None, profile=None, geoparquet_version=None)
¶
Partition a GeoParquet file by KD-tree cells.
Supports Arrow IPC streaming for input: - Input "-" reads from stdin (output is always a directory)
If the KD-tree column doesn't exist, it will be automatically added before partitioning.
Performance Note: Approximate mode is O(n), exact mode is O(n × iterations).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_parquet
|
str
|
Input GeoParquet file (local, remote URL, or "-" for stdin) |
required |
output_folder
|
str
|
Output directory |
required |
kdtree_column_name
|
str
|
Name of KD-tree column (default: 'kdtree_cell') |
'kdtree_cell'
|
iterations
|
int | None
|
Number of recursive splits (1-20, default: 9) |
None
|
hive
|
bool
|
Use Hive-style partitioning |
False
|
overwrite
|
bool
|
Overwrite existing files |
False
|
preview
|
bool
|
Show preview of partitions without creating files |
False
|
preview_limit
|
int
|
Maximum number of partitions to show in preview (default: 15) |
15
|
verbose
|
bool
|
Verbose output |
False
|
keep_kdtree_column
|
bool | None
|
Whether to keep KD-tree column in output files |
None
|
force
|
bool
|
Force partitioning even if analysis detects issues |
False
|
skip_analysis
|
bool
|
Skip partition strategy analysis (for performance) |
False
|
sample_size
|
int
|
Number of points to sample for computing boundaries |
100000
|
Validation¶
check_all¶
check_all(parquet_file, verbose=False, return_results=False, quiet=False)
¶
Run all structure checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parquet_file
|
Path to parquet file |
required | |
verbose
|
Print additional information |
False
|
|
return_results
|
If True, return aggregated results dict |
False
|
|
quiet
|
If True, suppress all output (for multi-file batch mode) |
False
|
Returns:
| Type | Description |
|---|---|
|
dict if return_results=True, containing results from all checks |
check_spatial_order¶
check_spatial_order(parquet_file, random_sample_size, limit_rows, verbose, return_results=False, quiet=False)
¶
Check if a GeoParquet file is spatially ordered.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parquet_file
|
Path to parquet file |
required | |
random_sample_size
|
Number of rows in each random sample |
required | |
limit_rows
|
Max number of rows to analyze |
required | |
verbose
|
Print additional information |
required | |
return_results
|
If True, return structured results dict |
False
|
|
quiet
|
If True, suppress all output (for multi-file batch mode) |
False
|
Returns:
| Type | Description |
|---|---|
|
ratio (float) if return_results=False, or dict if return_results=True |
Common Utilities¶
get_dataset_bounds¶
get_dataset_bounds(parquet_file, geometry_column=None, verbose=False)
¶
Calculate the bounding box of the entire dataset.
Uses bbox column if available for fast calculation, otherwise calculates from geometry column (slower).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parquet_file
|
Path to the parquet file |
required | |
geometry_column
|
Geometry column name (if None, will auto-detect) |
None
|
|
verbose
|
Whether to print verbose output |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
(xmin, ymin, xmax, ymax) or None if error |
find_primary_geometry_column¶
find_primary_geometry_column(parquet_file, verbose=False)
¶
Find the primary geometry column from GeoParquet metadata.
Looks up the geometry column name from GeoParquet metadata. Falls back to 'geometry' if no metadata is present or if the primary column is not specified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parquet_file
|
Path to the parquet file (local or remote URL) |
required | |
verbose
|
Print verbose output |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
Name of the primary geometry column (defaults to 'geometry') |
write_parquet_with_metadata¶
write_parquet_with_metadata(con, query, output_file, original_metadata=None, compression='ZSTD', compression_level=15, row_group_size_mb=None, row_group_rows=None, custom_metadata=None, verbose=False, show_sql=False, profile=None, geoparquet_version=None, input_crs=None)
¶
Write a parquet file with proper compression and metadata handling.
Uses Arrow as the internal transfer format for efficiency - fetches DuckDB query results directly as an Arrow table, applies metadata in memory, and writes once to disk (no intermediate file rewrites).
Supports both local and remote outputs (S3, GCS, Azure). Remote outputs are written to a temporary local file, then uploaded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
con
|
DuckDB connection |
required | |
query
|
SQL query to execute |
required | |
output_file
|
Path to output file (local path or remote URL) |
required | |
original_metadata
|
Original metadata from source file |
None
|
|
compression
|
Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED) |
'ZSTD'
|
|
compression_level
|
Compression level (varies by format) |
15
|
|
row_group_size_mb
|
Target row group size in MB |
None
|
|
row_group_rows
|
Exact number of rows per row group |
None
|
|
custom_metadata
|
Optional dict with custom metadata (e.g., H3 info) |
None
|
|
verbose
|
Whether to print verbose output |
False
|
|
show_sql
|
Whether to print SQL statements before execution |
False
|
|
profile
|
AWS profile name (S3 only, optional) |
None
|
|
geoparquet_version
|
GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only) |
None
|
|
input_crs
|
PROJJSON dict with CRS from input file |
None
|
Returns:
| Type | Description |
|---|---|
|
None |