Skip to content

Core Functions Reference

Auto-generated API reference for core functions.

Adding Columns

add_bbox_column

add_bbox_column(input_parquet, output_parquet=None, bbox_column_name='bbox', dry_run=False, verbose=False, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, profile=None, force=False, geoparquet_version=None)

Add a bbox struct column to a GeoParquet file.

Supports Arrow IPC streaming: - Input "-" reads from stdin - Output "-" or None (with piped stdout) streams to stdout

Checks for existing bbox columns before adding. If a bbox column already exists:

  • With covering metadata: Informs user and exits successfully (no action needed)
  • Without metadata: Suggests using gpio add bbox-metadata command
  • With --force: Replaces the existing bbox column

Parameters:

Name Type Description Default
input_parquet str

Path to the input parquet file (local, remote URL, or "-" for stdin)

required
output_parquet str | None

Path to output file, "-" for stdout, or None for auto-detect

None
bbox_column_name str

Name for the bbox column (default: 'bbox')

'bbox'
dry_run bool

Whether to print SQL commands without executing them

False
verbose bool

Whether to print verbose output

False
compression str

Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED)

'ZSTD'
compression_level int | None

Compression level (varies by format)

None
row_group_size_mb float | None

Target row group size in MB

None
row_group_rows int | None

Exact number of rows per row group

None
profile str | None

AWS profile name (S3 only, optional)

None
force bool

Whether to replace an existing bbox column

False
geoparquet_version str | None

GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only)

None
Note

Bbox covering metadata is automatically added when the file is written.

add_h3_column

add_h3_column(input_parquet, output_parquet=None, h3_column_name=DEFAULT_H3_COLUMN_NAME, h3_resolution=9, dry_run=False, verbose=False, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, profile=None, geoparquet_version=None)

Add an H3 cell ID column to a GeoParquet file.

Supports Arrow IPC streaming: - Input "-" reads from stdin - Output "-" or None (with piped stdout) streams to stdout

Computes H3 cell IDs based on geometry centroids using the H3 hierarchical hexagonal grid system. The cell ID is stored as a VARCHAR (string) for maximum portability.

Parameters:

Name Type Description Default
input_parquet str

Path to the input parquet file (local, remote URL, or "-" for stdin)

required
output_parquet str | None

Path to output file, "-" for stdout, or None for auto-detect

None
h3_column_name str

Name for the H3 column (default: 'h3_cell')

DEFAULT_H3_COLUMN_NAME
h3_resolution int

H3 resolution level (0-15) Res 7: ~5 km², Res 9: ~0.1 km², Res 11: ~1,770 m², Res 13: ~44 m², Res 15: ~0.9 m² Default: 9 (good balance for most use cases)

9
dry_run bool

Whether to print SQL commands without executing them

False
verbose bool

Whether to print verbose output

False
compression str

Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED)

'ZSTD'
compression_level int | None

Compression level (varies by format)

None
row_group_size_mb float | None

Target row group size in MB

None
row_group_rows int | None

Exact number of rows per row group

None
profile str | None

AWS profile name (S3 only, optional)

None
geoparquet_version str | None

GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only)

None

add_kdtree_column

add_kdtree_column(input_parquet, output_parquet=None, kdtree_column_name='kdtree_cell', iterations=9, dry_run=False, verbose=False, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, force=False, sample_size=100000, auto_target_rows=None, profile=None, geoparquet_version=None)

Add a KD-tree cell ID column to a GeoParquet file.

Supports Arrow IPC streaming: - Input "-" reads from stdin - Output "-" or None (with piped stdout) streams to stdout

Creates balanced spatial partitions using recursive splits alternating between X and Y dimensions at medians.

By default, uses approximate computation: computes partition boundaries on a sample, then applies to full dataset in a single pass.

Performance Note: Approximate mode is O(n), exact mode is O(n × iterations).

Parameters:

Name Type Description Default
input_parquet str

Path to the input parquet file (local, remote URL, or "-" for stdin)

required
output_parquet str | None

Path to output file, "-" for stdout, or None for auto-detect

None
kdtree_column_name str

Name for the KD-tree column (default: 'kdtree_cell')

'kdtree_cell'
iterations int

Number of recursive splits (1-20). Determines partition count: 2^iterations. If None, will be auto-computed based on auto_target_rows.

9
dry_run bool

Whether to print SQL commands without executing them

False
verbose bool

Whether to print verbose output

False
compression str

Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED)

'ZSTD'
compression_level int | None

Compression level (varies by format)

None
row_group_size_mb float | None

Target row group size in MB

None
row_group_rows int | None

Exact number of rows per row group

None
force bool

Force operation even on large datasets (not recommended)

False
sample_size int

Number of points to sample for computing boundaries. None for exact mode (default: 100000)

100000
auto_target_rows int | None

If set, auto-compute iterations to target this many rows per partition

None
profile str | None

AWS profile name (S3 only, optional)

None
geoparquet_version str | None

GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only)

None

Spatial Operations

hilbert_order

hilbert_order(input_parquet, output_parquet=None, geometry_column='geometry', add_bbox_flag=False, verbose=False, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, profile=None, geoparquet_version=None)

Reorder a GeoParquet file using Hilbert curve ordering.

Supports Arrow IPC streaming: - Input "-" reads from stdin - Output "-" or None (with piped stdout) streams to stdout

Parameters:

Name Type Description Default
input_parquet str

Path to input GeoParquet file (local, remote URL, or "-" for stdin)

required
output_parquet str | None

Path to output file, "-" for stdout, or None for auto-detect

None
geometry_column str

Name of geometry column (default: 'geometry')

'geometry'
add_bbox_flag bool

Add bbox column before sorting if not present

False
verbose bool

Print verbose output

False
compression str

Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED)

'ZSTD'
compression_level int | None

Compression level (varies by format)

None
row_group_size_mb float | None

Target row group size in MB

None
row_group_rows int | None

Exact number of rows per row group

None
profile str | None

AWS profile name (S3 only, optional)

None
geoparquet_version str | None

GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only)

None

add_bbox_metadata

add_bbox_metadata(parquet_file, verbose=False)

Add bbox covering metadata to a GeoParquet file.

Updates the GeoParquet metadata to include bbox covering information, which enables spatial filtering optimizations in readers that support it.

Parameters:

Name Type Description Default
parquet_file

Path to the parquet file (will be modified in place)

required
verbose

Print verbose output

False

Partitioning

partition_by_string

partition_by_string(input_parquet, output_folder, column, chars=None, hive=False, overwrite=False, preview=False, preview_limit=15, verbose=False, force=False, skip_analysis=False, filename_prefix=None, profile=None, geoparquet_version=None)

Partition a GeoParquet file by string column values or prefixes.

Supports Arrow IPC streaming for input: - Input "-" reads from stdin (output is always a directory)

Parameters:

Name Type Description Default
input_parquet str

Input GeoParquet file (local, remote URL, or "-" for stdin)

required
output_folder str

Output directory (always writes to directory, no stdout support)

required
column str

Column name to partition by (required)

required
chars int | None

Optional number of characters to use (prefix length)

None
hive bool

Use Hive-style partitioning

False
overwrite bool

Overwrite existing files

False
preview bool

Show preview of partitions without creating files

False
preview_limit int

Maximum number of partitions to show in preview (default: 15)

15
verbose bool

Verbose output

False
force bool

Force partitioning even if analysis detects issues

False
skip_analysis bool

Skip partition strategy analysis (for performance)

False
filename_prefix str | None

Optional prefix for partition filenames

None
profile str | None

AWS profile name (S3 only, optional)

None
geoparquet_version str | None

GeoParquet version to write

None

partition_by_h3

partition_by_h3(input_parquet, output_folder, h3_column_name=DEFAULT_H3_COLUMN_NAME, resolution=9, hive=False, overwrite=False, preview=False, preview_limit=15, verbose=False, keep_h3_column=None, force=False, skip_analysis=False, filename_prefix=None, profile=None, geoparquet_version=None)

Partition a GeoParquet file by H3 cells at specified resolution.

Supports Arrow IPC streaming for input: - Input "-" reads from stdin (output is always a directory)

Parameters:

Name Type Description Default
input_parquet str

Input GeoParquet file (local, remote URL, or "-" for stdin)

required
output_folder str

Output directory (always writes to directory, no stdout support)

required
h3_column_name str

Name of the H3 column (default: 'h3_cell')

DEFAULT_H3_COLUMN_NAME
resolution int

H3 resolution level (0-15). Default: 9

9
hive bool

Use Hive-style partitioning (column=value directories)

False
overwrite bool

Overwrite existing output directory

False
preview bool

Preview partition distribution without writing

False
preview_limit int

Max number of partitions to show in preview

15
verbose bool

Print verbose output

False
keep_h3_column bool | None

Keep H3 column in output partitions

None
force bool

Force operation even if analysis suggests issues

False
skip_analysis bool

Skip partition analysis

False
filename_prefix str | None

Prefix for output filenames

None
profile str | None

AWS profile name (S3 only, optional)

None
geoparquet_version str | None

GeoParquet version to write

None

partition_by_kdtree

partition_by_kdtree(input_parquet, output_folder, kdtree_column_name='kdtree_cell', iterations=None, hive=False, overwrite=False, preview=False, preview_limit=15, verbose=False, keep_kdtree_column=None, force=False, skip_analysis=False, sample_size=100000, auto_target_rows=None, filename_prefix=None, profile=None, geoparquet_version=None)

Partition a GeoParquet file by KD-tree cells.

Supports Arrow IPC streaming for input: - Input "-" reads from stdin (output is always a directory)

If the KD-tree column doesn't exist, it will be automatically added before partitioning.

Performance Note: Approximate mode is O(n), exact mode is O(n × iterations).

Parameters:

Name Type Description Default
input_parquet str

Input GeoParquet file (local, remote URL, or "-" for stdin)

required
output_folder str

Output directory

required
kdtree_column_name str

Name of KD-tree column (default: 'kdtree_cell')

'kdtree_cell'
iterations int | None

Number of recursive splits (1-20, default: 9)

None
hive bool

Use Hive-style partitioning

False
overwrite bool

Overwrite existing files

False
preview bool

Show preview of partitions without creating files

False
preview_limit int

Maximum number of partitions to show in preview (default: 15)

15
verbose bool

Verbose output

False
keep_kdtree_column bool | None

Whether to keep KD-tree column in output files

None
force bool

Force partitioning even if analysis detects issues

False
skip_analysis bool

Skip partition strategy analysis (for performance)

False
sample_size int

Number of points to sample for computing boundaries

100000

Validation

check_all

check_all(parquet_file, verbose=False, return_results=False, quiet=False)

Run all structure checks.

Parameters:

Name Type Description Default
parquet_file

Path to parquet file

required
verbose

Print additional information

False
return_results

If True, return aggregated results dict

False
quiet

If True, suppress all output (for multi-file batch mode)

False

Returns:

Type Description

dict if return_results=True, containing results from all checks

check_spatial_order

check_spatial_order(parquet_file, random_sample_size, limit_rows, verbose, return_results=False, quiet=False)

Check if a GeoParquet file is spatially ordered.

Parameters:

Name Type Description Default
parquet_file

Path to parquet file

required
random_sample_size

Number of rows in each random sample

required
limit_rows

Max number of rows to analyze

required
verbose

Print additional information

required
return_results

If True, return structured results dict

False
quiet

If True, suppress all output (for multi-file batch mode)

False

Returns:

Type Description

ratio (float) if return_results=False, or dict if return_results=True

Common Utilities

get_dataset_bounds

get_dataset_bounds(parquet_file, geometry_column=None, verbose=False)

Calculate the bounding box of the entire dataset.

Uses bbox column if available for fast calculation, otherwise calculates from geometry column (slower).

Parameters:

Name Type Description Default
parquet_file

Path to the parquet file

required
geometry_column

Geometry column name (if None, will auto-detect)

None
verbose

Whether to print verbose output

False

Returns:

Name Type Description
tuple

(xmin, ymin, xmax, ymax) or None if error

find_primary_geometry_column

find_primary_geometry_column(parquet_file, verbose=False)

Find the primary geometry column from GeoParquet metadata.

Looks up the geometry column name from GeoParquet metadata. Falls back to 'geometry' if no metadata is present or if the primary column is not specified.

Parameters:

Name Type Description Default
parquet_file

Path to the parquet file (local or remote URL)

required
verbose

Print verbose output

False

Returns:

Name Type Description
str

Name of the primary geometry column (defaults to 'geometry')

write_parquet_with_metadata

write_parquet_with_metadata(con, query, output_file, original_metadata=None, compression='ZSTD', compression_level=15, row_group_size_mb=None, row_group_rows=None, custom_metadata=None, verbose=False, show_sql=False, profile=None, geoparquet_version=None, input_crs=None)

Write a parquet file with proper compression and metadata handling.

Uses Arrow as the internal transfer format for efficiency - fetches DuckDB query results directly as an Arrow table, applies metadata in memory, and writes once to disk (no intermediate file rewrites).

Supports both local and remote outputs (S3, GCS, Azure). Remote outputs are written to a temporary local file, then uploaded.

Parameters:

Name Type Description Default
con

DuckDB connection

required
query

SQL query to execute

required
output_file

Path to output file (local path or remote URL)

required
original_metadata

Original metadata from source file

None
compression

Compression type (ZSTD, GZIP, BROTLI, LZ4, SNAPPY, UNCOMPRESSED)

'ZSTD'
compression_level

Compression level (varies by format)

15
row_group_size_mb

Target row group size in MB

None
row_group_rows

Exact number of rows per row group

None
custom_metadata

Optional dict with custom metadata (e.g., H3 info)

None
verbose

Whether to print verbose output

False
show_sql

Whether to print SQL statements before execution

False
profile

AWS profile name (S3 only, optional)

None
geoparquet_version

GeoParquet version to write (1.0, 1.1, 2.0, parquet-geo-only)

None
input_crs

PROJJSON dict with CRS from input file

None

Returns:

Type Description

None