aggregate#

Functions to aggregate per-process-unit intermediate files into final save-level output files.

Functions#

`join_nonnull_strings`(x)	Join non-null values of x as strings with ' + '; None when all null.
`aggregate_rows`(→ pandas.DataFrame \| None)	Aggregate rows of df using per-column functions from the attribute registry.
`aggregate_files`(recipe, admin_level[, output_dir, ...])	Aggregate per-process-unit intermediate files into save-level files.
`aggregate_to_admin_level`(recipe[, ...])	Aggregate per-process-unit intermediate files into save-level files.
`read_partition_coverage`(→ set[str])	Return the partition ids recorded in an aggregated parquet's footer.
`read_file_metadata`(→ dict[str, str])	Return a parquet file's footer key-value metadata as a string dict.
`aggregate_partitions`(recipe[, by, single_file, how, ...])	Roll up per-partition output files into a coarser-grained file.

Module Contents#

openplaces.io.aggregate.join_nonnull_strings(x)#: Join non-null values of x as strings with ‘ + ‘; None when all null.

openplaces.io.aggregate.aggregate_rows(df: pandas.DataFrame, by: str | list[str], aggregation_function=None, sort_by: str | None = None, list_columns: list[str] | None = None) → pandas.DataFrame | None#

Aggregate rows of df using per-column functions from the attribute registry.

Parameters:

df (pd.DataFrame) – Input DataFrame. Columns that appear in the attribute registry with a non-null aggregation function are included in the result.
by (str or list of str) – Column(s) to group by.
aggregation_function (None, callable, or dict, optional) –
Controls which aggregation function is applied to each column.

None
Use the function recorded in the attribute registry for each column.

callable
Apply this single function to all aggregatable columns.

dict
Map column names to callables; columns absent from the dict fall back to the registry default.
sort_by (str, optional) – Column to sort df by descending before grouping. When the column is absent or omitted and df is a GeoDataFrame, rows are sorted by geometry area descending.
list_columns (list of str, optional) – Column names for which an additional {col}_list column is added to the output, collecting all values per group into a Python list. The normal scalar aggregation for each column is still applied; these are extra columns alongside the registry-aggregated ones.

Returns:

Aggregated DataFrame with by as the index, or None when no aggregatable columns are found in df.

Return type:

pd.DataFrame or None

Raises:

ValueError – When aggregation_function is not None, a callable, or a dict.

openplaces.io.aggregate.aggregate_files(recipe, admin_level, output_dir=None, admin_ids_to_save=None, admin_ids_to_aggregate=None, keep_original=False, combined=False, verbose=False)#

Aggregate per-process-unit intermediate files into save-level files.

Used when the desired output level is coarser than the level at which files were written (process_by.admin_level). Reads the intermediate parquet files, concatenates them into one file per save-level unit, and deletes the originals (unless keep_original is True).

Parameters:

recipe (str or dict) – Recipe ID string (e.g. 'US_footprint-spine-2026') or a pre-loaded recipe dict.
admin_level (int) – Target admin level for output files (e.g. 2 for state-level). Required explicitly because the recipe’s own save_to.admin_level may differ from the intended aggregation target.
output_dir (str, optional) – Directory for the aggregated output files (e.g. 'share'). Does not affect where intermediate input files are looked up — those are always resolved from the recipe’s original save_to.data_dir. Uses the recipe default if omitted.
admin_ids_to_save (str, AdminId, or list, optional) – Save-level admin ID(s) for which to write output files. Accepts a single value or a list. Defaults to all admin IDs at admin_level that are children of recipe['admin_id'].
admin_ids_to_aggregate (str, AdminId, or list, optional) – Process-level admin ID(s) whose intermediate files should be included as input. Accepts a single value or a list. Defaults to all process-level children of each admin_ids_to_save entry.
keep_original (bool) – If True, do not delete the intermediate files after aggregation.
combined (bool) – If True, write the aggregated output as a single geoparquet file (attributes and geometry together) rather than the default split layout of an attribute table plus a _geo sidecar. Passed through to save_parquet().
verbose (bool) – If True, print a summary line for each aggregated file.

openplaces.io.aggregate.aggregate_to_admin_level(recipe, admin_ids_to_process=None, keep_intermediates=False, combined=False, verbose=False)#

Aggregate per-process-unit intermediate files into save-level files.

Wrapper around aggregate() that reads the save level from the recipe’s save_to.admin_level field.

Parameters:

recipe (dict) – Loaded recipe dictionary. Must have an explicit save_to: admin_level that is coarser than the process_by / download_by level.
admin_ids_to_process (list of str, optional) – Admin IDs at the process level whose intermediate files should be read. Defaults to all IDs at the process level under the recipe’s save-level admin IDs.
keep_intermediates (bool) – If True, do not delete the intermediate files after aggregation.
combined (bool) – If True, write the aggregated output as a single geoparquet file. Passed through to aggregate().
verbose (bool) – If True, print a summary line for each aggregated file.

openplaces.io.aggregate.read_partition_coverage(path) → set[str]#

Return the partition ids recorded in an aggregated parquet’s footer.

Reads the openplaces:partitions key from the Parquet file-level (footer) metadata written by aggregate_partitions() — no rows are scanned. Returns an empty set when the file is missing or carries no such key (e.g. files written before this metadata was introduced).

Parameters:: path (str or pathlib.Path) – Aggregated parquet path.
Returns:: Partition ids (e.g. year-months) contained in the file.
Return type:: set of str

openplaces.io.aggregate.read_file_metadata(path) → dict[str, str]#

Return a parquet file’s footer key-value metadata as a string dict.

Reads only the footer (no rows are scanned). Returns an empty dict when the file is missing or carries no key-value metadata. Pyarrow-internal keys (e.g. the pandas schema) are included as-is.

Parameters:: path (str or pathlib.Path) – Parquet file path.

openplaces.io.aggregate.aggregate_partitions(recipe, by='year', single_file=False, how='union', admin_ids=None, partition_ids=None, keep_original=False, combined=False, verbose=False)#

Roll up per-partition output files into a coarser-grained file.

For a recipe partitioned by year_month: when single_file is True, concatenate the monthly output files into one dataset-wide ..._all.parquet (the non-redundant default); otherwise write one file per group (by='year' -> ..._2021.parquet). The aggregated file’s footer records the partition ids it contains (read via read_partition_coverage()) so re-runs can skip already-ingested partitions without keeping the per-partition files. Reuses _aggregate_to_file().

Parameters:

recipe (str or dict) – Recipe ID string or loaded recipe dict.
by (str) – Roll-up granularity for the partition dimension when single_file is False. Currently 'year'.
single_file (bool) – If True, write only one file combining every partition, named with the 'all' partition suffix (not the bare path, which the Ingester reserves for its “already-ingested” check), and no per-group files.
how ({'union', 'replace'}) – Merge policy when the aggregated file already exists (named after geopandas.overlay). 'union' (default) integrates the new partitions, de-duplicating by full row against the existing file. 'replace' overwrites the file with only the partitions in this run (use with a full-range reprocess, or it narrows the file to those partitions). Both raise if the new batch has internal duplicate rows.
admin_ids (str, AdminId, or list, optional) – Save-level admin ID(s) to aggregate. Defaults to all save-level IDs under recipe['admin_id'].
partition_ids (list, optional) – Partition IDs to consider. Defaults to all of the recipe’s partitions; only those whose per-partition output exists are aggregated. For recipes without a declared partition range (e.g. scraped checkpoint partitions), the default falls back to the partition files found next to each admin unit’s output path.
keep_original (bool) – If False (default), delete the per-partition (monthly) output parquets after the roll-up. The downloaded source files are untouched, so a re-run rebuilds them. Deletion happens last, after the file is written.
combined (bool) – Passed to save_parquet (single geoparquet vs. split layout).
verbose (bool) – Print a summary line per written file.