aggregate#

Functions to aggregate per-process-unit intermediate files into final save-level output files.

Functions#

join_nonnull_strings(x)

Join non-null values of x as strings with ' + '; None when all null.

aggregate_rows(→ pandas.DataFrame | None)

Aggregate rows of df using per-column functions from the attribute registry.

aggregate_files(recipe, admin_level[, output_dir, ...])

Aggregate per-process-unit intermediate files into save-level files.

aggregate_to_admin_level(recipe[, ...])

Aggregate per-process-unit intermediate files into save-level files.

Module Contents#

openplaces.io.aggregate.join_nonnull_strings(x)#

Join non-null values of x as strings with ‘ + ‘; None when all null.

openplaces.io.aggregate.aggregate_rows(df: pandas.DataFrame, by: str | list[str], aggregation_function=None, sort_by: str | None = None, list_columns: list[str] | None = None) pandas.DataFrame | None#

Aggregate rows of df using per-column functions from the attribute registry.

Parameters:
  • df (pd.DataFrame) – Input DataFrame. Columns that appear in the attribute registry with a non-null aggregation function are included in the result.

  • by (str or list of str) – Column(s) to group by.

  • aggregation_function (None, callable, or dict, optional) –

    Controls which aggregation function is applied to each column.

    None

    Use the function recorded in the attribute registry for each column.

    callable

    Apply this single function to all aggregatable columns.

    dict

    Map column names to callables; columns absent from the dict fall back to the registry default.

  • sort_by (str, optional) – Column to sort df by descending before grouping. When the column is absent or omitted and df is a GeoDataFrame, rows are sorted by geometry area descending.

  • list_columns (list of str, optional) – Column names for which an additional {col}_list column is added to the output, collecting all values per group into a Python list. The normal scalar aggregation for each column is still applied; these are extra columns alongside the registry-aggregated ones.

Returns:

Aggregated DataFrame with by as the index, or None when no aggregatable columns are found in df.

Return type:

pd.DataFrame or None

Raises:

ValueError – When aggregation_function is not None, a callable, or a dict.

openplaces.io.aggregate.aggregate_files(recipe, admin_level, output_dir=None, admin_ids_to_save=None, admin_ids_to_aggregate=None, keep_original=False, combined=False, verbose=False)#

Aggregate per-process-unit intermediate files into save-level files.

Used when the desired output level is coarser than the level at which files were written (process_by.admin_level). Reads the intermediate parquet files, concatenates them into one file per save-level unit, and deletes the originals (unless keep_original is True).

Parameters:
  • recipe (str or dict) – Recipe ID string (e.g. 'US_footprint-cheer-2026') or a pre-loaded recipe dict.

  • admin_level (int) – Target admin level for output files (e.g. 2 for state-level). Required explicitly because the recipe’s own save_to.admin_level may differ from the intended aggregation target.

  • output_dir (str, optional) – Directory for the aggregated output files (e.g. 'share'). Does not affect where intermediate input files are looked up — those are always resolved from the recipe’s original save_to.data_dir. Uses the recipe default if omitted.

  • admin_ids_to_save (str, AdminId, or list, optional) – Save-level admin ID(s) for which to write output files. Accepts a single value or a list. Defaults to all admin IDs at admin_level that are children of recipe['admin_id'].

  • admin_ids_to_aggregate (str, AdminId, or list, optional) – Process-level admin ID(s) whose intermediate files should be included as input. Accepts a single value or a list. Defaults to all process-level children of each admin_ids_to_save entry.

  • keep_original (bool) – If True, do not delete the intermediate files after aggregation.

  • combined (bool) – If True, write the aggregated output as a single geoparquet file (attributes and geometry together) rather than the default split layout of an attribute table plus a _geo sidecar. Passed through to save_parquet().

  • verbose (bool) – If True, print a summary line for each aggregated file.

openplaces.io.aggregate.aggregate_to_admin_level(recipe, admin_ids_to_process=None, keep_intermediates=False, combined=False, verbose=False)#

Aggregate per-process-unit intermediate files into save-level files.

Wrapper around aggregate() that reads the save level from the recipe’s save_to.admin_level field.

Parameters:
  • recipe (dict) – Loaded recipe dictionary. Must have an explicit save_to: admin_level that is coarser than the process_by / download_by level.

  • admin_ids_to_process (list of str, optional) – Admin IDs at the process level whose intermediate files should be read. Defaults to all IDs at the process level under the recipe’s save-level admin IDs.

  • keep_intermediates (bool) – If True, do not delete the intermediate files after aggregation.

  • combined (bool) – If True, write the aggregated output as a single geoparquet file. Passed through to aggregate().

  • verbose (bool) – If True, print a summary line for each aggregated file.