transform#

Transformation engine for applying variable transformations to DataFrames.

This module provides a flexible system for transforming variables based on YAML recipe specifications. It supports: - Unary operations (log, arcsinh, power, etc.) - Binary operations (arithmetic on two columns) - Aggregate operations (sum, min, max across multiple columns) - String remapping and reclassification - Conditional transformations - Date/time extractions - Complex expressions - Pattern-based transformations

Functions#

apply_transformations(...)

Apply transformations from recipe to dataframe.

apply_transformation(...)

Apply a single transformation based on configuration.

apply_transformation_pattern(...)

Apply pattern-based transformation to multiple columns.

get_crosswalk(crosswalk_dict[, flip])

Get a crosswalk (Series of default keys -> source keys)

remap(df, recipe_id)

Remap values in dataframe column using recipe table

add_unique_suffix(s)

Make string Series unique by appending unique integer suffices.

make_index_unique(→ pandas.DataFrame)

Return a copy of a DataFrame / GeoDataFrame with a unique string index.

Module Contents#

openplaces.io.transform.apply_transformations(df: pandas.DataFrame | geopandas.GeoDataFrame, recipe: dict[str, Any], silent: bool = False) pandas.DataFrame | geopandas.GeoDataFrame#

Apply transformations from recipe to dataframe.

Parameters:
  • df (DataFrame or GeoDataFrame) – Input data to transform

  • recipe (dict) – Recipe dictionary containing ‘transformations’ and optionally ‘transformation_patterns’ keys

  • silent (bool, default False) – If True, suppress warnings

Returns:

Transformed dataframe with new columns added

Return type:

DataFrame or GeoDataFrame

openplaces.io.transform.apply_transformation(df: pandas.DataFrame | geopandas.GeoDataFrame, config: dict[str, Any], silent: bool = False) pandas.DataFrame | geopandas.GeoDataFrame#

Apply a single transformation based on configuration.

openplaces.io.transform.apply_transformation_pattern(df: pandas.DataFrame | geopandas.GeoDataFrame, config: dict[str, Any], silent: bool = False) pandas.DataFrame | geopandas.GeoDataFrame#

Apply pattern-based transformation to multiple columns.

openplaces.io.transform.get_crosswalk(crosswalk_dict, flip=False)#

Get a crosswalk (Series of default keys -> source keys)

Parameters:
  • crosswalk_dict (dict) – Dictionary with crosswalk arguments

  • flip (bool) – Flips keys (index) and value column (usually for joining)

openplaces.io.transform.remap(df, recipe_id)#

Remap values in dataframe column using recipe table

Parameters:
  • df (DataFrame or GeoDataFrame) – Data

  • recipe_id (str) – ID of recipe table that contains the remapping

openplaces.io.transform.add_unique_suffix(s)#

Make string Series unique by appending unique integer suffices.

All duplicate occurrences are suffixed (-1, -2, …), including the first one. Use make_index_unique when operating on a DataFrame index and the first (or largest) occurrence should keep the unsuffixed value.

Parameters:

s (pd.Series) – String Series containing duplicate entries

openplaces.io.transform.make_index_unique(df: pandas.DataFrame, sort_by: str | None = None, ascending: bool = False, separator: str = '-', *, sort_duplicates_by_area: bool = False, area_crs: str = 'EPSG:6933') pandas.DataFrame#

Return a copy of a DataFrame / GeoDataFrame with a unique string index.

Duplicate index values are resolved so that the first occurrence keeps the original index value and later duplicates receive suffixes -1, -2, … Sorting controls which occurrence counts as “first”.

Unlike add_unique_suffix, which operates on a Series and suffixes every duplicate (including the first), this function preserves the unsuffixed value for the winning row.

Parameters:
  • df (pd.DataFrame or gpd.GeoDataFrame) – Input frame whose index will be made unique.

  • sort_by (str, optional) – Column to sort the entire frame by before resolving duplicates.

  • ascending (bool) – Sort direction. Default False so larger values sort first.

  • separator (str) – String inserted between the original index value and the counter.

  • sort_duplicates_by_area (bool) – If True, and df is a GeoDataFrame, compute equal-area geometry area for rows with duplicated index values and sort within each group so the largest polygon keeps the unsuffixed index.

  • area_crs (str) – Equal-area CRS used for area calculation. Default: EPSG:6933.