transform#

Transformation engine for applying variable transformations to DataFrames.

This module provides a flexible system for transforming variables based on YAML recipe specifications. It supports: - Unary operations (log, arcsinh, power, etc.) - Binary operations (arithmetic on two columns) - Aggregate operations (sum, min, max across multiple columns) - String remapping and reclassification - Conditional transformations - Date/time extractions - Complex expressions - Pattern-based transformations

Functions#

`apply_transformations`(...)	Apply transformations from recipe to dataframe.
`apply_legacy_columns`(...)	Rename legacy columns declared by a recipe, merging when both exist.
`apply_transformation`(...)	Apply a single transformation based on configuration.
`apply_transformation_pattern`(...)	Apply pattern-based transformation to multiple columns.
`get_crosswalk`(crosswalk_dict[, flip])	Get a crosswalk (Series of default keys -> source keys)
`remap`(df, recipe_id)	Remap values in dataframe column using recipe table
`add_unique_suffix`(s)	Make string Series unique by appending unique integer suffices.
`make_index_unique`(→ pandas.DataFrame)	Return a copy of a DataFrame / GeoDataFrame with a unique string index.

Module Contents#

openplaces.io.transform.apply_transformations(df: pandas.DataFrame | geopandas.GeoDataFrame, recipe: dict[str, Any], silent: bool = False) → pandas.DataFrame | geopandas.GeoDataFrame#

Apply transformations from recipe to dataframe.

Parameters:

df (DataFrame or GeoDataFrame) – Input data to transform
recipe (dict) – Recipe dictionary containing ‘transformations’ and optionally ‘transformation_patterns’ keys
silent (bool, default False) – If True, suppress warnings

Returns:

Transformed dataframe with new columns added

Return type:

DataFrame or GeoDataFrame

openplaces.io.transform.apply_legacy_columns(df: pandas.DataFrame | geopandas.GeoDataFrame, recipe: dict[str, Any]) → pandas.DataFrame | geopandas.GeoDataFrame#: Rename legacy columns declared by a recipe, merging when both exist.

openplaces.io.transform.apply_transformation(df: pandas.DataFrame | geopandas.GeoDataFrame, config: dict[str, Any], silent: bool = False) → pandas.DataFrame | geopandas.GeoDataFrame#: Apply a single transformation based on configuration.

openplaces.io.transform.apply_transformation_pattern(df: pandas.DataFrame | geopandas.GeoDataFrame, config: dict[str, Any], silent: bool = False) → pandas.DataFrame | geopandas.GeoDataFrame#: Apply pattern-based transformation to multiple columns.

openplaces.io.transform.get_crosswalk(crosswalk_dict, flip=False)#

Get a crosswalk (Series of default keys -> source keys)

Parameters:

crosswalk_dict (dict) – Dictionary with crosswalk arguments
flip (bool) – Flips keys (index) and value column (usually for joining)

openplaces.io.transform.remap(df, recipe_id)#

Remap values in dataframe column using recipe table

Parameters:

df (DataFrame or GeoDataFrame) – Data
recipe_id (str) – ID of recipe table that contains the remapping

openplaces.io.transform.add_unique_suffix(s)#

Make string Series unique by appending unique integer suffices.

All duplicate occurrences are suffixed (-1, -2, …), including the first one. Use make_index_unique when operating on a DataFrame index and the first (or largest) occurrence should keep the unsuffixed value.

Parameters:: s (pd.Series) – String Series containing duplicate entries

openplaces.io.transform.make_index_unique(df: pandas.DataFrame, sort_by: str | None = None, ascending: bool = False, separator: str = '-', *, sort_duplicates_by_area: bool = False, area_crs: str = 'EPSG:6933') → pandas.DataFrame#

Return a copy of a DataFrame / GeoDataFrame with a unique string index.

Duplicate index values are resolved so that the first occurrence keeps the original index value and later duplicates receive suffixes -1, -2, … Sorting controls which occurrence counts as “first”.

Unlike add_unique_suffix, which operates on a Series and suffixes every duplicate (including the first), this function preserves the unsuffixed value for the winning row.

Parameters:

df (pd.DataFrame or gpd.GeoDataFrame) – Input frame whose index will be made unique.
sort_by (str, optional) – Column to sort the entire frame by before resolving duplicates.
ascending (bool) – Sort direction. Default False so larger values sort first.
separator (str) – String inserted between the original index value and the counter.
sort_duplicates_by_area (bool) – If True, and df is a GeoDataFrame, compute equal-area geometry area for rows with duplicated index values and sort within each group so the largest polygon keeps the unsuffixed index.
area_crs (str) – Equal-area CRS used for area calculation. Default: EPSG:6933.