Writing recipes#

Recipes are the instructions that allow openplaces to ingest new data into its data structure.

By contributing recipes, you permit others to reproduce your work.

Recipes can be found in src/openplaces/recipes.

Data ingestion recipes#

Recipes that manage data ingestion: file handling, data formatting, and gentle preprocessing (no major edits to data).

Dataset description and source#

admin_id#

Admin ID of the (top-level) administrative unit for which the dataset provides data.

NULL

Global

US

United States

US-MA

Massachusetts, U.S.

US-MA-MI

Middlesex county, Massachusetts, United States.

US-MA-MI-SO

City of Somerville, Middlesex county, Massachusetts, United States.

entity#

Entity of the dataset: the “thing” that every row of the dataset table refers to.

Use entity when each row in the data is a fundamental building block of openplaces — a parcel, building, admin unit, etc.

Use dataset instead when the data is not itself an entity type (see dataset below).

entity_type#

Type of entity, e.g. admin, parcel, property, building, transaction, tile.

source#

Source of the data.

portal_url#

URL of the landing page of the portal that offers and explains the data.

download_url#

URL to download the data directly.

If download_by is also set, download_url can include placeholders to be resolved, e.g.:

https://nsi.sec.usace.army.mil/downloads/nsi_2022/nsi_2022_{admin2_id_admin1}.gpkg.zip

download_url_source#

URL of page from which the download URL can be extracted.

download_url_source_regex#

String pattern (regular expression) that extracts the download URL from the HTML content of download_url_source.

Placeholders in the pattern (e.g. {admin3_name}) are substituted with the current partition value before matching.

version#

Version of the dataset (year, date, or version number).

dataset#

Datasets: attributes provide attributes that will be linked to entities.

Datasets can come in many formats: tables, vectors, rasters, XML. What distinguishes them from entities is that datasets are not yet organized by entity: the rows don’t refer to the entity yet (and therefore require some linkage algorithm).

theme#

Short descriptor of the data theme, e.g. landcover-annual.

source

Same sub-attributes as the source of entities.

version

Version of the dataset (year, date, or version number).

is_raster#

Set to True for raster datasets, so the file is not treated as a table or vector layer (default).

File handling#

download_by#

Instructions for downloading data that is provided in partitioned chunks (by administrative subdivision, tile, or year). Not needed if the data is provided as a single file.

admin_level#

Level of the administrative breakdown for download partitions:

2 for state-level files
3 for county-level files
admin_key_transform#

Rules to transform admin key values before substituting them into the download URL (e.g. remove spaces from county names).

Example (NC building recipe, which uses the county name in the URL):

admin_key_transform:
  admin3_name: remove_spaces
partition#

Type of non-admin partition. Supported values:

year

Download one file per year. Requires first and last.

tile_id

Download one file per tile. Requires tile_recipe_id.

first#

First year to download when partition is year.

last#

Last year to download when partition is year.

tile_recipe_id#

Recipe ID for the tile index when partition is tile_id.

Example: tile-obm-2025.

Example (raster dataset partitioned by year):

download_by:
  partition: year
  first: 1986
  last: 2024

Example (building footprints partitioned by tile):

download_by:
  partition: tile_id
  tile_recipe_id: tile-obm-2025
compressed_file_name#

Filename of the compressed file (usually in the external folder; see Directory structure).

Providing this skips re-downloading when the file is already present.

Can include placeholders substituted by download_by (e.g. {admin2_id_leaf}) and wildcards (the ingester will search for a matching file).

uncompressed_file_name#

Filename of the uncompressed file, ready to be read.

If compressed_file_name is set, the uncompressed file is expected in the heap folder after extraction.

If there is no compressed_file_name, the uncompressed file is treated as the original download in the external folder.

Can include placeholders and wildcards.

encoding#

Character encoding of the source file, passed directly to the underlying reader (e.g. latin-1).

Only needed when the file is not UTF-8.

process_by#

Instructions to process a large data file in smaller chunks, typically by administrative subdivision (e.g. reading a state-wide geodatabase county by county to avoid loading it entirely into memory).

admin_level

Level of the administrative breakdown for processing chunks.

admin_id_column#

Name of the column in the input data that identifies the processing chunk (commonly a county FIPS code or similar admin identifier).

admin_id_transformation#

Optional transformation to apply to the values in admin_id_column before building the crosswalk.

Useful when the source column contains a partial identifier that needs to be modified to match openplaces Admin IDs (e.g. prefixing a 3-digit county code with a state FIPS code to produce a 5-digit FIPS).

Uses the same transformation dict format as transformations.

Example (Wisconsin parcels, which stores a 3-digit county code that needs a 55 prefix):

admin_id_transformation:
  output: admin3_id_admin1
  type: string
  operation: add_prefix
  args:
    prefix: "55"
admin_id_crosswalk#

Instructions to map values from the source admin column (after any admin_id_transformation) to openplaces Admin IDs, so that rows can be filtered by admin chunk.

Two forms are supported:

Recipe shorthand — points to a prebuilt crosswalk table:

admin_id_crosswalk:
  recipe_id: "US-MA_parcel-massgis-2025_admin4-crosswalk"

Dynamic crosswalk — built at runtime from an admin recipe:

admin_id_crosswalk:
  admin_level: 3
  admin_id_column: admin3_id_admin1
  admin_recipe_id: "US_admin-census-2021_admin3"

In the dynamic form:

admin_level

Administrative level of the crosswalk target, e.g. 3 for counties.

admin_id_column

Column name in the input data (or produced by admin_id_transformation) to match against the crosswalk.

admin_recipe_id

Recipe from which to derive the crosswalk, e.g. "US_admin-census-2021_admin3".

use_spatial_index#

Set to True to use a spatial index on admin geometries to determine which rows fall within the current processing chunk, instead of using a crosswalk column.

Requires admin geometries to be loaded (admin level and recipe are read from this same process_by block).

use_spatial_mask#

Set to True to clip the data to the geometry of the current admin chunk instead of just filtering rows by a crosswalk column.

Useful when source data does not have a reliable admin ID column and rows must be assigned by spatial containment.

Reading#

layer#

Layer to read from the file.

Geodatabases (.gdb) and GeoPackages (.gpkg) can contain multiple layers. Use this to identify which one to read.

Also used as a cache key when additional_layers references the same source file.

Columns#

columns#

Dictionary of columns to keep from the source data, with optional renaming.

Provided as openplaces_column_name: original_column_name pairs.

Only the listed columns are retained (plus any columns generated by index creation or admin attribution); use keep_unnamed_columns to retain additional columns.

Example (from the Global Administrative Database):

columns:
  name: NAME_1
  type: ENGTYPE_1
  admin1_name: COUNTRY
keep_unnamed_columns#

Set to True to retain all columns not listed in columns.

Also retains columns generated during index creation (e.g. by create_index).

Data cleaning#

null_value_strings#

List of strings in the source data that should be interpreted as missing values (None).

Applied to all non-ID columns before any other preprocessing.

Example: ["NA", "NOT APPLICABLE", "NOT PROVIDED"]

query#

Pandas query string to filter rows.

Passed to pd.DataFrame.query(). Use to remove unwanted geometries before further processing.

Example: "type != 'Water body'" (used in GADM to drop ocean polygons).

Space saving#

columns_to_categorical#

Columns to cast to the pandas Categorical dtype, reducing memory and on-disk size.

Each entry is either a plain column name (unordered categorical) or a single-key dict with an explicit category list (ordered categorical, from lowest to highest):

columns_to_categorical:
  - city                           # unordered
  - zip                            # unordered
  - height_source: [H, HBET, HHT]  # ordered: H < HBET < HHT

Ordered categoricals enable prefer_higher in keep_overlapping_polygons.

Overlap handling#

keep_overlapping_polygons#

Policy for resolving pairs of polygons that overlap substantially (intersection-over-union > 0.5).

If omitted, overlapping polygons are reported as a warning but both are kept.

prefer_higher#

Name of an (ordered) categorical column. When two polygons overlap, the one with the higher category value is kept. Useful for preferring footprints derived from better data sources.

Example (OBM buildings, preferring height sources that are more reliable):

keep_overlapping_polygons:
  prefer_higher: height_source

Data transformations#

transformations#

List of operations to derive new columns from existing ones.

Each transformation is a dict with output, type, operation, input, and optionally args.

Refer to openplaces.io.transform for available operations.

Example (extract the first five characters of a census block ID to get the county FIPS code):

transformations:
  - output: admin3_id_admin1
    type: string
    operation: substring
    input: census_block_id
    args:
      start: 0
      end: 5

Indexing#

set_index#

Name of an existing column to use as the index.

Raises an error if the column contains duplicates.

create_index#

Define an index using a function and optional arguments.

function#

Dotted path to an openplaces function that adds an index to the GeoDataFrame.

Must start with openplaces. (enforced to prevent arbitrary code execution).

Example:

create_index:
  function: "openplaces.geo.ids.add_openlocationcode_index"
  args:
    name: footprint_id
args#

Keyword arguments passed to function.

sort_index#

Set to True to sort the dataset by its index after creation.

drop#

List of index values to drop after the index has been created.

Use to remove known bad rows that cannot be excluded by a query filter.

Attribution to administrative units#

Two mechanisms assign openplaces Admin IDs to rows.

Use admin_id_crosswalk when the source data already contains an admin identifier column, and overlay_admin_ids when Admin IDs must be derived from spatial geometry.

admin_id_crosswalk

Add an openplaces Admin ID column by joining from a crosswalk between source admin identifiers and openplaces Admin IDs.

Distinct from process_by.admin_id_crosswalk: that crosswalk is used to split the data during reading; this one attributes rows to admin units in the output.

admin_level

Administrative level of the target Admin IDs, e.g. 3.

admin_id_column

Column in the input data to join on, e.g. admin3_id_admin1.

admin_recipe_id

Recipe from which to derive the crosswalk, e.g. "US_admin-census-2021_admin3".

overlay_admin_ids#

Assign Admin IDs to rows by spatial overlay — intersecting geometries with admin unit boundaries.

Use when the source data has no reliable admin ID column and spatial containment must be used instead.

admin_level

Level of administrative units to assign to geometries.

admin_recipe_id

Recipe that provides the admin unit geometries.

Should use a high-resolution in-country boundary dataset for accurate allocation (e.g. building footprints require county-level precision).

Example: 'US_admin-census-2021_admin3' for US counties.

Saving#

save_to#

Controls where and how output is written.

data_dir#

Name of the openplaces data directory in which to save output.

Must be one of the registered directory names (e.g. cache, core, out).

Defaults to cache if omitted.

admin_level

Admin level at which to split output into separate files.

Not needed if splitting is already determined by download_by or process_by.

Example: set to 3 to write one parquet file per county.

Note

When this level is coarser than download_by: admin_level or attr:process_by: admin_level (e.g. save at county=3 while processing by town=4), the ingester automatically aggregates the intermediate files to the coarser level and deletes the intermediates, including any empty directories that remain.

filename#

Override the auto-generated output filename stem.

The auto-generated name is derived from the reference (admin ID, entity, source, version). Use this when the default is ambiguous — for example, a recipe that produces multiple files for different admin levels, or a dataset that needs a fixed conventional name.

Example: admin2 (used in admin recipes to produce admin-gadm-4~1_admin2.parquet).

Additional layers#

additional_layers#

List of additional tables to ingest from the same source file, each producing a separate output for a different entity type.

For example, a Massachusetts parcel geodatabase contains both a parcel polygon layer and a property attribute table; additional_layers ingests both in a single recipe.

Each entry in the list is a layer spec dict. The full recipe for each layer is built by merging the primary recipe with the layer spec:

  • Per-table keys are taken from the layer spec when present; otherwise they are removed from the merged recipe. These include:

    entity, layer, columns, keep_unnamed_columns, set_index, create_index, index_function, drop, query, null_value_strings, transformations, columns_to_categorical, encoding, save_to, overlay_admin_ids

  • Shared keys are always inherited from the primary recipe. These include:

    admin_id, download_by, compressed_file_name, uncompressed_file_name

  • ``process_by`` is inherited from the primary recipe unless the layer spec sets it explicitly. Use process_by: null in the layer spec to disable chunking for that table.

  • additional_layers cannot be nested.

entity is required in every layer spec.

Example: Massachusetts parcels. Primary layer is parcel polygons; additional layer is the property attribute table.

layer: L3_TAXPAR_POLY
process_by:
  admin_level: 4
  admin_id_column: TOWN_ID
  admin_id_crosswalk:
    recipe_id: "US-MA_parcel-massgis-2025_admin4-crosswalk"
columns:
  parcel_id_admin2: LOC_ID
  polygon_type: POLY_TYPE

additional_layers:
  - layer: L3_ASSESS
    entity:
      entity_type: property
      source:
        source_id: massgis
      version: 2025
    columns:
      property_id_assessor: PROP_ID
      parcel_id_admin2: LOC_ID
      total_value: TOTAL_VAL
    columns_to_categorical:
      - city
      - zip

For the full recipe, see:

src/openplaces/recipes/US/MA/_all/parcel/massgis/2025/US-MA_parcel-massgis-2025.yaml