Writing recipes#
Recipes are the instructions that allow openplaces to ingest new data into its data structure.
By contributing recipes, you permit others to reproduce your work.
Recipes can be found in src/openplaces/recipes.
Data ingestion recipes#
Recipes that manage data ingestion: file handling, data formatting, and gentle preprocessing (no major edits to data).
Dataset description and source#
- admin_id#
Admin ID of the (top-level) administrative unit for which the dataset provides data.
NULLGlobal
USUnited States
US-MAMassachusetts, U.S.
US-MA-MIMiddlesex county, Massachusetts, United States.
US-MA-MI-SOCity of Somerville, Middlesex county, Massachusetts, United States.
- entity#
Entity of the dataset: the “thing” that every row of the dataset table refers to.
Use
entitywhen each row in the data is a fundamental building block ofopenplaces— a parcel, building, admin unit, etc.Use
datasetinstead when the data is not itself an entity type (seedatasetbelow).- source#
Source of the data.
- portal_url#
URL of the landing page of the portal that offers and explains the data.
- download_url#
URL to download the data directly.
If
download_byis also set,download_urlcan include placeholders to be resolved, e.g.:https://nsi.sec.usace.army.mil/downloads/nsi_2022/nsi_2022_{admin2_id_admin1}.gpkg.zip
- download_url_source#
URL of page from which the download URL can be extracted.
- download_url_source_regex#
String pattern (regular expression) that extracts the download URL from the HTML content of
download_url_source.Placeholders in the pattern (e.g.
{admin3_name}) are substituted with the current partition value before matching.
- version#
Version of the dataset (year, date, or version number).
- dataset#
Datasets: attributes provide attributes that will be linked to entities.
Datasets can come in many formats: tables, vectors, rasters, XML. What distinguishes them from entities is that datasets are not yet organized by entity: the rows don’t refer to the entity yet (and therefore require some linkage algorithm).
- theme#
Short descriptor of the data theme, e.g.
landcover-annual.
- source
Same sub-attributes as the
sourceof entities.
- version
Version of the dataset (year, date, or version number).
- is_raster#
Set to
Truefor raster datasets, so the file is not treated as a table or vector layer (default).
File handling#
- download_by#
Instructions for downloading data that is provided in partitioned chunks (by administrative subdivision, tile, or year). Not needed if the data is provided as a single file.
- admin_level#
Level of the administrative breakdown for download partitions:
2for state-level files3for county-level files
- admin_key_transform#
Rules to transform admin key values before substituting them into the download URL (e.g. remove spaces from county names).
Example (NC building recipe, which uses the county name in the URL):
admin_key_transform: admin3_name: remove_spaces
- partition#
Type of non-admin partition. Supported values:
yeartile_idDownload one file per tile. Requires
tile_recipe_id.
Example (raster dataset partitioned by year):
download_by: partition: year first: 1986 last: 2024
Example (building footprints partitioned by tile):
download_by: partition: tile_id tile_recipe_id: tile-obm-2025
- compressed_file_name#
Filename of the compressed file (usually in the external folder; see Directory structure).
Providing this skips re-downloading when the file is already present.
Can include placeholders substituted by
download_by(e.g.{admin2_id_leaf}) and wildcards (the ingester will search for a matching file).
- uncompressed_file_name#
Filename of the uncompressed file, ready to be read.
If
compressed_file_nameis set, the uncompressed file is expected in the heap folder after extraction.If there is no
compressed_file_name, the uncompressed file is treated as the original download in the external folder.Can include placeholders and wildcards.
- encoding#
Character encoding of the source file, passed directly to the underlying reader (e.g.
latin-1).Only needed when the file is not UTF-8.
- process_by#
Instructions to process a large data file in smaller chunks, typically by administrative subdivision (e.g. reading a state-wide geodatabase county by county to avoid loading it entirely into memory).
- admin_level
Level of the administrative breakdown for processing chunks.
- admin_id_column#
Name of the column in the input data that identifies the processing chunk (commonly a county FIPS code or similar admin identifier).
- admin_id_transformation#
Optional transformation to apply to the values in
admin_id_columnbefore building the crosswalk.Useful when the source column contains a partial identifier that needs to be modified to match
openplacesAdmin IDs (e.g. prefixing a 3-digit county code with a state FIPS code to produce a 5-digit FIPS).Uses the same transformation dict format as
transformations.Example (Wisconsin parcels, which stores a 3-digit county code that needs a
55prefix):admin_id_transformation: output: admin3_id_admin1 type: string operation: add_prefix args: prefix: "55"
- admin_id_crosswalk#
Instructions to map values from the source admin column (after any
admin_id_transformation) toopenplacesAdmin IDs, so that rows can be filtered by admin chunk.Two forms are supported:
Recipe shorthand — points to a prebuilt crosswalk table:
admin_id_crosswalk: recipe_id: "US-MA_parcel-massgis-2025_admin4-crosswalk"
Dynamic crosswalk — built at runtime from an admin recipe:
admin_id_crosswalk: admin_level: 3 admin_id_column: admin3_id_admin1 admin_recipe_id: "US_admin-census-2021_admin3"
In the dynamic form:
- admin_level
Administrative level of the crosswalk target, e.g.
3for counties.
- admin_id_column
Column name in the input data (or produced by
admin_id_transformation) to match against the crosswalk.
- admin_recipe_id
Recipe from which to derive the crosswalk, e.g.
"US_admin-census-2021_admin3".
- use_spatial_index#
Set to
Trueto use a spatial index on admin geometries to determine which rows fall within the current processing chunk, instead of using a crosswalk column.Requires admin geometries to be loaded (admin level and recipe are read from this same
process_byblock).
- use_spatial_mask#
Set to
Trueto clip the data to the geometry of the current admin chunk instead of just filtering rows by a crosswalk column.Useful when source data does not have a reliable admin ID column and rows must be assigned by spatial containment.
Reading#
- layer#
Layer to read from the file.
Geodatabases (.gdb) and GeoPackages (.gpkg) can contain multiple layers. Use this to identify which one to read.
Also used as a cache key when
additional_layersreferences the same source file.
Columns#
- columns#
Dictionary of columns to keep from the source data, with optional renaming.
Provided as
openplaces_column_name: original_column_namepairs.Only the listed columns are retained (plus any columns generated by index creation or admin attribution); use
keep_unnamed_columnsto retain additional columns.Example (from the Global Administrative Database):
columns: name: NAME_1 type: ENGTYPE_1 admin1_name: COUNTRY
- keep_unnamed_columns#
Set to
Trueto retain all columns not listed incolumns.Also retains columns generated during index creation (e.g. by
create_index).
Data cleaning#
- null_value_strings#
List of strings in the source data that should be interpreted as missing values (
None).Applied to all non-ID columns before any other preprocessing.
Example:
["NA", "NOT APPLICABLE", "NOT PROVIDED"]
- query#
Pandas query string to filter rows.
Passed to
pd.DataFrame.query(). Use to remove unwanted geometries before further processing.Example:
"type != 'Water body'"(used in GADM to drop ocean polygons).
Space saving#
- columns_to_categorical#
Columns to cast to the
pandasCategoricaldtype, reducing memory and on-disk size.Each entry is either a plain column name (unordered categorical) or a single-key dict with an explicit category list (ordered categorical, from lowest to highest):
columns_to_categorical: - city # unordered - zip # unordered - height_source: [H, HBET, HHT] # ordered: H < HBET < HHT
Ordered categoricals enable
prefer_higherinkeep_overlapping_polygons.
Overlap handling#
- keep_overlapping_polygons#
Policy for resolving pairs of polygons that overlap substantially (intersection-over-union > 0.5).
If omitted, overlapping polygons are reported as a warning but both are kept.
- prefer_higher#
Name of an (ordered) categorical column. When two polygons overlap, the one with the higher category value is kept. Useful for preferring footprints derived from better data sources.
Example (OBM buildings, preferring height sources that are more reliable):
keep_overlapping_polygons: prefer_higher: height_source
Data transformations#
- transformations#
List of operations to derive new columns from existing ones.
Each transformation is a dict with
output,type,operation,input, and optionallyargs.Refer to
openplaces.io.transformfor available operations.Example (extract the first five characters of a census block ID to get the county FIPS code):
transformations: - output: admin3_id_admin1 type: string operation: substring input: census_block_id args: start: 0 end: 5
Indexing#
- set_index#
Name of an existing column to use as the index.
Raises an error if the column contains duplicates.
- create_index#
Define an index using a function and optional arguments.
- function#
Dotted path to an
openplacesfunction that adds an index to the GeoDataFrame.Must start with
openplaces.(enforced to prevent arbitrary code execution).Example:
create_index: function: "openplaces.geo.ids.add_openlocationcode_index" args: name: footprint_id
- sort_index#
Set to
Trueto sort the dataset by its index after creation.
Attribution to administrative units#
Two mechanisms assign openplaces Admin IDs to rows.
Use admin_id_crosswalk when the source data already contains an admin identifier column, and overlay_admin_ids when Admin IDs must be derived from spatial geometry.
- admin_id_crosswalk
Add an
openplacesAdmin ID column by joining from a crosswalk between source admin identifiers andopenplacesAdmin IDs.Distinct from
process_by.admin_id_crosswalk: that crosswalk is used to split the data during reading; this one attributes rows to admin units in the output.- admin_level
Administrative level of the target Admin IDs, e.g.
3.
- admin_id_column
Column in the input data to join on, e.g.
admin3_id_admin1.
- admin_recipe_id
Recipe from which to derive the crosswalk, e.g.
"US_admin-census-2021_admin3".
- overlay_admin_ids#
Assign Admin IDs to rows by spatial overlay — intersecting geometries with admin unit boundaries.
Use when the source data has no reliable admin ID column and spatial containment must be used instead.
- admin_level
Level of administrative units to assign to geometries.
- admin_recipe_id
Recipe that provides the admin unit geometries.
Should use a high-resolution in-country boundary dataset for accurate allocation (e.g. building footprints require county-level precision).
Example:
'US_admin-census-2021_admin3'for US counties.
Saving#
- save_to#
Controls where and how output is written.
- data_dir#
Name of the
openplacesdata directory in which to save output.Must be one of the registered directory names (e.g.
cache,core,out).Defaults to
cacheif omitted.
- admin_level
Admin level at which to split output into separate files.
Not needed if splitting is already determined by
download_byorprocess_by.Example: set to
3to write one parquet file per county.Note
When this level is coarser than
download_by: admin_levelor attr:process_by: admin_level (e.g. save at county=3 while processing by town=4), the ingester automatically aggregates the intermediate files to the coarser level and deletes the intermediates, including any empty directories that remain.
- filename#
Override the auto-generated output filename stem.
The auto-generated name is derived from the reference (admin ID, entity, source, version). Use this when the default is ambiguous — for example, a recipe that produces multiple files for different admin levels, or a dataset that needs a fixed conventional name.
Example:
admin2(used in admin recipes to produce admin-gadm-4~1_admin2.parquet).
Additional layers#
- additional_layers#
List of additional tables to ingest from the same source file, each producing a separate output for a different entity type.
For example, a Massachusetts parcel geodatabase contains both a parcel polygon layer and a property attribute table;
additional_layersingests both in a single recipe.Each entry in the list is a layer spec dict. The full recipe for each layer is built by merging the primary recipe with the layer spec:
Per-table keys are taken from the layer spec when present; otherwise they are removed from the merged recipe. These include:
entity,layer,columns,keep_unnamed_columns,set_index,create_index,index_function,drop,query,null_value_strings,transformations,columns_to_categorical,encoding,save_to,overlay_admin_idsShared keys are always inherited from the primary recipe. These include:
admin_id,download_by,compressed_file_name,uncompressed_file_name``process_by`` is inherited from the primary recipe unless the layer spec sets it explicitly. Use
process_by: nullin the layer spec to disable chunking for that table.additional_layerscannot be nested.
entityis required in every layer spec.Example: Massachusetts parcels. Primary layer is parcel polygons; additional layer is the property attribute table.
layer: L3_TAXPAR_POLY process_by: admin_level: 4 admin_id_column: TOWN_ID admin_id_crosswalk: recipe_id: "US-MA_parcel-massgis-2025_admin4-crosswalk" columns: parcel_id_admin2: LOC_ID polygon_type: POLY_TYPE additional_layers: - layer: L3_ASSESS entity: entity_type: property source: source_id: massgis version: 2025 columns: property_id_assessor: PROP_ID parcel_id_admin2: LOC_ID total_value: TOTAL_VAL columns_to_categorical: - city - zip
For the full recipe, see:
src/openplaces/recipes/US/MA/_all/parcel/massgis/2025/US-MA_parcel-massgis-2025.yaml