metobs_toolkit.dataset.Dataset.buddy_check_with_safetynets#
- Dataset.buddy_check_with_safetynets(obstype: str = 'temp', spatial_buddy_radius: int | float = 10000, safety_net_configs: List[Dict] = None, min_sample_size: int = 4, max_sample_size: int | None = None, max_alt_diff: int | float | None = None, min_sample_spread: int | float = 1.0, min_buddy_distance: int | float = 0.0, spatial_z_threshold: int | float = 3.1, N_iter: int = 2, instantaneous_tolerance: str | Timedelta = Timedelta('0 days 00:04:00'), lapserate: float | None = None, whiteset: WhiteSet = WhiteSet(empty), use_z_robust_method: bool = True, use_mp: bool = True, min_std=None)[source]#
Spatial buddy check with configurable safety nets.
The buddy check compares an observation against its neighbors (i.e. spatial buddies). The check loops over all stations, treating each as the center of a buddy group formed by nearby stations. For each center station, the z-score is computed from the buddy sample. If the z-score exceeds spatial_z_threshold, the center station’s observation is labeled as an outlier.
Multiple iterations of this check can be done using N_iter.
Optionally, one or more safety nets can be applied. A safety net tests potential outliers against a sample of stations that share a categorical attribute (e.g., LCZ, network). If the z-value computed using the safety net sample is below the specified threshold, the outlier is “saved” and removed from the outlier set for the current iteration.
Safety nets are applied in the order they are specified in safety_net_configs, allowing for multi-level filtering (e.g., first test against LCZ buddies, then against network buddies).
A schematic step-by-step description of the buddy check:
A distance matrix is constructed for all interdistances between the stations. This is done using the haversine approximation.
Groups of spatial buddies (neighbours) are created by using the spatial_buddy_radius and min_buddy_distance. Only stations within the distance range [min_buddy_distance, spatial_buddy_radius] are considered as buddies. These groups are further filtered by:
removing stations from the groups that differ too much in altitude (based on the max_alt_diff)
removing groups of buddies that are too small (based on the min_sample_size)
Observations per group are synchronized in time (using the instantaneous_tolerance for alignment).
If a lapserate is specified, the observations are corrected for altitude differences.
The following steps are repeated for N_iter iterations:
The values of outliers flagged by a previous iteration are converted to NaN’s. Therefore they are not used in any following score or sample.
For each center station:
The sample mean, spread (std or MAD depending on use_z_robust_method), and sample size are computed from the buddy stations (center station excluded).
If the spread is lower than min_sample_spread, it is replaced by min_sample_spread.
The z-score of the center station is calculated.
If the z-score exceeds spatial_z_threshold, the center station’s observation is flagged as an outlier. It will be ignored in the next iteration.
For each safety net in safety_net_configs (in order):
If only_if_previous_had_no_buddies is True for this safety net, only outlier records where the previous safety net had insufficient buddies are passed to this safety net. All other records retain their status from the previous safety net.
Category buddies (stations sharing the same category value within the specified distance range) are identified. Like spatial buddies, category buddies are filtered by distance range [min_buddy_distance, buddy_radius].
The category-buddy sample is tested in size (sample size must be at least min_sample_size). If the condition is not met, the safety net test is not applied.
The safety net test is applied:
The sample mean and spread (std or MAD depending on use_z_robust_method) are computed of the category-buddy sample. If the spread is smaller than min_sample_spread, the latter is used.
The z-value is computed for the target record (= flagged outlier).
If the z-value is smaller than the safety net’s z_threshold, the tested outlier is “saved” and removed from the set of outliers for the current iteration.
If whiteset is provided, any outliers that match the white-listed timestamps are removed from the outlier set for the current iteration. White-listed records participate in all buddy check and safety net calculations but are not flagged as outliers in the final results.
- Parameters:
obstype (str, optional) – The target observation to check. Default is “temp”.
spatial_buddy_radius (int or float, optional) – The radius to define spatial neighbors in meters. Default is 10000.
safety_net_configs (list of dict, optional) –
List of safety net configurations to apply in order. Each dict must contain:
’category’: str, the metadata column name to group by (e.g., ‘LCZ’, ‘network’)
’buddy_radius’: int or float, maximum distance for category buddies (in meters)
’z_threshold’: int or float, z-value threshold for saving outliers
’min_sample_size’: int, minimum number of buddies required for the safety net test
’min_buddy_distance’: int or float (optional), minimum distance (in meters) required between a station and its category buddies. Stations closer than this distance will be excluded from the buddy group. Defaults to 0 (no minimum distance).
’max_sample_size’: int or None (optional), maximum number of category buddies to use per station. If not None, category buddies are sorted by distance and only the nearest
max_sample_sizeare kept. Must be larger thanmin_sample_sizewhen specified. Defaults to None (no limit).’only_if_previous_had_no_buddies’: bool (optional), if True this safety net is only applied to outlier records for which the previous safety net could not be executed due to insufficient buddies. Records that were successfully tested by the previous safety net (passed or failed) are not re-tested. This enables a cascading fallback strategy. Cannot be True for the first safety net. Defaults to False.
Example:
safety_net_configs = [ { "category": "LCZ", "buddy_radius": 40000, "z_threshold": 2.1, "min_sample_size": 4 }, { "category": "network", "buddy_radius": 50000, "z_threshold": 2.5, "min_sample_size": 3, "only_if_previous_had_no_buddies": True } ]
The default is None.
min_sample_size (int, optional) – The minimum sample size to calculate statistics on. Used for spatial-buddy samples. Default is 4.
max_sample_size (int or None, optional) – The maximum number of spatial buddies to use per station. If not None, the spatial buddies for each station are sorted by distance and only the nearest
max_sample_sizeare kept. Must be larger thanmin_sample_sizewhen specified. The default is None (no limit).max_alt_diff (int or float or None, optional) – The maximum altitude difference allowed for buddies. Default is None.
min_sample_spread (int or float, optional) – The minimum sample spread for sample statistics. When use_z_robust_method is True, this is equal to the minimum MAD to use (avoids division by near-zero). When use_z_robust_method is False, this is the standard deviation. This parameter helps to represent the accuracy of the observations. This is used in spatial and safety net samples. Default is 1.0.
min_buddy_distance (int or float, optional) – The minimum distance (in meters) required between a station and its spatial buddies. Stations closer than this distance will be excluded from the buddy group. This also affects safety net buddy selection unless overridden in the safety_net_configs. Default is 0.0 (no minimum distance).
spatial_z_threshold (int or float, optional) – The z-score threshold for flagging observations as outliers. Default is 3.1.
N_iter (int, optional) – The number of iterations to perform the buddy check. Default is 2.
instantaneous_tolerance (str or pd.Timedelta, optional) – The maximum time difference allowed for synchronizing observations. Default is pd.Timedelta(“4min”).
lapserate (float or None, optional) – Describe how the obstype changes with altitude (in meters). Default is None.
whiteset (WhiteSet, optional) – A WhiteSet instance containing timestamps that should be excluded from outlier detection. The WhiteSet is used to create station-specific and obstype-specific whitelists before applying the buddy check. White-listed records participate in all buddy check and safety net iterations as regular records but are not flagged as outliers in the final results. The default is an empty WhiteSet().
use_z_robust_method (bool, optional) – If True, the robust z-score method (median/MAD) is used. If False, the classic z-score method (mean/std) is used. Default is True.
use_mp (bool, optional) – Use multiprocessing to speed up the buddy check. Default is True.
- Return type:
None
Notes
This method modifies the outliers in place and does not return anything. You can use the outliersdf property to view all flagged outliers.
The altitude of the stations can be extracted from GEE by using the Dataset.get_altitude() method.
The LCZ of the stations can be extracted from GEE by using the Dataset.get_LCZ() method.
White-listed records participate in all buddy check and safety net calculations but are not flagged as outliers in the final results.
See also
buddy_checkBuddy check without safety nets.
Examples
Apply buddy check with an LCZ safety net:
>>> dataset.buddy_check_with_safetynets( ... obstype="temp", ... safety_net_configs=[ ... {"category": "LCZ", "buddy_radius": 40000, "z_threshold": 2.1, "min_sample_size": 4} ... ] ... )
Apply buddy check with multiple safety nets (LCZ first, then network):
>>> dataset.buddy_check_with_safetynets( ... obstype="temp", ... safety_net_configs=[ ... {"category": "LCZ", "buddy_radius": 40000, "z_threshold": 2.1, "min_sample_size": 4}, ... {"category": "network", "buddy_radius": 50000, "z_threshold": 2.5, "min_sample_size": 3} ... ] ... )
Apply cascading safety nets where the second safety net only tests records that had insufficient buddies in the first:
>>> dataset.buddy_check_with_safetynets( ... obstype="temp", ... safety_net_configs=[ ... {"category": "LCZ", "buddy_radius": 40000, "z_threshold": 2.1, "min_sample_size": 4}, ... {"category": "network", "buddy_radius": 50000, "z_threshold": 2.5, "min_sample_size": 3, "only_if_previous_had_no_buddies": True} ... ] ... )
Changed in version 1.1.0: Breaking changes compared to v1.0.x:
Outlier-selection logic revised. The previous implementation evaluated groups of stations and flagged the single worst observation per group. The new implementation evaluates every station individually as the center of its buddy group, so each station is independently tested against its neighbors. This means the same dataset may produce a different set of outliers than before.
Default z-score method changed to robust (median / MAD). Statistics are now computed with the median and Median Absolute Deviation (MAD) by default (
use_z_robust_method=True). To reproduce the previous mean / std behaviour setuse_z_robust_method=False.min_stdrenamed tomin_sample_spread. Passingmin_stdnow raises aDeprecationWarning.New parameter
min_buddy_distance(default0.0) — excludes stations that are closer than this distance (in metres) from both spatial and safety-net buddy groups.New parameter
max_sample_size(defaultNone) — caps the number of spatial buddies per station to the nearest N stations.New parameter
use_z_robust_method— selects between the robust (median/MAD) and classic (mean/std) z-score. Set toFalseto reproduce the previous mean/std behaviour.Safety net configs extended. Each safety net dict now accepts two additional optional keys:
min_buddy_distance(minimum distance for category buddies),max_sample_size(caps the number of category buddies), andonly_if_previous_had_no_buddies(cascading fallback — this safety net is only applied to records for which the preceding safety net had too few buddies).