metobs_toolkit.dataset.Dataset.buddy_check_with_LCZ_safety_net#
- Dataset.buddy_check_with_LCZ_safety_net(target_obstype: str = 'temp', spatial_buddy_radius: int | float = 10000, LCZ_buddy_radius: int | float = 40000, min_sample_size: int = 4, max_alt_diff: int | float | None = None, min_std: int | float = 1.0, spatial_z_threshold: int | float = 3.1, safetynet_z_threshold: int | float = 2.1, N_iter: int = 2, instantaneous_tolerance: str | Timedelta = Timedelta('0 days 00:04:00'), lapserate: float | None = None, use_mp: bool = True)[source]#
Spatial buddy check with LCZ saftey net.
The buddy check compares an observation against its neighbors (i.e. spatial buddies). The check loops over all the groups, which are stations within a radius of each other. For each group, the z-value of the reference observation is computed given the sample of spatial buddies. If one (or more) exceeds the spatial_z_threshold, the most extreme (=baddest) observation of that group is labeled as an outlier.
Multiple iterations of this checks can be done using the N_iter.
The (potential) outliers, per iteration, are tested with another sample. This sample contains the LCZ-buddies, that are stations with the same LCZ as the reference station, and with a maximum distance of max_LCZ_buddy_dist. If a max_alt_diff is specified, a altitude-difference filtering is applied on these buddies aswell. If a test is sucsesfull, that is if the z-value is smaller than the safetynet_z_threshold, the outlier is saved. It will be removed from the outliers, and will pass to the next iteration or the end of this function.
A schematic step-by-step description of the buddy check:
A distance matrix is constructed for all interdistances between the stations. This is done using the haversine approximation.
Groups of spatial buddies (neighbours) are created by using the spatial_buddy_radius. These groups are further filtered by:
removing stations from the groups that differ to much in altitude (based on the max_alt_diff)
removing groups of buddies that are too small (based on the min_sample_size)
Observations per group are synchronized in time (using the instantaneous_tolerance for allignment).
If a lapsrate is specified, the observations are corrected for altitude differences.
The following steps are repeated for N-iter iterations:
The values of outliers flaged by a previous iteration are converted to NaN’s. Therefore they are not used in any following score or sample.
For each buddy group:
The mean, standard deviation (std), and sample size are computed.
If the std is lower than the minimum_std, it is replaced by the minimum std.
Chi values are calculated for all records.
For each timestamp the record with the highest Chi is tested if it is larger then spatial_z_threshold. If so, that record is flagged as an outlier. It will be ignored in the next iteration.
The following steps are applied on the outliers flagged by the current iteration.
The LCZ-buddy sample is tested in size (samplesize must be bigger then min_LCZ_safetynet_sample_size). If the condition is not met, the safetynet test is not applied.
The safetynet test is applied:
The mean and std are computed of the LCZ-buddy sample. If the std is smaller then min_std, then the latter is used.
The z-value is computed for the target record (= flagged outlier).
If the z-value is smaller than safetynet_z_threshold, the tested outlier is “saved”, and is removed from the set of outliers for the current iteration.
- Parameters:
target_obstype (str, optional) – The target observation to check. Default is “temp”.
spatial_buddy_radius (int | float, optional) – The radius to define spatial neighbors in meters. Default is 10000.
min_sample_size (int, optional) – The minimum sample size to calculate statistics on. Use for spatial-buddy samples and LCZ-safetynet samples. Default is 4.
max_alt_diff (int | float | None, optional) – The maximum altitude difference allowed for buddies. Default is None.
min_std (int | float, optional) – The minimum standard deviation for sample statistics. This is used in spatial and LCZ samples. Default is 1.0.
spatial_z_threshold (int | float, optional) – The threshold, tested with z-scores, for flagging observations as outliers. Default is 3.1.
safetynet_z_threshold (int or float or None) – The threshold for a succesfull safety net test. If the z-value is less than safetynet_z_threshold, the test is succesfull and the outlier is “saved”. It can proceed as a regular observation in the next iteration.
N_iter (int, optional) – The number of iterations to perform the buddy check. Default is 2.
instantaneous_tolerance (str | pd.Timedelta, optional) – The maximum time difference allowed for synchronizing observations. Default is pd.Timedelta(“4min”).
lapserate (int | float | None, optional) – Describe how the obstype changes with altitude (in meters). Default is None.
use_mp (bool, optional) – Use multiprocessing to speed up the buddy check. Default is True.
- Return type:
None
Notes
This method modifies the outliers in place and does not return anything. You can use the outliersdf property to view all flagged outliers.
The altitude of the stations can be extracted from GEE by using the Dataset.get_altitude() method.