metobs_toolkit.dataset.Dataset.buddy_check_with_LCZ_safety_net#

Dataset.buddy_check_with_LCZ_safety_net(target_obstype: str = 'temp', spatial_buddy_radius: int | float = 10000, LCZ_buddy_radius: int | float = 40000, min_sample_size: int = 4, max_alt_diff: int | float | None = None, min_std: int | float = 1.0, spatial_z_threshold: int | float = 3.1, safetynet_z_threshold: int | float = 2.1, N_iter: int = 2, instantaneous_tolerance: str | Timedelta = Timedelta('0 days 00:04:00'), lapserate: float | None = None, use_mp: bool = True)[source]#

Spatial buddy check with LCZ saftey net.

The buddy check compares an observation against its neighbors (i.e. spatial buddies). The check loops over all the groups, which are stations within a radius of each other. For each group, the z-value of the reference observation is computed given the sample of spatial buddies. If one (or more) exceeds the spatial_z_threshold, the most extreme (=baddest) observation of that group is labeled as an outlier.

Multiple iterations of this checks can be done using the N_iter.

The (potential) outliers, per iteration, are tested with another sample. This sample contains the LCZ-buddies, that are stations with the same LCZ as the reference station, and with a maximum distance of max_LCZ_buddy_dist. If a max_alt_diff is specified, a altitude-difference filtering is applied on these buddies aswell. If a test is sucsesfull, that is if the z-value is smaller than the safetynet_z_threshold, the outlier is saved. It will be removed from the outliers, and will pass to the next iteration or the end of this function.

A schematic step-by-step description of the buddy check:

  1. A distance matrix is constructed for all interdistances between the stations. This is done using the haversine approximation.

  2. Groups of spatial buddies (neighbours) are created by using the spatial_buddy_radius. These groups are further filtered by:

    • removing stations from the groups that differ to much in altitude (based on the max_alt_diff)

    • removing groups of buddies that are too small (based on the min_sample_size)

  3. Observations per group are synchronized in time (using the instantaneous_tolerance for allignment).

  4. If a lapsrate is specified, the observations are corrected for altitude differences.

  5. The following steps are repeated for N-iter iterations:

    1. The values of outliers flaged by a previous iteration are converted to NaN’s. Therefore they are not used in any following score or sample.

    2. For each buddy group:

      • The mean, standard deviation (std), and sample size are computed.

      • If the std is lower than the minimum_std, it is replaced by the minimum std.

      • Chi values are calculated for all records.

      • For each timestamp the record with the highest Chi is tested if it is larger then spatial_z_threshold. If so, that record is flagged as an outlier. It will be ignored in the next iteration.

    3. The following steps are applied on the outliers flagged by the current iteration.

      • The LCZ-buddy sample is tested in size (samplesize must be bigger then min_LCZ_safetynet_sample_size). If the condition is not met, the safetynet test is not applied.

      • The safetynet test is applied:

        • The mean and std are computed of the LCZ-buddy sample. If the std is smaller then min_std, then the latter is used.

        • The z-value is computed for the target record (= flagged outlier).

        • If the z-value is smaller than safetynet_z_threshold, the tested outlier is “saved”, and is removed from the set of outliers for the current iteration.

Parameters:
  • target_obstype (str, optional) – The target observation to check. Default is “temp”.

  • spatial_buddy_radius (int | float, optional) – The radius to define spatial neighbors in meters. Default is 10000.

  • min_sample_size (int, optional) – The minimum sample size to calculate statistics on. Use for spatial-buddy samples and LCZ-safetynet samples. Default is 4.

  • max_alt_diff (int | float | None, optional) – The maximum altitude difference allowed for buddies. Default is None.

  • min_std (int | float, optional) – The minimum standard deviation for sample statistics. This is used in spatial and LCZ samples. Default is 1.0.

  • spatial_z_threshold (int | float, optional) – The threshold, tested with z-scores, for flagging observations as outliers. Default is 3.1.

  • safetynet_z_threshold (int or float or None) – The threshold for a succesfull safety net test. If the z-value is less than safetynet_z_threshold, the test is succesfull and the outlier is “saved”. It can proceed as a regular observation in the next iteration.

  • N_iter (int, optional) – The number of iterations to perform the buddy check. Default is 2.

  • instantaneous_tolerance (str | pd.Timedelta, optional) – The maximum time difference allowed for synchronizing observations. Default is pd.Timedelta(“4min”).

  • lapserate (int | float | None, optional) – Describe how the obstype changes with altitude (in meters). Default is None.

  • use_mp (bool, optional) – Use multiprocessing to speed up the buddy check. Default is True.

Return type:

None

Notes

  • This method modifies the outliers in place and does not return anything. You can use the outliersdf property to view all flagged outliers.

  • The altitude of the stations can be extracted from GEE by using the Dataset.get_altitude() method.