metobs_toolkit.dataset.Dataset.buddy_check#

Dataset.buddy_check(obstype: str = 'temp', spatial_buddy_radius: int | float = 10000, min_sample_size: int = 4, max_alt_diff: int | float | None = None, min_std: int | float = 1.0, spatial_z_threshold: int | float = 3.1, N_iter: int = 2, instantaneous_tolerance: str | Timedelta = Timedelta('0 days 00:04:00'), lapserate: float | None = None, whiteset: WhiteSet = WhiteSet(empty), use_mp: bool = True)[source]#

Spatial buddy check.

The buddy check compares an observation against its neighbors (i.e. spatial buddies). The check loops over all the groups, which are stations within a radius of each other. For each group, the z-value of the reference observation is computed given the sample of spatial buddies. If one (or more) exceeds the spatial_z_threshold, the most extreme (=baddest) observation of that group is labeled as an outlier.

Multiple iterations of this checks can be done using the N_iter.

A schematic step-by-step description of the buddy check:

  1. A distance matrix is constructed for all interdistances between the stations. This is done using the haversine approximation.

  2. Groups of spatial buddies (neighbours) are created by using the spatial_buddy_radius. These groups are further filtered by:

    • removing stations from the groups that differ to much in altitude (based on the max_alt_diff)

    • removing groups of buddies that are too small (based on the min_sample_size)

  3. Observations per group are synchronized in time (using the instantaneous_tolerance for allignment).

  4. If a lapsrate is specified, the observations are corrected for altitude differences.

  5. The following steps are repeated for N-iter iterations:

    1. The values of outliers flaged by a previous iteration are converted to NaN’s. Therefore they are not used in any following score or sample.

    2. For each buddy group:

      • The mean, standard deviation (std), and sample size are computed.

      • If the std is lower than the minimum_std, it is replaced by the minimum std.

      • Chi values are calculated for all records.

      • For each timestamp the record with the highest Chi is tested if it is larger then spatial_z_threshold. If so, that record is flagged as an outlier. It will be ignored in the next iteration.

    3. If whiteset is provided, any outliers that match the white-listed timestamps are removed from the outlier set for the current iteration. White-listed records participate in all buddy check calculations but are not flagged as outliers in the final results.

Parameters:
  • obstype (str, optional) – The target observation to check. Default is “temp”.

  • spatial_buddy_radius (int | float, optional) – The radius to define spatial neighbors in meters. Default is 10000.

  • min_sample_size (int, optional) – The minimum sample size to calculate statistics on. Default is 4.

  • max_alt_diff (int | float | None, optional) – The maximum altitude difference allowed for buddies. Default is None.

  • min_std (int | float, optional) – The minimum standard deviation for sample statistics. Default is 1.0.

  • spatial_z_threshold (int | float, optional) – The threshold (std units) for flagging observations as outliers. Default is 3.1.

  • N_iter (int, optional) – The number of iterations to perform the buddy check. Default is 2.

  • instantaneous_tolerance (str | pd.Timedelta, optional) – The maximum time difference allowed for synchronizing observations. Default is pd.Timedelta(“4min”).

  • lapserate (int | float | None, optional) – Describe how the obstype changes with altitude (in meters). Default is None.

  • whiteset (WhiteSet, optional) – A WhiteSet instance containing timestamps that should be excluded from outlier detection. The WhiteSet is used to create station-specific and obstype-specific whitelists before applying the buddy check. White-listed records participate in all buddy check iterations as regular records but are not flagged as outliers in the final results. The default is an empty WhiteSet().

  • use_mp (bool, optional) – Use multiprocessing to speed up the buddy check. Default is True.

Return type:

None

Notes

  • This method modifies the outliers in place and does not return anything. You can use the outliersdf property to view all flagged outliers.

  • The altitude of the stations can be extracted from GEE by using the Dataset.get_altitude() method.

  • White-listed records from the WhiteSet participate in all buddy check calculations but are not flagged as outliers in the final results.