metobs_toolkit.dataset.Dataset.buddy_check#

Dataset.buddy_check(target_obstype: str = 'temp', buddy_radius: int | float = 10000, min_sample_size: int = 4, max_alt_diff: int | float | None = None, min_std: int | float = 1.0, std_threshold: int | float = 3.1, N_iter: int = 2, instantanious_tolerance: str | Timedelta = Timedelta('0 days 00:04:00'), lapserate: float | None = None, use_mp: bool = True)[source]#

Spatial buddy check.

The buddy check compares an observation against its neighbors (i.e. buddies). The check loops over all the groups, which are stations within a radius of each other. For each group, the absolute value of the difference with the groupmean, normalized by the standared deviation (with a defined minimum), is computed. If one (or more) exeeds the std_theshold, the most extreme (=baddest) observation of that group is labeled as an outlier.

Multiple iterations of this checks can be done using the N_iter.

A schematic step-by-step description of the buddy check:

  1. A distance matrix is constructed for all interdistances between the stations. This is done using the haversine approximation.

  2. Groups of buddies (neighbours) are created by using the buddy_radius. These groups are further filtered by:

    1. removing stations from the groups that differ to much in altitude (based on the max_alt_diff)

    2. removing groups of buddies that are too small (based on the min_sample_size)

  3. Observations per group are synchronized in time (using the max_shift as tolerance for allignment).

  4. If a lapsrate is specified, the observations are corrected for altitude differences.

  5. For each buddy group:

    1. The mean, standard deviation (std), and sample size are computed.

    2. If the std is lower than the minimum std, it is replaced by the minimum std.

    3. Chi values are calculated for all records.

    4. For each timestamp the record with the highest Chi is tested if it is larger then std_threshold. If so, that record is flagged as an outlier. It will be ignored in the next iteration.

    5. This is repeated N_iter times.

Parameters:
  • target_obstype (str, optional) – The target observation to check. Default is “temp”.

  • buddy_radius (int | float, optional) – The radius to define spatial neighbors in meters. Default is 10000.

  • min_sample_size (int, optional) – The minimum sample size to calculate statistics on. Default is 4.

  • max_alt_diff (int | float | None, optional) – The maximum altitude difference allowed for buddies. Default is None.

  • min_std (int | float, optional) – The minimum standard deviation for sample statistics. Default is 1.0.

  • std_threshold (int | float, optional) – The threshold (std units) for flagging observations as outliers. Default is 3.1.

  • N_iter (int, optional) – The number of iterations to perform the buddy check. Default is 2.

  • instantanious_tolerance (str | pd.Timedelta, optional) – The maximum time difference allowed for synchronizing observations. Default is pd.Timedelta(“4min”).

  • lapserate (int | float | None, optional) – Describe how the obstype changes with altitude (in meters). Default is None.

  • use_mp (bool, optional) – Use multiprocessing to speed up the buddy check. Default is True.

Return type:

None

Notes

  • This method modifies the outliers in place and does not return anything. You can use the outliersdf property to view all flagged outliers.

  • The altitude of the stations can be extracted from GEE by using the Dataset.get_altitude() method.