Using WhiteSet for Quality Control#
The WhiteSet class allows you to protect specific observations from being flagged as outliers during quality control checks. Whitelisted records participate in all QC calculations (means, standard deviations, etc.) but are protected from being flagged as outliers in the final results.
This is useful when you know certain records are valid despite appearing anomalous, such as:
Known extreme weather events
Validated station-specific conditions
Records that have been manually verified
This notebook demonstrates how to use WhiteSet with the quality control methods of the Dataset and Station classes.
Import and Setup#
First, import the toolkit and load demo data.
[1]:
import metobs_toolkit
import pandas as pd
import copy
# Import demo dataset
dataset = metobs_toolkit.Dataset()
dataset.import_data_from_file(
input_data_file=metobs_toolkit.demo_datafile,
input_metadata_file=metobs_toolkit.demo_metadatafile,
template_file=metobs_toolkit.demo_template,
)
# Resample to hourly for consistent QC
dataset.resample(target_freq='1h')
print(f"Dataset contains {len(dataset.stations)} stations")
Luchtdruk is present in the datafile, but not found in the template! This column will be ignored.
Neerslagintensiteit is present in the datafile, but not found in the template! This column will be ignored.
Neerslagsom is present in the datafile, but not found in the template! This column will be ignored.
Rukwind is present in the datafile, but not found in the template! This column will be ignored.
Luchtdruk_Zeeniveau is present in the datafile, but not found in the template! This column will be ignored.
Globe Temperatuur is present in the datafile, but not found in the template! This column will be ignored.
The following columns are present in the data file, but not in the template! They are skipped!
['Rukwind', 'Neerslagintensiteit', 'Luchtdruk_Zeeniveau', 'Luchtdruk', 'Neerslagsom', 'Globe Temperatuur']
The following columns are found in the metadata, but not in the template and are therefore ignored:
['stad', 'Network', 'benaming', 'sponsor']
The present gaps are removed, new gaps are constructed for humidity data of station vlinder02..
The present gaps are removed, new gaps are constructed for wind_speed data of station vlinder02..
The present gaps are removed, new gaps are constructed for wind_direction data of station vlinder02..
The present gaps are removed, new gaps are constructed for temp data of station vlinder02..
Dataset contains 28 stations
Baseline QC without WhiteSet#
Let’s run a gross value check to identify outliers without any whitelisting.
[2]:
baseline_dataset = copy.deepcopy(dataset)
# Apply gross value check without whitelisting
baseline_dataset.gross_value_check(
obstype='temp',
lower_threshold=10.0,
upper_threshold=20.0,
use_mp=False
)
# View outliers
baseline_dataset.make_plot(obstype='temp', colorby='label', title='Gross Value Check without Whitelisting')
[2]:
<Axes: title={'center': 'Gross Value Check without Whitelisting'}, xlabel='Timestamps (in UTC)', ylabel='temp (degree_Celsius)'>
Creating a WhiteSet#
A WhiteSet is created from a pandas Index or MultiIndex with one or more of these levels:
datetime: Specific timestamps to whitelist
name: Station names
obstype: Observation types
Let’s create different types of WhiteSets and examine them using the get_info() method.
WhiteSet with datetime only#
This whitelists specific timestamps across all stations and all observation types.
[3]:
# Specify timestamps
white_timestamps = pd.date_range(start='2022-09-03 01:00',
end='2022-09-05 18:00',
freq='1h',
tz='UTC'
)
# Create WhiteSet with datetime-only index
whiteset_dt = metobs_toolkit.WhiteSet(
pd.Index(white_timestamps, name='datetime')
)
# View information about this WhiteSet
whiteset_dt.get_info()
================================================================================
General info of WhiteSet
================================================================================
--- Whitelist details ---
-Total records: 66
-Index levels: datetime
-Unique timestamps: 66
-Time range: 2022-09-03 01:00:00+00:00 to 2022-09-05 18:00:00+00:00
Using WhiteSet in Dataset QC Methods#
Now let’s apply the same QC check with whitelisting and compare the results.
[4]:
# Create a fresh dataset copy
dataset_with_whitelist = copy.deepcopy(dataset)
# Apply gross value check WITH whitelisting
dataset_with_whitelist.gross_value_check(
obstype='temp',
lower_threshold=10.0,
upper_threshold=20.0,
whiteset=whiteset_dt,
use_mp=False
)
dataset_with_whitelist.make_plot(obstype='temp', colorby='label')
[4]:
<Axes: title={'center': 'temp data.'}, xlabel='Timestamps (in UTC)', ylabel='temp (degree_Celsius)'>
As can be seen, all records in the whitelisted timestamps period are ‘saved’ and not labeled as outliers.
WhiteSet with name-only index#
If you only provide a ‘name’ index to the WhiteSet, then none of the records of that station are flagged by a QC check if the whiteset is provided.
[5]:
# Create a fresh dataset copy
dataset_with_whitelist = copy.deepcopy(dataset)
whiteset = metobs_toolkit.WhiteSet(
pd.Index(['vlinder05', 'vlinder06'], name='name')
)
# Apply gross value check WITH whitelisting
dataset_with_whitelist.gross_value_check(
obstype='temp',
lower_threshold=10.0,
upper_threshold=20.0,
whiteset=whiteset,
use_mp=False
)
dataset_with_whitelist.make_plot(obstype='temp', colorby='label')
[5]:
<Axes: title={'center': 'temp data.'}, xlabel='Timestamps (in UTC)', ylabel='temp (degree_Celsius)'>
WhiteSet from outliersdf with name and datetime#
We can obtain more control by also specifying station name - timestamp pairs. This whitelists specific timestamps for specific stations only.
[6]:
# 1. Create a multi-index WhiteSet with name and datetime
# Get some outliers to use as examples
baseline_outliers = baseline_dataset.outliersdf
sample_outliers = baseline_outliers.reset_index().set_index(['name', 'datetime'])
# We take only a sample as demonstration
sample_outliers = sample_outliers.sample(n=42)
# Get the index
sample_outliers = sample_outliers.index
sample_outliers
[6]:
MultiIndex([('vlinder02', '2022-09-04 14:00:00+00:00'),
('vlinder06', '2022-09-05 08:00:00+00:00'),
('vlinder22', '2022-09-13 09:00:00+00:00'),
('vlinder03', '2022-09-04 20:00:00+00:00'),
('vlinder04', '2022-09-03 10:00:00+00:00'),
('vlinder26', '2022-09-05 19:00:00+00:00'),
('vlinder27', '2022-09-04 11:00:00+00:00'),
('vlinder23', '2022-09-01 17:00:00+00:00'),
('vlinder13', '2022-09-03 12:00:00+00:00'),
('vlinder02', '2022-09-05 10:00:00+00:00'),
('vlinder28', '2022-09-12 18:00:00+00:00'),
('vlinder08', '2022-09-04 10:00:00+00:00'),
('vlinder28', '2022-09-02 16:00:00+00:00'),
('vlinder17', '2022-09-02 12:00:00+00:00'),
('vlinder25', '2022-09-13 09:00:00+00:00'),
('vlinder19', '2022-09-12 13:00:00+00:00'),
('vlinder11', '2022-09-13 16:00:00+00:00'),
('vlinder25', '2022-09-04 09:00:00+00:00'),
('vlinder28', '2022-09-03 12:00:00+00:00'),
('vlinder27', '2022-09-13 15:00:00+00:00'),
('vlinder28', '2022-09-05 17:00:00+00:00'),
('vlinder12', '2022-09-06 19:00:00+00:00'),
('vlinder01', '2022-09-01 15:00:00+00:00'),
('vlinder27', '2022-09-01 10:00:00+00:00'),
('vlinder16', '2022-09-02 09:00:00+00:00'),
('vlinder03', '2022-09-01 11:00:00+00:00'),
('vlinder10', '2022-09-05 18:00:00+00:00'),
('vlinder13', '2022-09-03 14:00:00+00:00'),
('vlinder15', '2022-09-04 12:00:00+00:00'),
('vlinder20', '2022-09-04 09:00:00+00:00'),
('vlinder01', '2022-09-02 21:00:00+00:00'),
('vlinder05', '2022-09-01 03:00:00+00:00'),
('vlinder17', '2022-09-05 12:00:00+00:00'),
('vlinder25', '2022-09-01 13:00:00+00:00'),
('vlinder25', '2022-09-11 12:00:00+00:00'),
('vlinder10', '2022-09-06 18:00:00+00:00'),
('vlinder05', '2022-09-03 10:00:00+00:00'),
('vlinder15', '2022-09-01 16:00:00+00:00'),
('vlinder23', '2022-09-03 19:00:00+00:00'),
('vlinder03', '2022-09-05 12:00:00+00:00'),
('vlinder12', '2022-09-13 12:00:00+00:00'),
('vlinder12', '2022-09-03 11:00:00+00:00')],
names=['name', 'datetime'])
[7]:
# 2. Create a WhiteSet from the index
whiteset = metobs_toolkit.WhiteSet(sample_outliers)
# 3. Apply it in QC
# Create a fresh dataset copy
dataset_with_specific_whitelist = copy.deepcopy(dataset)
# Apply gross value check WITH whitelisting
dataset_with_specific_whitelist.gross_value_check(
obstype='temp',
lower_threshold=10.0,
upper_threshold=20.0,
whiteset=whiteset,
use_mp=False
)
dataset_with_specific_whitelist.make_plot(obstype='temp', colorby='label')
[7]:
<Axes: title={'center': 'temp data.'}, xlabel='Timestamps (in UTC)', ylabel='temp (degree_Celsius)'>
Special Note for Buddy Check#
Important: When using WhiteSet with buddy_check() or buddy_check_with_safetynets(), whitelisted records have a special behavior. Since buddy check is an iterative algorithm that compares each station’s values against its spatial neighbors, whitelisted records participate in every iteration and influence the statistical calculations used to evaluate other stations.
This means that a whitelisted extreme value at one station can affect whether neighboring stations get flagged as outliers, since the extreme value remains in the reference dataset throughout all buddy check iterations. Consider this carefully when whitelisting records for spatial quality control methods.
Key Points#
WhiteSet protects records from being flagged, but they still participate in calculations
Works with all Dataset QC methods:
gross_value_check,persistence_check,repetitions_check,step_check,window_variation_check,buddy_check,buddy_check_with_safetynetsWorks with all Station QC methods:
gross_value_check,persistence_check,repetitions_check,step_check,window_variation_checkCan whitelist by datetime, station name, obstype, or any combination
Use
get_info()to inspect WhiteSet contentsEmpty WhiteSet (default) means no protection