Converting MetObs Toolkit data to xarray Datasets#

This notebook demonstrates the to_xr() methods of the Station and Dataset classes from the metobs_toolkit package using the built-in demo dataset.

We show:

  1. Loading the demo dataset.

  2. Converting a single station to an xarray.Dataset.

  3. Inspecting the structure (dimensions, variables, attributes).

  4. Converting the full multi-station dataset to xarray.

  5. Exploring selections (e.g. picking observation values vs. labels).

What is xarray?#

xarray is a Python library that brings the labeled data concepts of pandas to N-dimensional arrays (NetCDF-style). It enables:

  • Named dimensions (e.g. datetime, kind, name)

  • Coordinate-based indexing and selection

  • Rich metadata via attributes

  • Easy export to formats like NetCDF / Zarr / GRIB (with plugins)

It is especially useful for structured time series, gridded data, or any multi-dimensional scientific data.

[1]:
# Imports
import metobs_toolkit
import xarray as xr
[2]:
# 1. Load the demo dataset into a Dataset object
dataset = metobs_toolkit.Dataset()
dataset.import_data_from_file(
    template_file=metobs_toolkit.demo_template,
    input_metadata_file=metobs_toolkit.demo_metadatafile,
    input_data_file=metobs_toolkit.demo_datafile,
)

print(f"Number of stations: {len(dataset.stations)}")
print("First 5 station names:", [s.name for s in dataset.stations[:5]])
Luchtdruk is present in the datafile, but not found in the template! This column will be ignored.
Neerslagintensiteit is present in the datafile, but not found in the template! This column will be ignored.
Neerslagsom is present in the datafile, but not found in the template! This column will be ignored.
Rukwind is present in the datafile, but not found in the template! This column will be ignored.
Luchtdruk_Zeeniveau is present in the datafile, but not found in the template! This column will be ignored.
Globe Temperatuur is present in the datafile, but not found in the template! This column will be ignored.
The following columns are present in the data file, but not in the template! They are skipped!
 ['Luchtdruk_Zeeniveau', 'Neerslagsom', 'Globe Temperatuur', 'Luchtdruk', 'Neerslagintensiteit', 'Rukwind']
The following columns are found in the metadata, but not in the template and are therefore ignored:
['stad', 'Network', 'benaming', 'sponsor']
Number of stations: 28
First 5 station names: ['vlinder01', 'vlinder02', 'vlinder03', 'vlinder04', 'vlinder05']
[3]:
# 2. Pick one station (e.g. 'vlinder05') and run a simple QC check to add labels
station = dataset.get_station('vlinder05')
station.repetitions_check(max_N_repetitions=200)

/home/thoverga/Documents/VLINDER_github/MetObs_toolkit/src/metobs_toolkit/qc_collection/checks/repetitions_check.py:88: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  groups.get_group(
/home/thoverga/Documents/VLINDER_github/MetObs_toolkit/src/metobs_toolkit/sensordata.py:524: FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
  self.outliers_values_bin = pd.concat([self.outliers_values_bin,
[4]:
# 3. Convert the single station to an xarray Dataset
ds_station = station.to_xr()

ds_station

[4]:
<xarray.Dataset> Size: 311kB
Dimensions:         (kind: 2, datetime: 4320)
Coordinates:
  * kind            (kind) <U5 40B 'obs' 'label'
  * datetime        (datetime) datetime64[ns] 35kB 2022-09-01 ... 2022-09-15T...
    lat             float64 8B 51.05
    lon             float64 8B 3.675
    altitude        float64 8B nan
    LCZ             float64 8B nan
    school          <U12 48B 'Sint-Barbara'
Data variables:
    humidity        (kind, datetime) float64 69kB 61.0 61.0 61.0 ... 0.0 0.0 0.0
    wind_speed      (kind, datetime) float64 69kB 1.611 1.611 1.611 ... 0.0 0.0
    temp            (kind, datetime) float64 69kB 21.1 21.1 21.1 ... 5.0 5.0 5.0
    wind_direction  (kind, datetime) float64 69kB 45.0 45.0 45.0 ... 0.0 0.0 0.0

Structure of the station-level Dataset#

For each observed variable (e.g. temp, humidity, etc.), a DataArray is created with:

  • Dimension kind: separates โ€˜obsโ€™ (values) and โ€˜labelโ€™ (QC and gap-fill labels). If modeldata is present, model is added to this dimension.

  • Dimension datetime: corresponding timestamp

Attributes on each variable include:

  • obstype_name, obstype_desc, obstype_unit

  • QC*: dictionary of applied quality control checks

  • GF*: dictionary of applied gap-fill methods

  • Label*: The mapping from the label to the numeric value.

Note: Labels are represented by numeric values in the xarray.Dataset. This is needed, so that the Dataset can be serialized (and thus saved as netCDF format).

[5]:
# 4. Inspect one variable (e.g. temperature)
ds_station['temp']

[5]:
<xarray.DataArray 'temp' (kind: 2, datetime: 4320)> Size: 69kB
array([[21.10000038, 21.10000038, 21.10000038, ..., 17.39999962,
        17.39999962, 17.39999962],
       [ 0.        ,  0.        ,  0.        , ...,  5.        ,
         5.        ,  5.        ]], shape=(2, 4320))
Coordinates:
  * kind      (kind) <U5 40B 'obs' 'label'
  * datetime  (datetime) datetime64[ns] 35kB 2022-09-01 ... 2022-09-15T23:55:00
    lat       float64 8B 51.05
    lon       float64 8B 3.675
    altitude  float64 8B nan
    LCZ       float64 8B nan
    school    <U12 48B 'Sint-Barbara'
Attributes:
    obstype_name:                      temp
    obstype_desc:                      2m - temperature
    obstype_unit:                      degree_Celsius
    MetObs toolkit version:            1.1.0a1
    QC checks:                         ['duplicated_timestamp', 'repetitions']
    QC:repetitions.max_N_repetitions:  200
    QC:repetitions.sensorwhiteset:     0 whitelisted timestamps
    GF methods:                        []
    Label:ok:                          0
    Label:repetitions outlier:         5
[6]:
# 5. Inspect the QC labels (kind='label')
labels = ds_station['temp'].sel(kind='label')
labels
[6]:
<xarray.DataArray 'temp' (datetime: 4320)> Size: 35kB
array([0., 0., 0., ..., 5., 5., 5.], shape=(4320,))
Coordinates:
  * datetime  (datetime) datetime64[ns] 35kB 2022-09-01 ... 2022-09-15T23:55:00
    kind      <U5 20B 'label'
    lat       float64 8B 51.05
    lon       float64 8B 3.675
    altitude  float64 8B nan
    LCZ       float64 8B nan
    school    <U12 48B 'Sint-Barbara'
Attributes:
    obstype_name:                      temp
    obstype_desc:                      2m - temperature
    obstype_unit:                      degree_Celsius
    MetObs toolkit version:            1.1.0a1
    QC checks:                         ['duplicated_timestamp', 'repetitions']
    QC:repetitions.max_N_repetitions:  200
    QC:repetitions.sensorwhiteset:     0 whitelisted timestamps
    GF methods:                        []
    Label:ok:                          0
    Label:repetitions outlier:         5

numeric to flags#

Here is a small example how to convert the numeric representation back to the labels. The conversion is written as a function so it can be copied over in your own scripts.

[7]:
import numpy as np

def numeric_labels_to_string_labels(da: xr.DataArray) -> xr.DataArray:
    # Construct the dict that maps labels -> numbers
    label_to_numeric_map = {f'{str(key).strip("Label:")}' : val for key, val in da.attrs.items() if key.startswith('Label:')}

    # Now we invert the map
    numeric_to_label_map = {v: k for k, v in label_to_numeric_map.items()}

    # Apply the mapping
    squarer = lambda t: numeric_to_label_map.get(t, t)
    vfunc = np.vectorize(squarer)
    da.data = vfunc(da.data)
    return da

obs_labels = numeric_labels_to_string_labels(ds_station['temp'].sel(kind='label'))
obs_labels
[7]:
<xarray.DataArray 'temp' (datetime: 4320)> Size: 328kB
array(['ok', 'ok', 'ok', ..., 'repetitions outlier',
       'repetitions outlier', 'repetitions outlier'],
      shape=(4320,), dtype='<U19')
Coordinates:
  * datetime  (datetime) datetime64[ns] 35kB 2022-09-01 ... 2022-09-15T23:55:00
    kind      <U5 20B 'label'
    lat       float64 8B 51.05
    lon       float64 8B 3.675
    altitude  float64 8B nan
    LCZ       float64 8B nan
    school    <U12 48B 'Sint-Barbara'
Attributes:
    obstype_name:                      temp
    obstype_desc:                      2m - temperature
    obstype_unit:                      degree_Celsius
    MetObs toolkit version:            1.1.0a1
    QC checks:                         ['duplicated_timestamp', 'repetitions']
    QC:repetitions.max_N_repetitions:  200
    QC:repetitions.sensorwhiteset:     0 whitelisted timestamps
    GF methods:                        []
    Label:ok:                          0
    Label:repetitions outlier:         5
[8]:
# Or the observations
records = ds_station['temp'].sel(kind='obs')
records
[8]:
<xarray.DataArray 'temp' (datetime: 4320)> Size: 35kB
array([21.10000038, 21.10000038, 21.10000038, ..., 17.39999962,
       17.39999962, 17.39999962], shape=(4320,))
Coordinates:
  * datetime  (datetime) datetime64[ns] 35kB 2022-09-01 ... 2022-09-15T23:55:00
    kind      <U5 20B 'obs'
    lat       float64 8B 51.05
    lon       float64 8B 3.675
    altitude  float64 8B nan
    LCZ       float64 8B nan
    school    <U12 48B 'Sint-Barbara'
Attributes:
    obstype_name:                      temp
    obstype_desc:                      2m - temperature
    obstype_unit:                      degree_Celsius
    MetObs toolkit version:            1.1.0a1
    QC checks:                         ['duplicated_timestamp', 'repetitions']
    QC:repetitions.max_N_repetitions:  200
    QC:repetitions.sensorwhiteset:     0 whitelisted timestamps
    GF methods:                        []
    Label:ok:                          0
    Label:repetitions outlier:         5

Converting the full Dataset#

We can also use to_xr() on a Dataset object. When doing so, an additional dimension name is added to the xarray.Dataset.

[9]:
# 6. Convert the entire collection of stations
ds_all = dataset.to_xr()

ds_all
[9]:
<xarray.Dataset> Size: 8MB
Dimensions:         (name: 28, kind: 2, datetime: 4320)
Coordinates:
  * name            (name) <U9 1kB 'vlinder01' 'vlinder02' ... 'vlinder28'
    lat             (name) float64 224B 50.98 51.02 51.32 ... 51.16 51.06 51.04
    lon             (name) float64 224B 3.816 3.71 4.952 ... 4.998 3.728 3.77
    school          (name) <U29 3kB 'UGent' 'UGent' ... 'GO! Ath.'
  * kind            (kind) <U5 40B 'obs' 'label'
  * datetime        (datetime) datetime64[ns] 35kB 2022-09-01 ... 2022-09-15T...
    altitude        float64 8B nan
    LCZ             float64 8B nan
Data variables:
    humidity        (name, kind, datetime) float64 2MB 65.0 65.0 ... 0.0 0.0
    wind_speed      (name, kind, datetime) float64 2MB 1.556 1.528 ... 0.0 0.0
    temp            (name, kind, datetime) float64 2MB 18.8 18.8 ... 0.0 0.0
    wind_direction  (name, kind, datetime) float64 2MB 65.0 75.0 ... 0.0 0.0
[10]:
# 7. Selecting a single station from the multi-station Dataset
ds_one = ds_all.sel(name='vlinder05')
ds_one['temp']

[10]:
<xarray.DataArray 'temp' (kind: 2, datetime: 4320)> Size: 69kB
array([[21.10000038, 21.10000038, 21.10000038, ..., 17.39999962,
        17.39999962, 17.39999962],
       [ 0.        ,  0.        ,  0.        , ...,  5.        ,
         5.        ,  5.        ]], shape=(2, 4320))
Coordinates:
  * kind      (kind) <U5 40B 'obs' 'label'
  * datetime  (datetime) datetime64[ns] 35kB 2022-09-01 ... 2022-09-15T23:55:00
    lat       float64 8B 51.05
    lon       float64 8B 3.675
    altitude  float64 8B nan
    LCZ       float64 8B nan
    school    <U29 116B 'Sint-Barbara'
    name      <U9 36B 'vlinder05'
Attributes:
    obstype_name:            temp
    obstype_desc:            2m - temperature
    obstype_unit:            degree_Celsius
    MetObs toolkit version:  1.1.0a1
    QC checks:               ['duplicated_timestamp']
    GF methods:              []
    Label:ok:                0

Dimension summary (multi-station)#

  • name: name of the station

  • kind: sub-type of the data (e.g. โ€˜obsโ€™, โ€˜labelโ€™, possibly โ€˜modelโ€™ if model time series added)

  • datetime: consolidated time axis (union across stations)

If model time series (e.g. ERA5) are imported, an additional internal dimension (e.g. models) appears inside the model DataArrays (stacked under kind='model').

Saving to NetCDF format#

Both Station and Dataset objects can be saved to NetCDF format using the .to_netcdf() method