metobs_toolkit.dataset.Dataset.import_data_from_file#

Dataset.import_data_from_file(template_file: str | Path, input_data_file: str | Path = None, input_metadata_file: str | Path = None, freq_estimation_method: Literal['highest', 'median'] = 'median', freq_estimation_simplify_tolerance: str | Timedelta = '2min', origin_simplify_tolerance: str | Timedelta = '5min', timestamp_tolerance: str | Timedelta = '4min', kwargs_data_read: dict = {}, kwargs_metadata_read: dict = {}, templatefile_is_url: bool = False) None[source]#

Import observational data and metadata from files.

Importing data requires a Template which is constructed from a template file (JSON). (Use metobs_toolkit.build_template_prompt() to create a template file).

If input_data_file is provided, the method reads the raw observational data (supported formats: CSV, Parquet). A basic quality control (duplicate timestamps and invalid input) is performed, and a frequency estimation is made. Based on the estimated frequency, gaps are identified if present.

The method performs the following steps:

  • Estimates the frequency of observations using the freq_estimation_method.

  • Simplifies the estimated frequency and origin timestamps based on tolerances.

  • Aligns the raw timestamps with target timestamps (by origin, and freq) using a nearest merge, considering a specified timestamp tolerance.

  • Executes checks for duplicates and invalid input.

  • Identifies gaps in the data.

If input_metadata_file is provided, the method reads the metadata (supported formats: CSV, Parquet).

Parameters:
  • template_file (str or Path) – Path to the template (JSON) file used to interpret the raw data/metadata files.

  • input_data_file (str or Path, optional) – Path to the input data file containing observations (CSV or Parquet format). If None, no data is read.

  • input_metadata_file (str or Path, optional) – Path to the input metadata file (CSV or Parquet format). If None, no metadata is read.

  • freq_estimation_method ({'highest', 'median'}, optional) – Method to estimate the frequency of observations (per station per observation type).

  • freq_estimation_simplify_tolerance (str or pd.Timedelta, optional) – The maximum allowed error in simplifying the target frequency.

  • origin_simplify_tolerance (str or pd.Timedelta, optional) – For each time series, the origin (first occurring timestamp) is set and simplification is applied.

  • timestamp_tolerance (str or pd.Timedelta, optional) – The maximum allowed time shift tolerance for aligning timestamps to target (perfect-frequency) timestamps.

  • kwargs_data_read (dict, optional) – Additional keyword arguments to pass to the file reader (e.g., pandas.read_csv() for CSV files or pandas.read_parquet() for Parquet files) when reading the data file.

  • kwargs_metadata_read (dict, optional) – Additional keyword arguments to pass to the file reader (e.g., pandas.read_csv() for CSV files or pandas.read_parquet() for Parquet files) when reading the metadata file.

  • templatefile_is_url (bool, optional) – If True, the template_file is interpreted as a URL to an online template file. If False, it is interpreted as a local file path.

Return type:

None