metobs_toolkit.dataset.Dataset.import_data_from_file#
- Dataset.import_data_from_file(template_file: str | Path, input_data_file: str | Path = None, input_metadata_file: str | Path = None, freq_estimation_method: Literal['highest', 'median'] = 'median', freq_estimation_simplify_tolerance: str | Timedelta = '2min', origin_simplify_tolerance: str | Timedelta = '5min', timestamp_tolerance: str | Timedelta = '4min', kwargs_data_read: dict = {}, kwargs_metadata_read: dict = {}, templatefile_is_url: bool = False) None[source]#
Import observational data and metadata from files.
Importing data requires a Template which is constructed from a template file (JSON). (Use
metobs_toolkit.build_template_prompt()to create a template file).If input_data_file is provided, the method reads the raw observational data (supported formats: CSV, Parquet). A basic quality control (duplicate timestamps and invalid input) is performed, and a frequency estimation is made. Based on the estimated frequency, gaps are identified if present.
The method performs the following steps:
Estimates the frequency of observations using the freq_estimation_method.
Simplifies the estimated frequency and origin timestamps based on tolerances.
Aligns the raw timestamps with target timestamps (by origin, and freq) using a nearest merge, considering a specified timestamp tolerance.
Executes checks for duplicates and invalid input.
Identifies gaps in the data.
If input_metadata_file is provided, the method reads the metadata (supported formats: CSV, Parquet).
- Parameters:
template_file (str or Path) – Path to the template (JSON) file used to interpret the raw data/metadata files.
input_data_file (str or Path, optional) – Path to the input data file containing observations (CSV or Parquet format). If None, no data is read.
input_metadata_file (str or Path, optional) – Path to the input metadata file (CSV or Parquet format). If None, no metadata is read.
freq_estimation_method ({'highest', 'median'}, optional) – Method to estimate the frequency of observations (per station per observation type).
freq_estimation_simplify_tolerance (str or pd.Timedelta, optional) – The maximum allowed error in simplifying the target frequency.
origin_simplify_tolerance (str or pd.Timedelta, optional) – For each time series, the origin (first occurring timestamp) is set and simplification is applied.
timestamp_tolerance (str or pd.Timedelta, optional) – The maximum allowed time shift tolerance for aligning timestamps to target (perfect-frequency) timestamps.
kwargs_data_read (dict, optional) – Additional keyword arguments to pass to the file reader (e.g., pandas.read_csv() for CSV files or pandas.read_parquet() for Parquet files) when reading the data file.
kwargs_metadata_read (dict, optional) – Additional keyword arguments to pass to the file reader (e.g., pandas.read_csv() for CSV files or pandas.read_parquet() for Parquet files) when reading the metadata file.
templatefile_is_url (bool, optional) – If True, the template_file is interpreted as a URL to an online template file. If False, it is interpreted as a local file path.
- Return type:
None