Xarray engine: mono variable with remapping

This notebook demonstrates how to generate an Xarray with a single dataarray containing all the parameters from a GRIB fieldlist. This data structure is often needed for machine learning.

First, we get GRIB data containing multiple forecasts on the surface and pressure levels. We select a single forecast out of it.

[1]:
import earthkit.data as ekd
ds_fl = ekd.from_source("sample", "mixed_pl_sfc.grib").sel(date=20240603, time=0)

Next, we convert the GRIB Fieldlist to Xarray with to_xarray(). The goal is to create a single variable in the dataset called “data”. Since we have both surface and pressure level parameters the input data does not form a full hypercube. To overcome this problem we use the remapping option to merge the “param” and “level” metadata keys into a single key. With fixed_dims we define the dimensions and their order to use and mono_variable=True ensures a single dataarray will be created.

[2]:
ds = ds_fl.to_xarray(fixed_dims=["valid_time", "param", "number"],
                     mono_variable=True,
                     chunks={"valid_time": 1},
                     flatten_values=True,
                     add_earthkit_attrs=False,
                     remapping={"param": "{param}_{level}"}
                    )
ds
[2]:
<xarray.Dataset> Size: 362kB
Dimensions:     (valid_time: 2, param: 32, number: 1, values: 684)
Coordinates:
  * valid_time  (valid_time) datetime64[ns] 16B 2024-06-03 2024-06-03T06:00:00
  * param       (param) <U6 768B '2t_0' 'msl_0' 'r_1000' ... 'z_700' 'z_850'
  * number      (number) int64 8B 0
    latitude    (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
    longitude   (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
Dimensions without coordinates: values
Data variables:
    data        (valid_time, param, number, values) float64 350kB dask.array<chunksize=(1, 32, 1, 684), meta=np.ndarray>
Attributes:
    paramId:      167
    class:        od
    stream:       oper
    levtype:      sfc
    type:         fc
    expver:       0001
    date:         20240603
    time:         0
    domain:       g
    Conventions:  CF-1.8
    institution:  ECMWF

When generating the Xarray we flattened the field values and chose the chunking so that one chunk would contain all the data belonging to a given valid time.

[3]:
ds["data"]
[3]:
<xarray.DataArray 'data' (valid_time: 2, param: 32, number: 1, values: 684)> Size: 350kB
dask.array<open_dataset-data, shape=(2, 32, 1, 684), dtype=float64, chunksize=(1, 32, 1, 684), chunktype=numpy.ndarray>
Coordinates:
  * valid_time  (valid_time) datetime64[ns] 16B 2024-06-03 2024-06-03T06:00:00
  * param       (param) <U6 768B '2t_0' 'msl_0' 'r_1000' ... 'z_700' 'z_850'
  * number      (number) int64 8B 0
    latitude    (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
    longitude   (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
Dimensions without coordinates: values
Attributes:
    standard_name:  unknown
    long_name:      2 metre temperature
    units:          K
[ ]: