Xarray engine: mono variable with remapping

This notebook demonstrates how to generate an Xarray with a single DataArray containing all the parameters from a GRIB fieldlist. This data structure is often needed for machine learning.

First, we get GRIB data containing multiple forecasts on the surface and pressure levels. We select a single forecast out of it.

[1]:
import earthkit.data as ekd

ds_fl = ekd.from_source("sample", "mixed_pl_sfc.grib").to_fieldlist().sel({"time.base_datetime": "2024-06-03T00"})

Next, we convert the GRIB Fieldlist to Xarray with to_xarray(). The goal is to create a single variable in the dataset called “data”. Since we have both surface and pressure level parameters, the input data does not form a full hypercube. To overcome this problem we use the remapping option to merge the “parameter.variable” and “vertical.level” metadata keys into a single key. With fixed_dims we define the dimensions and their order to use and mono_variable=True ensures a single DataArray will be created.

[2]:
ds = ds_fl.to_xarray(
    fixed_dims=["time.valid_datetime", "param", "ensemble.member"],
    mono_variable=True,
    chunks={"valid_time": 1},
    flatten_values=True,
    add_earthkit_attrs=False,
    remapping={"param": "{parameter.variable}_{vertical.level}"},
)
ds
[2]:
<xarray.Dataset> Size: 362kB
Dimensions:         (valid_datetime: 2, param: 32, member: 1, values: 684)
Coordinates:
  * valid_datetime  (valid_datetime) datetime64[ns] 16B 2024-06-03 2024-06-03...
  * param           (param) <U6 768B '2t_0' 'msl_0' 'r_1000' ... 'z_700' 'z_850'
  * member          (member) <U1 4B '0'
    latitude        (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
    longitude       (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
Dimensions without coordinates: values
Data variables:
    data            (valid_datetime, param, member, values) float64 350kB dask.array<chunksize=(2, 32, 1, 684), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.8
    institution:  ECMWF

When generating the Xarray we flattened the field values and chose the chunking so that one chunk would contain all the data belonging to a given valid time.

[3]:
ds["data"]
[3]:
<xarray.DataArray 'data' (valid_datetime: 2, param: 32, member: 1, values: 684)> Size: 350kB
dask.array<open_dataset-data, shape=(2, 32, 1, 684), dtype=float64, chunksize=(2, 32, 1, 684), chunktype=numpy.ndarray>
Coordinates:
  * valid_datetime  (valid_datetime) datetime64[ns] 16B 2024-06-03 2024-06-03...
  * param           (param) <U6 768B '2t_0' 'msl_0' 'r_1000' ... 'z_700' 'z_850'
  * member          (member) <U1 4B '0'
    latitude        (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
    longitude       (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
Dimensions without coordinates: values
Attributes:
    standard_name:  unknown
    long_name:      2 metre temperature
    units:          kelvin
    level_type:     surface
[ ]: