Xarray engine: mono variable with remapping
This notebook demonstrates how to generate an Xarray with a single dataarray containing all the parameters from a GRIB fieldlist. This data structure is often needed for machine learning.
First, we get GRIB data containing multiple forecasts on the surface and pressure levels. We select a single forecast out of it.
[1]:
import earthkit.data as ekd
ds_fl = ekd.from_source("sample", "mixed_pl_sfc.grib").sel(date=20240603, time=0)
Next, we convert the GRIB Fieldlist to Xarray with to_xarray(). The goal is to create a single variable in the dataset called “data”. Since we have both surface and pressure level parameters the input data does not form a full hypercube. To overcome this problem we use the remapping option to merge the “param” and “level” metadata keys into a single key. With fixed_dims we define the dimensions and their order to use and mono_variable=True ensures a single dataarray will be created.
[2]:
ds = ds_fl.to_xarray(fixed_dims=["valid_time", "param", "number"],
mono_variable=True,
chunks={"valid_time": 1},
flatten_values=True,
add_earthkit_attrs=False,
remapping={"param": "{param}_{level}"}
)
ds
[2]:
<xarray.Dataset> Size: 362kB
Dimensions: (valid_time: 2, param: 32, number: 1, values: 684)
Coordinates:
* valid_time (valid_time) datetime64[ns] 16B 2024-06-03 2024-06-03T06:00:00
* param (param) <U6 768B '2t_0' 'msl_0' 'r_1000' ... 'z_700' 'z_850'
* number (number) int64 8B 0
latitude (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
longitude (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
Dimensions without coordinates: values
Data variables:
data (valid_time, param, number, values) float64 350kB dask.array<chunksize=(1, 32, 1, 684), meta=np.ndarray>
Attributes:
paramId: 167
class: od
stream: oper
levtype: sfc
type: fc
expver: 0001
date: 20240603
time: 0
domain: g
Conventions: CF-1.8
institution: ECMWFWhen generating the Xarray we flattened the field values and chose the chunking so that one chunk would contain all the data belonging to a given valid time.
[3]:
ds["data"]
[3]:
<xarray.DataArray 'data' (valid_time: 2, param: 32, number: 1, values: 684)> Size: 350kB
dask.array<open_dataset-data, shape=(2, 32, 1, 684), dtype=float64, chunksize=(1, 32, 1, 684), chunktype=numpy.ndarray>
Coordinates:
* valid_time (valid_time) datetime64[ns] 16B 2024-06-03 2024-06-03T06:00:00
* param (param) <U6 768B '2t_0' 'msl_0' 'r_1000' ... 'z_700' 'z_850'
* number (number) int64 8B 0
latitude (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
longitude (values) float64 5kB dask.array<chunksize=(684,), meta=np.ndarray>
Dimensions without coordinates: values
Attributes:
standard_name: unknown
long_name: 2 metre temperature
units: K[ ]: