Data objects

When we call from_source() it will return a data object. The actual object type depends on the source parameters and the data format, but is supposed to implement a common set of methods/operators, some of which will only be available for certain data types.

The list of common methods/operators:

Conversion to scientific Python objects

We can convert data objects into familiar scientific Python objects (including numpy arrays, pandas dataframes, xarray datasets):

ds.to_xarray()  # for field data
ds.to_pandas()  # for non-field data
ds.to_numpy()  # when the data is a n-dimensional array.

Concatenation

Data objects can be concatenated with the “+” operator:

>>> import earthkit.data as ekd
>>> ds1 = ekd.from_source("file", "docs/examples/test.grib")
>>> len(ds1)
2
>>> ds2 = ekd.from_source("file", "docs/examples/test6.grib")
>>> len(ds2)
6
>>> ds = ds1 + ds2
>>> len(ds)
8

Iteration

When an earthkit-data data source or dataset provides a FieldList or message list, we can iterate through it to access each element (in a given order see below).

In the the following example we read a GRIB file from disk. In the iteration each element is a field (representing a GRIB message):

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("file", "docs/examples/test6.grib")

>>> len(ds)
6

>>> for f in ds:
...     print(f)
...
GribField(t,1000,20180801,1200,0,0)
GribField(u,1000,20180801,1200,0,0)
GribField(v,1000,20180801,1200,0,0)
GribField(t,850,20180801,1200,0,0)
GribField(u,850,20180801,1200,0,0)
GribField(v,850,20180801,1200,0,0)

Iteration with .batched()

When an earthkit-data data source or dataset provides a FieldList or message list, we can iterate through it in batches of fixed size using batched(). This method also works for streams.

In the the following example we read a GRIB file from disk and iterate through it in batches of 2. Each iteration step yields a FieldList of 2 fields.

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("file", "docs/examples/test6.grib")

>>> for f in ds.batched(2):
...     print(f"len={len(f)} {f.metadata(('param', 'level'))}")
...
len=2 [('t', 1000), ('u', 1000)]
len=2 [('v', 1000), ('t', 850)]
len=2 [('u', 850), ('v', 850)]

Iteration with .group_by()

When an earthkit-data data source or dataset provides a FieldList or message list, we can iterate through it in groups defined by metadata keys using group_by(). This method also works for streams.

In the the following example we read a GRIB file from disk and iterate through it in groups defined by the “level” metadata key. Each iteration step yields a FieldList containing fields with the same “level” value.

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("file", "docs/examples/test6.grib")

>>> for f in ds.group_by("level"):
...     print(f"len={len(f)} {f.metadata(('param', 'level'))}")
...
len=3 [('t', 1000), ('u', 1000), ('v', 1000)]
len=3 [('t', 850), ('u', 850), ('v', 850)]

Selection with [...]

When an earthkit-data data source or dataset provides a FieldList or message list, a subset of it can be created using the standard python list interface relying on brackets and slices. Slicing also works by providing a list or ndarray of indices.

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("file", "docs/examples/test6.grib")

>>> len(ds)
6

>>> ds[0]
GribField(t,1000,20180801,1200,0,0)

>>> for f in ds[0:3]:
...     print(f)
GribField(t,1000,20180801,1200,0,0)
GribField(u,1000,20180801,1200,0,0)
GribField(v,1000,20180801,1200,0,0)

>>> for f in ds[0:4:2]:
...     print(f)
GribField(t,1000,20180801,1200,0,0)
GribField(v,1000,20180801,1200,0,0)

>>> ds[-1]
GribField(v,850,20180801,1200,0,0)

>>> for f in ds[-2:]:
...     print(f)
GribField(u,850,20180801,1200,0,0)
GribField(v,850,20180801,1200,0,0)

>>> for f in ds[[1, 3]]:
...     print(f)
...
GribField(u,1000,20180801,1200,0,0)
GribField(t,850,20180801,1200,0,0)

>>> for f in ds[np.array([1, 3])]:
...     print(f)
...
GribField(u,1000,20180801,1200,0,0)
GribField(t,850,20180801,1200,0,0)

Selection with .sel()

When an earthkit-data data source or dataset provides a FieldList or message list, the method .sel() allows filtering this list and we can select a subset of the list. .sel() returns a view to original data, so no data is copied. The selection offers the same functionality as the original data object, so methods like .to_numpy(), .to_xarray(), etc. are all available.

For more details see: sel().

The following example demonstrates the usage of .sel(). The input data contains temperature and wind fields on various pressure levels.

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("file", "docs/examples/tuv_pl.grib")

>>> len(ds)
18

>>> subset = ds.sel(param="t")
>>> len(subset)
6

>>> for f in subset:
...     print(f)
...
GribField(t,1000,20180801,1200,0,0)
GribField(t,850,20180801,1200,0,0)
GribField(t,700,20180801,1200,0,0)
GribField(t,500,20180801,1200,0,0)
GribField(t,400,20180801,1200,0,0)
GribField(t,300,20180801,1200,0,0)

>>> subset = ds.sel(param=["u", "v"], level=slice(400, 700))
>>> len(subset)
6

>>> for f in subset:
...     print(f)
...
GribField(u,700,20180801,1200,0,0)
GribField(v,700,20180801,1200,0,0)
GribField(u,500,20180801,1200,0,0)
GribField(v,500,20180801,1200,0,0)
GribField(u,400,20180801,1200,0,0)
GribField(v,400,20180801,1200,0,0)

Selection with .isel()

When an earthkit-data data source or dataset provides a FieldList, the method .isel() allows filtering this list and we can select a subset of the list. .isel() returns a view to the original data, so no data is copied. The selection offers the same functionality as the original data object, so methods like .to_numpy(), .to_xarray() , etc. are all available.

.isel() works similarly to sel but conditions are specified by indices of metadata keys. A metadata index stores the unique, sorted values of the corresponding metadata key from all the fields in the input data.

For more details see: isel()

The following example demonstrates the usage of .isel(). The input data contains temperature and wind fields on various pressure levels.

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("file", "docs/examples/tuv_pl.grib")

>>> len(ds)
18
>>> ds.indices
{'levelist': (1000, 850, 700, 500, 400, 300), 'param': ('t', 'u', 'v')}

>>> subset = ds.isel(param=0)
>>> len(ds)
6

>>> for f in subset:
...     print(f)
...
GribField(t,1000,20180801,1200,0,0)
GribField(t,850,20180801,1200,0,0)
GribField(t,700,20180801,1200,0,0)
GribField(t,500,20180801,1200,0,0)
GribField(t,400,20180801,1200,0,0)
GribField(t,300,20180801,1200,0,0)

>>> subset = ds.isel(param=[1, 2], level=slice(2, 4))
>>> len(subset)
4

>>> for f in subset:
...     print(f)
...
GribField(u,700,20180801,1200,0,0)
GribField(v,700,20180801,1200,0,0)
GribField(u,500,20180801,1200,0,0)
GribField(v,500,20180801,1200,0,0)

Ordering with .order_by()

When an earthkit-data data source or dataset provides a FieldList or message list, the method .order_by() allows sorting this list.

.order_by() returns a “view” so no new data is generated on disk or in memory. The resulting object offers the same functionality as the original data object, so methods like .to_numpy(), .to_xarray(), etc. are all available.

For more details see: order_by()

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("file", "docs/examples/test6.grib")

>>> len(ds)
6

>>> for f in ds.order_by("param"):
...     print(f)
...
GribField(t,850,20180801,1200,0,0)
GribField(t,1000,20180801,1200,0,0)
GribField(u,850,20180801,1200,0,0)
GribField(u,1000,20180801,1200,0,0)
GribField(v,850,20180801,1200,0,0)
GribField(v,1000,20180801,1200,0,0)

>>> for f in ds.order_by(["level", "param"]):
...     print(f)
...
GribField(t,850,20180801,1200,0,0)
GribField(u,850,20180801,1200,0,0)
GribField(v,850,20180801,1200,0,0)
GribField(t,1000,20180801,1200,0,0)
GribField(u,1000,20180801,1200,0,0)
GribField(v,1000,20180801,1200,0,0)

>>> for f in ds.order_by(param=["u", "t", "v"]):
...     print(f)
...
GribField(u,850,20180801,1200,0,0)
GribField(u,1000,20180801,1200,0,0)
GribField(t,850,20180801,1200,0,0)
GribField(t,1000,20180801,1200,0,0)
GribField(v,850,20180801,1200,0,0)
GribField(v,1000,20180801,1200,0,0)

Accessing data values

We can extract the values from data objects as an ndarray using the .to_numpy() method or the .values property.

When an earthkit-data source provides a FieldList, these methods can be called both on the whole object and on the individual fields, too.

While .to_numpy(), by default, preserves the shape of the fields, .values always returns a flat array per field. By using flatten=True, we can force .to_numpy() to return a flat ndarray per field.

For more details see: to_numpy().

In the following example the input GRIB data contains 6 fields each defined on a latitude-longitude grid with a shape of (7, 12).

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("file", "docs/examples/test6.grib")

>>> ds.to_numpy().shape
(6, 7, 12)
>>> ds.to_numpy(flatten=True).shape
(6, 84)
>>> ds.values.shape
(6, 84)

>>> for f in ds:
...     f.values.shape
...
(84,)
(84,)
(84,)
(84,)
(84,)
(84,)

>>> for f in ds:
...     f.to_numpy().shape
...
(7, 12)
(7, 12)
(7, 12)
(7, 12)
(7, 12)
(7, 12)

Accessing metadata

We can extract metadata from data objects using the .metadata() method.

When an earthkit-data source provides a FieldList or message list, this method can be called both on the whole object and on the individual fields, too.

For more details see: FieldList.metadata() and Field.metadata()

Inspecting contents

On certain data objects (currently only GRIB and BUFR) we can call .ls(), .head() or .tail().

For more details see: ls().