Retrieving data from FDB¶

[1]:

import earthkit.data

FDB (Fields DataBase) is a domain-specific object store developed at ECMWF for storing, indexing and retrieving GRIB data. For more information on FBD please consult the following pages:

FDB
pyfdb

FDB support in earthkit-data requires both FDB and pyfdb to be installed.

This example requires the FDB_HOME environment variable has to be set correctly.

The following request was written to retrieve data from the operational FDB at ECMWF. Please note that the date must be adjusted since FDB at ECMWF only stores the most recent dates.

[2]:

request = {
    "class": "od",
    "expver": "0001",
    "stream": "oper",
    "date": "20260423",
    "time": [0, 12],
    "domain": "g",
    "type": "an",
    "levtype": "sfc",
    "step": 0,
    "param": [151, 167, 168],
}

Reading as a stream¶

By default we retrieve data from an FDB source with from_source() as a stream.

Iteration with one field at a time in memory¶

When we use the default arguments in from_source() the resulting object can only be used for iteration and only one field is kept in memory at a time. Fields created in the iteration get deleted when going out of scope.

[3]:

ds = earthkit.data.from_source("fdb", request=request).to_fieldlist()
for f in ds:
    print(f)

Field(msl, 2026-04-23 00:00:00, 2026-04-23 00:00:00, 0:00:00, 0, surface, 0, reduced_gg)
Field(2t, 2026-04-23 00:00:00, 2026-04-23 00:00:00, 0:00:00, 0, surface, 0, reduced_gg)
Field(2d, 2026-04-23 00:00:00, 2026-04-23 00:00:00, 0:00:00, 0, surface, 0, reduced_gg)
Field(msl, 2026-04-23 12:00:00, 2026-04-23 12:00:00, 0:00:00, 0, surface, 0, reduced_gg)
Field(2t, 2026-04-23 12:00:00, 2026-04-23 12:00:00, 0:00:00, 0, surface, 0, reduced_gg)
Field(2d, 2026-04-23 12:00:00, 2026-04-23 12:00:00, 0:00:00, 0, surface, 0, reduced_gg)

Once the iteration is completed, there is nothing left in ds.

[4]:

sum([1 for _ in ds])

[4]:

Iteration with group_by¶

When we use the py:func:~earthkit.data.indexing.stream.StreamFieldList.group_by method we can iterate through the stream in groups defined by metadata keys. Each iteration step results in a :py:class:~earthkit.data.indexing.simple.SimpleFieldList` object, which is built by consuming GRIB messages from the stream until the values of the metadata keys change. The generated FieldList keeps GRIB messages in memory then gets deleted when going out of scope.

[5]:

ds = earthkit.data.from_source("fdb", request=request).to_fieldlist()
for f in ds.group_by("metadata.time"):
    print(f"len={len(f)} {f.metadata(('param', 'level'))}")

len=3 [('msl', 0), ('2t', 0), ('2d', 0)]
len=3 [('msl', 0), ('2t', 0), ('2d', 0)]

Iteration with batched¶

When we use the batched() method we can iterate through the stream in batches of fixed size. In this example we create a stream and read 2 fields from it at a time.

[6]:

ds = earthkit.data.from_source("fdb", request=request).to_fieldlist()
for f in ds.batched(2):
    print(f"len={len(f)} {f.metadata(('param', 'level'))}")

len=2 [('msl', 0), ('2t', 0)]
len=2 [('2d', 0), ('msl', 0)]
len=2 [('2t', 0), ('2d', 0)]

Storing all the fields in memory¶

We can load the whole stream into memory by using read_all=True in to_fieldlist(). The resulting object will be a FieldList.

[7]:

ds = earthkit.data.from_source("fdb", request=request).to_fieldlist(read_all=True)

[8]:

len(ds)

[8]:

[9]:

ds.ls()

[9]:

	parameter.variable	time.valid_datetime	time.base_datetime	vertical.level_type	geography.grid_type
0	msl	2026-04-23 00:00:00	2026-04-23 00:00:00	surface	reduced_gg
1	2t	2026-04-23 00:00:00	2026-04-23 00:00:00	surface	reduced_gg
2	2d	2026-04-23 00:00:00	2026-04-23 00:00:00	surface	reduced_gg
3	msl	2026-04-23 12:00:00	2026-04-23 12:00:00	surface	reduced_gg
4	2t	2026-04-23 12:00:00	2026-04-23 12:00:00	surface	reduced_gg
5	2d	2026-04-23 12:00:00	2026-04-23 12:00:00	surface	reduced_gg

[10]:

ds.sel({"parameter.variable": "2t"}).ls()

[10]:

	parameter.variable	time.valid_datetime	time.base_datetime	time.step	vertical.level	vertical.level_type	ensemble.member	geography.grid_type
0	2t	2026-04-23 00:00:00	2026-04-23 00:00:00	0 days	0	surface	0	reduced_gg
1	2t	2026-04-23 12:00:00	2026-04-23 12:00:00	0 days	0	surface	0	reduced_gg

[11]:

ds.to_xarray()

[11]:

<xarray.Dataset> Size: 422MB
Dimensions:                  (forecast_reference_time: 2, values: 6599680)
Coordinates:
  * forecast_reference_time  (forecast_reference_time) datetime64[ns] 16B 202...
    latitude                 (values) float64 53MB ...
    longitude                (values) float64 53MB ...
Dimensions without coordinates: values
Data variables:
    2d                       (forecast_reference_time, values) float64 106MB ...
    2t                       (forecast_reference_time, values) float64 106MB ...
    msl                      (forecast_reference_time, values) float64 106MB ...
Attributes:
    Conventions:  CF-1.8
    institution:  ECMWF

Reading into a file¶

We can retrieve data from FDB into a file, which is located in the cache:

[12]:

ds = earthkit.data.from_source("fdb", request=request, stream=False).to_fieldlist()

[13]:

ds.ls()

[13]:

	parameter.variable	time.valid_datetime	time.base_datetime	vertical.level_type	geography.grid_type
0	msl	2026-04-23 00:00:00	2026-04-23 00:00:00	surface	reduced_gg
1	2t	2026-04-23 00:00:00	2026-04-23 00:00:00	surface	reduced_gg
2	2d	2026-04-23 00:00:00	2026-04-23 00:00:00	surface	reduced_gg
3	msl	2026-04-23 12:00:00	2026-04-23 12:00:00	surface	reduced_gg
4	2t	2026-04-23 12:00:00	2026-04-23 12:00:00	surface	reduced_gg
5	2d	2026-04-23 12:00:00	2026-04-23 12:00:00	surface	reduced_gg

The data is now cached. Subsequent retrievals will use the cached file directly.

[ ]: