GRIB field memory management

Warning

This guide is currently under construction and may be incomplete or inaccurate.

grib is a message-based binary format, where each message is regarded as a field. For reading GRIB, earthkit-data relies on ecCodes, which, when loading a message into memory, represents it as a GRIB handle. In the low level API, the GRIB handle is the object that holds the data and metadata of a GRIB field, therefore it can use up a significant amount of memory.

Determining when a GRIB handle needs to be created and when it can be released is important for memory management. Earthkit-data provides several config options to control this behaviour depending on how we actually read the data.

Reading GRIB data as a stream iterator

We can read grib data as a stream iterator e.g. with the following code:

import earthkit.data as ekd

url = "https://sites.ecmwf.int/repository/earthkit-data/how-tos/test6.grib"
ds = ekd.from_source("url", url, stream=True)
for f in fields:
    print(f)

Here, field f is not attached to a fieldlist and only exists in the scope of the iteration (in the for loop). During its existence the field keeps the GRIB handle in memory and if used in the way shown above, only one field can exist at a time. Once the stream is consumed there is no way to access the data again (unless we read it with from_source() again).

Reading all GRIB data from a stream into memory

We can load grib data fully into memory when we read it as a stream with the read_all=True option in from_source().

import earthkit.data as ekd

url = "https://sites.ecmwf.int/repository/earthkit-data/how-tos/test6.grib"
ds = ekd.from_source("url", url, stream=True, read_all=True)

With this, the entire ds fieldlist, including all the fields and the related GRIB handles, are stored in memory.

Reading data from disk and managing its memory

When reading grib data from disk as a file source, it is represented as a fieldlist and loaded lazily. After the (fast) initial scan for field offsets and lengths, no actual fields are created and no data is read into memory. When we start using the fieldlist, e.g. by iterating over the fields, accessing data or metadata etc., the fields will be created on demand and the related GRIB handles will be loaded from disk when needed. Whether this data or part of it stays in memory depends on the following config:

grib-field-policy

Controls whether fields are kept in memory. The default is "persistent". The possible values are:

  • "persistent": fields are kept in memory until the fieldlist is deleted

  • "temporary": fields are deleted when they go out of scope and recreated on demand

The actual memory used by a field depends on whether it owns the GRIB handle of the related GRIB message. This is controlled by the grib-handle-policy config option.

A field can also cache its metadata access for performance, thus increasing memory usage. This is controlled by the use-grib-metadata-cache config option.

grib-handle-policy

Controls whether GRIB handles are kept in memory. The default is "cache". The possible values are:

  • "cache": a separate in-memory LRU cache is created for the GRIB handles in the fieldlist. The maximum number of GRIB handles kept in this cache is controlled by grib-handle-cache-size. In this mode, field objects are lightweight and only store the GRIB handle cache index, and can only access the GRIB handles via the cache.

  • "persistent": once a GRIB handle is created, a field keeps it in memory until the field is deleted

  • "temporary": for each call to data and metadata access on a field, a new GRIB handle is created and released once the access has finished.

grib-handle-cache-size

When grib-handle-policy is "cache", the config option grib-handle-cache-size (default is 1) specifies the maximum number of GRIB handles kept in an in-memory cache per fieldlist. This is an LRU cache, so when it is full, the least recently used GRIB handle is removed and a new GRIB message is loaded from disk and added to the cache.

Overriding the configuration

In addition to changing the Configuration, it is possible to override the parameters discussed above when loading a given fieldlist by passing them as keyword arguments to from_source(). The parameter names are the same but the dashes are replaced by underscores. When a parameter is not specified in from_source() or is set to None, its value is taken from the actual Configuration. E.g.:

import earthkit.data as ekd

ds = ekd.from_source(
    "file",
    "test6.grib",
    grib_field_policy="persistent",
    grib_handle_policy="temporary",
    grib_handle_cache_size=0,
)

Reading data from disk as a stream

Whilst the usual way of reading GRIB data from disk loads fields lazily (i.e. only when they are actually used), it is also possible to read all fields up-front and keep them in memory by reading it as a stream source with the read_all=True option.

import earthkit.data as ekd

f = open("test6.grib", "rb")
ds = ekd.from_source("stream", f, read_all=True)

Warning

Use this option carefully since your data might not fit into memory.

Note

The default config options are chosen to keep the memory usage low and the performance high. However, depending on the use case, the configuration can be adjusted to optimize the memory usage and performance.