Retrieving subsets from Grib files via GribJump

This example demonstrates how the experimental gribjump source allows efficient retrieval of individual grid cells from Grib messages stored in an FDB. The source is a thin wrapper around the Python bindings of GribJump.

[1]:
import os

import numpy as np

import earthkit.data

GribJump can retrieve ranges of grid cells for GRIB files in an FDB that were previously indexed by GribJump (e.g. using gribjump-scan). To use the gribjump source in earthkit-data, the environment must point to an FDB in addition to GribJump-specific environment variables.

⚠️ Please be aware that this source currently does not perform any validation that the grid indices specified by the user actually correspond to the fields’ underlying grids. Please make sure that any fields referenced by the specified FDB requests will result in your expected grid. Because of this, we also need to tell GribJump to ignore any missing grid validation information via the GRIBJUMP_IGNORE_GRID environment variable.

[2]:
os.environ.setdefault("FDB_HOME", "<your fdb home directory>")
os.environ.setdefault("FDB5_CONFIG_FILE", "<your fdb5 config file>")
os.environ.setdefault("GRIBJUMP_CONFIG_FILE", "<your gribjump config file>")
os.environ.setdefault("GRIBJUMP_IGNORE_GRID", "1")
[2]:
'1'

How To Use

The gribjump source works similar to the fdb source and receives a dictionary with an FDB request. Please note that the mars syntax for ranges and lists using “/” is not supported. Only scalar values and Python lists are supported.

The second required parameter is one of ranges, indices, or mask, selecting the grid cells which should be extracted. For convenience, one can set an additional parameter fetch_coords_from_fdb=True to make an additional request directly to the fdb to retrieve latitude and longitude information for the retrieved cells and include them in the retrieved cell’s metadata.

[ ]:
source = earthkit.data.from_source(
    "gribjump",
    {
        "class": "ce",
        "expver": "0001",
        "stream": "efcl",
        "date": "20230101",
        "model": "lisflood",
        "domain": "g",
        "origin": "ecmf",
        "step": 6,
        "type": "sfo",
        "levtype": "sfc",
        "param": "240023",
        "time": ["0000", "0600"],
        "hdate": ["20200101", "20200102"],
    },
    ranges=[(1234, 2345)],
    fetch_coords_from_fdb=True,
)
[4]:
source.ls()
Gribjump Engine: Built file map: 0.022177 second elapsed, 0.011457 second cpu
Starting 8 threads
Gribjump Progress: 1 of 1 tasks complete
Gribjump Engine: All tasks finished: 0.334884 second elapsed, 0.162512 second cpu
Gribjump Engine: Repackaged results: 8e-06 second elapsed, 7e-06 second cpu
Engine::extract: 1.7e-05 second elapsed, 1.5e-05 second cpu
[4]:
param level base_datetime valid_datetime step number
0 240023 None 2020-01-01T00:00:00 2020-01-01T06:00:00 6 None
1 240023 None 2020-01-01T06:00:00 2020-01-01T12:00:00 6 None
2 240023 None 2020-01-02T00:00:00 2020-01-02T06:00:00 6 None
3 240023 None 2020-01-02T06:00:00 2020-01-02T12:00:00 6 None
[5]:
ds = source.to_xarray()
ds
[5]:
<xarray.Dataset> Size: 62kB
Dimensions:                  (forecast_reference_time: 4, index: 1111)
Coordinates:
  * forecast_reference_time  (forecast_reference_time) datetime64[ns] 32B 202...
    latitude                 (index) float64 9kB ...
    longitude                (index) float64 9kB ...
  * index                    (index) int64 9kB 1234 1235 1236 ... 2342 2343 2344
Data variables:
    240023                   (forecast_reference_time, index) float64 36kB ...
Attributes: (12/13)
    param:        240023
    class:        ce
    stream:       efcl
    levtype:      sfc
    type:         sfo
    expver:       0001
    ...           ...
    hdate:        20200101
    time:         0000
    origin:       ecmf
    domain:       g
    Conventions:  CF-1.8
    institution:  ECMWF

Selection and Groupings

The gribjump source offers limited support for selection methods (.sel() and .isel()) and grouping method (.group_by()) and anything else implemented for a SimpleFieldList. However, please keep in mind that the only available metadata for these operations comes from the specified fdb request dictionary. Any selection value must match the type in this dictionary supplied by the user.

[6]:
groups = source.sel(hdate="20200101").group_by("time")
for group in groups:
    print(group, group.to_numpy().shape, group.metadata("base_datetime"))
data=SimpleFieldList(2) 2
SimpleFieldList(1) (1, 1111) ['2020-01-01T00:00:00']
SimpleFieldList(1) (1, 1111) ['2020-01-01T06:00:00']

Extraction Options

You can specify the extraction points through one of three options. GribJump treats all fields as flattened 1D arrays and all coordinates on the grid must assume this representation.

  • Ranges: A list of tuples (start, end) defining contiguous ranges of grid points to extract. As shown in the example above, each tuple specifies a start index (inclusive) and end index (exclusive) in the flattened 1D array representation of the grid. For example, [(0, 100), (200, 300)] would extract grid points 0-99 and 200-299.

  • Indices: A 1D numpy array or list of specific grid point indices to extract from the flattened grid. This allows for non-contiguous extraction of individual grid points. For example, np.array([5, 10, 15, 20]) would extract exactly those four grid points. This array must be sorted in ascending order.

  • Masks: A numpy boolean array where True indicates grid points to extract and False indicates points to skip. The mask must have the same length as the total number of grid points in the field. However, no such validation is performed and passing a mask with an invalid shape will silently return wrong results.

Only one of these methods can be used at a time. Please also note that GribJump uses ranges internally regardless of what the user specifies. Converting the user’s chosen representation to ranges can be expensive when multiple fields are accessed simultaneously.

Code Examples

[7]:
request = {
    "class": "ce",
    "expver": "0001",
    "stream": "efcl",
    "date": "20230101",
    "model": "lisflood",
    "domain": "g",
    "origin": "ecmf",
    "step": 6,
    "type": "sfo",
    "levtype": "sfc",
    "param": "240023",
    "time": "0000",
    "hdate": "20200101",
}

# Example 1: Using ranges
source_ranges = earthkit.data.from_source(
    "gribjump",
    request,
    ranges=[(1234, 2345), (3456, 4567)],
)
ds = source_ranges.to_xarray()
print("Extracted dataset (ranges):", ds)

# Example 2: Using indices to extract specific grid points
indices = np.array([10, 50, 100, 150, 200])
source_indices = earthkit.data.from_source(
    "gribjump",
    request,
    indices=indices,
)
print("Extracted dataset (indices):", source_indices.to_xarray())

# Example 3: Using a boolean mask with random selection
shape = 4530 * 2970  # Depends on your grid size
mask = np.random.choice([True, False], size=shape, p=[0.05, 0.95])

source_mask = earthkit.data.from_source(
    "gribjump",
    request,
    mask=mask,
)
print("Extracted dataset (mask):", source_mask.to_xarray())
Gribjump Engine: Built file map: 0.010474 second elapsed, 0.008713 second cpu
Gribjump Progress: 1 of 1 tasks complete
Gribjump Engine: All tasks finished: 0.039335 second elapsed, 0.039178 second cpu
Gribjump Engine: Repackaged results: 6e-06 second elapsed, 5e-06 second cpu
Engine::extract: 2e-05 second elapsed, 2e-05 second cpu
Extracted dataset (ranges): <xarray.Dataset> Size: 36kB
Dimensions:  (index: 2222)
Coordinates:
  * index    (index) int64 18kB 1234 1235 1236 1237 1238 ... 4563 4564 4565 4566
Data variables:
    240023   (index) float64 18kB ...
Attributes: (12/13)
    param:        240023
    class:        ce
    stream:       efcl
    levtype:      sfc
    type:         sfo
    expver:       0001
    ...           ...
    hdate:        20200101
    time:         0000
    origin:       ecmf
    domain:       g
    Conventions:  CF-1.8
    institution:  ECMWF
Gribjump Engine: Built file map: 0.009283 second elapsed, 0.007779 second cpu
Gribjump Progress: 1 of 1 tasks complete
Gribjump Engine: All tasks finished: 0.039215 second elapsed, 0.038721 second cpu
Gribjump Engine: Repackaged results: 5e-06 second elapsed, 5e-06 second cpu
Engine::extract: 2.3e-05 second elapsed, 2.2e-05 second cpu
Extracted dataset (indices): <xarray.Dataset> Size: 80B
Dimensions:  (index: 5)
Coordinates:
  * index    (index) int64 40B 10 50 100 150 200
Data variables:
    240023   (index) float64 40B ...
Attributes: (12/13)
    param:        240023
    class:        ce
    stream:       efcl
    levtype:      sfc
    type:         sfo
    expver:       0001
    ...           ...
    hdate:        20200101
    time:         0000
    origin:       ecmf
    domain:       g
    Conventions:  CF-1.8
    institution:  ECMWF
Gribjump Engine: Built file map: 0.012851 second elapsed, 0.009124 second cpu
Gribjump Progress: 1 of 1 tasks complete
Gribjump Engine: All tasks finished: 1 second elapsed, 1 second cpu
Gribjump Engine: Repackaged results: 6e-06 second elapsed, 6e-06 second cpu
Engine::extract: 2.7e-05 second elapsed, 2.6e-05 second cpu
Extracted dataset (mask): <xarray.Dataset> Size: 11MB
Dimensions:  (index: 672975)
Coordinates:
  * index    (index) int64 5MB 10 11 32 41 ... 13454079 13454087 13454093
Data variables:
    240023   (index) float64 5MB ...
Attributes: (12/13)
    param:        240023
    class:        ce
    stream:       efcl
    levtype:      sfc
    type:         sfo
    expver:       0001
    ...           ...
    hdate:        20200101
    time:         0000
    origin:       ecmf
    domain:       g
    Conventions:  CF-1.8
    institution:  ECMWF