Reading data parts from URLs

[1]:
import earthkit.data as ekd

This notebook demonstrates how to download only parts (byte ranges) from URLs.

We download one of the files and inspect the contents with ls(). By using the “offset” key we can get the byte positions where each message starts within the file.

[2]:
ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib")
ds.to_fieldlist().ls(keys="metadata.offset")
[2]:
metadata.offset
0 0.0
1 240.0
2 480.0
3 720.0
4 960.0
5 1200.0

Single files

The parts option in from_source() specifies the byte range(s) we want to read from a remote file. A single part is a tuple or list in the following format: (offset, length).

Using the offsets from the example above we can specify the part for the fist message.

[3]:
ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=(0, 240))
ds.to_fieldlist().ls()
[3]:
parameter.variable time.valid_datetime time.base_datetime time.step vertical.level vertical.level_type ensemble.member geography.grid_type
0 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll

The call above can also be written as:

[4]:
ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240)])
ds.to_fieldlist().ls()
[4]:
parameter.variable time.valid_datetime time.base_datetime time.step vertical.level vertical.level_type ensemble.member geography.grid_type
0 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll

A part can go over a message boundary. Here bytes 240-244 belong to the second message, which is not read because not all of its bytes are specified.

[5]:
ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 245)])
ds.to_fieldlist().ls()
[5]:
parameter.variable time.valid_datetime time.base_datetime time.step vertical.level vertical.level_type ensemble.member geography.grid_type
0 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll

Multiple parts can be used.

[6]:
ds = ekd.from_source(
    "url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240), (480, 480)]
)
ds.to_fieldlist().ls()
[6]:
parameter.variable time.valid_datetime time.base_datetime time.step vertical.level vertical.level_type ensemble.member geography.grid_type
0 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll
1 v 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll
2 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 850 pressure 0 regular_ll

Parts cannot overlap.

[7]:
try:
    ds = ekd.from_source(
        "url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240), (220, 240)]
    )
except Exception as e:
    print(e)
Offsets and lengths must be in order, and not overlapping: offset=220, end of previous part=240

Multiple files

When using multiple URLs we can specify the part for each file with the following syntax:

[8]:
ds = ekd.from_source(
    "url",
    [
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test.grib", (0, 526)],
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", [(0, 240), (480, 240)]],
    ],
)
ds.to_fieldlist().ls()
[8]:
parameter.variable time.valid_datetime time.base_datetime time.step vertical.level vertical.level_type ensemble.member geography.grid_type
0 2t 2020-05-13 12:00:00 2020-05-13 12:00:00 0 days 0 surface 0 regular_ll
1 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll
2 v 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll

When a part is None for a given file the whole file will be used.

[9]:
ds = ekd.from_source(
    "url",
    [
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test.grib", None],
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", [(0, 240), (480, 240)]],
    ],
)
ds.to_fieldlist().ls()
[9]:
parameter.variable time.valid_datetime time.base_datetime time.step vertical.level vertical.level_type ensemble.member geography.grid_type
0 2t 2020-05-13 12:00:00 2020-05-13 12:00:00 0 days 0 surface 0 regular_ll
1 msl 2020-05-13 12:00:00 2020-05-13 12:00:00 0 days 0 surface 0 regular_ll
2 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll
3 v 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll

The parts kwarg can still be used for multiple files; in this case it will be applied to each of them one by one.

[10]:
ds = ekd.from_source(
    "url",
    [
        "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
        "https://sites.ecmwf.int/repository/earthkit-data/examples/tuv_pl.grib",
    ],
    parts=(0, 240),
)
ds.to_fieldlist().ls()
[10]:
parameter.variable time.valid_datetime time.base_datetime time.step vertical.level vertical.level_type ensemble.member geography.grid_type
0 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll
1 t 2018-08-01 12:00:00 2018-08-01 12:00:00 0 days 1000 pressure 0 regular_ll
[ ]: