Reading data parts from URLs¶

[1]:

import earthkit.data as ekd

This notebook demonstrates how to download only parts (byte ranges) from URLs.

We download one of the files and inspect the contents with ls(). By using the “offset” key we can get the byte positions where each message starts within the file.

[2]:

ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib")
ds.to_fieldlist().ls(keys="metadata.offset")

[2]:

	metadata.offset
0	0.0
1	240.0
2	480.0
3	720.0
4	960.0
5	1200.0

Single files¶

The parts option in from_source() specifies the byte range(s) we want to read from a remote file. A single part is a tuple or list in the following format: (offset, length).

Using the offsets from the example above we can specify the part for the fist message.

[3]:

ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=(0, 240))
ds.to_fieldlist().ls()

[3]:

	parameter.variable	time.valid_datetime	time.base_datetime	time.step	vertical.level	vertical.level_type	ensemble.member	geography.grid_type
0	t	2018-08-01 12:00:00	2018-08-01 12:00:00	0 days	1000	pressure	0	regular_ll

The call above can also be written as:

[4]:

ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240)])
ds.to_fieldlist().ls()

[4]:

	parameter.variable	time.valid_datetime	time.base_datetime	time.step	vertical.level	vertical.level_type	ensemble.member	geography.grid_type
0	t	2018-08-01 12:00:00	2018-08-01 12:00:00	0 days	1000	pressure	0	regular_ll

A part can go over a message boundary. Here bytes 240-244 belong to the second message, which is not read because not all of its bytes are specified.

[5]:

ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 245)])
ds.to_fieldlist().ls()

[5]:

	parameter.variable	time.valid_datetime	time.base_datetime	time.step	vertical.level	vertical.level_type	ensemble.member	geography.grid_type
0	t	2018-08-01 12:00:00	2018-08-01 12:00:00	0 days	1000	pressure	0	regular_ll

Multiple parts can be used.

[6]:

ds = ekd.from_source(
    "url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240), (480, 480)]
)
ds.to_fieldlist().ls()

[6]:

	parameter.variable	time.valid_datetime	time.base_datetime	vertical.level	vertical.level_type	geography.grid_type
0	t	2018-08-01 12:00:00	2018-08-01 12:00:00	1000	pressure	regular_ll
1	v	2018-08-01 12:00:00	2018-08-01 12:00:00	1000	pressure	regular_ll
2	t	2018-08-01 12:00:00	2018-08-01 12:00:00	850	pressure	regular_ll

Parts cannot overlap.

[7]:

try:
    ds = ekd.from_source(
        "url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240), (220, 240)]
    )
except Exception as e:
    print(e)

Offsets and lengths must be in order, and not overlapping: offset=220, end of previous part=240

Multiple files¶

When using multiple URLs we can specify the part for each file with the following syntax:

[8]:

ds = ekd.from_source(
    "url",
    [
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test.grib", (0, 526)],
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", [(0, 240), (480, 240)]],
    ],
)
ds.to_fieldlist().ls()

[8]:

	parameter.variable	time.valid_datetime	time.base_datetime	vertical.level	vertical.level_type	geography.grid_type
0	2t	2020-05-13 12:00:00	2020-05-13 12:00:00	0	surface	regular_ll
1	t	2018-08-01 12:00:00	2018-08-01 12:00:00	1000	pressure	regular_ll
2	v	2018-08-01 12:00:00	2018-08-01 12:00:00	1000	pressure	regular_ll

When a part is None for a given file the whole file will be used.

[9]:

ds = ekd.from_source(
    "url",
    [
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test.grib", None],
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", [(0, 240), (480, 240)]],
    ],
)
ds.to_fieldlist().ls()

[9]:

	parameter.variable	time.valid_datetime	time.base_datetime	vertical.level	vertical.level_type	geography.grid_type
0	2t	2020-05-13 12:00:00	2020-05-13 12:00:00	0	surface	regular_ll
1	msl	2020-05-13 12:00:00	2020-05-13 12:00:00	0	surface	regular_ll
2	t	2018-08-01 12:00:00	2018-08-01 12:00:00	1000	pressure	regular_ll
3	v	2018-08-01 12:00:00	2018-08-01 12:00:00	1000	pressure	regular_ll

The parts kwarg can still be used for multiple files; in this case it will be applied to each of them one by one.

[10]:

ds = ekd.from_source(
    "url",
    [
        "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
        "https://sites.ecmwf.int/repository/earthkit-data/examples/tuv_pl.grib",
    ],
    parts=(0, 240),
)
ds.to_fieldlist().ls()

[10]:

	parameter.variable	time.valid_datetime	time.base_datetime	time.step	vertical.level	vertical.level_type	ensemble.member	geography.grid_type
0	t	2018-08-01 12:00:00	2018-08-01 12:00:00	0 days	1000	pressure	0	regular_ll
1	t	2018-08-01 12:00:00	2018-08-01 12:00:00	0 days	1000	pressure	0	regular_ll

[ ]: