Reading data parts from URLs

[1]:
import earthkit.data as ekd

This notebook demonstrates how to download only parts (byte ranges) from URLs.

We download one of the files and inspect the contents with ls(). By using the “offset” key we can get the byte positions where each message starts within the file.

[2]:
ds = ekd.from_source(
        "url",
        "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib")
ds.ls(extra_keys="offset")
[2]:
centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType offset
0 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll 0.0
1 ecmf u isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll 240.0
2 ecmf v isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll 480.0
3 ecmf t isobaricInhPa 850 20180801 1200 0 an 0 regular_ll 720.0
4 ecmf u isobaricInhPa 850 20180801 1200 0 an 0 regular_ll 960.0
5 ecmf v isobaricInhPa 850 20180801 1200 0 an 0 regular_ll 1200.0

Single files

The parts option in from_source() specifies the byte range(s) we want to read from a remote file. A single part is a tuple or list in the following format: (offset, length).

Using the offsets from the example above we can specify the part for the fist message.

[3]:
ds = ekd.from_source(
        "url",
        "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
        parts=(0, 240))
ds.ls()
[3]:
centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType
0 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll

The call above can also be written as:

[4]:
ds = ekd.from_source(
        "url",
        "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
        parts=[(0, 240)])
ds.ls()
[4]:
centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType
0 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll

A part can go over a message boundary. Here bytes 240-244 belong to the second message, which is not read because not all of its bytes are specified.

[5]:
ds = ekd.from_source(
        "url",
        "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
        parts=[(0, 245)])
ds.ls()
[5]:
centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType
0 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll

Multiple parts can be used.

[6]:
ds = ekd.from_source(
        "url",
        "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
        parts=[(0, 240), (480, 480)])
ds.ls()
[6]:
centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType
0 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll
1 ecmf v isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll
2 ecmf t isobaricInhPa 850 20180801 1200 0 an 0 regular_ll

Parts cannot overlap.

[7]:
try:
    ds = ekd.from_source(
            "url",
            "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
             parts=[(0, 240), (220, 240)])
except Exception as e:
    print(e)
Offsets and lengths must be in order, and not overlapping: offset=220, end of previous part=240

Multiple files

When using multiple URLs we can specify the part for each file with the following syntax:

[8]:
ds = ekd.from_source("url", [
                               ["https://sites.ecmwf.int/repository/earthkit-data/examples/test.grib", (0,526)],
                               ["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", [(0, 240), (480, 240)]]
                              ])
ds.ls()
[8]:
centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType
0 ecmf 2t surface 0 20200513 1200 0 an 0 regular_ll
1 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll
2 ecmf v isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll

When a part is None for a given file the whole file will be used.

[9]:
ds = ekd.from_source(
    "url", [
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test.grib", None],
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", [(0,240), (480, 240)]]
        ])
ds.ls()
[9]:
centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType
0 ecmf 2t surface 0 20200513 1200 0 an 0 regular_ll
1 ecmf msl surface 0 20200513 1200 0 an 0 regular_ll
2 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll
3 ecmf v isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll

The parts kwarg can still be used for multiple files; in this case it will be applied to each of them one by one.

[10]:
ds = ekd.from_source(
        "url",
        ["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
         "https://sites.ecmwf.int/repository/earthkit-data/examples/tuv_pl.grib"],
        parts=(0,240))
ds.ls()
[10]:
centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType
0 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll
1 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll
[ ]: