Reading data parts from URLs¶
[1]:
import earthkit.data as ekd
This notebook demonstrates how to download only parts (byte ranges) from URLs.
We download one of the files and inspect the contents with ls(). By using the “offset” key we can get the byte positions where each message starts within the file.
[2]:
ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib")
ds.to_fieldlist().ls(keys="metadata.offset")
[2]:
| metadata.offset | |
|---|---|
| 0 | 0.0 |
| 1 | 240.0 |
| 2 | 480.0 |
| 3 | 720.0 |
| 4 | 960.0 |
| 5 | 1200.0 |
Single files¶
The parts option in from_source() specifies the byte range(s) we want to read from a remote file. A single part is a tuple or list in the following format: (offset, length).
Using the offsets from the example above we can specify the part for the fist message.
[3]:
ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=(0, 240))
ds.to_fieldlist().ls()
[3]:
| parameter.variable | time.valid_datetime | time.base_datetime | time.step | vertical.level | vertical.level_type | ensemble.member | geography.grid_type | |
|---|---|---|---|---|---|---|---|---|
| 0 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
The call above can also be written as:
[4]:
ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240)])
ds.to_fieldlist().ls()
[4]:
| parameter.variable | time.valid_datetime | time.base_datetime | time.step | vertical.level | vertical.level_type | ensemble.member | geography.grid_type | |
|---|---|---|---|---|---|---|---|---|
| 0 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
A part can go over a message boundary. Here bytes 240-244 belong to the second message, which is not read because not all of its bytes are specified.
[5]:
ds = ekd.from_source("url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 245)])
ds.to_fieldlist().ls()
[5]:
| parameter.variable | time.valid_datetime | time.base_datetime | time.step | vertical.level | vertical.level_type | ensemble.member | geography.grid_type | |
|---|---|---|---|---|---|---|---|---|
| 0 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
Multiple parts can be used.
[6]:
ds = ekd.from_source(
"url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240), (480, 480)]
)
ds.to_fieldlist().ls()
[6]:
| parameter.variable | time.valid_datetime | time.base_datetime | time.step | vertical.level | vertical.level_type | ensemble.member | geography.grid_type | |
|---|---|---|---|---|---|---|---|---|
| 0 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
| 1 | v | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
| 2 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 850 | pressure | 0 | regular_ll |
Parts cannot overlap.
[7]:
try:
ds = ekd.from_source(
"url", "https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", parts=[(0, 240), (220, 240)]
)
except Exception as e:
print(e)
Offsets and lengths must be in order, and not overlapping: offset=220, end of previous part=240
Multiple files¶
When using multiple URLs we can specify the part for each file with the following syntax:
[8]:
ds = ekd.from_source(
"url",
[
["https://sites.ecmwf.int/repository/earthkit-data/examples/test.grib", (0, 526)],
["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", [(0, 240), (480, 240)]],
],
)
ds.to_fieldlist().ls()
[8]:
| parameter.variable | time.valid_datetime | time.base_datetime | time.step | vertical.level | vertical.level_type | ensemble.member | geography.grid_type | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2t | 2020-05-13 12:00:00 | 2020-05-13 12:00:00 | 0 days | 0 | surface | 0 | regular_ll |
| 1 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
| 2 | v | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
When a part is None for a given file the whole file will be used.
[9]:
ds = ekd.from_source(
"url",
[
["https://sites.ecmwf.int/repository/earthkit-data/examples/test.grib", None],
["https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib", [(0, 240), (480, 240)]],
],
)
ds.to_fieldlist().ls()
[9]:
| parameter.variable | time.valid_datetime | time.base_datetime | time.step | vertical.level | vertical.level_type | ensemble.member | geography.grid_type | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2t | 2020-05-13 12:00:00 | 2020-05-13 12:00:00 | 0 days | 0 | surface | 0 | regular_ll |
| 1 | msl | 2020-05-13 12:00:00 | 2020-05-13 12:00:00 | 0 days | 0 | surface | 0 | regular_ll |
| 2 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
| 3 | v | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
The parts kwarg can still be used for multiple files; in this case it will be applied to each of them one by one.
[10]:
ds = ekd.from_source(
"url",
[
"https://sites.ecmwf.int/repository/earthkit-data/examples/test6.grib",
"https://sites.ecmwf.int/repository/earthkit-data/examples/tuv_pl.grib",
],
parts=(0, 240),
)
ds.to_fieldlist().ls()
[10]:
| parameter.variable | time.valid_datetime | time.base_datetime | time.step | vertical.level | vertical.level_type | ensemble.member | geography.grid_type | |
|---|---|---|---|---|---|---|---|---|
| 0 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
| 1 | t | 2018-08-01 12:00:00 | 2018-08-01 12:00:00 | 0 days | 1000 | pressure | 0 | regular_ll |
[ ]: