Data sources

Getting data from a source

We can get data from a given source by using from_source():

from_source(name, *args, **kwargs)

Return a data object from the source specified by name .

Parameters:
  • name (str) – the source (see below)

  • *args (tuple) –

    specifies the data location and additional parameters to access the data

  • **kwargs (dict) –

    provides additional functionalities including caching, filtering, sorting and indexing

earthkit-data has the following built-in sources:

Data sources

Name

Description

file

read data from a file/files

file-pattern

read data from a list of files created from a pattern

url

read data from a URL

url-pattern

read data from a list of URLs created from a pattern

sample

read example data

stream

read data from a stream

memory

read data from a memory buffer

forcings

generate forcing data

list-of-dicts

read data from a list of dictionaries

multi

read data from multiple sources

ads

retrieve data from the Copernicus Atmosphere Data Store (ADS)

cds

retrieve data from the Copernicus Climate Data Store (CDS)

ecfs

retrieve data from the ECMWF ECFS File Storage system

ecmwf-open-data

retrieve ECMWF open data

fdb

retrieve data from the Fields DataBase (FDB)

gribjump

retrieve data from the FDB (Fields DataBase) using the gribjump library

mars

retrieve data from the ECMWF MARS archive

opendap

retrieve NetCDF data from OPEnDAP services

polytope

retrieve fields from the Polytope services

s3

retrieve data from Amazon S3 buckets

wekeo

retrieve data from WEkEO using the WEkEO grammar

wekeocds

retrieve CDS data stored on WEkEO using the cdsapi grammar

zarr

load data from a Zarr store


file

from_source("file", path, expand_user=True, expand_vars=False, unix_glob=True, recursive_glob=True, filter=None, parts=None)

The simplest source is file, which can access a local file/list of files.

Parameters:
  • path (str, list, tuple) – input path(s). Each path can be a file path or a directory path. If it is a directory path, it is recursively scanned for supported files. When a path is an archive format such as .zip, .tar, .tar.gz, etc, earthkit-data will attempt to open it and extract any usable files, which are then stored in the cache. Each filepath can contain the parts defining the byte ranges to read.

  • expand_user (bool) – replace the leading ~ or ~user in path by that user’s home directory. See os.path.expanduser

  • expand_vars (bool) – expand shell environment variables in path. See os.path.expandpath

  • unix_glob (bool) – allow UNIX globbing in path

  • recursive_glob (bool) – allow recursive scanning of directories. Only used when uxix_glob is True

  • filter (str, callable) – apply filter to the files read from directories or archives. The filter can be a callable or a string. If it is a string, it is interpreted as a UNIX glob pattern. If it is a callable, it should accept the full file path as a string and return a boolean.

  • parts (pair, list or tuple of pairs, None) – the parts to read from the file(s) specified by path. Cannot be used when path already defines the parts.

  • stream (bool) – if True, the data is read as a stream. Directories and archives are supported. Stream based access is only available for GRIB and CoverageJson data. See details about streams here. New in version 0.11.0

  • read_all (bool) – if True, all the data is read straight to memory from a stream. Used when stream=True. New in version 0.11.0

earthkit-data will inspect the content of the files to check for any of the supported data formats.

When the input is an archive format such as .zip, .tar, .tar.gz, etc, earthkit-data will attempt to open it and extract any usable files, which are then stored in the cache.

The path can be used in a flexible way:

import earthkit.data as ekd

# UNIX globbing is allowed by default
ds = ekd.from_source("file", "path/to/t_*.grib")

# list of files can be specified
ds = ekd.from_source("file", ["path/to/f1.grib", "path/to/f2.grib"])

# a path can be a directory, in this case it is recursively scanned for supported files
ds = ekd.from_source("file", "path/to/dir")

The following examples using parts:

import earthkit.data as ekd

# reading only certain parts (byte ranges) from a single file
ds = ekd.from_source("file", "my.grib", parts=[(0, 150), (400, 160)])

# reading only certain parts (byte ranges) from multiple files
ds = ekd.from_source(
    "file",
    [
        ("a.grib", (0, 150)),
        ("b.grib", (240, 120)),
        ("c.grib", None),
        ("d.grib", [(240, 120), (720, 120)]),
    ],
)

Further examples:

file-pattern

from_source("file-pattern", pattern, *args, hive_partitioning=False, **kwargs)

The file-pattern source reads data from paths specified by a pattern.

Parameters:
  • pattern (str) – input path pattern using {} brackets to define parameters that can be substituted. See patterns for details.

  • *args (tuple) –

    specify the values to substitute into the parameters pattern. Each parameter can be a list/tuple or a single value.

  • hive_partitioning (bool) – control how the pattern is interpreted. See details below.

  • **kwargs (dict) –

    other keyword arguments specifying the parameter values

The actual behaviour and the type of the returned object depend on hive_partitioning:

hive_partioning=False

When hive_partitioning is False, first, the pattern parameters are substituted with the values specified by the *args and **kwargs, see patterns for details. For this, all the possible values must be specified for each pattern parameter. Next, the paths are constructed by taking the Cartesian product of the substituted values. Finally, the resulting paths are read and from_source returns a single object (for GRIB data it will be a Fieldlist).

import datetime
import earthkit.data as ekd

# ds is a fieldlist
ds = ekd.from_source(
    "file-pattern",
    "path/to/data-{my_date:date(%Y-%m-%d)}-{run_time}-{param}.grib",
    {
        "my_date": datetime.datetime(2020, 5, 2),
        "run_time": [12, 18],
        "param": ["t2", "msl"],
    },
)

The code above substitutes “my_date”, “run_time” and “param” into the pattern and constructs the following file paths read into single GRIB Fieldlist:

path/to/data-2020-05-02-12-t2.grib
path/to/data-2020-05-02-12-msl.grib
path/to/data-2020-05-02-18-t2.grib
path/to/data-2020-05-02-18-msl.grib

hive_partioning=True

When hive_partitioning is True, the pattern defines a Hive partitioning with each pattern parameter interpreted as a metadata key. The returned object has a limited scope only supporting the sel() method. Calling any of these methods will trigger a filesystem scan for all the matching files. During this scan, if the required metadata is present in the pattern no files will be opened at all to extract their metadata, which can be an enormous optimisation. Another advantage is that during the scan entire file system branches can be skipped based simply on inspecting the actual file path.

Pattern values are optional, but can be still specified to restrict the search to a specific set of values.

For the hive partitioning example below let us suppose we have the following directory structure containing several years of GRIB data:

mydir/
    20230101/
        myfile_t.grib
        myfile_r.grib
        myfile_u.grib
        myfile_v.grib
    20230102/
        myfile_t.grib
        myfile_r.grib
        myfile_u.grib
        myfile_v.grib
    20230103/
        myfile_t.grib
        myfile_r.grib
        myfile_u.grib
        myfile_v.grib
    20230104/
    ...
import datetime
import earthkit.data as ekd

# At this point nothing is scanned/read yet. ds only has the
# sel() method.
ds = from_source(
    "file-pattern", "mydir/{date}/myfile_{param}.grib", hive_partitioning=True
)

# The following line will trigger a filesystem scan
# for all the matching files. The scan will be limited to the
# "mydir/20230101/" sub-directory and non of the GRIB files will be
# opened to extract their metadata. The returned object will
# be a Fieldlist.
ds1 = ds.sel(date="20230101", param=["t", "r"])

Further examples:

url

from_source("url", url, unpack=True, parts=None, stream=False, read_all=False)

The url source will download the data from the address specified and store it in the cache. The supported data formats are the same as for the file data source above.

Parameters:
  • url (str) – the URL(s) to download. Each URL can contain the parts defining the byte ranges to read.

  • unpack (bool) – for archive formats such as .zip, .tar, .tar.gz, etc, earthkit-data will attempt to open it and extract any usable file. To keep the downloaded file as is use unpack=False

  • parts (pair, list or tuple of pairs, None) – the parts to read from the resource(s) specified by url. Cannot be used when url already defines the parts.

  • stream (bool) – if True, the data is read as a stream. Otherwise the data is retrieved into a file and stored in the cache. This option only works for GRIB data. No archive formats supported (unpack is ignored). stream only works for http and https URLs. See details about streams here.

  • read_all (bool) – if True, all the data is read straight to memory from a stream. Used when stream=True. New in version 0.8.0

  • **kwargs (dict) –

    other keyword arguments specifying the request

>>> import earthkit.data as ekd
>>> ds = ekd.from_source(
...     "url",
...     "https://sites.ecmwf.int/repository/earthkit-data/examples/test4.grib",
... )
>>> ds.ls()
  centre shortName    typeOfLevel  level  dataDate  dataTime stepRange dataType  number    gridType
0   ecmf         t  isobaricInhPa    500  20070101      1200         0       an       0  regular_ll
1   ecmf         z  isobaricInhPa    500  20070101      1200         0       an       0  regular_ll
2   ecmf         t  isobaricInhPa    850  20070101      1200         0       an       0  regular_ll
3   ecmf         z  isobaricInhPa    850  20070101      1200         0       an       0  regular_ll
>>> import earthkit.data as ekd
>>> ds = ekd.from_source(
...     "url",
...     "https://sites.ecmwf.int/repository/earthkit-data/examples/test4.grib",
...     parts=[(0, 130428), (260856, 130428)],
... )
>>> ds.ls()
  centre shortName    typeOfLevel  level  dataDate  dataTime stepRange dataType  number    gridType
0   ecmf         t  isobaricInhPa    500  20070101      1200         0       an       0  regular_ll
1   ecmf         t  isobaricInhPa    850  20070101      1200         0       an       0  regular_ll

Further examples:

url-pattern

from_source("url-pattern", url, unpack=True)

The url-pattern source will build urls from the pattern specified, using the other arguments to fill the pattern. Each argument can be a list to iterate and create the cartesian product of all lists. Then each url is downloaded and stored in the cache. The supported download the data from the address data formats are the same as for the file and url data sources above.

import earthkit.data as ekd

ds = ekd.from_source(
    "url-pattern",
    "https://www.example.com/data-{foo}-{bar}-{qux}.csv",
    foo=[1, 2, 3],
    bar=["a", "b"],
    qux="unique",
)

The code above will download and process the data from the six following urls:

https://www.example.com/data-1-a-unique.csv
https://www.example.com/data-2-a-unique.csv
https://www.example.com/data-3-a-unique.csv
https://www.example.com/data-1-b-unique.csv
https://www.example.com/data-2-b-unique.csv
https://www.example.com/data-3-b-unique.csv

If the urls are pointing to archive format, the data will be unpacked by url-pattern according to the unpack argument, similarly to what the source url does (see above the url source).

sample

from_source("sample", name_or_path)

The sample source will download example data prepared for earthkit and store it in the cache. The supported data formats are the same as for the file data source above.

Parameters:

name_or_path (str, list, tuple) – input file name(s) or relative path(s) to the root of the remote storage folder.

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("sample", "storm_ophelia_wind_850.grib")
>>> ds.ls()
  centre shortName    typeOfLevel  level  dataDate  dataTime stepRange dataType  number    gridType
0   ecmf         u  isobaricInhPa    850  20171016         0         0       an       0  regular_ll
1   ecmf         v  isobaricInhPa    850  20171016         0         0       an       0  regular_ll

stream

from_source("stream", stream, read_all=False)

The stream source will read data from a stream (or streams), which can be an FDB stream, a standard Python IO stream or any object implementing the necessary stream methods. At the moment it only works for GRIB and CoverageJson data. For more details see here.

Parameters:
  • stream (stream, list, tuple) – the stream(s)

  • read_all (bool) – if True, all the data is read into memory from a stream. Used when stream=True. New in version 0.8.0

In the examples below, for simplicity, we create a file stream from a GRIB file. By default from_source() returns an object that can only be used as an iterator.

>>> import earthkit.data as ekd
>>> stream = open("docs/examples/test4.grib", "rb")
>>> ds = ekd.from_source("stream", stream)

# f is a GribField
>>> for f in ds:
...     print(f)
...
GribField(t,500,20070101,1200,0,0)
GribField(z,500,20070101,1200,0,0)
GribField(t,850,20070101,1200,0,0)
GribField(z,850,20070101,1200,0,0)

We can also iterate through the stream in batches of fixed size using batched():

>>> import earthkit.data as ekd
>>> stream = open("docs/examples/test4.grib", "rb")
>>> ds = ekd.from_source("stream", stream, batch_size=2)

 # f is a FieldList
>>> for f in ds.batched(2):
...     print(f"len={len(f)} {f.metadata(('param', 'level'))}")
...
len=2 [('t', 500), ('z', 500)]
len=2 [('t', 850), ('z', 850)]

When using group_by() we can iterate through the stream in groups defined by metadata keys. In this case each iteration step yields a FieldList.

>>> import earthkit.data as ekd
>>> stream = open("docs/examples/test4.grib", "rb")
>>> ds = ekd.from_source("stream", stream)

# f is a FieldList
>>> for f in ds.group_by("level"):
...     print(f"len={len(f)} {f.metadata(('param', 'level'))}")
...
len=2 [('t', 500), ('z', 500)]
len=2 [('t', 850), ('z', 850)]

We can consume the whole stream and load all the data into memory by using read_all=True in from_source(). Use this option carefully!

>>> import earthkit.data as ekd
>>> stream = open("docs/examples/test4.grib", "rb")
>>> ds = ekd.from_source("stream", stream, read_all=True)

# ds is empty at this point, but calling any method on it will
# consume the whole stream
>>> len(ds)
4

# now ds stores all the messages in memory

See the following notebook examples for further details:

memory

from_source("memory", buffer)

The memory source will read data from a memory buffer. Currently it only works for a buffer storing GRIB data or a single CoverageJson object. The result is a FieldList object storing all the data in memory.

import earthkit.data as ekd

# buffer storing a GRIB message
buffer = ...

ds = ekd.from_source("memory", bufr)

# f is the only GribField in ds
f = ds[0]

Please note that if the given input can be read as a stream we can also use the stream source to read the buffer using io.BytesIO. The equivalent code to the example above using a stream is as follows:

import io
import earthkit.data as ekd

# buffer storing a GRIB message
buffer = ...
stream = io.BytesIO(buffer)

ds = ekd.from_source("stream", stream, real_all=True)

# f is the only GribField in ds
f = ds[0]

forcings

from_source("forcings", source_or_dataset=None, *, request={}, **kwargs)
Parameters:
  • source_or_dataset (Source, FieldList or None) – the input data. It can the object returned from from_source() or a FieldLists. If it is None a list-of-dicts source is built from the request. The first field in this data is used a template to build the forcing fields.

  • request (dict) – specify the request

  • **kwargs (dict) –

    other keyword arguments specifying the request

The forcings source generate forcings fields.

list-of-dicts

from_source("list-of-dicts", list_of_dicts)

The list-of-dicts source will read data from a list of dictionaries. Each dictionary represents a single field and the result is a FieldList consisting of ArrayField fields.

Note

No attempt is made to represent the fields internally as GRIB messages, so field functionalities are limited, and some of them may not work at all. The fields cannot be saved to a GRIB file.

The only required key for a dictionary is “values”, which represents the data values. It can be a list, tuple or an ndarray. All the other keys define the metadata and are optional. However, many field functionalities require the existence of specific keys (see below).

The keys that might be interpreted internally can be grouped into the following categories:

Geography keys:

  • “latitudes”: the latitudes, iterable or ndarray

  • “longitudes”: the longitudes, iterable or ndarray

  • “distinctLatitudes”: the distinct latitudes, iterable or ndarray

  • “distinctLongitudes”: the distinct longitudes, iterable or ndarray

These keys are required to make any geography related field functionalities work (e.g. to_latlon()). The role of the keys depends on the grid type:

  • structured grids: “latitudes” and “longitudes” can define the distinct latitudes and longitudes or the full grid. The keys “distinctLatitudes” and “distinctLongitudes” are only used when “latitudes” and “longitudes” are not present and in this case they define the distinct latitudes and longitudes.

  • other grids: “latitudes” and “longitudes” must have the same number of points as “values”.

When other GRIB related geography keys are present, no attempt is made to check if they are consistent with the grid defined by “latitudes” and “longitudes”. Therefore their usage is strongly discouraged.

See: list-of-dicts: defining geography for more details.

Parameter keys:

  • “param”: the parameter name, alias to “shortName” if missing. Must be a str.

  • “shortName”: the parameter name, alias to “param” if missing. Must be a str.

Temporal keys:

  • “date”: the date part of the forecast reference time. Must be an int as YYYYMMDD (the same format as the “date” ecCodes GRIB key).

  • “time”: the time part of the forecast reference time. Must be an int as hhmm with leading zeros omitted (the same format as the “time” ecCodes GRIB key).

  • “dataDate”: alias to “date”

  • “dataTime”: alias to “time”

  • “forecast_reference_time”: the forecast reference time. Must be a datetime object. If not present it is automatically built from “date” and “time” or from “valid_datetime” and “step”.

  • “base_datetime”: alias to “forecast_reference_time”

  • “valid_datetime”: the valid datetime. Must be a datetime object. If not present it is automatically built from “forecast_reference_time” and “step”.

  • “step”: the forecast step. If it is an int, it specifies the number of hours. If it is a str it must use the same format as the “step” ecCodes GRIB key. Can be a timedelta object.

  • “step_timedelta”: the step timedelta. Must be a timedelta object. If not present it is automatically built from “step”.

Level keys:

  • “level”: the level value. Must be a number.

  • “levelist”: the level value. Must be a number.

  • “typeOfLevel”: the type of level. Must be a str.

  • “levtype”: the type of level. Must be a str.

These keys are supposed to be the same as the corresponding GRIB keys.

Ensemble keys:

  • “number”: the ensemble member number. Must be an int.

Other keys:

Other keys can be used to store additional metadata.

Further examples:

multi

from_source("multi", *sources, merger=None, **kwargs)

The multi source reads multiple sources.

Parameters:
  • *sources (tuple) –

    the sources

  • merger

    if it is None an attempt is made to merge/concatenate the sources by their classes (using the nearest common class). Otherwise the sources are merged/concatenated using the merger in a lazy way. The merger can one of the following:

    • class/object implementing the to_xarray() or to_pandas() methods

    • callable

    • str, describing a call either to “concat” or “merge”. E.g.: “concat(concat_dim=time)”

    • tuple with 2 elements. The fist element is a str, either “concat” or “merge”, and the second element is a dict with the keyword arguments for the call. E.g.: (“concat”, {“concat_dim”: “time”})

  • **kwargs (dict) –

    other keyword arguments

ads

from_source("ads", dataset, *args, request=None, **kwargs)

The ads source accesses the Copernicus Atmosphere Data Store (ADS), using the cdsapi package.

Parameters:
  • dataset (str) – the name of the ADS dataset

  • *args (tuple) –

    positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries

  • request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0

  • **kwargs (dict) –

    other keyword arguments specifying the request

Note

The following logic is applied to build the requests:

  1. All individual dictionaries found in request and *args are used as separate requests.

  2. If **kwargs are provided, they are merged into each request dictionary. If only **kwargs are provided (no request or *args specified), they form a single request.

  3. If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.

The requests can contain GRIB post-processing options such as grid and area for regridding and sub-area extraction, respectively. They can also contain the earthkit-data specific split_on parameter.

Note

Currently, for accessing ADS earthkit-data requires the credentials for cdsapi to be stored in the RC file ~/.adsapirc.

When no ~/.adsapirc RC file exists a prompt will appear to specify the credentials for cdsapi and write them into ~/.adsapirc.

The following example retrieves CAMS global reanalysis GRIB data for 2 parameters:

import earthkit.data as ekd

ds = ekd.from_source(
    "ads",
    "cams-global-reanalysis-eac4",
    request=dict(
        variable=["particulate_matter_10um", "particulate_matter_1um"],
        area=[50, -50, 20, 50],  # N,W,S,E
        date="2012-12-12",
        time="12:00",
    ),
)

Data downloaded from the ADS is stored in the the cache.

To access data from the ADS, you will need to register and retrieve an access token. The process is described here. For more information, see the ADS_knowledge base.

Further examples:

cds

from_source("cds", dataset, *args, request=None, prompt=True, **kwargs)

The cds source accesses the Copernicus Climate Data Store (CDS), using the cdsapi package.

Parameters:
  • dataset (str) – the name of the CDS dataset

  • *args (tuple) –

    positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries

  • request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0

  • prompt (bool) –

    if True, it can offer a prompt to specify the credentials for cdsapi and write them into the default RC file ~/.cdsapirc. The prompt only appears when:

    • no cdsapi RC file exists at the default location ~/.cdsapirc

    • no cdsapi RC file exists at the location specified via the CDSAPI_RC environment variable

    • no credentials specified via the CDSAPI_URL and CDSAPI_KEY environment variables

  • **kwargs (dict) –

    other keyword arguments specifying the request

Note

The following logic is applied to build the requests:

  1. All individual dictionaries found in request and *args are used as separate requests.

  2. If **kwargs are provided, they are merged into each request dictionary. If only **kwargs are provided (no request or *args specified), they form a single request.

  3. If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.

The requests can contain GRIB post-processing options such as grid and area for regridding and sub-area extraction, respectively. They can also contain the earthkit-data specific split_on parameter.

The following example retrieves ERA5 reanalysis GRIB data for a subarea for 2 surface parameters. The request is specified using kwargs:

import earthkit.data as ekd

ds = ekd.from_source(
    "cds",
    "reanalysis-era5-single-levels",
    request=dict(
        product_type="reanalysis",
        area=[50, -10, 40, 10],  # N,W,S,E
        grid=[2, 2],
        date="2012-05-10",
    ),
)

Data downloaded from the CDS is stored in the the cache.

To access data from the CDS, you will need to register and retrieve an access token. The process is described here. For more information, see the CDS_knowledge base.

Further examples:

ecfs

from_source("ecfs", path)

The ecfs source provides access to ECMWF’s File Storage system. This service is only available at ECMWF.

The path has to start with ec: followed by the path to the file to retrieve.

ecmwf-open-data

from_source("ecmwf-open-data", *args, source="ecmwf", model="ifs", request=None, **kwargs)

The ecmwf-open-data source provides access to the ECMWF open data, which is a subset of ECMWF real-time forecast data made available to the public free of charge. It uses the ecmwf-opendata package.

Parameters:
  • *args (tuple) –

    positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries

  • request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0

  • source (str) – either the name of the server to contact or a fully qualified URL. Possible values are “ecmwf” to access ECMWF’s servers, or “azure” to access data hosted on Microsoft’s Azure. Default is “ecmwf”.

  • model (str) – name of the model that produced the data. Use “ifs” for the physics-driven model and “aifs” for the data-driven model. Please note that “aifs” is currently experimental and only produces a small subset of fields. Default is “ifs”.

  • **kwargs (dict) –

    other keyword arguments specifying the request

Note

The following logic is applied to build the requests:

  1. All individual dictionaries found in request and *args are used as separate requests.

  2. If **kwargs are provided, they are merged into each request dictionary. If only **kwargs are provided (no request or *args specified), they form a single request.

  3. If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.

Details about the request format can be found here.

The following example retrieves forecast for 2 surface parameters from the latest forecast:

import earthkit.data

ds = earthkit.data.from_source(
    "ecmwf-open-data",
    requests=dict(param=["2t", "msl"], levtype="sfc", step=[0, 6, 12]),
)

The resulting GRIB data files are stored in the cache.

Further examples:

fdb

from_source("fdb", *args, config=None, userconfig=None, request=None, stream=True, read_all=False, lazy=False, **kwargs)

The fdb source accesses the FDB (Fields DataBase), which is a domain-specific object store developed at ECMWF for storing, indexing and retrieving GRIB data. earthkit-data uses the pyfdb package to retrieve data from FDB.

Parameters:
  • *args (tuple) –

    positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries, but current only one request is supported.

  • config (dict,str) – the FDB configuration directly passed to pyfdb.FDB(). If not provided, the configuration is either read from the environment or the default configuration is used. New in version 0.11.0

  • userconfig (dict,str) – the FDB user configuration directly passed to pyfdb.FDB(). If not provided, the configuration is either read from the environment or the default configuration is used. New in version 0.11.0

  • request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests, but current only one request is supported. New in version 0.18.0

  • stream (bool) – if True, the data is read as a stream. Otherwise it is retrieved into a file and stored in the cache. Stream-based access only works for GRIB and CoverageJson data. See details about streams here.

  • read_all (bool) – if True, all the data is read into memory from a stream. Used when stream=True. New in version 0.8.0

  • lazy (bool) –

    if True, the data is read in a lazy way. This means the following:

    • GRIB data is not retrieved until it is explicitly/implictly requested for a given field

    • metadata related calls (e.g. metadata() or sel()) work without retrieving the GRIB data

    • to_xarray() works without retrieving the GRIB data

    • the retrieved GRIB data is not cached (either in memory or on disk) but gets deleted as soon as the data values are extracted. Repeated request for the data values will trigger a new retrieval.

    • the resulting FieldList always retrives one GRIB field as a reference and stores it in memory throughout the lifetime of the FieldList. This is managed internally.

    When lazy=True the stream and read_all options are ignored. Please note that this is an experimental feature. New in version 0.14.0

  • **kwargs (dict) –

    other keyword arguments specifying the request

Note

The following logic is applied to build the requests:

  1. All individual dictionaries found in request and *args are used as separate requests.

  2. If **kwargs are provided, they are merged into each request dictionary. If only **kwargs are provided (no request or *args specified), they form a single request.

  3. If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.

The following example retrieves analysis GRIB data for 3 surface parameters as stream. By default we will consume one message at a time and ds can only be used as an iterator:

>>> import earthkit.data as ekd
>>> request = {
...     "class": "od",
...     "expver": "0001",
...     "stream": "oper",
...     "date": "20240421",
...     "time": [0, 12],
...     "domain": "g",
...     "type": "an",
...     "levtype": "sfc",
...     "step": 0,
...     "param": [151, 167, 168],
... }
>>>
>>> ds = ekd.from_source("fdb", request=request)
>>> for f in ds:
...     print(f)
...
GribField(msl,None,20240421,0,0,0)
GribField(2t,None,20240421,0,0,0)
GribField(2d,None,20240421,0,0,0)
GribField(msl,None,20240421,1200,0,0)
GribField(2t,None,20240421,1200,0,0)
GribField(2d,None,20240421,1200,0,0)

We can also iterate through the stream in batches of fixed size using batched:

>>> ds = ekd.from_source("fdb", request=request)
>>> for f in ds.batched(2):
...     print(f"len={len(f)} {f.metadata(('param', 'level'))}")
...
len=2 [('msl', 0), ('2t', 0)]
len=2 [('2d', 0), ('msl', 0)]
len=2 [('2t', 0), ('2d', 0)]

We can use batch_size=2 to read 2 fields at a time. ds is still just an iterator, but f is now a FieldList containing 2 fields:

When using group_by() we can iterate through the stream in groups defined by metadata keys. In this case each iteration step yields a FieldList.

>>> ds = ekd.from_source("fdb", request=request)
>>> for f in ds.group_by("time"):
...     print(f"len={len(f)} {f.metadata(('param', 'level'))}")
...
len=3 [('msl', 0), ('2t', 0), ('2d', 0)]
len=3 [('msl', 0), ('2t', 0), ('2d', 0)]

We can consume the whole stream and load all the data into memory by using read_all=True in from_source(). Use this option carefully!

>>> import earthkit.data as ekd
>>> ds = ekd.from_source("fdb", request=request, read_all=True)

# ds is empty at this point, but calling any method on it will
# consume the whole stream
>>> len(ds)
3

# now ds stores all the messages in memory

Further examples:

gribjump

from_source("gribjump", request, *, ranges=None, mask=None, indices=None, fetch_coords_from_fdb=False, fdb_kwargs=None, **kwargs)

New in version 0.17.0

The gribjump source enables fast retrieval of GRIB message subsets from the FDB (Fields DataBase) using the gribjump library. Both pygribjump and pyfdb must be installed. The pygribjump package uses findlibs to locate an installation of the gribjump library. If the library is not available on your system, you can install it via the gribjumplib wheel from PyPI. Installing gribjumplib from PyPI will also automatically install fdb5lib and other dependencies, which may take priority over any existing installations on your system.

Warning

⚠️ This source is experimental and may change in future versions without warning. It performs no validation that the specified grid indices, masks, or ranges correspond to the fields’ actual underlying grids. Incorrect usage may silently return wrong data points. The provided ranges or masks might correspond to unexpected points on the grid. This source is also currently not thread-safe.

Exactly one of the parameters ranges, mask or indices must be specified at a time.

Parameters:
  • request (dict) – the FDB request as a dictionary. GribJump requires strict value formatting (e.g., hdates as “YYYYMMDD”, not “YYYY-MM-DD”). Format errors may result in “DataNotFound” errors.

  • ranges (list[tuple[int, int]], optional) – a list of tuples specifying the ranges of 1D grid indices to retrieve in the form [(start1, end1), (start2, end2), …]. Ranges are exclusive, meaning that the end index is not included in the range.

  • mask (numpy.array, optional) – a 1D boolean mask specifying which grid points to retrieve

  • indices (numpy.array, optional) – a 1D array of grid indices to retrieve

  • fetch_coords_from_fdb (bool, optional) – if True, loads the first field’s metadata from the FDB to extract the coordinates at the specified indices. If False, the coordinates are not loaded and no separate FDB request is made. Default is False. Please note that no validation is performed to ensure that all fields in the requests share the same grid.

  • fdb_kwargs (dict, optional) – only used when fetch_coords_from_fdb=True. A dict of keyword arguments passed to the pyfdb.FDB constructor. This allows to specify the FDB configuration, user configuration, etc. If not provided, the default configuration is used. These arguments are only passed to the FDB when fetching coordinates and are not used by GribJump for the extraction itself.

The following example retrieves a subset from a GRIB message in the FDB using a boolean mask:

import earthkit.data as ekd
import numpy as np

request = {
    "class": "od",
    "type": "fc",
    "stream": "oper",
    "expver": "0001",
    "repres": "gg",
    "levtype": "sfc",
    "param": "2t",
    "date": "20250703",
    "time": 0,
    "step": list(range(0, 24, 6)),
    "domain": "g",
}

ranges = [(0, 10), (20, 30)]

source = ekd.from_source("gribjump", request, ranges=ranges)
ds = source.to_xarray()

Further examples:

mars

from_source("mars", *args, request=None, prompt=True, log="default", **kwargs)

The mars source retrieves data from the ECMWF MARS (Meteorological Archival and Retrieval System) archive.

Parameters:
  • *args (tuple) –

    positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries

  • request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0

  • prompt (bool) –

    if True, it can offer a prompt to specify the credentials for web API and write them into the default RC file ~/.ecmwfapirc. The prompt only appears when:

    • no web API RC file exists at the default location ~/.ecmwfapirc

    • no web API RC file exists at the location specified via the ECMWF_API_RC_FILE environment variable

    • no credentials specified via the ECMWF_API_URL and ECMWF_API_KEY environment variables

  • log (str, None, callable, dict) –

    control the logging of the retrieval. The behaviour depends on the underlying MARS client used:

    • web API based access:

      • ”default”: the built-in logging of web API is used (the log is written to stdout)

      • None: turn off logging

      • callable: the log is written to the specified callable. The callable should accept a single argument, a string with the log message.

      import earthkit.data as ekd
      
      
      def my_logging_function(msg):
          print("message=", msg)
      
      
      request = {...}
      ds = ekd.from_source("mars", request, log=my_logging_function)
      
    • direct MARS access:

      • ”default”: log is written to stdout

      • None: turn off logging

      • dict specifying the “stdout” or/and the “stderr” kwargs for Pythons’s subrocess.run() method

  • **kwargs (dict) –

    other keyword arguments specifying the request

The following logic is applied to build the requests:

  1. All individual dictionaries found in request and *args are used as separate requests.

  2. If **kwargs are provided, they are merged into each request dictionary. If only **kwargs are provided (no request or *args specified), they form a single request.

  3. If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.

The requests can contain GRIB post-processing options such as grid and area for regridding and sub-area extraction, respectively. They can also contain the earthkit-data specific split_on parameter.

To figure out which data you need, or discover relevant data available in MARS, see the publicly accessible MARS catalog (or this access restricted catalog).

If the use-standalone-mars-client-when-available config option is True and the MARS client is installed (e.g. at ECMWF) the MARS access is direct. In this case the MARS client command can be specified via the MARS_CLIENT_EXECUTABLE environment variable. When it is not set the "/usr/local/bin/mars" path will be used.

If the standalone MARS client is not available or not enabled the web API will be used. In order to use the web API you will need to register and retrieve an access token. For a more extensive documentation about MARS, please refer to the MARS user documentation.

The following example retrieves analysis GRIB data for a subarea for 2 surface parameters:

import earthkit.data as ekd

ds = ekd.from_source(
    "mars",
    request={
        "param": ["2t", "msl"],
        "levtype": "sfc",
        "area": [50, -50, 20, 50],
        "grid": [2, 2],
        "date": "2023-05-10",
    },
)

Data downloaded from MARS is stored in the cache.

Further examples:

opendap

from_source("opendap", url)

The opendap source accesses NetCDF data from OPeNDAP services. OPenDAP is an acronym for “Open-source Project for a Network Data Access Protocol”.

Parameters:

url (str) – the url of the remote NetCDF file

Examples:

polytope

from_source("polytope", collection, *args, address=None, user_email=None, user_key=None, request=None, stream=True, read_all=False, **kwargs)

The polytope source accesses the Polytope web services , using the polytope-client package.

Parameters:
  • collection (str) – the name of the polytope collection

  • *args (tuple) –

    positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries

  • address (str) – specify the address of the polytope service

  • user_email (str) – specify the user email credential. Must be used together with user_key. This is an alternative to using the POLYTOPE_USER_EMAIL environment variable. New in version 0.7.0

  • user_key (str) – specify the user key credential. Must be used together with user_email. This is an alternative to using the POLYTOPE_USER_KEY environment variable. New in version 0.7.0

  • request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0

  • stream (bool) – if True, the data is read as a stream. Otherwise it is retrieved into a file and stored in the cache. Stream-based access only works for GRIB and CoverageJson data. See details about streams here.

  • read_all (bool) – if True, all the data is read into memory from a stream. Used when stream=True. New in version 0.8.0

  • **kwargs (dict) –

    other keyword arguments, these can include options passed to the polytope-client

The following logic is applied to build the requests:

  1. All individual dictionaries found in request and *args are used as separate requests.

  2. If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.

Please note that the preferred way to specify requests is via the request parameter, as it improves code readability.

The following example retrieves GRIB data from the “ecmwf-mars” polytope collection:

import earthkit.data as ekd

request = {
    "stream": "oper",
    "levtype": "pl",
    "levellist": "1",
    "param": "130.128",
    "step": "0/12",
    "time": "00:00:00",
    "date": "20200915",
    "type": "fc",
    "class": "rd",
    "expver": "hsvs",
    "domain": "g",
}

ds = ekd.from_source("polytope", "ecmwf-mars", request=request, stream=False)

Data downloaded from the polytope service is stored in the cache.

To access data from polytope, you will need to register and retrieve an access token.

Further examples:

s3

from_source("s3", *args, anon=True, aws_access_key=None, aws_secret_access_key=None, aws_token=None, stream=False, read_all=False)

New in version 0.11.0

The s3 source provides access to Amazon S3 buckets.

Parameters:
  • *args (tuple) –

    positional arguments specifying the request(s). Each request is represented by a dict. See detailed description below. A sequence of dicts can also be used to specify multiple requests.

  • anon (bool) –

    if True use anonymous access, this will only work for public buckets. If False, use the aws_access_key, aws_secret_access_key and aws_token credentials. These can also be specified as part of the request (request values override the kwargs). If no credentials provided use botocore to load the aws credentials from:

  • aws_access_key (str) – the AWS access key. Can be overridden in a request. Used when anon=False.

  • aws_secret_access_key (str) – the AWS secret access key. Can be overridden in a request. Used when anon=False.

  • aws_token (str) – the AWS token only used for AWS Security Token Service (AWS STS) temporary credentials. Can be overridden in a request. Used when anon=False.

  • stream (bool) – if True, the data is read as a stream. Otherwise it is retrieved into a file and stored in the cache. Stream-based access only works for GRIB and CoverageJson data. See details about streams here.

  • read_all (bool) – if True, all the data is read into memory from a stream. Used when stream=True.

A request is a dictionary describing a single or multiple objects in a given bucket. It has the following format:

{
    "endpoint": endpoint,  # optional
    "region": region,  # optional
    "bucket": bucket,
    "objects": objects,
    "aws_access_key": aws_access_key,  # optional
    "aws_secret_access_key": aws_secret_access_key,  # optional
    "aws_token": aws_token,  # optional
}

where:

  • “endpoint”: specifies the S3 endpoint (optional). Defaults to "s3.amazonaws.com"

  • “region”: specifies the AWS region (optional). Defaults to "eu-west-2"

  • “bucket”: specifies the bucket name

  • “objects”: specifies the object in the bucket. A list/tuple of objects can be provided.

  • “aws_access_key”: the AWS access key (optional). It overrides aws_access_key. Only used when anon=False.

  • “aws_secret_access_key”: the AWS secret access key (optional). It overrides aws_secret_access_key. Only used when anon=False.

  • “aws_token”: the AWS token (optional). It overrides aws_token. Only used when anon=False.

An object can be:

  • the name of the object as a str

  • a dict in the following format:

    {"object": name, "parts": parts}
    

    where the optional “parts” can specify the parts (byte ranges) to read.

The following examples retrieve GRIB data from a publicly available bucket on the European Weather Cloud (EWC).

>>> import earthkit.data as ekd
>>> req = {
...     "endpoint": "object-store.os-api.cci1.ecmwf.int",
...     "bucket": "earthkit-test-data-public",
...     "objects": "test6.grib",
... }
>>> ds = ekd.from_source("s3", req, anon=True)
>>> ds.ls()
  centre shortName    typeOfLevel  level  dataDate  dataTime stepRange dataType  number    gridType
0   ecmf         t  isobaricInhPa   1000  20180801      1200         0       an       0  regular_ll
1   ecmf         u  isobaricInhPa   1000  20180801      1200         0       an       0  regular_ll
2   ecmf         v  isobaricInhPa   1000  20180801      1200         0       an       0  regular_ll
3   ecmf         t  isobaricInhPa    850  20180801      1200         0       an       0  regular_ll
4   ecmf         u  isobaricInhPa    850  20180801      1200         0       an       0  regular_ll
5   ecmf         v  isobaricInhPa    850  20180801      1200         0       an       0  regular_ll
>>> req = {
...     "endpoint": "object-store.os-api.cci1.ecmwf.int",
...     "bucket": "earthkit-test-data-public",
...     "objects": [
...         {"object": "test6.grib", "parts": (0, 240)},
...         {"object": "tuv_pl.grib", "parts": (2400, 240)},
...     ],
... }
>>>
>>> ds = ekd.from_source("s3", req, anon=True)
>>> ds.ls()
  centre shortName    typeOfLevel  level  dataDate  dataTime stepRange dataType  number    gridType
0   ecmf         t  isobaricInhPa   1000  20180801      1200         0       an       0  regular_ll
1   ecmf         u  isobaricInhPa    500  20180801      1200         0       an       0  regular_ll

Further examples:

wekeo

from_source("wekeo", dataset, *args, request=None, prompt=True, **kwargs)

WEkEO is the Copernicus DIAS reference service for environmental data and virtual processing environments. The wekeo source provides access to WEkEO using the WEkEO grammar. The retrieval is based on the hda Python API.

Parameters:
  • dataset (str) – the name of the WEkEO dataset

  • *args (tuple) –

    positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries

  • request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0

  • prompt (bool) –

    if True, it can offer a prompt to specify the credentials for hda and write them into the default RC file ~/.hdarc. The prompt only appears when:

    • no hda RC file exists at the default location ~/.hdarc

    • no hda RC file exists at the location specified via the HDA_RC environment variable

    • no credentials specified via the HDA_USER and HDA_PASSWORD environment variables

  • **kwargs (dict) –

    other keyword arguments specifying the request

The following logic is applied to build the requests:

  1. All individual dictionaries found in request and *args are used as separate requests.

  2. If **kwargs are provided, they are merged into each request dictionary. If only **kwargs are provided (no request or *args specified), they form a single request.

  3. If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.

The following example retrieves Normalized Difference Vegetation Index data derived from EO satellite imagery in NetCDF format:

import earthkit.data as ekd

ds = ekd.from_source(
    "wekeo",
    "EO:CLMS:DAT:CLMS_GLOBAL_BA_300M_V3_MONTHLY_NETCDF",
    request={
        "dataset_id": "EO:CLMS:DAT:CLMS_GLOBAL_BA_300M_V3_MONTHLY_NETCDF",
        "startdate": "2019-01-01T00:00:00.000Z",
        "enddate": "2019-01-01T23:59:59.999Z",
    },
)

Data downloaded from WEkEO is stored in the the cache.

To access data from WEkEO, you will need to register and set up the Harmonized Data Access (HDA) API client. The process is described here.

Further examples:

wekeocds

from_source("wekeocds", dataset, *args, request=None, prompt=True, **kwargs)

WEkEO is the Copernicus DIAS reference service for environmental data and virtual processing environments. The wekeocds source provides access to Copernicus Climate Data Store (CDS) datasets served on WEkEO using the cdsapi grammar. The retrieval is based on the hda Python API.

Parameters:
  • dataset (str) – the name of the WEkEO dataset

  • *args (tuple) –

    positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries

  • request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0

  • prompt (bool) –

    if True, it can offer a prompt to specify the credentials for hda and write them into the default RC file ~/.hdarc. The prompt only appears when:

    • no hda RC file exists at the default location ~/.hdarc

    • no hda RC file exists at the location specified via the HDA_RC environment variable

    • no credentials specified via the HDA_USER and HDA_PASSWORD environment variables

  • **kwargs (dict) –

    other keyword arguments specifying the request

The following logic is applied to build the requests:

  1. All individual dictionaries found in request and *args are used as separate requests.

  2. If **kwargs are provided, they are merged into each request dictionary. If only **kwargs are provided (no request or *args specified), they form a single request.

  3. If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.

The following example retrieves ERA5 surface data for multiple days in GRIB format:

import earthkit.data as ekd

ds = ekd.from_source(
    "wekeocds",
    "EO:ECMWF:DAT:REANALYSIS_ERA5_SINGLE_LEVELS_MONTHLY_MEANS_MONTHLY_MEANS",
    requewst=dict(
        variable=["2m_temperature", "mean_sea_level_pressure"],
        product_type=["monthly_averaged_reanalysis_by_hour_of_day"],
        year=["2012"],
        month=["12"],
        time=["11:00"],
        data_format="grib",
        download_format="zip",
    ),
)

Data downloaded from WEkEO is stored in the the cache.

To access data from WEkEO, you will need to register and set up the Harmonized Data Access (HDA) API client. The process is described here.

Further examples:

zarr

from_source("zarr", path)

New in version 0.15.0

The zarr source accesses data from a Zarr store. Internally the data is loaded via the xarray.open_zarr() method, so only Zarr data supported by Xarray can be accessed. Requires zarr >= 3 version.

Parameters:

path (str) – path or URL to the Zarr store