Data sources
Getting data from a source
We can get data from a given source by using from_source():
- from_source(name, *args, **kwargs)
Return a data object from the source specified by
name.- Parameters:
name (str) – the source (see below)
*args (tuple) –
specifies the data location and additional parameters to access the data
**kwargs (dict) –
provides additional functionalities including caching, filtering, sorting and indexing
earthkit-data has the following built-in sources:
Data sources Name
Description
read data from a file/files
read data from a list of files created from a pattern
read data from a URL
read data from a list of URLs created from a pattern
read example data
read data from a stream
read data from a memory buffer
generate forcing data
read data from a list of dictionaries
read data from multiple sources
retrieve data from the Copernicus Atmosphere Data Store (ADS)
retrieve data from the Copernicus Climate Data Store (CDS)
retrieve data from the ECMWF ECFS File Storage system
retrieve ECMWF open data
retrieve data from the Fields DataBase (FDB)
retrieve data from the FDB (Fields DataBase) using the gribjump library
retrieve data from the ECMWF MARS archive
retrieve NetCDF data from OPEnDAP services
retrieve fields from the Polytope services
retrieve data from Amazon S3 buckets
retrieve data from WEkEO using the WEkEO grammar
load data from a Zarr store
file
- from_source("file", path, expand_user=True, expand_vars=False, unix_glob=True, recursive_glob=True, filter=None, parts=None)
The simplest source is
file, which can access a local file/list of files.- Parameters:
path (str, list, tuple) – input path(s). Each path can be a file path or a directory path. If it is a directory path, it is recursively scanned for supported files. When a path is an archive format such as
.zip,.tar,.tar.gz, etc, earthkit-data will attempt to open it and extract any usable files, which are then stored in the cache. Each filepath can contain the parts defining the byte ranges to read.expand_user (bool) – replace the leading ~ or ~user in
pathby that user’s home directory. Seeos.path.expanduserexpand_vars (bool) – expand shell environment variables in
path. Seeos.path.expandpathunix_glob (bool) – allow UNIX globbing in
pathrecursive_glob (bool) – allow recursive scanning of directories. Only used when
uxix_globis Truefilter (str, callable) – apply filter to the files read from directories or archives. The filter can be a callable or a string. If it is a string, it is interpreted as a UNIX glob pattern. If it is a callable, it should accept the full file path as a string and return a boolean.
parts (pair, list or tuple of pairs, None) – the parts to read from the file(s) specified by
path. Cannot be used whenpathalready defines the parts.stream (bool) – if
True, the data is read as a stream. Directories and archives are supported. Stream based access is only available for GRIB and CoverageJson data. See details about streams here. New in version 0.11.0read_all (bool) – if
True, all the data is read straight to memory from a stream. Used whenstream=True. New in version 0.11.0
earthkit-data will inspect the content of the files to check for any of the supported data formats.
When the input is an archive format such as
.zip,.tar,.tar.gz, etc, earthkit-data will attempt to open it and extract any usable files, which are then stored in the cache.The
pathcan be used in a flexible way:import earthkit.data as ekd # UNIX globbing is allowed by default ds = ekd.from_source("file", "path/to/t_*.grib") # list of files can be specified ds = ekd.from_source("file", ["path/to/f1.grib", "path/to/f2.grib"]) # a path can be a directory, in this case it is recursively scanned for supported files ds = ekd.from_source("file", "path/to/dir")
The following examples using parts:
import earthkit.data as ekd # reading only certain parts (byte ranges) from a single file ds = ekd.from_source("file", "my.grib", parts=[(0, 150), (400, 160)]) # reading only certain parts (byte ranges) from multiple files ds = ekd.from_source( "file", [ ("a.grib", (0, 150)), ("b.grib", (240, 120)), ("c.grib", None), ("d.grib", [(240, 120), (720, 120)]), ], )
Further examples:
file-pattern
- from_source("file-pattern", pattern, *args, hive_partitioning=False, **kwargs)
The
file-patternsource reads data from paths specified by a pattern.- Parameters:
pattern (str) – input path pattern using
{}brackets to define parameters that can be substituted. See patterns for details.*args (tuple) –
specify the values to substitute into the parameters
pattern. Each parameter can be a list/tuple or a single value.hive_partitioning (bool) – control how the
patternis interpreted. See details below.**kwargs (dict) –
other keyword arguments specifying the parameter values
The actual behaviour and the type of the returned object depend on
hive_partitioning:
hive_partioning=False
When
hive_partitioningisFalse, first, the pattern parameters are substituted with the values specified by the*argsand**kwargs, see patterns for details. For this, all the possible values must be specified for each pattern parameter. Next, the paths are constructed by taking the Cartesian product of the substituted values. Finally, the resulting paths are read and from_source returns a single object (for GRIB data it will be aFieldlist).import datetime import earthkit.data as ekd # ds is a fieldlist ds = ekd.from_source( "file-pattern", "path/to/data-{my_date:date(%Y-%m-%d)}-{run_time}-{param}.grib", { "my_date": datetime.datetime(2020, 5, 2), "run_time": [12, 18], "param": ["t2", "msl"], }, )The code above substitutes “my_date”, “run_time” and “param” into the
patternand constructs the following file paths read into single GRIBFieldlist:path/to/data-2020-05-02-12-t2.grib path/to/data-2020-05-02-12-msl.grib path/to/data-2020-05-02-18-t2.grib path/to/data-2020-05-02-18-msl.grib
hive_partioning=True
When
hive_partitioningisTrue, thepatterndefines a Hive partitioning with each pattern parameter interpreted as a metadata key. The returned object has a limited scope only supporting thesel()method. Calling any of these methods will trigger a filesystem scan for all the matching files. During this scan, if the required metadata is present in the pattern no files will be opened at all to extract their metadata, which can be an enormous optimisation. Another advantage is that during the scan entire file system branches can be skipped based simply on inspecting the actual file path.Pattern values are optional, but can be still specified to restrict the search to a specific set of values.
For the hive partitioning example below let us suppose we have the following directory structure containing several years of GRIB data:
mydir/ 20230101/ myfile_t.grib myfile_r.grib myfile_u.grib myfile_v.grib 20230102/ myfile_t.grib myfile_r.grib myfile_u.grib myfile_v.grib 20230103/ myfile_t.grib myfile_r.grib myfile_u.grib myfile_v.grib 20230104/ ...import datetime import earthkit.data as ekd # At this point nothing is scanned/read yet. ds only has the # sel() method. ds = from_source( "file-pattern", "mydir/{date}/myfile_{param}.grib", hive_partitioning=True ) # The following line will trigger a filesystem scan # for all the matching files. The scan will be limited to the # "mydir/20230101/" sub-directory and non of the GRIB files will be # opened to extract their metadata. The returned object will # be a Fieldlist. ds1 = ds.sel(date="20230101", param=["t", "r"])
Further examples:
url
- from_source("url", url, unpack=True, parts=None, stream=False, read_all=False)
The
urlsource will download the data from the address specified and store it in the cache. The supported data formats are the same as for the file data source above.- Parameters:
url (str) – the URL(s) to download. Each URL can contain the parts defining the byte ranges to read.
unpack (bool) – for archive formats such as
.zip,.tar,.tar.gz, etc, earthkit-data will attempt to open it and extract any usable file. To keep the downloaded file as is useunpack=Falseparts (pair, list or tuple of pairs, None) – the parts to read from the resource(s) specified by
url. Cannot be used whenurlalready defines the parts.stream (bool) – if
True, the data is read as a stream. Otherwise the data is retrieved into a file and stored in the cache. This option only works for GRIB data. No archive formats supported (unpackis ignored).streamonly works forhttpandhttpsURLs. See details about streams here.read_all (bool) – if
True, all the data is read straight to memory from a stream. Used whenstream=True. New in version 0.8.0**kwargs (dict) –
other keyword arguments specifying the request
>>> import earthkit.data as ekd >>> ds = ekd.from_source( ... "url", ... "https://sites.ecmwf.int/repository/earthkit-data/examples/test4.grib", ... ) >>> ds.ls() centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType 0 ecmf t isobaricInhPa 500 20070101 1200 0 an 0 regular_ll 1 ecmf z isobaricInhPa 500 20070101 1200 0 an 0 regular_ll 2 ecmf t isobaricInhPa 850 20070101 1200 0 an 0 regular_ll 3 ecmf z isobaricInhPa 850 20070101 1200 0 an 0 regular_ll
>>> import earthkit.data as ekd >>> ds = ekd.from_source( ... "url", ... "https://sites.ecmwf.int/repository/earthkit-data/examples/test4.grib", ... parts=[(0, 130428), (260856, 130428)], ... ) >>> ds.ls() centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType 0 ecmf t isobaricInhPa 500 20070101 1200 0 an 0 regular_ll 1 ecmf t isobaricInhPa 850 20070101 1200 0 an 0 regular_ll
Further examples:
url-pattern
- from_source("url-pattern", url, unpack=True)
The
url-patternsource will build urls from the pattern specified, using the other arguments to fill the pattern. Each argument can be a list to iterate and create the cartesian product of all lists. Then each url is downloaded and stored in the cache. The supported download the data from the address data formats are the same as for the file and url data sources above.import earthkit.data as ekd ds = ekd.from_source( "url-pattern", "https://www.example.com/data-{foo}-{bar}-{qux}.csv", foo=[1, 2, 3], bar=["a", "b"], qux="unique", )
The code above will download and process the data from the six following urls:
https://www.example.com/data-1-a-unique.csv https://www.example.com/data-2-a-unique.csv https://www.example.com/data-3-a-unique.csv https://www.example.com/data-1-b-unique.csv https://www.example.com/data-2-b-unique.csv https://www.example.com/data-3-b-unique.csv
If the urls are pointing to archive format, the data will be unpacked by
url-patternaccording to the unpack argument, similarly to what the sourceurldoes (see above the url source).
sample
- from_source("sample", name_or_path)
The
samplesource will download example data prepared for earthkit and store it in the cache. The supported data formats are the same as for the file data source above.- Parameters:
name_or_path (str, list, tuple) – input file name(s) or relative path(s) to the root of the remote storage folder.
>>> import earthkit.data as ekd >>> ds = ekd.from_source("sample", "storm_ophelia_wind_850.grib") >>> ds.ls() centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType 0 ecmf u isobaricInhPa 850 20171016 0 0 an 0 regular_ll 1 ecmf v isobaricInhPa 850 20171016 0 0 an 0 regular_ll
stream
- from_source("stream", stream, read_all=False)
The
streamsource will read data from a stream (or streams), which can be an FDB stream, a standard Python IO stream or any object implementing the necessary stream methods. At the moment it only works for GRIB and CoverageJson data. For more details see here.- Parameters:
stream (stream, list, tuple) – the stream(s)
read_all (bool) – if
True, all the data is read into memory from a stream. Used whenstream=True. New in version 0.8.0
In the examples below, for simplicity, we create a file stream from a GRIB file. By default from_source() returns an object that can only be used as an iterator.
>>> import earthkit.data as ekd >>> stream = open("docs/examples/test4.grib", "rb") >>> ds = ekd.from_source("stream", stream) # f is a GribField >>> for f in ds: ... print(f) ... GribField(t,500,20070101,1200,0,0) GribField(z,500,20070101,1200,0,0) GribField(t,850,20070101,1200,0,0) GribField(z,850,20070101,1200,0,0)
We can also iterate through the stream in batches of fixed size using
batched():>>> import earthkit.data as ekd >>> stream = open("docs/examples/test4.grib", "rb") >>> ds = ekd.from_source("stream", stream, batch_size=2) # f is a FieldList >>> for f in ds.batched(2): ... print(f"len={len(f)} {f.metadata(('param', 'level'))}") ... len=2 [('t', 500), ('z', 500)] len=2 [('t', 850), ('z', 850)]
When using
group_by()we can iterate through the stream in groups defined by metadata keys. In this case each iteration step yields aFieldList.>>> import earthkit.data as ekd >>> stream = open("docs/examples/test4.grib", "rb") >>> ds = ekd.from_source("stream", stream) # f is a FieldList >>> for f in ds.group_by("level"): ... print(f"len={len(f)} {f.metadata(('param', 'level'))}") ... len=2 [('t', 500), ('z', 500)] len=2 [('t', 850), ('z', 850)]
We can consume the whole stream and load all the data into memory by using
read_all=Truein from_source(). Use this option carefully!>>> import earthkit.data as ekd >>> stream = open("docs/examples/test4.grib", "rb") >>> ds = ekd.from_source("stream", stream, read_all=True) # ds is empty at this point, but calling any method on it will # consume the whole stream >>> len(ds) 4 # now ds stores all the messages in memory
See the following notebook examples for further details:
memory
- from_source("memory", buffer)
The
memorysource will read data from a memory buffer. Currently it only works for abufferstoring GRIB data or a single CoverageJson object. The result is a FieldList object storing all the data in memory.import earthkit.data as ekd # buffer storing a GRIB message buffer = ... ds = ekd.from_source("memory", bufr) # f is the only GribField in ds f = ds[0]
Please note that if the given input can be read as a stream we can also use the stream source to read the
bufferusingio.BytesIO. The equivalent code to the example above using a stream is as follows:import io import earthkit.data as ekd # buffer storing a GRIB message buffer = ... stream = io.BytesIO(buffer) ds = ekd.from_source("stream", stream, real_all=True) # f is the only GribField in ds f = ds[0]
forcings
- from_source("forcings", source_or_dataset=None, *, request={}, **kwargs)
- Parameters:
source_or_dataset (Source, FieldList or None) – the input data. It can the object returned from
from_source()or a FieldLists. If it is None a list-of-dicts source is built from therequest. The first field in this data is used a template to build the forcing fields.request (dict) – specify the request
**kwargs (dict) –
other keyword arguments specifying the request
The
forcingssource generate forcings fields.
list-of-dicts
- from_source("list-of-dicts", list_of_dicts)
The
list-of-dictssource will read data from a list of dictionaries. Each dictionary represents a single field and the result is a FieldList consisting of ArrayField fields.Note
No attempt is made to represent the fields internally as GRIB messages, so field functionalities are limited, and some of them may not work at all. The fields cannot be saved to a GRIB file.
The only required key for a dictionary is “values”, which represents the data values. It can be a list, tuple or an ndarray. All the other keys define the metadata and are optional. However, many field functionalities require the existence of specific keys (see below).
The keys that might be interpreted internally can be grouped into the following categories:
Geography keys:
“latitudes”: the latitudes, iterable or ndarray
“longitudes”: the longitudes, iterable or ndarray
“distinctLatitudes”: the distinct latitudes, iterable or ndarray
“distinctLongitudes”: the distinct longitudes, iterable or ndarray
These keys are required to make any geography related field functionalities work (e.g.
to_latlon()). The role of the keys depends on the grid type:structured grids: “latitudes” and “longitudes” can define the distinct latitudes and longitudes or the full grid. The keys “distinctLatitudes” and “distinctLongitudes” are only used when “latitudes” and “longitudes” are not present and in this case they define the distinct latitudes and longitudes.
other grids: “latitudes” and “longitudes” must have the same number of points as “values”.
When other GRIB related geography keys are present, no attempt is made to check if they are consistent with the grid defined by “latitudes” and “longitudes”. Therefore their usage is strongly discouraged.
See: list-of-dicts: defining geography for more details.
Parameter keys:
“param”: the parameter name, alias to “shortName” if missing. Must be a str.
“shortName”: the parameter name, alias to “param” if missing. Must be a str.
Temporal keys:
“date”: the date part of the forecast reference time. Must be an int as YYYYMMDD (the same format as the “date” ecCodes GRIB key).
“time”: the time part of the forecast reference time. Must be an int as hhmm with leading zeros omitted (the same format as the “time” ecCodes GRIB key).
“dataDate”: alias to “date”
“dataTime”: alias to “time”
“forecast_reference_time”: the forecast reference time. Must be a datetime object. If not present it is automatically built from “date” and “time” or from “valid_datetime” and “step”.
“base_datetime”: alias to “forecast_reference_time”
“valid_datetime”: the valid datetime. Must be a datetime object. If not present it is automatically built from “forecast_reference_time” and “step”.
“step”: the forecast step. If it is an int, it specifies the number of hours. If it is a str it must use the same format as the “step” ecCodes GRIB key. Can be a timedelta object.
“step_timedelta”: the step timedelta. Must be a timedelta object. If not present it is automatically built from “step”.
Level keys:
“level”: the level value. Must be a number.
“levelist”: the level value. Must be a number.
“typeOfLevel”: the type of level. Must be a str.
“levtype”: the type of level. Must be a str.
These keys are supposed to be the same as the corresponding GRIB keys.
Ensemble keys:
“number”: the ensemble member number. Must be an int.
Other keys:
Other keys can be used to store additional metadata.
Further examples:
multi
- from_source("multi", *sources, merger=None, **kwargs)
The
multisource reads multiple sources.- Parameters:
*sources (tuple) –
the sources
merger –
if it is None an attempt is made to merge/concatenate the sources by their classes (using the nearest common class). Otherwise the sources are merged/concatenated using the merger in a lazy way. The merger can one of the following:
class/object implementing the
to_xarray()orto_pandas()methodscallable
str, describing a call either to “concat” or “merge”. E.g.: “concat(concat_dim=time)”
tuple with 2 elements. The fist element is a str, either “concat” or “merge”, and the second element is a dict with the keyword arguments for the call. E.g.: (“concat”, {“concat_dim”: “time”})
**kwargs (dict) –
other keyword arguments
ads
- from_source("ads", dataset, *args, request=None, **kwargs)
The
adssource accesses the Copernicus Atmosphere Data Store (ADS), using the cdsapi package.- Parameters:
dataset (str) – the name of the ADS dataset
*args (tuple) –
positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries
request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0
**kwargs (dict) –
other keyword arguments specifying the request
Note
The following logic is applied to build the requests:
All individual dictionaries found in
requestand*argsare used as separate requests.If
**kwargsare provided, they are merged into each request dictionary. If only**kwargsare provided (norequestor*argsspecified), they form a single request.If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.
The requests can contain GRIB post-processing options such as
gridandareafor regridding and sub-area extraction, respectively. They can also contain the earthkit-data specific split_on parameter.Note
Currently, for accessing ADS earthkit-data requires the credentials for cdsapi to be stored in the RC file
~/.adsapirc.When no
~/.adsapircRC file exists a prompt will appear to specify the credentials for cdsapi and write them into~/.adsapirc.The following example retrieves CAMS global reanalysis GRIB data for 2 parameters:
import earthkit.data as ekd ds = ekd.from_source( "ads", "cams-global-reanalysis-eac4", request=dict( variable=["particulate_matter_10um", "particulate_matter_1um"], area=[50, -50, 20, 50], # N,W,S,E date="2012-12-12", time="12:00", ), )
Data downloaded from the ADS is stored in the the cache.
To access data from the ADS, you will need to register and retrieve an access token. The process is described here. For more information, see the ADS_knowledge base.
Further examples:
cds
- from_source("cds", dataset, *args, request=None, prompt=True, **kwargs)
The
cdssource accesses the Copernicus Climate Data Store (CDS), using the cdsapi package.- Parameters:
dataset (str) – the name of the CDS dataset
*args (tuple) –
positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries
request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0
prompt (bool) –
if
True, it can offer a prompt to specify the credentials for cdsapi and write them into the default RC file~/.cdsapirc. The prompt only appears when:**kwargs (dict) –
other keyword arguments specifying the request
Note
The following logic is applied to build the requests:
All individual dictionaries found in
requestand*argsare used as separate requests.If
**kwargsare provided, they are merged into each request dictionary. If only**kwargsare provided (norequestor*argsspecified), they form a single request.If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.
The requests can contain GRIB post-processing options such as
gridandareafor regridding and sub-area extraction, respectively. They can also contain the earthkit-data specific split_on parameter.The following example retrieves ERA5 reanalysis GRIB data for a subarea for 2 surface parameters. The request is specified using
kwargs:import earthkit.data as ekd ds = ekd.from_source( "cds", "reanalysis-era5-single-levels", request=dict( product_type="reanalysis", area=[50, -10, 40, 10], # N,W,S,E grid=[2, 2], date="2012-05-10", ), )
Data downloaded from the CDS is stored in the the cache.
To access data from the CDS, you will need to register and retrieve an access token. The process is described here. For more information, see the CDS_knowledge base.
Further examples:
ecfs
- from_source("ecfs", path)
The
ecfssource provides access to ECMWF’s File Storage system. This service is only available at ECMWF.The
pathhas to start withec:followed by the path to the file to retrieve.
ecmwf-open-data
- from_source("ecmwf-open-data", *args, source="ecmwf", model="ifs", request=None, **kwargs)
The
ecmwf-open-datasource provides access to the ECMWF open data, which is a subset of ECMWF real-time forecast data made available to the public free of charge. It uses the ecmwf-opendata package.- Parameters:
*args (tuple) –
positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries
request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0
source (str) – either the name of the server to contact or a fully qualified URL. Possible values are “ecmwf” to access ECMWF’s servers, or “azure” to access data hosted on Microsoft’s Azure. Default is “ecmwf”.
model (str) – name of the model that produced the data. Use “ifs” for the physics-driven model and “aifs” for the data-driven model. Please note that “aifs” is currently experimental and only produces a small subset of fields. Default is “ifs”.
**kwargs (dict) –
other keyword arguments specifying the request
Note
The following logic is applied to build the requests:
All individual dictionaries found in
requestand*argsare used as separate requests.If
**kwargsare provided, they are merged into each request dictionary. If only**kwargsare provided (norequestor*argsspecified), they form a single request.If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.
Details about the request format can be found here.
The following example retrieves forecast for 2 surface parameters from the latest forecast:
import earthkit.data ds = earthkit.data.from_source( "ecmwf-open-data", requests=dict(param=["2t", "msl"], levtype="sfc", step=[0, 6, 12]), )
The resulting GRIB data files are stored in the cache.
Further examples:
fdb
- from_source("fdb", *args, config=None, userconfig=None, request=None, stream=True, read_all=False, lazy=False, **kwargs)
The
fdbsource accesses the FDB (Fields DataBase), which is a domain-specific object store developed at ECMWF for storing, indexing and retrieving GRIB data. earthkit-data uses the pyfdb package to retrieve data from FDB.- Parameters:
*args (tuple) –
positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries, but current only one request is supported.
config (dict,str) – the FDB configuration directly passed to
pyfdb.FDB(). If not provided, the configuration is either read from the environment or the default configuration is used. New in version 0.11.0userconfig (dict,str) – the FDB user configuration directly passed to
pyfdb.FDB(). If not provided, the configuration is either read from the environment or the default configuration is used. New in version 0.11.0request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests, but current only one request is supported. New in version 0.18.0
stream (bool) – if
True, the data is read as a stream. Otherwise it is retrieved into a file and stored in the cache. Stream-based access only works for GRIB and CoverageJson data. See details about streams here.read_all (bool) – if
True, all the data is read into memory from a stream. Used whenstream=True. New in version 0.8.0lazy (bool) –
if
True, the data is read in a lazy way. This means the following:GRIB data is not retrieved until it is explicitly/implictly requested for a given field
metadata related calls (e.g.
metadata()orsel()) work without retrieving the GRIB datato_xarray()works without retrieving the GRIB datathe retrieved GRIB data is not cached (either in memory or on disk) but gets deleted as soon as the data values are extracted. Repeated request for the data values will trigger a new retrieval.
the resulting
FieldListalways retrives one GRIB field as a reference and stores it in memory throughout the lifetime of theFieldList. This is managed internally.
When
lazy=Truethestreamandread_alloptions are ignored. Please note that this is an experimental feature. New in version 0.14.0**kwargs (dict) –
other keyword arguments specifying the request
Note
The following logic is applied to build the requests:
All individual dictionaries found in
requestand*argsare used as separate requests.If
**kwargsare provided, they are merged into each request dictionary. If only**kwargsare provided (norequestor*argsspecified), they form a single request.If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.
The following example retrieves analysis GRIB data for 3 surface parameters as stream. By default we will consume one message at a time and
dscan only be used as an iterator:>>> import earthkit.data as ekd >>> request = { ... "class": "od", ... "expver": "0001", ... "stream": "oper", ... "date": "20240421", ... "time": [0, 12], ... "domain": "g", ... "type": "an", ... "levtype": "sfc", ... "step": 0, ... "param": [151, 167, 168], ... } >>> >>> ds = ekd.from_source("fdb", request=request) >>> for f in ds: ... print(f) ... GribField(msl,None,20240421,0,0,0) GribField(2t,None,20240421,0,0,0) GribField(2d,None,20240421,0,0,0) GribField(msl,None,20240421,1200,0,0) GribField(2t,None,20240421,1200,0,0) GribField(2d,None,20240421,1200,0,0)
We can also iterate through the stream in batches of fixed size using
batched:>>> ds = ekd.from_source("fdb", request=request) >>> for f in ds.batched(2): ... print(f"len={len(f)} {f.metadata(('param', 'level'))}") ... len=2 [('msl', 0), ('2t', 0)] len=2 [('2d', 0), ('msl', 0)] len=2 [('2t', 0), ('2d', 0)]
We can use
batch_size=2to read 2 fields at a time.dsis still just an iterator, butfis now aFieldListcontaining 2 fields:When using
group_by()we can iterate through the stream in groups defined by metadata keys. In this case each iteration step yields aFieldList.>>> ds = ekd.from_source("fdb", request=request) >>> for f in ds.group_by("time"): ... print(f"len={len(f)} {f.metadata(('param', 'level'))}") ... len=3 [('msl', 0), ('2t', 0), ('2d', 0)] len=3 [('msl', 0), ('2t', 0), ('2d', 0)]
We can consume the whole stream and load all the data into memory by using
read_all=Truein from_source(). Use this option carefully!>>> import earthkit.data as ekd >>> ds = ekd.from_source("fdb", request=request, read_all=True) # ds is empty at this point, but calling any method on it will # consume the whole stream >>> len(ds) 3 # now ds stores all the messages in memory
Further examples:
gribjump
- from_source("gribjump", request, *, ranges=None, mask=None, indices=None, fetch_coords_from_fdb=False, fdb_kwargs=None, **kwargs)
New in version 0.17.0
The
gribjumpsource enables fast retrieval of GRIB message subsets from the FDB (Fields DataBase) using the gribjump library. Both pygribjump and pyfdb must be installed. The pygribjump package uses findlibs to locate an installation of the gribjump library. If the library is not available on your system, you can install it via the gribjumplib wheel from PyPI. Installing gribjumplib from PyPI will also automatically install fdb5lib and other dependencies, which may take priority over any existing installations on your system.Warning
⚠️ This source is experimental and may change in future versions without warning. It performs no validation that the specified grid indices, masks, or ranges correspond to the fields’ actual underlying grids. Incorrect usage may silently return wrong data points. The provided ranges or masks might correspond to unexpected points on the grid. This source is also currently not thread-safe.
Exactly one of the parameters
ranges,maskorindicesmust be specified at a time.- Parameters:
request (dict) – the FDB request as a dictionary. GribJump requires strict value formatting (e.g., hdates as “YYYYMMDD”, not “YYYY-MM-DD”). Format errors may result in “DataNotFound” errors.
ranges (list[tuple[int, int]], optional) – a list of tuples specifying the ranges of 1D grid indices to retrieve in the form [(start1, end1), (start2, end2), …]. Ranges are exclusive, meaning that the end index is not included in the range.
mask (numpy.array, optional) – a 1D boolean mask specifying which grid points to retrieve
indices (numpy.array, optional) – a 1D array of grid indices to retrieve
fetch_coords_from_fdb (bool, optional) – if
True, loads the first field’s metadata from the FDB to extract the coordinates at the specified indices. IfFalse, the coordinates are not loaded and no separate FDB request is made. Default isFalse. Please note that no validation is performed to ensure that all fields in the requests share the same grid.fdb_kwargs (dict, optional) – only used when
fetch_coords_from_fdb=True. A dict of keyword arguments passed to the pyfdb.FDB constructor. This allows to specify the FDB configuration, user configuration, etc. If not provided, the default configuration is used. These arguments are only passed to the FDB when fetching coordinates and are not used by GribJump for the extraction itself.
The following example retrieves a subset from a GRIB message in the FDB using a boolean mask:
import earthkit.data as ekd import numpy as np request = { "class": "od", "type": "fc", "stream": "oper", "expver": "0001", "repres": "gg", "levtype": "sfc", "param": "2t", "date": "20250703", "time": 0, "step": list(range(0, 24, 6)), "domain": "g", } ranges = [(0, 10), (20, 30)] source = ekd.from_source("gribjump", request, ranges=ranges) ds = source.to_xarray()
Further examples:
mars
- from_source("mars", *args, request=None, prompt=True, log="default", **kwargs)
The
marssource retrieves data from the ECMWF MARS (Meteorological Archival and Retrieval System) archive.- Parameters:
*args (tuple) –
positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries
request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0
prompt (bool) –
if
True, it can offer a prompt to specify the credentials for web API and write them into the default RC file~/.ecmwfapirc. The prompt only appears when:log (str, None, callable, dict) –
control the logging of the retrieval. The behaviour depends on the underlying MARS client used:
web API based access:
”default”: the built-in logging of web API is used (the log is written to stdout)
None: turn off logging
callable: the log is written to the specified callable. The callable should accept a single argument, a string with the log message.
import earthkit.data as ekd def my_logging_function(msg): print("message=", msg) request = {...} ds = ekd.from_source("mars", request, log=my_logging_function)
direct MARS access:
”default”: log is written to stdout
None: turn off logging
dict specifying the “stdout” or/and the “stderr” kwargs for Pythons’s
subrocess.run()method
**kwargs (dict) –
other keyword arguments specifying the request
The following logic is applied to build the requests:
All individual dictionaries found in
requestand*argsare used as separate requests.If
**kwargsare provided, they are merged into each request dictionary. If only**kwargsare provided (norequestor*argsspecified), they form a single request.If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.
The requests can contain GRIB post-processing options such as
gridandareafor regridding and sub-area extraction, respectively. They can also contain the earthkit-data specific split_on parameter.To figure out which data you need, or discover relevant data available in MARS, see the publicly accessible MARS catalog (or this access restricted catalog).
If the
use-standalone-mars-client-when-availableconfig option is True and the MARS client is installed (e.g. at ECMWF) the MARS access is direct. In this case the MARS client command can be specified via theMARS_CLIENT_EXECUTABLEenvironment variable. When it is not set the"/usr/local/bin/mars"path will be used.If the standalone MARS client is not available or not enabled the web API will be used. In order to use the web API you will need to register and retrieve an access token. For a more extensive documentation about MARS, please refer to the MARS user documentation.
The following example retrieves analysis GRIB data for a subarea for 2 surface parameters:
import earthkit.data as ekd ds = ekd.from_source( "mars", request={ "param": ["2t", "msl"], "levtype": "sfc", "area": [50, -50, 20, 50], "grid": [2, 2], "date": "2023-05-10", }, )
Data downloaded from MARS is stored in the cache.
Further examples:
opendap
- from_source("opendap", url)
The
opendapsource accesses NetCDF data from OPeNDAP services. OPenDAP is an acronym for “Open-source Project for a Network Data Access Protocol”.- Parameters:
url (str) – the url of the remote NetCDF file
Examples:
polytope
- from_source("polytope", collection, *args, address=None, user_email=None, user_key=None, request=None, stream=True, read_all=False, **kwargs)
The
polytopesource accesses the Polytope web services , using the polytope-client package.- Parameters:
collection (str) – the name of the polytope collection
*args (tuple) –
positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries
address (str) – specify the address of the polytope service
user_email (str) – specify the user email credential. Must be used together with
user_key. This is an alternative to using thePOLYTOPE_USER_EMAILenvironment variable. New in version 0.7.0user_key (str) – specify the user key credential. Must be used together with
user_email. This is an alternative to using thePOLYTOPE_USER_KEYenvironment variable. New in version 0.7.0request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0
stream (bool) – if
True, the data is read as a stream. Otherwise it is retrieved into a file and stored in the cache. Stream-based access only works for GRIB and CoverageJson data. See details about streams here.read_all (bool) – if
True, all the data is read into memory from a stream. Used whenstream=True. New in version 0.8.0**kwargs (dict) –
other keyword arguments, these can include options passed to the polytope-client
The following logic is applied to build the requests:
All individual dictionaries found in
requestand*argsare used as separate requests.If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.
Please note that the preferred way to specify requests is via the
requestparameter, as it improves code readability.The following example retrieves GRIB data from the “ecmwf-mars” polytope collection:
import earthkit.data as ekd request = { "stream": "oper", "levtype": "pl", "levellist": "1", "param": "130.128", "step": "0/12", "time": "00:00:00", "date": "20200915", "type": "fc", "class": "rd", "expver": "hsvs", "domain": "g", } ds = ekd.from_source("polytope", "ecmwf-mars", request=request, stream=False)
Data downloaded from the polytope service is stored in the cache.
To access data from polytope, you will need to register and retrieve an access token.
Further examples:
s3
- from_source("s3", *args, anon=True, aws_access_key=None, aws_secret_access_key=None, aws_token=None, stream=False, read_all=False)
New in version 0.11.0
The
s3source provides access to Amazon S3 buckets.- Parameters:
*args (tuple) –
positional arguments specifying the request(s). Each request is represented by a dict. See detailed description below. A sequence of dicts can also be used to specify multiple requests.
anon (bool) –
if
Trueuse anonymous access, this will only work for public buckets. IfFalse, use theaws_access_key,aws_secret_access_keyandaws_tokencredentials. These can also be specified as part of the request (request values override the kwargs). If no credentials provided use botocore to load the aws credentials from:a configuration file. Note that this does not include s3cmd configuration files (e.g. “.s3cfg”).
aws_access_key (str) – the AWS access key. Can be overridden in a request. Used when
anon=False.aws_secret_access_key (str) – the AWS secret access key. Can be overridden in a request. Used when
anon=False.aws_token (str) – the AWS token only used for AWS Security Token Service (AWS STS) temporary credentials. Can be overridden in a request. Used when
anon=False.stream (bool) – if
True, the data is read as a stream. Otherwise it is retrieved into a file and stored in the cache. Stream-based access only works for GRIB and CoverageJson data. See details about streams here.read_all (bool) – if
True, all the data is read into memory from a stream. Used whenstream=True.
A request is a dictionary describing a single or multiple objects in a given bucket. It has the following format:
{ "endpoint": endpoint, # optional "region": region, # optional "bucket": bucket, "objects": objects, "aws_access_key": aws_access_key, # optional "aws_secret_access_key": aws_secret_access_key, # optional "aws_token": aws_token, # optional }
where:
“endpoint”: specifies the S3 endpoint (optional). Defaults to
"s3.amazonaws.com"“region”: specifies the AWS region (optional). Defaults to
"eu-west-2"“bucket”: specifies the bucket name
“objects”: specifies the object in the bucket. A list/tuple of objects can be provided.
“aws_access_key”: the AWS access key (optional). It overrides
aws_access_key. Only used whenanon=False.“aws_secret_access_key”: the AWS secret access key (optional). It overrides
aws_secret_access_key. Only used whenanon=False.“aws_token”: the AWS token (optional). It overrides
aws_token. Only used whenanon=False.
An object can be:
the name of the object as a str
a dict in the following format:
{"object": name, "parts": parts}
where the optional “parts” can specify the parts (byte ranges) to read.
The following examples retrieve GRIB data from a publicly available bucket on the European Weather Cloud (EWC).
>>> import earthkit.data as ekd >>> req = { ... "endpoint": "object-store.os-api.cci1.ecmwf.int", ... "bucket": "earthkit-test-data-public", ... "objects": "test6.grib", ... } >>> ds = ekd.from_source("s3", req, anon=True) >>> ds.ls() centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType 0 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll 1 ecmf u isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll 2 ecmf v isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll 3 ecmf t isobaricInhPa 850 20180801 1200 0 an 0 regular_ll 4 ecmf u isobaricInhPa 850 20180801 1200 0 an 0 regular_ll 5 ecmf v isobaricInhPa 850 20180801 1200 0 an 0 regular_ll
>>> req = { ... "endpoint": "object-store.os-api.cci1.ecmwf.int", ... "bucket": "earthkit-test-data-public", ... "objects": [ ... {"object": "test6.grib", "parts": (0, 240)}, ... {"object": "tuv_pl.grib", "parts": (2400, 240)}, ... ], ... } >>> >>> ds = ekd.from_source("s3", req, anon=True) >>> ds.ls() centre shortName typeOfLevel level dataDate dataTime stepRange dataType number gridType 0 ecmf t isobaricInhPa 1000 20180801 1200 0 an 0 regular_ll 1 ecmf u isobaricInhPa 500 20180801 1200 0 an 0 regular_ll
Further examples:
wekeo
- from_source("wekeo", dataset, *args, request=None, prompt=True, **kwargs)
WEkEO is the Copernicus DIAS reference service for environmental data and virtual processing environments. The
wekeosource provides access to WEkEO using the WEkEO grammar. The retrieval is based on the hda Python API.- Parameters:
dataset (str) – the name of the WEkEO dataset
*args (tuple) –
positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries
request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0
prompt (bool) –
if
True, it can offer a prompt to specify the credentials for hda and write them into the default RC file~/.hdarc. The prompt only appears when:**kwargs (dict) –
other keyword arguments specifying the request
The following logic is applied to build the requests:
All individual dictionaries found in
requestand*argsare used as separate requests.If
**kwargsare provided, they are merged into each request dictionary. If only**kwargsare provided (norequestor*argsspecified), they form a single request.If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.
The following example retrieves Normalized Difference Vegetation Index data derived from EO satellite imagery in NetCDF format:
import earthkit.data as ekd ds = ekd.from_source( "wekeo", "EO:CLMS:DAT:CLMS_GLOBAL_BA_300M_V3_MONTHLY_NETCDF", request={ "dataset_id": "EO:CLMS:DAT:CLMS_GLOBAL_BA_300M_V3_MONTHLY_NETCDF", "startdate": "2019-01-01T00:00:00.000Z", "enddate": "2019-01-01T23:59:59.999Z", }, )
Data downloaded from WEkEO is stored in the the cache.
To access data from WEkEO, you will need to register and set up the Harmonized Data Access (HDA) API client. The process is described here.
Further examples:
wekeocds
- from_source("wekeocds", dataset, *args, request=None, prompt=True, **kwargs)
WEkEO is the Copernicus DIAS reference service for environmental data and virtual processing environments. The
wekeocdssource provides access to Copernicus Climate Data Store (CDS) datasets served on WEkEO using the cdsapi grammar. The retrieval is based on the hda Python API.- Parameters:
dataset (str) – the name of the WEkEO dataset
*args (tuple) –
positional arguments representing request dictionaries. Each item can be dictionary or a list/tuple of dictionaries
request (dict, list/tuple of dicts, None) – specify the request as a dictionary. A list/tuple of dicts can be used to specify multiple requests. New in version 0.18.0
prompt (bool) –
if
True, it can offer a prompt to specify the credentials for hda and write them into the default RC file~/.hdarc. The prompt only appears when:**kwargs (dict) –
other keyword arguments specifying the request
The following logic is applied to build the requests:
All individual dictionaries found in
requestand*argsare used as separate requests.If
**kwargsare provided, they are merged into each request dictionary. If only**kwargsare provided (norequestor*argsspecified), they form a single request.If a request contains the split_on key, the request is split into multiple requests based on the specified keys and their values.
The following example retrieves ERA5 surface data for multiple days in GRIB format:
import earthkit.data as ekd ds = ekd.from_source( "wekeocds", "EO:ECMWF:DAT:REANALYSIS_ERA5_SINGLE_LEVELS_MONTHLY_MEANS_MONTHLY_MEANS", requewst=dict( variable=["2m_temperature", "mean_sea_level_pressure"], product_type=["monthly_averaged_reanalysis_by_hour_of_day"], year=["2012"], month=["12"], time=["11:00"], data_format="grib", download_format="zip", ), )
Data downloaded from WEkEO is stored in the the cache.
To access data from WEkEO, you will need to register and set up the Harmonized Data Access (HDA) API client. The process is described here.
Further examples:
zarr
- from_source("zarr", path)
New in version 0.15.0
The
zarrsource accesses data from a Zarr store. Internally the data is loaded via thexarray.open_zarr()method, so only Zarr data supported by Xarray can be accessed. Requireszarr >= 3version.- Parameters:
path (str) – path or URL to the Zarr store