Caching

Purpose

earthkit-data uses a dedicated directory to store the results of remote data access and some GRIB/BUFR indexing information. By default this directory is unmanaged (its size is not checked/limited) and no caching is provided for the files in it, i.e. repeated calls to from_source() for remote services and URLs will download the data again!

When caching is enabled this directory will also serve as a cache. It means if we run from_source() again with the same arguments it will load the data from the cache instead of downloading it again. Additionally, caching offers monitoring and disk space management. When the cache is full, cached data is deleted according to the configuration (i.e. oldest data is deleted first). The cache is implemented by using a sqlite database running in a separate thread.

Please note that the earthkit-data cache configuration is managed through the Configuration.

Warning

By default the caching is disabled, i.e. the cache-policy is off.

Warning

The earthkit-data cache is intended to be used by a single user. Sharing cache with multiple users is not recommended. Downloading a local copy of data on a shared disk to have multiple users working is a different use case and should be supported through using mirrors.

Cache policies

The primary config option to control the cache is cache-policy, which can take the following values:

The cache location can be read and modified with Python (see the details below).

Tip

See the Cache policies notebook for examples.

Note

It is recommended to restart your Jupyter kernels after changing the cache policy or location.

Off cache policy

When the cache-policy is “off” no caching is available. This is the default value. In this case all files are downloaded into an unmanaged temporary directory created by tempfile.TemporaryDirectory. Since caching is disabled, all repeated calls to from_source() for remote services and URLSs will download the data again! This temporary directory will be unique for each earthkit-data session. When the directory object goes out of scope (at the latest on exit) the directory will be cleaned up.

Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.

>>> from earthkit.data import cache, config
>>> config.set("cache-policy", "off")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'

We can specify the parent directory for the the temporary directory by using the temporary-directory-root config. By default it is set to None (no parent directory specified).

>>> from earthkit.data import cache, setting
>>> s = {
...     "cache-policy": "off",
...     "temporary-directory-root": "~/my_demo_tmp",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_tmp/tmp0iiuvsz5'

Temporary cache policy

When the cache-policy is “temporary” the cache will be active and located in a managed temporary directory created by tempfile.TemporaryDirectory. This directory will be unique for each earthkit-data session. When the directory object goes out of scope (at the latest on exit) the cache is cleaned up.

Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.

>>> from earthkit.data import cache, config
>>> config.set("cache-policy", "temporary")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'

We can specify the parent directory for the the temporary cache by using the temporary-cache-directory-root config option. By default it is set to None (no parent directory specified).

>>> from earthkit.data import cache, setting
>>> s = {
...     "cache-policy": "temporary",
...     "temporary-cache-directory-root": "~/my_demo_cache",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_cache/tmp0iiuvsz5'

User cache policy

When the cache-policy is “user” the cache will be active and created in a managed directory defined by the user-cache-directory config option.

The user cache directory is not cleaned up on exit. So next time you start earthkit-data it will be there again unless it is deleted manually or it is set in way that on each startup a different path is assigned to it. Also, when you run multiple sessions of earthkit-data under the same user they will share the same cache.

The default value of the user cache directory depends on your system:

  • /tmp/earthkit-data-$USER for Linux,

  • C:\\Users\\$USER\\AppData\\Local\\Temp\\earthkit-data-$USER for Windows

  • /tmp/.../earthkit-data-$USER for MacOS

We can query the directory path via the Configuration and also by calling the directory() cache method.

>>> from earthkit.data import cache, config
>>> config.set("cache-policy", "user")
>>> config.get("user-cache-directory")
/tmp/earthkit-data-myusername
>>> cache.directory()
/tmp/earthkit-data-myusername

The following code shows how to change the user-cache-directory config option:

>>> from earthkit.data import config
>>> config.get("user-cache-directory")  # Find the current cache directory
/tmp/earthkit-data-myusername
>>> # Change the value of the setting
>>> config.set("user-cache-directory", "/big-disk/earthkit-data-cache")

# Python kernel restarted

>>> from earthkit.data import config
>>> config.get("user-cache-directory")  # Cache directory has been modified
/big-disk/earthkit-data-cache

More generally, the earthkit-data config options can be read, modified, reset to their default values from Python, see the Configs documentation.

Cache methods

The cache is controlled by a global object, which we can access as earthkit.data.cache.

>>> from earthkit.data import cache
>>> cache
<earthkit.data.core.caching.Cache object at 0x117be7040>

When cache-policy is user or temporary there are a set of methods available on this object to manage and interact with the cache.

Methods/properties of the cache object

Methods

Description

policy

Get the current cache policy object.

directory()

Return the path to the current cache directory

size()

Return the total number of bytes stored in the cache

check_size()

Check the cache size and trim it down when needed.

entries()

Dump the entries stored in the cache

summary_dump_database()

Return the number of items and total size of the cache

purge()

Delete entries from the cache

Warning

check_size() automatically runs when a new entry is added to the cache or any of the Cache config parameters changes.

Examples:

>>> from earthkit.data import cache
>>> cache.policy.name
'user'
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/earthkit-data-myusername'
>>> cache.size()
846785699
>>> cache.summary_dump_database()
(40, 846785699)
>>> d = cache.entries()
>>> len(d)
40
>>> d[0].get("creation_date")
'2023-10-30 14:48:31.320322'

Cache limits

Warning

These config options do not work when cache-policy is off .

Maximum-cache-size

The maximum-cache-size setting ensures that earthkit-data does not use to much disk space. Its value sets the maximum disk space used by earthkit-data cache. When earthkit-data cache disk usage goes above this limit, earthkit-data triggers its cache cleaning mechanism before downloading additional data. The value of cache-maximum-size is absolute (such as “10G”, “10M”, “1K”). To disable it use None.

Maximum-cache-disk-usage

The maximum-cache-disk-usage setting ensures that earthkit-data does not fill your disk. It specifies the maximum disk usage (as a percentage) on the filesystem containing the cache directory. When the total disk usage (so this is not the cache usage alone) goes above this limit, earthkit-data triggers its cache cleaning mechanism to free up space before downloading additional data. The value of maximum-cache-disk-usage is relative (such as “90%” or “100%”). To disable it use None.

Warning

If your disk is filled by another application, earthkit-data will happily delete its cached data to make room for the other application as soon as it has a chance.

Cache config parameters

Name

Default

Description

cache‑policy

‘off’

Caching policy. Valid values: off, temporary and user. See Caching for more information.

grib‑handle‑cache‑size

1

Maximum number of GRIB handles cached in memory per fieldlist with data on disk. Used when grib-handle-policy is cache. See GRIB field memory management for more information.

maximum‑cache‑disk‑usage

‘95%’

Specify maximum disk usage as a percentage of the full disk capacity on the filesystem the cache is located (e.g.: 90%). When the total disk usage exceeds this limit (it’s not limited to the cache usage alone), earthkit-data evicts older cached entries until the usage is below the specified limit. Can be set to None. Ignored when cache-policy is off. See Caching for more information.

maximum‑cache‑size

None

Maximum disk space used by the earthkit-data cache (e.g.: 100G or 2T). When exceeded, earthkit-data evicts older cached entries until the usage is below the specified limit. Can be set to None. Ignored when cache-policy is off. See Caching for more information.

temporary‑cache‑directory‑root

None

Parent of the cache directory when cache-policy is temporary. See Caching for more information.

use‑grib‑metadata‑cache

True

Use in-memory cache kept in each field for GRIB metadata access in fieldlists with data on disk. See GRIB field memory management for more information.

use‑message‑position‑index‑cache

False

Stores message offset index for GRIB/BUFR files in the cache.

user‑cache‑directory

‘TMP/earthkit‑data‑${USER}’

Cache directory used when cache-policy is user. See Caching for more information.

Other earthkit-data config options can be found here.