{ "cells": [ { "cell_type": "markdown", "id": "b52638c2-dcb9-4d81-b527-6a6b03298d89", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Retrieving subsets from Grib files via GribJump" ] }, { "cell_type": "raw", "id": "32d4fcf7-8f81-4782-89a9-2aae2f4cb2c5", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "This example demonstrates how the experimental :ref:`data-sources-gribjump` source allows efficient retrieval of individual grid cells from Grib messages stored in an FDB. The source is a thin wrapper around the Python bindings of `GribJump `_." ] }, { "cell_type": "code", "execution_count": 1, "id": "06c4aefb", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import earthkit.data" ] }, { "cell_type": "markdown", "id": "0e7e19c7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "GribJump can retrieve ranges of grid cells for GRIB files in an FDB that were\n", "previously indexed by GribJump (e.g. using `gribjump-scan`). To use the\n", "`gribjump` source in earthkit-data, the environment must point to an FDB in\n", "addition to GribJump-specific environment variables.\n", "\n", "⚠️ Please be aware that this source currently does not perform any validation\n", "that the grid indices specified by the user actually correspond to the fields'\n", "underlying grids. Please make sure that any fields referenced by the specified\n", "FDB requests will result in your expected grid. Because of this, we also need to\n", "tell GribJump to ignore any missing grid validation information via the\n", "`GRIBJUMP_IGNORE_GRID` environment variable." ] }, { "cell_type": "code", "execution_count": 2, "id": "ffc76940", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'1'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "os.environ.setdefault(\"FDB_HOME\", \"\")\n", "os.environ.setdefault(\"FDB5_CONFIG_FILE\", \"\")\n", "os.environ.setdefault(\"GRIBJUMP_CONFIG_FILE\", \"\")\n", "os.environ.setdefault(\"GRIBJUMP_IGNORE_GRID\", \"1\")" ] }, { "cell_type": "markdown", "id": "44141c94-08e3-4c8d-b46c-fefd7a1b3d8b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "#### How To Use" ] }, { "cell_type": "raw", "id": "f2c9afde-4eab-407d-9e24-3527edf6a25b", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The :ref:`data-sources-gribjump` source works similar to the :ref:`data-sources-fdb` source and receives a dictionary with an FDB request.\n", "Please note that the mars syntax for ranges and lists using \"/\" is not supported. Only scalar values and\n", "Python lists are supported.\n", "\n", "The second required parameter is one of `ranges`, `indices`, or `mask`, selecting the grid cells which should\n", "be extracted. For convenience, one can set an additional parameter `fetch_coords_from_fdb=True` to make an additional\n", "request directly to the fdb to retrieve latitude and longitude information for the retrieved cells and include\n", "them in the retrieved cell's metadata." ] }, { "cell_type": "code", "execution_count": null, "id": "cd0c1962", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "source = earthkit.data.from_source(\n", " \"gribjump\",\n", " {\n", " \"class\": \"ce\",\n", " \"expver\": \"0001\",\n", " \"stream\": \"efcl\",\n", " \"date\": \"20230101\",\n", " \"model\": \"lisflood\",\n", " \"domain\": \"g\",\n", " \"origin\": \"ecmf\",\n", " \"step\": 6,\n", " \"type\": \"sfo\",\n", " \"levtype\": \"sfc\",\n", " \"param\": \"240023\",\n", " \"time\": [\"0000\", \"0600\"],\n", " \"hdate\": [\"20200101\", \"20200102\"],\n", " },\n", " ranges=[(1234, 2345)],\n", " fetch_coords_from_fdb=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "id": "eb808136", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gribjump Engine: Built file map: 0.022177 second elapsed, 0.011457 second cpu\n", "Starting 8 threads\n", "Gribjump Progress: 1 of 1 tasks complete\n", "Gribjump Engine: All tasks finished: 0.334884 second elapsed, 0.162512 second cpu\n", "Gribjump Engine: Repackaged results: 8e-06 second elapsed, 7e-06 second cpu\n", "Engine::extract: 1.7e-05 second elapsed, 1.5e-05 second cpu\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
paramlevelbase_datetimevalid_datetimestepnumber
0240023None2020-01-01T00:00:002020-01-01T06:00:006None
1240023None2020-01-01T06:00:002020-01-01T12:00:006None
2240023None2020-01-02T00:00:002020-01-02T06:00:006None
3240023None2020-01-02T06:00:002020-01-02T12:00:006None
\n", "
" ], "text/plain": [ " param level base_datetime valid_datetime step number\n", "0 240023 None 2020-01-01T00:00:00 2020-01-01T06:00:00 6 None\n", "1 240023 None 2020-01-01T06:00:00 2020-01-01T12:00:00 6 None\n", "2 240023 None 2020-01-02T00:00:00 2020-01-02T06:00:00 6 None\n", "3 240023 None 2020-01-02T06:00:00 2020-01-02T12:00:00 6 None" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "source.ls()" ] }, { "cell_type": "code", "execution_count": 5, "id": "7eff5b19", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 62kB\n",
       "Dimensions:                  (forecast_reference_time: 4, index: 1111)\n",
       "Coordinates:\n",
       "  * forecast_reference_time  (forecast_reference_time) datetime64[ns] 32B 202...\n",
       "    latitude                 (index) float64 9kB ...\n",
       "    longitude                (index) float64 9kB ...\n",
       "  * index                    (index) int64 9kB 1234 1235 1236 ... 2342 2343 2344\n",
       "Data variables:\n",
       "    240023                   (forecast_reference_time, index) float64 36kB ...\n",
       "Attributes: (12/13)\n",
       "    param:        240023\n",
       "    class:        ce\n",
       "    stream:       efcl\n",
       "    levtype:      sfc\n",
       "    type:         sfo\n",
       "    expver:       0001\n",
       "    ...           ...\n",
       "    hdate:        20200101\n",
       "    time:         0000\n",
       "    origin:       ecmf\n",
       "    domain:       g\n",
       "    Conventions:  CF-1.8\n",
       "    institution:  ECMWF
" ], "text/plain": [ " Size: 62kB\n", "Dimensions: (forecast_reference_time: 4, index: 1111)\n", "Coordinates:\n", " * forecast_reference_time (forecast_reference_time) datetime64[ns] 32B 202...\n", " latitude (index) float64 9kB ...\n", " longitude (index) float64 9kB ...\n", " * index (index) int64 9kB 1234 1235 1236 ... 2342 2343 2344\n", "Data variables:\n", " 240023 (forecast_reference_time, index) float64 36kB ...\n", "Attributes: (12/13)\n", " param: 240023\n", " class: ce\n", " stream: efcl\n", " levtype: sfc\n", " type: sfo\n", " expver: 0001\n", " ... ...\n", " hdate: 20200101\n", " time: 0000\n", " origin: ecmf\n", " domain: g\n", " Conventions: CF-1.8\n", " institution: ECMWF" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = source.to_xarray()\n", "ds" ] }, { "cell_type": "markdown", "id": "045235f5-175b-434a-8998-6af85670cbac", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "#### Selection and Groupings" ] }, { "cell_type": "raw", "id": "1953e91a-c48a-4b47-9be5-fe395f39698a", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The :ref:`data-sources-gribjump` source offers limited support for selection methods (`.sel()` and\n", "`.isel()`) and grouping method (`.group_by()`) and anything else implemented for a\n", "`SimpleFieldList`. However, please keep in mind that the only available metadata\n", "for these operations comes from the specified fdb request dictionary. Any\n", "selection value must match the type in this dictionary supplied by the user." ] }, { "cell_type": "code", "execution_count": 6, "id": "2c1e99ae", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data=SimpleFieldList(2) 2\n", "SimpleFieldList(1) (1, 1111) ['2020-01-01T00:00:00']\n", "SimpleFieldList(1) (1, 1111) ['2020-01-01T06:00:00']\n" ] } ], "source": [ "groups = source.sel(hdate=\"20200101\").group_by(\"time\")\n", "for group in groups:\n", " print(group, group.to_numpy().shape, group.metadata('base_datetime'))" ] }, { "cell_type": "markdown", "id": "8e1626a9", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "#### Extraction Options\n", "\n", "You can specify the extraction points through one of three options. GribJump\n", "treats all fields as flattened 1D arrays and all coordinates on the grid must\n", "assume this representation.\n", "\n", "* **Ranges:** A list of tuples `(start, end)` defining contiguous ranges of grid\n", " points to extract. As shown in the example above, each tuple specifies a start\n", " index (inclusive) and end index (exclusive) in the flattened 1D array\n", " representation of the grid. For example, `[(0, 100), (200, 300)]` would extract\n", " grid points 0-99 and 200-299.\n", "\n", "* **Indices:** A 1D numpy array or list of specific grid point indices to extract\n", " from the flattened grid. This allows for non-contiguous extraction of\n", " individual grid points. For example, `np.array([5, 10, 15, 20])` would extract\n", " exactly those four grid points. This array must be sorted in ascending order.\n", "\n", "* **Masks:** A numpy boolean array where `True` indicates grid points to extract\n", " and `False` indicates points to skip. The mask must have the same length as\n", " the total number of grid points in the field. However, no such validation is\n", " performed and passing a mask with an invalid shape will silently return wrong\n", " results.\n", "\n", "Only one of these methods can be used at a time. Please also note that GribJump\n", "uses ranges internally regardless of what the user specifies. Converting the\n", "user's chosen representation to ranges can be expensive when multiple\n", "fields are accessed simultaneously." ] }, { "cell_type": "markdown", "id": "6fe61883", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "##### Code Examples" ] }, { "cell_type": "code", "execution_count": 7, "id": "60165c68", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gribjump Engine: Built file map: 0.010474 second elapsed, 0.008713 second cpu\n", "Gribjump Progress: 1 of 1 tasks complete\n", "Gribjump Engine: All tasks finished: 0.039335 second elapsed, 0.039178 second cpu\n", "Gribjump Engine: Repackaged results: 6e-06 second elapsed, 5e-06 second cpu\n", "Engine::extract: 2e-05 second elapsed, 2e-05 second cpu\n", "Extracted dataset (ranges): Size: 36kB\n", "Dimensions: (index: 2222)\n", "Coordinates:\n", " * index (index) int64 18kB 1234 1235 1236 1237 1238 ... 4563 4564 4565 4566\n", "Data variables:\n", " 240023 (index) float64 18kB ...\n", "Attributes: (12/13)\n", " param: 240023\n", " class: ce\n", " stream: efcl\n", " levtype: sfc\n", " type: sfo\n", " expver: 0001\n", " ... ...\n", " hdate: 20200101\n", " time: 0000\n", " origin: ecmf\n", " domain: g\n", " Conventions: CF-1.8\n", " institution: ECMWF\n", "Gribjump Engine: Built file map: 0.009283 second elapsed, 0.007779 second cpu\n", "Gribjump Progress: 1 of 1 tasks complete\n", "Gribjump Engine: All tasks finished: 0.039215 second elapsed, 0.038721 second cpu\n", "Gribjump Engine: Repackaged results: 5e-06 second elapsed, 5e-06 second cpu\n", "Engine::extract: 2.3e-05 second elapsed, 2.2e-05 second cpu\n", "Extracted dataset (indices): Size: 80B\n", "Dimensions: (index: 5)\n", "Coordinates:\n", " * index (index) int64 40B 10 50 100 150 200\n", "Data variables:\n", " 240023 (index) float64 40B ...\n", "Attributes: (12/13)\n", " param: 240023\n", " class: ce\n", " stream: efcl\n", " levtype: sfc\n", " type: sfo\n", " expver: 0001\n", " ... ...\n", " hdate: 20200101\n", " time: 0000\n", " origin: ecmf\n", " domain: g\n", " Conventions: CF-1.8\n", " institution: ECMWF\n", "Gribjump Engine: Built file map: 0.012851 second elapsed, 0.009124 second cpu\n", "Gribjump Progress: 1 of 1 tasks complete\n", "Gribjump Engine: All tasks finished: 1 second elapsed, 1 second cpu\n", "Gribjump Engine: Repackaged results: 6e-06 second elapsed, 6e-06 second cpu\n", "Engine::extract: 2.7e-05 second elapsed, 2.6e-05 second cpu\n", "Extracted dataset (mask): Size: 11MB\n", "Dimensions: (index: 672975)\n", "Coordinates:\n", " * index (index) int64 5MB 10 11 32 41 ... 13454079 13454087 13454093\n", "Data variables:\n", " 240023 (index) float64 5MB ...\n", "Attributes: (12/13)\n", " param: 240023\n", " class: ce\n", " stream: efcl\n", " levtype: sfc\n", " type: sfo\n", " expver: 0001\n", " ... ...\n", " hdate: 20200101\n", " time: 0000\n", " origin: ecmf\n", " domain: g\n", " Conventions: CF-1.8\n", " institution: ECMWF\n" ] } ], "source": [ "request = {\n", " \"class\": \"ce\",\n", " \"expver\": \"0001\",\n", " \"stream\": \"efcl\",\n", " \"date\": \"20230101\",\n", " \"model\": \"lisflood\",\n", " \"domain\": \"g\",\n", " \"origin\": \"ecmf\",\n", " \"step\": 6,\n", " \"type\": \"sfo\",\n", " \"levtype\": \"sfc\",\n", " \"param\": \"240023\",\n", " \"time\": \"0000\",\n", " \"hdate\": \"20200101\",\n", "}\n", "\n", "# Example 1: Using ranges\n", "source_ranges = earthkit.data.from_source(\n", " \"gribjump\",\n", " request,\n", " ranges=[(1234, 2345), (3456, 4567)],\n", ")\n", "ds = source_ranges.to_xarray()\n", "print(\"Extracted dataset (ranges):\", ds)\n", "\n", "# Example 2: Using indices to extract specific grid points\n", "indices = np.array([10, 50, 100, 150, 200])\n", "source_indices = earthkit.data.from_source(\n", " \"gribjump\",\n", " request,\n", " indices=indices,\n", ")\n", "print(\"Extracted dataset (indices):\", source_indices.to_xarray())\n", "\n", "# Example 3: Using a boolean mask with random selection\n", "shape = 4530 * 2970 # Depends on your grid size\n", "mask = np.random.choice([True, False], size=shape, p=[0.05, 0.95])\n", "\n", "source_mask = earthkit.data.from_source(\n", " \"gribjump\",\n", " request,\n", " mask=mask,\n", ")\n", "print(\"Extracted dataset (mask):\", source_mask.to_xarray())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.12" } }, "nbformat": 4, "nbformat_minor": 5 }