{ "cells": [ { "cell_type": "markdown", "id": "8fdfb068-2bee-47eb-9dc3-b48d7cd6ab28", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Retrieving data from S3 buckets" ] }, { "cell_type": "raw", "id": "449009cf-adc2-4045-8402-1d3cd00d3b73", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The :ref:`data-sources-s3` data source provides access to `Amazon S3 `_ buckets.\n", "\n", "In this example we will read GRIB data from a publicly available `Amazon S3 `_ bucket on the European Weather Cloud (EWC)." ] }, { "cell_type": "markdown", "id": "bb761cba-a7d8-4c72-a314-0c61376a37e2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Getting a whole object" ] }, { "cell_type": "markdown", "id": "51b17e18-6bf5-46bc-be6f-95d040d8d2af", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "#### Disk based access" ] }, { "cell_type": "raw", "id": "0593f2a0-5a70-49a3-9473-9df98dc9832c", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "By default the data is downloaded and stored in the :ref:`cache `. Since we know that the bucket is public we use ``anon=True`` in :ref:`from_source() ` to bypass the S3 authentication." ] }, { "cell_type": "code", "execution_count": 1, "id": "d67d6ebf-ecef-4e65-b9cc-da9251b0d66c", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "66227b7222e943968b92e86aeca6b153", "version_major": 2, "version_minor": 0 }, "text/plain": [ "test6.grib: 0%| | 0.00/1.41k [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
centreshortNametypeOfLevelleveldataDatedataTimestepRangedataTypenumbergridType
0ecmftisobaricInhPa10002018080112000an0regular_ll
1ecmfuisobaricInhPa10002018080112000an0regular_ll
2ecmfvisobaricInhPa10002018080112000an0regular_ll
3ecmftisobaricInhPa8502018080112000an0regular_ll
4ecmfuisobaricInhPa8502018080112000an0regular_ll
5ecmfvisobaricInhPa8502018080112000an0regular_ll
\n", "" ], "text/plain": [ " centre shortName typeOfLevel level dataDate dataTime stepRange \\\n", "0 ecmf t isobaricInhPa 1000 20180801 1200 0 \n", "1 ecmf u isobaricInhPa 1000 20180801 1200 0 \n", "2 ecmf v isobaricInhPa 1000 20180801 1200 0 \n", "3 ecmf t isobaricInhPa 850 20180801 1200 0 \n", "4 ecmf u isobaricInhPa 850 20180801 1200 0 \n", "5 ecmf v isobaricInhPa 850 20180801 1200 0 \n", "\n", " dataType number gridType \n", "0 an 0 regular_ll \n", "1 an 0 regular_ll \n", "2 an 0 regular_ll \n", "3 an 0 regular_ll \n", "4 an 0 regular_ll \n", "5 an 0 regular_ll " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import earthkit.data as ekd\n", "\n", "req = {\"endpoint\": \"object-store.os-api.cci1.ecmwf.int\",\n", " \"bucket\": \"earthkit-test-data-public\", \n", " \"objects\": \"test6.grib\",\n", " }\n", "\n", "ds = ekd.from_source(\"s3\", req, anon=True) \n", "ds.ls()" ] }, { "cell_type": "markdown", "id": "0f9e038b-c81c-4fce-bcd8-9e53c379c22f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "#### Reading as a stream" ] }, { "cell_type": "raw", "id": "d750791d-570e-4a83-85c5-fbbe5d0769f6", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "We can read GRIB data from an S3 bucket as a stream without writing anything to disk. This can be activated by calling :ref:`from_source() ` with ``stream=True``. By default we get a stream iterator, which we can consume field by field." ] }, { "cell_type": "code", "execution_count": 2, "id": "d9f20fcb-3759-4327-acf3-622d0d03b518", "metadata": { "editable": true, "raw_mimetype": "", "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GribField(t,1000,20180801,1200,0,0)\n", "GribField(u,1000,20180801,1200,0,0)\n", "GribField(v,1000,20180801,1200,0,0)\n", "GribField(t,850,20180801,1200,0,0)\n", "GribField(u,850,20180801,1200,0,0)\n", "GribField(v,850,20180801,1200,0,0)\n" ] } ], "source": [ "req = {\"endpoint\": \"object-store.os-api.cci1.ecmwf.int\",\n", " \"bucket\": \"earthkit-test-data-public\", \n", " \"objects\": \"test6.grib\",\n", " }\n", "\n", "ds = ekd.from_source(\"s3\", req, stream=True, anon=True) \n", "\n", "for f in ds:\n", " # f is GribField object. It gets deleted when going out of scope\n", " print(f)\n" ] }, { "cell_type": "raw", "id": "b8699c09-4adf-46cb-94a4-56c72cb69300", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "When we use the :py:meth:`batched ` method we can iterate through the stream in batches of fixed size. E.g. the following code reads the GRIB data in messages of 2." ] }, { "cell_type": "code", "execution_count": 3, "id": "14709514-1672-4391-86f7-bd553d047496", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len=2\n", " GribField(t,1000,20180801,1200,0,0)\n", " GribField(u,1000,20180801,1200,0,0)\n", "len=2\n", " GribField(v,1000,20180801,1200,0,0)\n", " GribField(t,850,20180801,1200,0,0)\n", "len=2\n", " GribField(u,850,20180801,1200,0,0)\n", " GribField(v,850,20180801,1200,0,0)\n" ] } ], "source": [ "ds = ekd.from_source(\"s3\", req, stream=True, anon=True) \n", "\n", "for f in ds.batched(2):\n", " # f is a fieldlist\n", " print(f\"len={len(f)}\")\n", " for g in f:\n", " print(f\" {g}\")" ] }, { "cell_type": "raw", "id": "16f8ff1a-546e-4789-9d89-522da100b3e8", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "With the ``read_all=True`` option we will load the whole object into memory. " ] }, { "cell_type": "code", "execution_count": 4, "id": "2cf20278-762c-4ff4-93bf-acd2220856a2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
centreshortNametypeOfLevelleveldataDatedataTimestepRangedataTypenumbergridType
0ecmftisobaricInhPa10002018080112000an0regular_ll
1ecmfuisobaricInhPa10002018080112000an0regular_ll
2ecmfvisobaricInhPa10002018080112000an0regular_ll
3ecmftisobaricInhPa8502018080112000an0regular_ll
4ecmfuisobaricInhPa8502018080112000an0regular_ll
5ecmfvisobaricInhPa8502018080112000an0regular_ll
\n", "
" ], "text/plain": [ " centre shortName typeOfLevel level dataDate dataTime stepRange \\\n", "0 ecmf t isobaricInhPa 1000 20180801 1200 0 \n", "1 ecmf u isobaricInhPa 1000 20180801 1200 0 \n", "2 ecmf v isobaricInhPa 1000 20180801 1200 0 \n", "3 ecmf t isobaricInhPa 850 20180801 1200 0 \n", "4 ecmf u isobaricInhPa 850 20180801 1200 0 \n", "5 ecmf v isobaricInhPa 850 20180801 1200 0 \n", "\n", " dataType number gridType \n", "0 an 0 regular_ll \n", "1 an 0 regular_ll \n", "2 an 0 regular_ll \n", "3 an 0 regular_ll \n", "4 an 0 regular_ll \n", "5 an 0 regular_ll " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = ekd.from_source(\"s3\", req, stream=True, read_all=True, anon=True) \n", "ds.ls()" ] }, { "cell_type": "markdown", "id": "6f74674f-269b-4f16-b543-90a0d427fd6b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Getting multiple objects" ] }, { "cell_type": "code", "execution_count": 5, "id": "69103ab9-e724-4cb0-b1d7-05e10a26e091", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "97c2cab62e884449a9c9b925415f505b", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/2 [00:00` (byte ranges) we want to read. It works both in stream and non-stream mode." ] }, { "cell_type": "code", "execution_count": 6, "id": "680c7137-a57f-40e9-935c-351050037816", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "86f5a8dc836240bc9e73a9869ad0eb13", "version_major": 2, "version_minor": 0 }, "text/plain": [ "test6.grib: 0%| | 0.00/480 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
centreshortNametypeOfLevelleveldataDatedataTimestepRangedataTypenumbergridType
0ecmfuisobaricInhPa10002018080112000an0regular_ll
1ecmfvisobaricInhPa10002018080112000an0regular_ll
\n", "" ], "text/plain": [ " centre shortName typeOfLevel level dataDate dataTime stepRange \\\n", "0 ecmf u isobaricInhPa 1000 20180801 1200 0 \n", "1 ecmf v isobaricInhPa 1000 20180801 1200 0 \n", "\n", " dataType number gridType \n", "0 an 0 regular_ll \n", "1 an 0 regular_ll " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req = {\"endpoint\": \"object-store.os-api.cci1.ecmwf.int\",\n", " \"bucket\": \"earthkit-test-data-public\",\n", " \"objects\": { \"object\": \"test6.grib\", \"parts\": (240, 480)},\n", " }\n", "\n", "ds = ekd.from_source(\"s3\", req, anon=True) \n", "ds.ls()" ] }, { "cell_type": "code", "execution_count": 7, "id": "d0dc0235-f6f7-4309-9aef-c679acec75ae", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "20ebfaa4cc5141828ff955fa83e90f7b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "test6.grib: 0%| | 0.00/480 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
centreshortNametypeOfLevelleveldataDatedataTimestepRangedataTypenumbergridType
0ecmftisobaricInhPa10002018080112000an0regular_ll
1ecmfvisobaricInhPa10002018080112000an0regular_ll
\n", "" ], "text/plain": [ " centre shortName typeOfLevel level dataDate dataTime stepRange \\\n", "0 ecmf t isobaricInhPa 1000 20180801 1200 0 \n", "1 ecmf v isobaricInhPa 1000 20180801 1200 0 \n", "\n", " dataType number gridType \n", "0 an 0 regular_ll \n", "1 an 0 regular_ll " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req = {\"endpoint\": \"object-store.os-api.cci1.ecmwf.int\",\n", " \"bucket\": \"earthkit-test-data-public\",\n", " \"objects\": { \"object\": \"test6.grib\", \"parts\": [(0, 240), (480, 240)]},\n", " }\n", "\n", "ds = ekd.from_source(\"s3\", req, anon=True) \n", "ds.ls()" ] }, { "cell_type": "markdown", "id": "1ad0d5da-b870-4bc5-b440-82ddb70d9685", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Getting parts of multiple objects" ] }, { "cell_type": "code", "execution_count": 8, "id": "b234011b-86b2-4ecc-9362-fedb27561343", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "94357b25bb624aa7b8771df07083a432", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/2 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
centreshortNametypeOfLevelleveldataDatedataTimestepRangedataTypenumbergridType
0ecmftisobaricInhPa10002018080112000an0regular_ll
1ecmfuisobaricInhPa5002018080112000an0regular_ll
\n", "" ], "text/plain": [ " centre shortName typeOfLevel level dataDate dataTime stepRange \\\n", "0 ecmf t isobaricInhPa 1000 20180801 1200 0 \n", "1 ecmf u isobaricInhPa 500 20180801 1200 0 \n", "\n", " dataType number gridType \n", "0 an 0 regular_ll \n", "1 an 0 regular_ll " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req = {\"endpoint\": \"object-store.os-api.cci1.ecmwf.int\",\n", " \"bucket\": \"earthkit-test-data-public\", \n", " \"objects\": [{\"object\": \"test6.grib\", \"parts\": (0,240)}, \n", " {\"object\": \"tuv_pl.grib\", \"parts\": (2400, 240)}],\n", " }\n", "\n", "ds = ekd.from_source(\"s3\", req, anon=True) \n", "ds.ls()" ] }, { "cell_type": "markdown", "id": "f528e5a1-9376-4b0a-bfdf-c9bc6a8a7482", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Using parts with a stream" ] }, { "cell_type": "raw", "id": "0e69210d-8df0-444f-9761-e1110db77ccb", "metadata": { "editable": true, "raw_mimetype": "text/restructuredtext", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The :ref:`parts ` (byte ranges) still work when used with streams." ] }, { "cell_type": "code", "execution_count": 9, "id": "71827f00-d5ee-43bd-a88d-74f01bed2c39", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GribField(u,1000,20180801,1200,0,0)\n", "GribField(v,1000,20180801,1200,0,0)\n" ] } ], "source": [ "req = {\"endpoint\": \"object-store.os-api.cci1.ecmwf.int\",\n", " \"bucket\": \"earthkit-test-data-public\",\n", " \"objects\": { \"object\": \"test6.grib\", \"parts\": (240, 480)},\n", " }\n", "\n", "\n", "ds = ekd.from_source(\"s3\", req, stream=True, anon=True) \n", "\n", "for f in ds:\n", " # f is GribField object. It gets deleted when going out of scope\n", " print(f)" ] }, { "cell_type": "code", "execution_count": null, "id": "86576539-7cf1-494b-8b49-0a85fb3d128b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "dev", "language": "python", "name": "dev" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }