{ "cells": [ { "cell_type": "markdown", "id": "a1e27a82-7792-4544-939a-b3ca2adc87fd", "metadata": {}, "source": [ "# Build a catalog for CESM1 timeseries output" ] }, { "cell_type": "markdown", "id": "beb0521b-922f-48f2-960c-bb2526ba1671", "metadata": {}, "source": [ "## Imports\n", "First, we import the `Builder` object and the parser we are using, in this case, `parse_cesm_timeseries`!" ] }, { "cell_type": "code", "execution_count": 1, "id": "94483fcd-b5ec-4668-985f-291675c52637", "metadata": {}, "outputs": [], "source": [ "from ecgtools import Builder\n", "from ecgtools.parsers.cesm import parse_cesm_timeseries" ] }, { "cell_type": "markdown", "id": "63fad45a-3a88-415c-bc2e-6df9af0802a6", "metadata": {}, "source": [ "## Setup the Builder\n", "In this example, we are using sample CESM model output stored in `/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/`" ] }, { "cell_type": "code", "execution_count": 2, "id": "d771fac5-09b2-498c-9de7-d579f447f1d4", "metadata": {}, "outputs": [], "source": [ "b = Builder(\n", " # Where to look for model output\n", " \"/glade/campaign/univ/udeo0005/cesmLE_no_pinatubo/\",\n", " depth=5,\n", " exclude_patterns=[\"*/hist/*\", \"*/rest/*\"],\n", " njobs=-1,\n", ")" ] }, { "cell_type": "markdown", "id": "06b6ba6d-b468-41ba-9c63-228e16febf6a", "metadata": {}, "source": [ "## Configuring the parser\n", "Since we are working with CESM1 model output, we will specify the stream information to ensure the parser is adding the correct information to our catalog" ] }, { "cell_type": "markdown", "id": "79849523-c095-480e-b1e7-df74cf1d6624", "metadata": {}, "source": [ "Notice how the parser takes in two arguments, the `file` and `user_streams_dict`\n", "\n", "The `user_streams_dict` allows us to customize the stream information" ] }, { "cell_type": "code", "execution_count": 3, "id": "fb05360f-6f4f-458e-91e9-0cbb262164dc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mSignature:\u001b[0m \u001b[0mparse_cesm_timeseries\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfile\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser_streams_dict\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m Parser for CESM Timeseries files\n", "\u001b[0;31mFile:\u001b[0m /glade/work/mgrover/git_repos/ecgtools/ecgtools/parsers/cesm.py\n", "\u001b[0;31mType:\u001b[0m function\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "?parse_cesm_timeseries" ] }, { "cell_type": "markdown", "id": "e7cc1e7d-a526-40cf-89bb-11063a7b7fdb", "metadata": {}, "source": [ "We setup a dictionary formatted in the following way - this is a good practice to investigate which streams are included in your CESM model output, their associated components, and the frequency. Since this CESM1, the `frequency` information is not stored within the attributes, which means the parser will use the [`default_streams` dictionary](https://github.com/mgrover1/ecgtools/blob/main/ecgtools/parsers/cesm.py#L12#L58) to parse the files and assign metadata within the catalog" ] }, { "cell_type": "code", "execution_count": 4, "id": "9aa1b599-e1fd-489f-bd3f-7d017e8ab3ee", "metadata": {}, "outputs": [], "source": [ "stream_info = {\n", " 'cam.h0': {'component': 'atm', 'frequency': 'month_1'},\n", " 'cam.h1': {'component': 'atm', 'frequency': 'day_1'},\n", " 'cam.h2': {'component': 'atm', 'frequency': 'hour_6'},\n", " 'cice.h': {'component': 'ice', 'frequency': 'month_1'},\n", " 'cice.h1': {'component': 'ice', 'frequency': 'day_1'},\n", " 'clm2.h0': {'component': 'lnd', 'frequency': 'month_1'},\n", " 'clm2.h1': {'component': 'lnd', 'frequency': 'day_1'},\n", " 'pop.h.ecosys.nyear1': {'component': 'ocn', 'frequency': 'year_1'},\n", " 'pop.h.nday1': {'component': 'ocn', 'frequency': 'day_1'},\n", " 'pop.h': {'component': 'ocn', 'frequency': 'month_1'},\n", "}" ] }, { "cell_type": "markdown", "id": "29ed06bc-7f6c-41f7-b1e0-918ace433294", "metadata": {}, "source": [ "## Build the catalog\n", "Now that we setup our `stream_info` dictionary, we feed the parser (`parse_cesm_timeseries`) and the `stream_info` dictionary into the `.build()` call, using the following syntax" ] }, { "cell_type": "code", "execution_count": 5, "id": "5cfe931a-f39e-4311-aaf2-2f38c350af70", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 36 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 1.9s finished\n", "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 36 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 90 tasks | elapsed: 1.6s\n", "[Parallel(n_jobs=-1)]: Done 216 tasks | elapsed: 2.1s\n", "[Parallel(n_jobs=-1)]: Done 378 tasks | elapsed: 2.7s\n", "[Parallel(n_jobs=-1)]: Done 576 tasks | elapsed: 3.4s\n", "[Parallel(n_jobs=-1)]: Done 810 tasks | elapsed: 4.3s\n", "[Parallel(n_jobs=-1)]: Done 1080 tasks | elapsed: 5.4s\n", "[Parallel(n_jobs=-1)]: Done 1386 tasks | elapsed: 6.5s\n", "[Parallel(n_jobs=-1)]: Done 1728 tasks | elapsed: 7.8s\n", "[Parallel(n_jobs=-1)]: Done 2106 tasks | elapsed: 9.2s\n", "[Parallel(n_jobs=-1)]: Done 2520 tasks | elapsed: 10.7s\n", "[Parallel(n_jobs=-1)]: Done 2970 tasks | elapsed: 12.2s\n", "[Parallel(n_jobs=-1)]: Done 3888 tasks | elapsed: 15.5s\n", "[Parallel(n_jobs=-1)]: Done 4932 tasks | elapsed: 18.9s\n", "[Parallel(n_jobs=-1)]: Done 6048 tasks | elapsed: 22.6s\n", "[Parallel(n_jobs=-1)]: Done 7236 tasks | elapsed: 26.8s\n", "[Parallel(n_jobs=-1)]: Done 8496 tasks | elapsed: 31.9s\n", "[Parallel(n_jobs=-1)]: Done 9828 tasks | elapsed: 36.8s\n", "[Parallel(n_jobs=-1)]: Done 11232 tasks | elapsed: 42.4s\n", "[Parallel(n_jobs=-1)]: Done 12708 tasks | elapsed: 48.2s\n", "[Parallel(n_jobs=-1)]: Done 14256 tasks | elapsed: 54.5s\n", "[Parallel(n_jobs=-1)]: Done 15876 tasks | elapsed: 1.0min\n", "[Parallel(n_jobs=-1)]: Done 17568 tasks | elapsed: 1.1min\n", "[Parallel(n_jobs=-1)]: Done 19332 tasks | elapsed: 1.2min\n", "[Parallel(n_jobs=-1)]: Done 20736 tasks | elapsed: 1.5min\n", "[Parallel(n_jobs=-1)]: Done 22464 tasks | elapsed: 1.6min\n", "[Parallel(n_jobs=-1)]: Done 24444 tasks | elapsed: 1.6min\n", "[Parallel(n_jobs=-1)]: Done 26496 tasks | elapsed: 1.7min\n", "[Parallel(n_jobs=-1)]: Done 28620 tasks | elapsed: 1.8min\n", "[Parallel(n_jobs=-1)]: Done 31716 tasks | elapsed: 1.9min\n", "[Parallel(n_jobs=-1)]: Done 36252 tasks | elapsed: 2.1min\n", "[Parallel(n_jobs=-1)]: Done 38733 out of 38733 | elapsed: 2.2min finished\n", "/glade/work/mgrover/git_repos/ecgtools/ecgtools/builder.py:180: UserWarning: Unable to parse 1950 assets/files. A list of these assets can be found in `.invalid_assets` attribute.\n", " parsing_func, parsing_func_kwargs\n" ] }, { "data": { "text/plain": [ "Builder(root_path=PosixPath('/glade/campaign/univ/udeo0005/cesmLE_no_pinatubo'), extension='.nc', depth=5, exclude_patterns=['*/hist/*', '*/rest/*'], njobs=-1)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.build(parse_cesm_timeseries, parsing_func_kwargs={'user_streams_dict': stream_info})" ] }, { "cell_type": "markdown", "id": "294685b5-695e-4027-a7b8-7193d9aef30b", "metadata": {}, "source": [ "## Inspect the catalog" ] }, { "cell_type": "markdown", "id": "1693590d-8516-4651-baad-c567c23e75be", "metadata": {}, "source": [ "It looks like the primary files missed were zonally averaged timeseries files, which for now, we are not concerned about dealing with " ] }, { "cell_type": "code", "execution_count": 8, "id": "e6af3ae2-8367-47a7-999d-748f197abe77", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved catalog location: /glade/work/mgrover/intake-esm-catalogs/pinatubo-LE.json and /glade/work/mgrover/intake-esm-catalogs/pinatubo-LE.csv\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/glade/u/home/mgrover/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/ipykernel_launcher.py:17: UserWarning: Unable to parse 1950 assets/files. A list of these assets can be found in /glade/work/mgrover/intake-esm-catalogs/invalid_assets_pinatubo-LE.csv.\n" ] } ], "source": [ "b.save(\n", " '/glade/work/mgrover/intake-esm-catalogs/pinatubo-LE.csv',\n", " # Column name including filepath\n", " path_column_name='path',\n", " # Column name including variables\n", " variable_column_name='variable',\n", " # Data file format - could be netcdf or zarr (in this case, netcdf)\n", " data_format=\"netcdf\",\n", " # Which attributes to groupby when reading in variables using intake-esm\n", " groupby_attrs=[\"component\", \"stream\", \"case\"],\n", " # Aggregations which are fed into xarray when reading in data using intake\n", " aggregations=[\n", " {\n", " \"type\": \"join_existing\",\n", " \"attribute_name\": \"time_range\",\n", " \"options\": {\"dim\": \"time\", \"coords\": \"minimal\", \"compat\": \"override\"},\n", " }\n", " ],\n", ")" ] }, { "cell_type": "markdown", "id": "c002400b-33e5-4b91-a63f-0c72d00df923", "metadata": {}, "source": [ "## Test the Catalog" ] }, { "cell_type": "code", "execution_count": 9, "id": "2a60872d-c08d-4b5d-b208-2984022cb875", "metadata": {}, "outputs": [], "source": [ "import intake" ] }, { "cell_type": "code", "execution_count": 10, "id": "c28576f5-71ac-42ad-a08c-9e5bb5134ea8", "metadata": {}, "outputs": [], "source": [ "col = intake.open_esm_datastore('/glade/work/mgrover/intake-esm-catalogs/pinatubo-LE.json')" ] }, { "cell_type": "code", "execution_count": 11, "id": "cfd0ce51-8414-4301-ad66-fc70df171213", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['cam.h1', 'cam.h0', 'pop.h.ecosys.nday1', 'pop.h.nday1',\n", " 'pop.h.ecosys.nyear1', 'pop.h', 'clm2.h1', 'clm2.h0', 'rtm.h1',\n", " 'rtm.h0', 'cice.h1', 'cice.h'], dtype=object)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col.df.stream.unique()" ] }, { "cell_type": "code", "execution_count": 12, "id": "c93534fc-2377-4853-a018-aaa743918e7b", "metadata": {}, "outputs": [], "source": [ "cat = col.search(variable='TEMP', frequency='month_1')" ] }, { "cell_type": "code", "execution_count": 13, "id": "f796b5e3-4411-492b-9be8-c6e46f6017d0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "--> The keys in the returned dictionary of datasets are constructed as follows:\n", "\t'component.stream.case'\n" ] }, { "data": { "text/html": [ "\n", "