{ "cells": [ { "cell_type": "markdown", "id": "sublime-university", "metadata": {}, "source": [ "# Building an Intake-esm catalog from CESM2 History Files\n", "\n", "This example covers how to build an intake-esm catalog from Community Earth System Model v2 (CESM2) model output. In this case, we use model output using the default component-set (compset) detailed in the [CESM Quickstart Guide](https://escomp.github.io/CESM/versions/cesm2.1/html/).\n", "\n", "## What's a \"history\" file?\n", "A history file is the default output from CESM, where each file is a single time \"slice\" with every variable from the component of interest. These types of files can be difficult to work with, since often times one is interested in a time series of a single variable. Building a catalog can be helpful in accessing your data, querying for certain variables, and potentially creating timeseries files later down the road.\n", "\n", "Let's get started!\n", "\n", "## Imports\n", "The only parts of ecgtools we need are the `Builder` object and the `parse_cesm_history` parser from the CESM parsers! We import `glob` to take a look at the files we are parsing." ] }, { "cell_type": "code", "execution_count": 1, "id": "outstanding-blackjack", "metadata": {}, "outputs": [], "source": [ "import glob\n", "\n", "from ecgtools import Builder\n", "from ecgtools.parsers.cesm import parse_cesm_history" ] }, { "cell_type": "markdown", "id": "electric-privacy", "metadata": {}, "source": [ "### Understanding the Directory Structure\n", "\n", "The first step to setting up the `Builder` object is determining where your files are stored. As mentioned previously, we have a sample dataset of CESM2 model output, which is stored in `/glade/work/mgrover/cesm_test_data/`\n", "\n", "Taking a look at that directory, we see that there is a single case `b.e20.B1850.f19_g17.test`" ] }, { "cell_type": "code", "execution_count": 2, "id": "excellent-donor", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('/glade/work/mgrover/cesm_test_data/*')" ] }, { "cell_type": "markdown", "id": "oriental-record", "metadata": {}, "source": [ "Once we go into that directory, we see all the different components, including the atmosphere (atm), ocean (ocn), and land (lnd)!" ] }, { "cell_type": "code", "execution_count": 3, "id": "smaller-bubble", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/logs',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/cpl',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/lnd',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/esp',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/glc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/rof',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/rest',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/wav',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ice']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/*')" ] }, { "cell_type": "markdown", "id": "delayed-chuck", "metadata": {}, "source": [ "If we go one step further, we notice that within each component, is a `hist` directory which contains the model output" ] }, { "cell_type": "code", "execution_count": 4, "id": "white-hebrew", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/hist/b.e20.B1850.f19_g17.test.cam.h0.0002-08.nc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/hist/b.e20.B1850.f19_g17.test.cam.h0.0001-09.nc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/hist/b.e20.B1850.f19_g17.test.cam.h0.0002-07.nc']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/*/*.nc')[0:3]" ] }, { "cell_type": "markdown", "id": "brief-class", "metadata": {}, "source": [ "If we take a look at the `ocn` component though, we notice that there are a few timeseries files in there..." ] }, { "cell_type": "code", "execution_count": 5, "id": "tracked-friend", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/tseries/b.e20.B1850.f19_g17.test.pop.h.pCO2SURF.000101-001012.nc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/tseries/b.e20.B1850.f19_g17.test.pop.h.SiO3_RIV_FLUX.000101-001012.nc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/tseries/b.e20.B1850.f19_g17.test.pop.h.graze_sp_zootot.000101-001012.nc']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/*/*.nc')[0:3]" ] }, { "cell_type": "markdown", "id": "large-hardware", "metadata": {}, "source": [ "When we setup our catalog builder, we will need to specify not including the timeseries (tseries) and restart (rest) directories!\n", "\n", "Now that we understand the directory structure, let's make the catalog." ] }, { "cell_type": "markdown", "id": "wound-administration", "metadata": {}, "source": [ "## Build the catalog!\n", "\n", "Let's start by inspecting the builder object" ] }, { "cell_type": "code", "execution_count": 6, "id": "played-killer", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mInit signature:\u001b[0m\n", "\u001b[0mBuilder\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mroot_path\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mpydantic\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtypes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDirectoryPath\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mextension\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'.nc'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdepth\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mexclude_patterns\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mList\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mnjobs\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m \n", "Generates a catalog from a list of files.\n", "\n", "Parameters\n", "----------\n", "root_path : str\n", " Path of root directory.\n", "extension : str, optional\n", " File extension, by default None. If None, the builder will look for files with\n", " \"*.nc\" extension.\n", "depth : int, optional\n", " Recursion depth. Recursively crawl `root_path` up to a specified depth, by default 0\n", "exclude_patterns : list, optional\n", " Directory, file patterns to exclude during catalog generation.\n", " These could be substring or regular expressions. by default None\n", "njobs : int, optional\n", " The maximum number of concurrently running jobs,\n", " by default -1 meaning all CPUs are used.\n", "\u001b[0;31mFile:\u001b[0m /glade/work/mgrover/git_repos/ecgtools/ecgtools/builder.py\n", "\u001b[0;31mType:\u001b[0m type\n", "\u001b[0;31mSubclasses:\u001b[0m \n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "?Builder" ] }, { "cell_type": "code", "execution_count": 7, "id": "physical-messenger", "metadata": {}, "outputs": [], "source": [ "b = Builder(\n", " # Directory with the output\n", " \"/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/\",\n", " # Depth of 1 since we are sending it to the case output directory\n", " depth=1,\n", " # Exclude the timeseries and restart directories\n", " exclude_patterns=[\"*/tseries/*\", \"*/rest/*\"],\n", " # Number of jobs to execute - should be equal to # threads you are using\n", " njobs=5,\n", ")" ] }, { "cell_type": "markdown", "id": "corrected-preference", "metadata": {}, "source": [ "Double check the object is set up..." ] }, { "cell_type": "code", "execution_count": 8, "id": "passing-reliance", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Builder(root_path=PosixPath('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test'), extension='.nc', depth=1, exclude_patterns=['*/tseries/*', '*/rest/*'], njobs=5)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b" ] }, { "cell_type": "markdown", "id": "natural-barrier", "metadata": {}, "source": [ "We are good to go! Let's build the catalog by calling `.build()` on the object! By default, it will use the `LokyBackend` which is described in the [Joblib documentation](https://joblib.readthedocs.io/en/latest/parallel.html).\n", "\n", "We also add in the parser here! By default, the parsers use a `default_streams` dictionary formatted as follows to parse the files:\n", "\n", "This dictionary follows the convention:\n", "```python\n", "{'stream': {'component': 'some_component', 'frequency': 'frequency_num'}\n", "```\n", "\n", "Here is an example of the first few!\n", "```python\n", "{'cam.h0': {'component': 'atm', 'frequency': 'month_1'},\n", " 'cam.h1': {'component': 'atm', 'frequency': 'day_1'},\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "id": "racial-championship", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.\n", "[Parallel(n_jobs=5)]: Done 6 out of 12 | elapsed: 0.2s remaining: 0.2s\n", "[Parallel(n_jobs=5)]: Done 9 out of 12 | elapsed: 0.2s remaining: 0.1s\n", "[Parallel(n_jobs=5)]: Done 12 out of 12 | elapsed: 0.2s remaining: 0.0s\n", "[Parallel(n_jobs=5)]: Done 12 out of 12 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.\n", "[Parallel(n_jobs=5)]: Done 8 tasks | elapsed: 1.5s\n", "[Parallel(n_jobs=5)]: Done 62 tasks | elapsed: 5.1s\n", "[Parallel(n_jobs=5)]: Done 152 tasks | elapsed: 12.3s\n", "[Parallel(n_jobs=5)]: Done 264 out of 264 | elapsed: 16.8s finished\n", "/glade/work/mgrover/git_repos/ecgtools/ecgtools/builder.py:180: UserWarning: Unable to parse 5 assets/files. A list of these assets can be found in `.invalid_assets` attribute.\n", " parsing_func, parsing_func_kwargs\n" ] } ], "source": [ "b = b.build( # Use the parse_cesm_history parsing function\n", " parse_cesm_history,\n", ")" ] }, { "cell_type": "markdown", "id": "automated-pharmacy", "metadata": {}, "source": [ "## Inspect the Catalog" ] }, { "cell_type": "markdown", "id": "premium-chapel", "metadata": {}, "source": [ "Now that the catalog is built, we can inspect the dataframe which is used to create the catalog by calling `.df` on the builder object" ] }, { "cell_type": "code", "execution_count": 10, "id": "internal-warren", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | component | \n", "stream | \n", "case | \n", "date | \n", "frequency | \n", "variables | \n", "path | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-08 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| 1 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0001-09 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| 2 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-07 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| 3 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0003-05 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| 4 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-01 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 259 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0001-08 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| 260 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0001-03 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| 261 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0002-11 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| 262 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0002-10 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
| 263 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0003-12 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
259 rows × 7 columns
\n", "| \n", " | INVALID_ASSET | \n", "TRACEBACK | \n", "
|---|---|---|
| 15 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
| 28 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
| 34 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
| 130 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
| 191 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
None catalog with 10 dataset(s) from 262 asset(s):
| \n", " | unique | \n", "
|---|---|
| component | \n", "6 | \n", "
| stream | \n", "10 | \n", "
| case | \n", "1 | \n", "
| date | \n", "79 | \n", "
| frequency | \n", "4 | \n", "
| variables | \n", "1449 | \n", "
| path | \n", "262 | \n", "
None catalog with 1 dataset(s) from 36 asset(s):
| \n", " | unique | \n", "
|---|---|
| component | \n", "1 | \n", "
| stream | \n", "1 | \n", "
| case | \n", "1 | \n", "
| date | \n", "36 | \n", "
| frequency | \n", "1 | \n", "
| variables | \n", "434 | \n", "
| path | \n", "36 | \n", "