{ "cells": [ { "cell_type": "markdown", "id": "56cac4e2-73b8-4e82-b5a5-3d59879be2f9", "metadata": {}, "source": [ "# Use a custom parser\n", "\n", "While many of the parsers included within this libary may be useful, we do not have parsers for **every** dataset out there. If you are interested in adding your own parser (and hopefully contributing that parser to the main repo 😊 ), check out this walkthrough of how to build one!" ] }, { "cell_type": "markdown", "id": "1f1af82a-a475-478b-a6e1-d5fccb98623a", "metadata": {}, "source": [ "## What is a Parser?\n", "Basically, a parser collects information from two main sources:\n", "* The file string\n", "* The dataset itself\n", "\n", "This means there are two main steps:\n", "* Parsing out the file string, separating based on some symbol\n", "* Opening the file, and extracting variables and their attributes, or even global attributes\n", "\n", "The result from a \"parser\" is a dictionary of fields to add to the catalog, stored in a `pandas.DataFrame`\n", "\n", "It would probably be **more helpful** to walk through a concrete example of this..." ] }, { "cell_type": "markdown", "id": "e02d6ea1-faf0-4ce6-9b6b-31e4bbc079b5", "metadata": { "tags": [] }, "source": [ "## Example of Building a Parser\n", "Let's say we have a list of files which we wanted to parse! In this example, we are using a set of observational data on NCAR HPC resources. A full blog post detailing this dataset and comparison is [included here](https://ncar.github.io/esds/posts/2021/intake-obs-cesm2le-comparison/)" ] }, { "cell_type": "markdown", "id": "8a7cf547-d32e-4700-b54e-43898c6b13d7", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 16, "id": "5ee1b377-04f5-4afb-98c8-4c81bf702aec", "metadata": {}, "outputs": [], "source": [ "import glob\n", "import pathlib\n", "import traceback\n", "from datetime import datetime\n", "\n", "import xarray as xr\n", "\n", "from ecgtools import Builder\n", "from ecgtools.builder import INVALID_ASSET, TRACEBACK" ] }, { "cell_type": "code", "execution_count": 3, "id": "1942a0c6-d031-4f21-93c4-e01cfb805d7e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ARM_annual_cycle_twp_c2_cmbe_sound_p_f.cdf',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES2_04_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES_07_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSATCOSP_07_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSAT_10_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ECMWF_09_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/EP.ERAI_DJF_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERAI_04_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERBE_07_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERS_12_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/GPCP_JJA_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_CL_03_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PD_02_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PI_05_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPCOSP_07_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPFD_07_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCP_12_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/JRA25_SON_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/LEGATES_04_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRAW_19x2_09_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRA_12_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MISRCOSP_JJA_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MODIS_ANN_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/NVAP_03_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/PRECL_07_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_09_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_SEAICE_DJF_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/TRMM_MAM_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WARREN_DJF_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WILLMOTT_04_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/XIEARKIN_09_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsg_10_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlso_ANN_climo.nc',\n", " '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsw_MAM_climo.nc']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "files = sorted(glob.glob('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/*'))\n", "files[::20]" ] }, { "cell_type": "markdown", "id": "02e6e231-712f-4306-a23f-8ccd6bafc735", "metadata": {}, "source": [ "Observational datasetsets in this directory follow the convention `source_(month/season/annual)_climo.nc.`\n", "\n", "Let’s open up one of those datasets" ] }, { "cell_type": "code", "execution_count": 5, "id": "0b32e4fe-026e-4301-9907-4b0f48f4e536", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:  (lat: 180, lon: 360, time: 1)\n",
       "Coordinates:\n",
       "  * lon      (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5\n",
       "  * lat      (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5\n",
       "  * time     (time) float32 1.0\n",
       "Data variables:\n",
       "    SOLIN    (time, lat, lon) float32 495.0 495.0 495.0 495.0 ... 0.0 0.0 0.0\n",
       "    FLUT     (time, lat, lon) float32 187.4 187.4 187.4 ... 170.7 170.7 170.7\n",
       "    FLUTC    (time, lat, lon) float32 188.8 188.8 188.8 ... 178.9 178.9 178.9\n",
       "    FSNTOA   (time, lat, lon) float32 147.8 147.8 147.8 ... -0.049 -0.049 -0.049\n",
       "    FSNTOAC  (time, lat, lon) float32 150.9 150.9 150.9 ... -0.006 -0.006 -0.006\n",
       "    SWCF     (time, lat, lon) float32 -3.149 -3.149 -3.149 ... -0.043 -0.043\n",
       "    LWCF     (time, lat, lon) float32 1.391 1.391 1.391 ... 8.272 8.272 8.272\n",
       "    RESTOA   (time, lat, lon) float32 -39.6 -39.6 -39.6 ... -170.7 -170.7 -170.7\n",
       "    ALBEDO   (time, lat, lon) float32 0.7015 0.7015 0.7015 ... nan nan nan\n",
       "    ALBEDOC  (time, lat, lon) float32 0.6951 0.6951 0.6951 ... nan nan nan\n",
       "    gw       (lat) float64 0.0001523 0.0004569 0.0007613 ... 0.0004569 0.0001523\n",
       "Attributes:\n",
       "    version:             This is version 2.8: March 7, 2014\n",
       "    institution:         NASA Langley Research Center\n",
       "    comment:             Data is from East to West and South to North. Climat...\n",
       "    title:               CERES EBAF (Energy Balanced and Filled) Fluxes. Mont...\n",
       "    AMWG_author:         Cecile Hannay\n",
       "    AMWG_creation_date:  Thu Jul 24 16:08:10 MDT 2014 for AMWG package\n",
       "    history:             Thu Jul 24 16:08:10 2014: ncks -A -v gw CERES2_01_cl...\n",
       "    NCO:                 20140724
" ], "text/plain": [ "\n", "Dimensions: (lat: 180, lon: 360, time: 1)\n", "Coordinates:\n", " * lon (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5\n", " * lat (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5\n", " * time (time) float32 1.0\n", "Data variables:\n", " SOLIN (time, lat, lon) float32 ...\n", " FLUT (time, lat, lon) float32 ...\n", " FLUTC (time, lat, lon) float32 ...\n", " FSNTOA (time, lat, lon) float32 ...\n", " FSNTOAC (time, lat, lon) float32 ...\n", " SWCF (time, lat, lon) float32 ...\n", " LWCF (time, lat, lon) float32 ...\n", " RESTOA (time, lat, lon) float32 ...\n", " ALBEDO (time, lat, lon) float32 ...\n", " ALBEDOC (time, lat, lon) float32 ...\n", " gw (lat) float64 ...\n", "Attributes:\n", " version: This is version 2.8: March 7, 2014\n", " institution: NASA Langley Research Center\n", " comment: Data is from East to West and South to North. Climat...\n", " title: CERES EBAF (Energy Balanced and Filled) Fluxes. Mont...\n", " AMWG_author: Cecile Hannay\n", " AMWG_creation_date: Thu Jul 24 16:08:10 MDT 2014 for AMWG package\n", " history: Thu Jul 24 16:08:10 2014: ncks -A -v gw CERES2_01_cl...\n", " NCO: 20140724" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = xr.open_dataset('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc')\n", "ds" ] }, { "cell_type": "markdown", "id": "136fe816-4e4c-401e-beb1-14fb2bcb3a76", "metadata": {}, "source": [ "We see that this dataset is gridded on a global 0.5° grid, with several variables related to solar fluxes (ex. `TOA net shortwave`)" ] }, { "cell_type": "markdown", "id": "dc84ad08-0d5c-4e38-be9e-0f4b1910a0e9", "metadata": {}, "source": [ "### Parsing the Filepath\n", "As mentioned before, the first step is parsing out information from the filepath. Here, we use [pathlib](https://docs.python.org/3/library/pathlib.html) which can be helpful when working with filepaths generically" ] }, { "cell_type": "code", "execution_count": 7, "id": "00118c42-f314-43ec-b31f-18b1eccd92ad", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'AIRS_01_climo'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = pathlib.Path(files[0])\n", "path.stem" ] }, { "cell_type": "markdown", "id": "12800996-7570-4ac3-840a-b82b7fd25674", "metadata": {}, "source": [ "This path can be split using `.split('_')`, separates the path into the following:\n", "* Observational dataset source\n", "* Month Number, Season, or Annual\n", "* “climo”" ] }, { "cell_type": "code", "execution_count": 8, "id": "ee0baf5a-6a46-402b-b54b-f8e5d10f0f2d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['AIRS', '01', 'climo']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.stem.split('_')" ] }, { "cell_type": "markdown", "id": "24f7c30b-8bfd-493d-a20a-8b359a349778", "metadata": {}, "source": [ "### Open the File for More Information\n", "We can also gather useful insight by opening the file!" ] }, { "cell_type": "code", "execution_count": 10, "id": "8b37d494-b524-4f19-b6bf-f6b5ce72f432", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:  (lat: 94, lev: 13, lon: 192, month: 1, time: 1)\n",
       "Coordinates:\n",
       "  * lat      (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54\n",
       "  * time     (time) int32 1\n",
       "  * lev      (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0\n",
       "  * lon      (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1\n",
       "  * month    (month) int32 0\n",
       "Data variables:\n",
       "    gw       (lat) float64 0.0008309 0.001933 0.003035 ... 0.001933 0.0008309\n",
       "    T        (time, lev, lat, lon) float32 ...\n",
       "    RELHUM   (time, lev, lat, lon) float32 ...\n",
       "    O3       (time, lev, lat, lon) float32 ...\n",
       "    SHUM     (time, lev, lat, lon) float32 ...\n",
       "    PREH2O   (month, lat, lon) float32 nan nan nan nan nan ... nan nan 1.961 nan\n",
       "Attributes:\n",
       "    creation_date:             Thu Mar 13 09:28:11 MDT 2008\n",
       "    interpolation:             bilinear\n",
       "    outliers:                  \\nAll [RELHUM>100] and [T>323] were set to _Fi...\n",
       "    html:                      \\nhttp://www.cgd.ucar.edu/cms/andrew/papers/ge...\n",
       "    reference:                 \\nA. Gettelman, W.D. Collins, E.J. Fetzer, A. ...\n",
       "    source:                    Andrew Gettleman\n",
       "    file:                      airsmm48_all_4d_rt_v5_c3.nc\n",
       "    title:                     AIRS: 9/2002 - 8/2006\n",
       "    history:                   Tue Mar 18 14:35:30 2008: ncrename -O -v gwt,g...\n",
       "    nco_openmp_thread_number:  1
" ], "text/plain": [ "\n", "Dimensions: (lat: 94, lev: 13, lon: 192, month: 1, time: 1)\n", "Coordinates:\n", " * lat (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54\n", " * time (time) int32 1\n", " * lev (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0\n", " * lon (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1\n", " * month (month) int32 0\n", "Data variables:\n", " gw (lat) float64 ...\n", " T (time, lev, lat, lon) float32 ...\n", " RELHUM (time, lev, lat, lon) float32 ...\n", " O3 (time, lev, lat, lon) float32 ...\n", " SHUM (time, lev, lat, lon) float32 ...\n", " PREH2O (month, lat, lon) float32 ...\n", "Attributes:\n", " creation_date: Thu Mar 13 09:28:11 MDT 2008\n", " interpolation: bilinear\n", " outliers: \\nAll [RELHUM>100] and [T>323] were set to _Fi...\n", " html: \\nhttp://www.cgd.ucar.edu/cms/andrew/papers/ge...\n", " reference: \\nA. Gettelman, W.D. Collins, E.J. Fetzer, A. ...\n", " source: Andrew Gettleman\n", " file: airsmm48_all_4d_rt_v5_c3.nc\n", " title: AIRS: 9/2002 - 8/2006\n", " history: Tue Mar 18 14:35:30 2008: ncrename -O -v gwt,g...\n", " nco_openmp_thread_number: 1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = xr.open_dataset(files[0])\n", "ds" ] }, { "cell_type": "markdown", "id": "b953602e-25ef-4858-ad49-e8f08fc6851e", "metadata": {}, "source": [ "Let’s look at the variable “Temperature” (`T`)" ] }, { "cell_type": "code", "execution_count": 11, "id": "eb066248-6b59-41b0-a598-6b5e11b5cb5e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'T' (time: 1, lev: 13, lat: 94, lon: 192)>\n",
       "[234624 values with dtype=float32]\n",
       "Coordinates:\n",
       "  * lat      (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54\n",
       "  * time     (time) int32 1\n",
       "  * lev      (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0\n",
       "  * lon      (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1\n",
       "Attributes:\n",
       "    units:        K\n",
       "    long_name:    Temperature\n",
       "    climatology:  AIRS monthly climatology 9/2002-8/2006
" ], "text/plain": [ "\n", "[234624 values with dtype=float32]\n", "Coordinates:\n", " * lat (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54\n", " * time (time) int32 1\n", " * lev (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0\n", " * lon (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1\n", "Attributes:\n", " units: K\n", " long_name: Temperature\n", " climatology: AIRS monthly climatology 9/2002-8/2006" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.T" ] }, { "cell_type": "markdown", "id": "61d5d65c-0981-468f-b3a8-efd661ce50e4", "metadata": {}, "source": [ "In this case, we want to include the list of variables available from this single file, such that each entry in our catalog represents a single file. We can search for variables in this dataset using the following:" ] }, { "cell_type": "code", "execution_count": 13, "id": "7ac3888a-6080-4aa3-915e-46da36983c19", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "variable_list = [var for var in ds if 'long_name' in ds[var].attrs]\n", "variable_list" ] }, { "cell_type": "markdown", "id": "28ffb10d-fc27-41d9-b647-492dde8a1a69", "metadata": {}, "source": [ "### Assembling These Parts into a Function\n", "Now that we have methods of extracting the relevant information, we can assemble this into a function which returns a dictionary. You'll notice the addition of the exception handling, which will add the unparsable file to a `pandas.DataFrame` with the unparsable file, and the associated traceback error." ] }, { "cell_type": "code", "execution_count": 15, "id": "7a92c3d3-5a6b-4cc2-9fa4-6cde2aabc332", "metadata": {}, "outputs": [], "source": [ "def parse_amwg_obs(file):\n", " \"\"\"Atmospheric observational data stored in\"\"\"\n", " file = pathlib.Path(file)\n", " info = {}\n", "\n", " try:\n", " stem = file.stem\n", " split = stem.split('_')\n", " source = split[0]\n", " temporal = split[-2]\n", " if len(temporal) == 2:\n", " month_number = int(temporal)\n", " time_period = 'monthly'\n", " temporal = datetime(2020, month_number, 1).strftime('%b').upper()\n", "\n", " elif temporal == 'ANN':\n", " time_period = 'annual'\n", " else:\n", " time_period = 'seasonal'\n", "\n", " with xr.open_dataset(file, chunks={}, decode_times=False) as ds:\n", " variable_list = [var for var in ds if 'long_name' in ds[var].attrs]\n", "\n", " info = {\n", " 'source': source,\n", " 'temporal': temporal,\n", " 'time_period': time_period,\n", " 'variable': variable_list,\n", " 'path': str(file),\n", " }\n", "\n", " return info\n", "\n", " except Exception:\n", " return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}" ] }, { "cell_type": "markdown", "id": "74c3f396-a1c1-4750-a4c8-94e71853d81e", "metadata": { "tags": [] }, "source": [ "### Test this Parser on Some Files\n", "We can try this parser on a single file, to make sure that it returns a dictionary" ] }, { "cell_type": "code", "execution_count": 20, "id": "6ff7d2e9-2a27-4692-8000-01b757992ffb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'source': 'AIRS',\n", " 'temporal': 'JAN',\n", " 'time_period': 'monthly',\n", " 'variable': ['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O'],\n", " 'path': '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc'}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parse_amwg_obs(files[0])" ] }, { "cell_type": "markdown", "id": "9081ddcb-a706-4766-ae58-dd9fa450364a", "metadata": {}, "source": [ "Now that we made sure that it works, we can implement in `ecgtools`! \n", "\n", "First, we setup the `Builder` object" ] }, { "cell_type": "code", "execution_count": 17, "id": "14665e0c-d8ee-432b-9638-fd4543384892", "metadata": {}, "outputs": [], "source": [ "b = Builder(paths=['/glade/p/cesm/amwg/amwg_diagnostics/obs_data'])" ] }, { "cell_type": "markdown", "id": "e42702eb-bf01-47fe-b1f7-9621e635b106", "metadata": {}, "source": [ "Next, we build the catalog using our newly created parser!" ] }, { "cell_type": "code", "execution_count": 18, "id": "e43f83de-bcac-49f9-a19e-08abb5e1542a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.9s remaining: 0.0s\n", "[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.9s finished\n", "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 82 tasks | elapsed: 3.4s\n", "[Parallel(n_jobs=-1)]: Done 216 tasks | elapsed: 3.8s\n", "[Parallel(n_jobs=-1)]: Done 760 tasks | elapsed: 4.1s\n", "[Parallel(n_jobs=-1)]: Done 2333 tasks | elapsed: 5.1s\n", "[Parallel(n_jobs=-1)]: Done 2882 tasks | elapsed: 5.5s\n", "[Parallel(n_jobs=-1)]: Done 3096 out of 3096 | elapsed: 5.8s finished\n", "/glade/work/mgrover/git_repos/ecgtools/ecgtools/builder.py:180: UserWarning: Unable to parse 510 assets/files. A list of these assets can be found in `.invalid_assets` attribute.\n", " parsing_func, parsing_func_kwargs\n" ] }, { "data": { "text/plain": [ "Builder(root_path=PosixPath('/glade/p/cesm/amwg/amwg_diagnostics/obs_data'), extension='.nc', depth=0, exclude_patterns=None, njobs=-1)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.build(parsing_func=parse_amwg_obs)" ] }, { "cell_type": "markdown", "id": "1580e4f2-eef2-4221-83cb-880e6248f73a", "metadata": {}, "source": [ "Let's take a look at our resultant catalog..." ] }, { "cell_type": "code", "execution_count": 19, "id": "f680c5ec-5e32-4eb9-913c-71f3265abd39", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sourcetemporaltime_periodvariablepath
0ABLE-2Ac2h6seasonal[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
1ABLE-2Ac2h6seasonal[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
2ABLE-2Ac3h8seasonal[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3ABLE-2Ac3h8seasonal[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
6ABLE-2Anodayseasonal[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
..................
3091ozonesondespolar1995seasonal[levels, o3_mean, o3_med, o3_num, o3_std, o3_w.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3092ozonesondestropics11995seasonal[levels, o3_mean, o3_med, o3_num, o3_std, o3_w.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3093ozonesondestropics21995seasonal[levels, o3_mean, o3_med, o3_num, o3_std, o3_w.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3094ozonesondestropics31995seasonal[levels, o3_mean, o3_med, o3_num, o3_std, o3_w.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3095ozonesondeseurope1995seasonal[levels, o3_mean, o3_med, o3_num, o3_std, o3_w.../glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
\n", "

2586 rows × 5 columns

\n", "
" ], "text/plain": [ " source temporal time_period \\\n", "0 ABLE-2A c2h6 seasonal \n", "1 ABLE-2A c2h6 seasonal \n", "2 ABLE-2A c3h8 seasonal \n", "3 ABLE-2A c3h8 seasonal \n", "6 ABLE-2A noday seasonal \n", "... ... ... ... \n", "3091 ozonesondes polar1995 seasonal \n", "3092 ozonesondes tropics11995 seasonal \n", "3093 ozonesondes tropics21995 seasonal \n", "3094 ozonesondes tropics31995 seasonal \n", "3095 ozonesondes europe1995 seasonal \n", "\n", " variable \\\n", "0 [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... \n", "1 [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... \n", "2 [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... \n", "3 [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... \n", "6 [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... \n", "... ... \n", "3091 [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... \n", "3092 [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... \n", "3093 [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... \n", "3094 [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... \n", "3095 [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... \n", "\n", " path \n", "0 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "1 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "2 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "3 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "6 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "... ... \n", "3091 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "3092 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "3093 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "3094 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "3095 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... \n", "\n", "[2586 rows x 5 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.df" ] }, { "cell_type": "code", "execution_count": null, "id": "3da8bfe4-2074-4aa2-b595-feb08f4663f8", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 5 }