Use a custom parser

While many of the parsers included within this libary may be useful, we do not have parsers for every dataset out there. If you are interested in adding your own parser (and hopefully contributing that parser to the main repo 😊 ), check out this walkthrough of how to build one!

What is a Parser?

Basically, a parser collects information from two main sources:

  • The file string

  • The dataset itself

This means there are two main steps:

  • Parsing out the file string, separating based on some symbol

  • Opening the file, and extracting variables and their attributes, or even global attributes

The result from a “parser” is a dictionary of fields to add to the catalog, stored in a pandas.DataFrame

It would probably be more helpful to walk through a concrete example of this…

Example of Building a Parser

Let’s say we have a list of files which we wanted to parse! In this example, we are using a set of observational data on NCAR HPC resources. A full blog post detailing this dataset and comparison is included here

Imports

import glob
import pathlib
import traceback
from datetime import datetime

import xarray as xr

from ecgtools import Builder
from ecgtools.builder import INVALID_ASSET, TRACEBACK
files = sorted(glob.glob('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/*'))
files[::20]
['/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ARM_annual_cycle_twp_c2_cmbe_sound_p_f.cdf',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES2_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSATCOSP_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSAT_10_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ECMWF_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/EP.ERAI_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERAI_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERBE_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERS_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/GPCP_JJA_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_CL_03_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PD_02_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PI_05_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPCOSP_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPFD_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCP_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/JRA25_SON_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/LEGATES_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRAW_19x2_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRA_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MISRCOSP_JJA_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MODIS_ANN_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/NVAP_03_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/PRECL_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_SEAICE_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/TRMM_MAM_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WARREN_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WILLMOTT_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/XIEARKIN_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsg_10_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlso_ANN_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsw_MAM_climo.nc']

Observational datasetsets in this directory follow the convention source_(month/season/annual)_climo.nc.

Let’s open up one of those datasets

ds = xr.open_dataset('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc')
ds
<xarray.Dataset>
Dimensions:  (lat: 180, lon: 360, time: 1)
Coordinates:
  * lon      (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
  * lat      (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * time     (time) float32 1.0
Data variables:
    SOLIN    (time, lat, lon) float32 495.0 495.0 495.0 495.0 ... 0.0 0.0 0.0
    FLUT     (time, lat, lon) float32 187.4 187.4 187.4 ... 170.7 170.7 170.7
    FLUTC    (time, lat, lon) float32 188.8 188.8 188.8 ... 178.9 178.9 178.9
    FSNTOA   (time, lat, lon) float32 147.8 147.8 147.8 ... -0.049 -0.049 -0.049
    FSNTOAC  (time, lat, lon) float32 150.9 150.9 150.9 ... -0.006 -0.006 -0.006
    SWCF     (time, lat, lon) float32 -3.149 -3.149 -3.149 ... -0.043 -0.043
    LWCF     (time, lat, lon) float32 1.391 1.391 1.391 ... 8.272 8.272 8.272
    RESTOA   (time, lat, lon) float32 -39.6 -39.6 -39.6 ... -170.7 -170.7 -170.7
    ALBEDO   (time, lat, lon) float32 0.7015 0.7015 0.7015 ... nan nan nan
    ALBEDOC  (time, lat, lon) float32 0.6951 0.6951 0.6951 ... nan nan nan
    gw       (lat) float64 0.0001523 0.0004569 0.0007613 ... 0.0004569 0.0001523
Attributes:
    version:             This is version 2.8: March 7, 2014
    institution:         NASA Langley Research Center
    comment:             Data is from East to West and South to North. Climat...
    title:               CERES EBAF (Energy Balanced and Filled) Fluxes. Mont...
    AMWG_author:         Cecile Hannay
    AMWG_creation_date:  Thu Jul 24 16:08:10 MDT 2014 for AMWG package
    history:             Thu Jul 24 16:08:10 2014: ncks -A -v gw CERES2_01_cl...
    NCO:                 20140724

We see that this dataset is gridded on a global 0.5° grid, with several variables related to solar fluxes (ex. TOA net shortwave)

Parsing the Filepath

As mentioned before, the first step is parsing out information from the filepath. Here, we use pathlib which can be helpful when working with filepaths generically

path = pathlib.Path(files[0])
path.stem
'AIRS_01_climo'

This path can be split using .split('_'), separates the path into the following:

  • Observational dataset source

  • Month Number, Season, or Annual

  • “climo”

path.stem.split('_')
['AIRS', '01', 'climo']

Open the File for More Information

We can also gather useful insight by opening the file!

ds = xr.open_dataset(files[0])
ds
<xarray.Dataset>
Dimensions:  (lat: 94, lev: 13, lon: 192, month: 1, time: 1)
Coordinates:
  * lat      (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54
  * time     (time) int32 1
  * lev      (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0
  * lon      (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
  * month    (month) int32 0
Data variables:
    gw       (lat) float64 0.0008309 0.001933 0.003035 ... 0.001933 0.0008309
    T        (time, lev, lat, lon) float32 ...
    RELHUM   (time, lev, lat, lon) float32 ...
    O3       (time, lev, lat, lon) float32 ...
    SHUM     (time, lev, lat, lon) float32 ...
    PREH2O   (month, lat, lon) float32 nan nan nan nan nan ... nan nan 1.961 nan
Attributes:
    creation_date:             Thu Mar 13 09:28:11 MDT 2008
    interpolation:             bilinear
    outliers:                  \nAll [RELHUM>100] and [T>323] were set to _Fi...
    html:                      \nhttp://www.cgd.ucar.edu/cms/andrew/papers/ge...
    reference:                 \nA. Gettelman, W.D. Collins, E.J. Fetzer, A. ...
    source:                    Andrew Gettleman
    file:                      airsmm48_all_4d_rt_v5_c3.nc
    title:                     AIRS: 9/2002 - 8/2006
    history:                   Tue Mar 18 14:35:30 2008: ncrename -O -v gwt,g...
    nco_openmp_thread_number:  1

Let’s look at the variable “Temperature” (T)

ds.T
<xarray.DataArray 'T' (time: 1, lev: 13, lat: 94, lon: 192)>
[234624 values with dtype=float32]
Coordinates:
  * lat      (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54
  * time     (time) int32 1
  * lev      (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0
  * lon      (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
Attributes:
    units:        K
    long_name:    Temperature
    climatology:  AIRS monthly climatology 9/2002-8/2006

In this case, we want to include the list of variables available from this single file, such that each entry in our catalog represents a single file. We can search for variables in this dataset using the following:

variable_list = [var for var in ds if 'long_name' in ds[var].attrs]
variable_list
['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O']

Assembling These Parts into a Function

Now that we have methods of extracting the relevant information, we can assemble this into a function which returns a dictionary. You’ll notice the addition of the exception handling, which will add the unparsable file to a pandas.DataFrame with the unparsable file, and the associated traceback error.

def parse_amwg_obs(file):
    """Atmospheric observational data stored in"""
    file = pathlib.Path(file)
    info = {}

    try:
        stem = file.stem
        split = stem.split('_')
        source = split[0]
        temporal = split[-2]
        if len(temporal) == 2:
            month_number = int(temporal)
            time_period = 'monthly'
            temporal = datetime(2020, month_number, 1).strftime('%b').upper()

        elif temporal == 'ANN':
            time_period = 'annual'
        else:
            time_period = 'seasonal'

        with xr.open_dataset(file, chunks={}, decode_times=False) as ds:
            variable_list = [var for var in ds if 'long_name' in ds[var].attrs]

            info = {
                'source': source,
                'temporal': temporal,
                'time_period': time_period,
                'variable': variable_list,
                'path': str(file),
            }

        return info

    except Exception:
        return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}

Test this Parser on Some Files

We can try this parser on a single file, to make sure that it returns a dictionary

parse_amwg_obs(files[0])
{'source': 'AIRS',
 'temporal': 'JAN',
 'time_period': 'monthly',
 'variable': ['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O'],
 'path': '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc'}

Now that we made sure that it works, we can implement in ecgtools!

First, we setup the Builder object

b = Builder(paths=['/glade/p/cesm/amwg/amwg_diagnostics/obs_data'])

Next, we build the catalog using our newly created parser!

b.build(parsing_func=parse_amwg_obs)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 216 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 760 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 2333 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 2882 tasks      | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Done 3096 out of 3096 | elapsed:    5.8s finished
/glade/work/mgrover/git_repos/ecgtools/ecgtools/builder.py:180: UserWarning: Unable to parse 510 assets/files. A list of these assets can be found in `.invalid_assets` attribute.
  parsing_func, parsing_func_kwargs
Builder(root_path=PosixPath('/glade/p/cesm/amwg/amwg_diagnostics/obs_data'), extension='.nc', depth=0, exclude_patterns=None, njobs=-1)

Let’s take a look at our resultant catalog…

b.df
source temporal time_period variable path
0 ABLE-2A c2h6 seasonal [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
1 ABLE-2A c2h6 seasonal [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
2 ABLE-2A c3h8 seasonal [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3 ABLE-2A c3h8 seasonal [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
6 ABLE-2A noday seasonal [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
... ... ... ... ... ...
3091 ozonesondes polar1995 seasonal [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3092 ozonesondes tropics11995 seasonal [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3093 ozonesondes tropics21995 seasonal [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3094 ozonesondes tropics31995 seasonal [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3095 ozonesondes europe1995 seasonal [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...

2586 rows × 5 columns