Use a custom parser#

While many of the parsers included within this libary may be useful, we do not have parsers for every dataset out there. If you are interested in adding your own parser (and hopefully contributing that parser to the main repo 😊 ), check out this walkthrough of how to build one!

What is a Parser?#

Basically, a parser collects information from two main sources:

The file string
The dataset itself

This means there are two main steps:

Parsing out the file string, separating based on some symbol
Opening the file, and extracting variables and their attributes, or even global attributes

The result from a “parser” is a dictionary of fields to add to the catalog, stored in a pandas.DataFrame

It would probably be more helpful to walk through a concrete example of this…

Example of Building a Parser#

Let’s say we have a list of files which we wanted to parse! In this example, we are using a set of observational data on NCAR HPC resources. A full blog post detailing this dataset and comparison is included here

Parsing the Filepath#

As mentioned before, the first step is parsing out information from the filepath. Here, we use pathlib which can be helpful when working with filepaths generically

path = pathlib.Path(files[0])
path.stem

'AIRS_01_climo'

This path can be split using .split('_'), separates the path into the following:

Observational dataset source
Month Number, Season, or Annual
“climo”

path.stem.split('_')

['AIRS', '01', 'climo']

Assembling These Parts into a Function#

Now that we have methods of extracting the relevant information, we can assemble this into a function which returns a dictionary. You’ll notice the addition of the exception handling, which will add the unparsable file to a pandas.DataFrame with the unparsable file, and the associated traceback error.

def parse_amwg_obs(file):
    """Atmospheric observational data stored in"""
    file = pathlib.Path(file)
    info = {}

    try:
        stem = file.stem
        split = stem.split('_')
        source = split[0]
        temporal = split[-2]
        if len(temporal) == 2:
            month_number = int(temporal)
            time_period = 'monthly'
            temporal = datetime(2020, month_number, 1).strftime('%b').upper()

        elif temporal == 'ANN':
            time_period = 'annual'
        else:
            time_period = 'seasonal'

        with xr.open_dataset(file, chunks={}, decode_times=False) as ds:
            variable_list = [var for var in ds if 'long_name' in ds[var].attrs]

            info = {
                'source': source,
                'temporal': temporal,
                'time_period': time_period,
                'variable': variable_list,
                'path': str(file),
            }

        return info

    except Exception:
        return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}

Test this Parser on Some Files#

We can try this parser on a single file, to make sure that it returns a dictionary

parse_amwg_obs(files[0])

{'source': 'AIRS',
 'temporal': 'JAN',
 'time_period': 'monthly',
 'variable': ['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O'],
 'path': '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc'}

Now that we made sure that it works, we can implement in ecgtools!

First, we setup the Builder object

b = Builder(paths=['/glade/p/cesm/amwg/amwg_diagnostics/obs_data'])

Next, we build the catalog using our newly created parser!

b.build(parsing_func=parse_amwg_obs)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 216 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 760 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 2333 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 2882 tasks      | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Done 3096 out of 3096 | elapsed:    5.8s finished
/glade/work/mgrover/git_repos/ecgtools/ecgtools/builder.py:180: UserWarning: Unable to parse 510 assets/files. A list of these assets can be found in `.invalid_assets` attribute.
  parsing_func, parsing_func_kwargs

Builder(root_path=PosixPath('/glade/p/cesm/amwg/amwg_diagnostics/obs_data'), extension='.nc', depth=0, exclude_patterns=None, njobs=-1)

Let’s take a look at our resultant catalog…

b.df

	source	temporal	time_period	variable	path
0	ABLE-2A	c2h6	seasonal	[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
1	ABLE-2A	c2h6	seasonal	[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
2	ABLE-2A	c3h8	seasonal	[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3	ABLE-2A	c3h8	seasonal	[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
6	ABLE-2A	noday	seasonal	[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
...	...	...	...	...	...
3091	ozonesondes	polar1995	seasonal	[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3092	ozonesondes	tropics11995	seasonal	[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3093	ozonesondes	tropics21995	seasonal	[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3094	ozonesondes	tropics31995	seasonal	[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3095	ozonesondes	europe1995	seasonal	[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...

2586 rows × 5 columns