Use a custom parser¶
While many of the parsers included within this libary may be useful, we do not have parsers for every dataset out there. If you are interested in adding your own parser (and hopefully contributing that parser to the main repo 😊 ), check out this walkthrough of how to build one!
What is a Parser?¶
Basically, a parser collects information from two main sources:
The file string
The dataset itself
This means there are two main steps:
Parsing out the file string, separating based on some symbol
Opening the file, and extracting variables and their attributes, or even global attributes
The result from a “parser” is a dictionary of fields to add to the catalog, stored in a pandas.DataFrame
It would probably be more helpful to walk through a concrete example of this…
Example of Building a Parser¶
Let’s say we have a list of files which we wanted to parse! In this example, we are using a set of observational data on NCAR HPC resources. A full blog post detailing this dataset and comparison is included here
Imports¶
import glob
import pathlib
import traceback
from datetime import datetime
import xarray as xr
from ecgtools import Builder
from ecgtools.builder import INVALID_ASSET, TRACEBACK
files = sorted(glob.glob('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/*'))
files[::20]
['/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ARM_annual_cycle_twp_c2_cmbe_sound_p_f.cdf',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES2_04_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES_07_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSATCOSP_07_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSAT_10_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ECMWF_09_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/EP.ERAI_DJF_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERAI_04_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERBE_07_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERS_12_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/GPCP_JJA_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_CL_03_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PD_02_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PI_05_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPCOSP_07_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPFD_07_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCP_12_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/JRA25_SON_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/LEGATES_04_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRAW_19x2_09_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRA_12_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MISRCOSP_JJA_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MODIS_ANN_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/NVAP_03_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/PRECL_07_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_09_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_SEAICE_DJF_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/TRMM_MAM_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WARREN_DJF_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WILLMOTT_04_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/XIEARKIN_09_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsg_10_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlso_ANN_climo.nc',
'/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsw_MAM_climo.nc']
Observational datasetsets in this directory follow the convention source_(month/season/annual)_climo.nc.
Let’s open up one of those datasets
ds = xr.open_dataset('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc')
ds
<xarray.Dataset> Dimensions: (lat: 180, lon: 360, time: 1) Coordinates: * lon (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5 * lat (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5 * time (time) float32 1.0 Data variables: SOLIN (time, lat, lon) float32 495.0 495.0 495.0 495.0 ... 0.0 0.0 0.0 FLUT (time, lat, lon) float32 187.4 187.4 187.4 ... 170.7 170.7 170.7 FLUTC (time, lat, lon) float32 188.8 188.8 188.8 ... 178.9 178.9 178.9 FSNTOA (time, lat, lon) float32 147.8 147.8 147.8 ... -0.049 -0.049 -0.049 FSNTOAC (time, lat, lon) float32 150.9 150.9 150.9 ... -0.006 -0.006 -0.006 SWCF (time, lat, lon) float32 -3.149 -3.149 -3.149 ... -0.043 -0.043 LWCF (time, lat, lon) float32 1.391 1.391 1.391 ... 8.272 8.272 8.272 RESTOA (time, lat, lon) float32 -39.6 -39.6 -39.6 ... -170.7 -170.7 -170.7 ALBEDO (time, lat, lon) float32 0.7015 0.7015 0.7015 ... nan nan nan ALBEDOC (time, lat, lon) float32 0.6951 0.6951 0.6951 ... nan nan nan gw (lat) float64 0.0001523 0.0004569 0.0007613 ... 0.0004569 0.0001523 Attributes: version: This is version 2.8: March 7, 2014 institution: NASA Langley Research Center comment: Data is from East to West and South to North. Climat... title: CERES EBAF (Energy Balanced and Filled) Fluxes. Mont... AMWG_author: Cecile Hannay AMWG_creation_date: Thu Jul 24 16:08:10 MDT 2014 for AMWG package history: Thu Jul 24 16:08:10 2014: ncks -A -v gw CERES2_01_cl... NCO: 20140724
- lat: 180
- lon: 360
- time: 1
- lon(lon)float320.5 1.5 2.5 ... 357.5 358.5 359.5
- long_name :
- longitude
- standard_name :
- longitude
- units :
- degrees_east
- valid_range :
- [-180. 360.]
array([ 0.5, 1.5, 2.5, ..., 357.5, 358.5, 359.5], dtype=float32)
- lat(lat)float32-89.5 -88.5 -87.5 ... 88.5 89.5
- long_name :
- latitude
- standard_name :
- latitude
- units :
- degrees_north
- valid_range :
- [-90. 90.]
- actual_range :
- [-89.5 89.5]
array([-89.5, -88.5, -87.5, -86.5, -85.5, -84.5, -83.5, -82.5, -81.5, -80.5, -79.5, -78.5, -77.5, -76.5, -75.5, -74.5, -73.5, -72.5, -71.5, -70.5, -69.5, -68.5, -67.5, -66.5, -65.5, -64.5, -63.5, -62.5, -61.5, -60.5, -59.5, -58.5, -57.5, -56.5, -55.5, -54.5, -53.5, -52.5, -51.5, -50.5, -49.5, -48.5, -47.5, -46.5, -45.5, -44.5, -43.5, -42.5, -41.5, -40.5, -39.5, -38.5, -37.5, -36.5, -35.5, -34.5, -33.5, -32.5, -31.5, -30.5, -29.5, -28.5, -27.5, -26.5, -25.5, -24.5, -23.5, -22.5, -21.5, -20.5, -19.5, -18.5, -17.5, -16.5, -15.5, -14.5, -13.5, -12.5, -11.5, -10.5, -9.5, -8.5, -7.5, -6.5, -5.5, -4.5, -3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5, 22.5, 23.5, 24.5, 25.5, 26.5, 27.5, 28.5, 29.5, 30.5, 31.5, 32.5, 33.5, 34.5, 35.5, 36.5, 37.5, 38.5, 39.5, 40.5, 41.5, 42.5, 43.5, 44.5, 45.5, 46.5, 47.5, 48.5, 49.5, 50.5, 51.5, 52.5, 53.5, 54.5, 55.5, 56.5, 57.5, 58.5, 59.5, 60.5, 61.5, 62.5, 63.5, 64.5, 65.5, 66.5, 67.5, 68.5, 69.5, 70.5, 71.5, 72.5, 73.5, 74.5, 75.5, 76.5, 77.5, 78.5, 79.5, 80.5, 81.5, 82.5, 83.5, 84.5, 85.5, 86.5, 87.5, 88.5, 89.5], dtype=float32)
- time(time)float321.0
array([1.], dtype=float32)
- SOLIN(time, lat, lon)float32...
- average_op_ncl :
- dim_avg_n over dimension(s): time
- long_name :
- Incoming Solar Flux, Monthly Means
- standard_name :
- Incoming Solar Flux
- units :
- W m-2
- valid_min :
- 0.00000
- valid_max :
- 800.000
array([[[495.0077 , 495.0077 , ..., 495.0077 , 495.0077 ], [494.86923, 494.86923, ..., 494.86923, 494.86923], ..., [ 0. , 0. , ..., 0. , 0. ], [ 0. , 0. , ..., 0. , 0. ]]], dtype=float32)
- FLUT(time, lat, lon)float32...
- long_name :
- Top of The Atmosphere Longwave Flux, Monthly Means, All-Sky conditions
- standard_name :
- TOA Longwave Flux - All-Sky
- units :
- W m-2
- valid_min :
- 0.00000
- valid_max :
- 400.000
array([[[187.38461, 187.38461, ..., 187.38461, 187.38461], [190.79231, 190.79231, ..., 190.79231, 190.79231], ..., [170.49231, 170.49231, ..., 170.49231, 170.49231], [170.66924, 170.66924, ..., 170.66924, 170.66924]]], dtype=float32)
- FLUTC(time, lat, lon)float32...
- long_name :
- Top of The Atmosphere Longwave Flux, Monthly Means, Clear-Sky conditions
- standard_name :
- TOA Longwave Flux - Clear-Sky
- units :
- W m-2
- valid_min :
- 0.00000
- valid_max :
- 400.000
array([[[188.78462, 188.78462, ..., 188.78462, 188.78462], [192.3923 , 192.3923 , ..., 192.3923 , 192.3923 ], ..., [179.29231, 179.29231, ..., 179.29231, 179.29231], [178.93077, 178.93077, ..., 178.93077, 178.93077]]], dtype=float32)
- FSNTOA(time, lat, lon)float32...
- lunits :
- W m-2
- long_name :
- TOA net shortwave
array([[[ 1.477692e+02, 1.477692e+02, ..., 1.477692e+02, 1.477692e+02], [ 1.536615e+02, 1.536615e+02, ..., 1.536615e+02, 1.536615e+02], ..., [-4.900000e-02, -4.900000e-02, ..., -4.900000e-02, -4.900000e-02], [-4.900000e-02, -4.900000e-02, ..., -4.900000e-02, -4.900000e-02]]], dtype=float32)
- FSNTOAC(time, lat, lon)float32...
- long_name :
- TOA clear-sky net shortwave
- lunits :
- W m-2
array([[[ 1.509154e+02, 1.509154e+02, ..., 1.509154e+02, 1.509154e+02], [ 1.565077e+02, 1.565077e+02, ..., 1.565077e+02, 1.565077e+02], ..., [-6.000000e-03, -6.000000e-03, ..., -6.000000e-03, -6.000000e-03], [-6.000000e-03, -6.000000e-03, ..., -6.000000e-03, -6.000000e-03]]], dtype=float32)
- SWCF(time, lat, lon)float32...
- long_name :
- Top of The Atmosphere Cloud Radiative Effects Shortwave Flux, Monthly Means
- standard_name :
- TOA CRE Shortwave Flux
- units :
- W m-2
- valid_min :
- -400.000
- valid_max :
- 100.000
array([[[-3.149 , -3.149 , ..., -3.149 , -3.149 ], [-2.844985, -2.844985, ..., -2.844985, -2.844985], ..., [-0.043 , -0.043 , ..., -0.043 , -0.043 ], [-0.043 , -0.043 , ..., -0.043 , -0.043 ]]], dtype=float32)
- LWCF(time, lat, lon)float32...
- long_name :
- Top of The Atmosphere Cloud Radiative Effects Longwave Flux, Monthly Means
- standard_name :
- TOA CRE Longwave Flux
- units :
- W m-2
- valid_min :
- -100.000
- valid_max :
- 300.000
array([[[1.391205, 1.391205, ..., 1.391205, 1.391205], [1.616079, 1.616079, ..., 1.616079, 1.616079], ..., [8.813915, 8.813915, ..., 8.813915, 8.813915], [8.271692, 8.271692, ..., 8.271692, 8.271692]]], dtype=float32)
- RESTOA(time, lat, lon)float32...
- long_name :
- residual energy at TOA
- standard_name :
- TOA Net Flux - All-Sky
- units :
- W m-2
- valid_min :
- -400.000
- valid_max :
- 400.000
array([[[ -39.603077, -39.603077, ..., -39.603077, -39.603077], [ -37.104614, -37.104614, ..., -37.104614, -37.104614], ..., [-170.55385 , -170.55385 , ..., -170.55385 , -170.55385 ], [-170.71538 , -170.71538 , ..., -170.71538 , -170.71538 ]]], dtype=float32)
- ALBEDO(time, lat, lon)float32...
- long_name :
- TOA albedo
array([[[0.701483, 0.701483, ..., 0.701483, 0.701483], [0.689493, 0.689493, ..., 0.689493, 0.689493], ..., [ nan, nan, ..., nan, nan], [ nan, nan, ..., nan, nan]]], dtype=float32)
- ALBEDOC(time, lat, lon)float32...
- long_name :
- TOA clear-sky albedo
array([[[0.695126, 0.695126, ..., 0.695126, 0.695126], [0.683734, 0.683734, ..., 0.683734, 0.683734], ..., [ nan, nan, ..., nan, nan], [ nan, nan, ..., nan, nan]]], dtype=float32)
- gw(lat)float64...
- long_name :
- latitude weight
array([0.000152, 0.000457, 0.000761, 0.001065, 0.001369, 0.001673, 0.001976, 0.002278, 0.00258 , 0.002881, 0.003181, 0.00348 , 0.003778, 0.004074, 0.00437 , 0.004664, 0.004957, 0.005248, 0.005538, 0.005826, 0.006112, 0.006397, 0.006679, 0.006959, 0.007238, 0.007514, 0.007788, 0.008059, 0.008328, 0.008594, 0.008858, 0.009119, 0.009378, 0.009633, 0.009886, 0.010135, 0.010381, 0.010625, 0.010865, 0.011102, 0.011335, 0.011565, 0.011791, 0.012014, 0.012233, 0.012448, 0.01266 , 0.012868, 0.013072, 0.013271, 0.013467, 0.013659, 0.013846, 0.01403 , 0.014209, 0.014384, 0.014554, 0.01472 , 0.014881, 0.015038, 0.01519 , 0.015338, 0.015481, 0.015619, 0.015753, 0.015882, 0.016006, 0.016125, 0.016239, 0.016348, 0.016452, 0.016551, 0.016645, 0.016734, 0.016818, 0.016897, 0.016971, 0.017039, 0.017103, 0.017161, 0.017214, 0.017261, 0.017304, 0.017341, 0.017373, 0.017399, 0.017421, 0.017436, 0.017447, 0.017452, 0.017452, 0.017447, 0.017436, 0.017421, 0.017399, 0.017373, 0.017341, 0.017304, 0.017261, 0.017214, 0.017161, 0.017103, 0.017039, 0.016971, 0.016897, 0.016818, 0.016734, 0.016645, 0.016551, 0.016452, 0.016348, 0.016239, 0.016125, 0.016006, 0.015882, 0.015753, 0.015619, 0.015481, 0.015338, 0.01519 , 0.015038, 0.014881, 0.01472 , 0.014554, 0.014384, 0.014209, 0.01403 , 0.013846, 0.013659, 0.013467, 0.013271, 0.013072, 0.012868, 0.01266 , 0.012448, 0.012233, 0.012014, 0.011791, 0.011565, 0.011335, 0.011102, 0.010865, 0.010625, 0.010381, 0.010135, 0.009886, 0.009633, 0.009378, 0.009119, 0.008858, 0.008594, 0.008328, 0.008059, 0.007788, 0.007514, 0.007238, 0.006959, 0.006679, 0.006397, 0.006112, 0.005826, 0.005538, 0.005248, 0.004957, 0.004664, 0.00437 , 0.004074, 0.003778, 0.00348 , 0.003181, 0.002881, 0.00258 , 0.002278, 0.001976, 0.001673, 0.001369, 0.001065, 0.000761, 0.000457, 0.000152])
- version :
- This is version 2.8: March 7, 2014
- institution :
- NASA Langley Research Center
- comment :
- Data is from East to West and South to North. Climatology from 03/2000 to 02/2013.
- title :
- CERES EBAF (Energy Balanced and Filled) Fluxes. Monthly Averages and 13-year Climatology.
- AMWG_author :
- Cecile Hannay
- AMWG_creation_date :
- Thu Jul 24 16:08:10 MDT 2014 for AMWG package
- history :
- Thu Jul 24 16:08:10 2014: ncks -A -v gw CERES2_01_climo.nc CERES-EBAF_01_climo.nc
- NCO :
- 20140724
We see that this dataset is gridded on a global 0.5° grid, with several variables related to solar fluxes (ex. TOA net shortwave
)
Parsing the Filepath¶
As mentioned before, the first step is parsing out information from the filepath. Here, we use pathlib which can be helpful when working with filepaths generically
path = pathlib.Path(files[0])
path.stem
'AIRS_01_climo'
This path can be split using .split('_')
, separates the path into the following:
Observational dataset source
Month Number, Season, or Annual
“climo”
path.stem.split('_')
['AIRS', '01', 'climo']
Open the File for More Information¶
We can also gather useful insight by opening the file!
ds = xr.open_dataset(files[0])
ds
<xarray.Dataset> Dimensions: (lat: 94, lev: 13, lon: 192, month: 1, time: 1) Coordinates: * lat (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54 * time (time) int32 1 * lev (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0 * lon (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1 * month (month) int32 0 Data variables: gw (lat) float64 0.0008309 0.001933 0.003035 ... 0.001933 0.0008309 T (time, lev, lat, lon) float32 ... RELHUM (time, lev, lat, lon) float32 ... O3 (time, lev, lat, lon) float32 ... SHUM (time, lev, lat, lon) float32 ... PREH2O (month, lat, lon) float32 nan nan nan nan nan ... nan nan 1.961 nan Attributes: creation_date: Thu Mar 13 09:28:11 MDT 2008 interpolation: bilinear outliers: \nAll [RELHUM>100] and [T>323] were set to _Fi... html: \nhttp://www.cgd.ucar.edu/cms/andrew/papers/ge... reference: \nA. Gettelman, W.D. Collins, E.J. Fetzer, A. ... source: Andrew Gettleman file: airsmm48_all_4d_rt_v5_c3.nc title: AIRS: 9/2002 - 8/2006 history: Tue Mar 18 14:35:30 2008: ncrename -O -v gwt,g... nco_openmp_thread_number: 1
- lat: 94
- lev: 13
- lon: 192
- month: 1
- time: 1
- lat(lat)float64-88.54 -86.65 ... 86.65 88.54
- units :
- degrees_north
- long_name :
- latitude
array([-88.541946, -86.653168, -84.753227, -82.850769, -80.947357, -79.04348 , -77.139351, -75.235054, -73.330658, -71.426186, -69.52166 , -67.617104, -65.712509, -63.807896, -61.903259, -59.998611, -58.093948, -56.189278, -54.284599, -52.379913, -50.47522 , -48.570518, -46.665817, -44.761108, -42.8564 , -40.951687, -39.04697 , -37.14225 , -35.23753 , -33.332806, -31.428082, -29.523355, -27.618628, -25.7139 , -23.80917 , -21.90444 , -19.999708, -18.094976, -16.190243, -14.28551 , -12.380776, -10.476042, -8.571308, -6.666573, -4.761838, -2.857103, -0.952368, 0.952368, 2.857103, 4.761838, 6.666573, 8.571308, 10.476042, 12.380776, 14.28551 , 16.190243, 18.094976, 19.999708, 21.90444 , 23.80917 , 25.7139 , 27.618628, 29.523355, 31.428082, 33.332806, 35.23753 , 37.14225 , 39.04697 , 40.951687, 42.8564 , 44.761108, 46.665817, 48.570518, 50.47522 , 52.379913, 54.284599, 56.189278, 58.093948, 59.998611, 61.903259, 63.807896, 65.712509, 67.617104, 69.52166 , 71.426186, 73.330658, 75.235054, 77.139351, 79.04348 , 80.947357, 82.850769, 84.753227, 86.653168, 88.541946])
- time(time)int321
array([1], dtype=int32)
- lev(lev)float321e+03 925.0 850.0 ... 100.0 70.0
- fill_value :
- 1e+36
- units :
- hPa
- long_name :
- Pressure
array([1000., 925., 850., 700., 600., 500., 400., 300., 250., 200., 150., 100., 70.], dtype=float32)
- lon(lon)float320.0 1.875 3.75 ... 356.2 358.1
- units :
- degrees_east
- long_name :
- longitude
array([ 0. , 1.875, 3.75 , 5.625, 7.5 , 9.375, 11.25 , 13.125, 15. , 16.875, 18.75 , 20.625, 22.5 , 24.375, 26.25 , 28.125, 30. , 31.875, 33.75 , 35.625, 37.5 , 39.375, 41.25 , 43.125, 45. , 46.875, 48.75 , 50.625, 52.5 , 54.375, 56.25 , 58.125, 60. , 61.875, 63.75 , 65.625, 67.5 , 69.375, 71.25 , 73.125, 75. , 76.875, 78.75 , 80.625, 82.5 , 84.375, 86.25 , 88.125, 90. , 91.875, 93.75 , 95.625, 97.5 , 99.375, 101.25 , 103.125, 105. , 106.875, 108.75 , 110.625, 112.5 , 114.375, 116.25 , 118.125, 120. , 121.875, 123.75 , 125.625, 127.5 , 129.375, 131.25 , 133.125, 135. , 136.875, 138.75 , 140.625, 142.5 , 144.375, 146.25 , 148.125, 150. , 151.875, 153.75 , 155.625, 157.5 , 159.375, 161.25 , 163.125, 165. , 166.875, 168.75 , 170.625, 172.5 , 174.375, 176.25 , 178.125, 180. , 181.875, 183.75 , 185.625, 187.5 , 189.375, 191.25 , 193.125, 195. , 196.875, 198.75 , 200.625, 202.5 , 204.375, 206.25 , 208.125, 210. , 211.875, 213.75 , 215.625, 217.5 , 219.375, 221.25 , 223.125, 225. , 226.875, 228.75 , 230.625, 232.5 , 234.375, 236.25 , 238.125, 240. , 241.875, 243.75 , 245.625, 247.5 , 249.375, 251.25 , 253.125, 255. , 256.875, 258.75 , 260.625, 262.5 , 264.375, 266.25 , 268.125, 270. , 271.875, 273.75 , 275.625, 277.5 , 279.375, 281.25 , 283.125, 285. , 286.875, 288.75 , 290.625, 292.5 , 294.375, 296.25 , 298.125, 300. , 301.875, 303.75 , 305.625, 307.5 , 309.375, 311.25 , 313.125, 315. , 316.875, 318.75 , 320.625, 322.5 , 324.375, 326.25 , 328.125, 330. , 331.875, 333.75 , 335.625, 337.5 , 339.375, 341.25 , 343.125, 345. , 346.875, 348.75 , 350.625, 352.5 , 354.375, 356.25 , 358.125], dtype=float32)
- month(month)int320
array([0], dtype=int32)
- gw(lat)float64...
- long_name :
- gaussian weights
array([0.000831, 0.001933, 0.003035, 0.004134, 0.005228, 0.006316, 0.007397, 0.008471, 0.009534, 0.010588, 0.011629, 0.012658, 0.013673, 0.014672, 0.015656, 0.016622, 0.01757 , 0.018498, 0.019406, 0.020292, 0.021156, 0.021997, 0.022813, 0.023604, 0.02437 , 0.025108, 0.025818, 0.0265 , 0.027152, 0.027775, 0.028367, 0.028927, 0.029456, 0.029952, 0.030415, 0.030844, 0.031239, 0.0316 , 0.031925, 0.032216, 0.032471, 0.03269 , 0.032873, 0.033019, 0.033129, 0.033203, 0.033239, 0.033239, 0.033203, 0.033129, 0.033019, 0.032873, 0.03269 , 0.032471, 0.032216, 0.031925, 0.0316 , 0.031239, 0.030844, 0.030415, 0.029952, 0.029456, 0.028927, 0.028367, 0.027775, 0.027152, 0.0265 , 0.025818, 0.025108, 0.02437 , 0.023604, 0.022813, 0.021997, 0.021156, 0.020292, 0.019406, 0.018498, 0.01757 , 0.016622, 0.015656, 0.014672, 0.013673, 0.012658, 0.011629, 0.010588, 0.009534, 0.008471, 0.007397, 0.006316, 0.005228, 0.004134, 0.003035, 0.001933, 0.000831])
- T(time, lev, lat, lon)float32...
- units :
- K
- long_name :
- Temperature
- climatology :
- AIRS monthly climatology 9/2002-8/2006
[234624 values with dtype=float32]
- RELHUM(time, lev, lat, lon)float32...
- climatology :
- AIRS monthly climatology 9/2002-8/2006
- units :
- Percent
- long_name :
- Relative Humidity
[234624 values with dtype=float32]
- O3(time, lev, lat, lon)float32...
- climatology :
- AIRS monthly climatology 9/2002-8/2006
- units :
- ppmv
- long_name :
- Ozone
[234624 values with dtype=float32]
- SHUM(time, lev, lat, lon)float32...
- climatology :
- AIRS monthly climatology 9/2002-8/2006
- units :
- g/kg
- long_name :
- specific humidity
[234624 values with dtype=float32]
- PREH2O(month, lat, lon)float32...
- units :
- mm
- long_name :
- precipitable water
- sum_op_ncl :
- dim_sum over dimension: lev
- time_op_ncl :
- Climatology: 5 years
- info :
- function clmMonLLT: contributed.ncl
array([[[ nan, nan, ..., nan, nan], [0.7132 , 0.714477, ..., 0.737737, 0.734408], ..., [2.198801, 2.261673, ..., 2.205242, 2.144126], [ nan, nan, ..., 1.960759, nan]]], dtype=float32)
- creation_date :
- Thu Mar 13 09:28:11 MDT 2008
- interpolation :
- bilinear
- outliers :
- All [RELHUM>100] and [T>323] were set to _FillValue prior to processing
- html :
- http://www.cgd.ucar.edu/cms/andrew/papers/gettelman2006_rhclimo.pdf
- reference :
- A. Gettelman, W.D. Collins, E.J. Fetzer, A. Eldering, F.W. Irion, P.B. Duffy and G. Bala Climatology of Upper Tropospheric Relative Humidity from the Atmospheric Infrared Sounder and Implications for Climate. J.Climate, 19(23), 6104-6121, 2006.
- source :
- Andrew Gettleman
- file :
- airsmm48_all_4d_rt_v5_c3.nc
- title :
- AIRS: 9/2002 - 8/2006
- history :
- Tue Mar 18 14:35:30 2008: ncrename -O -v gwt,gw AIRS_01_climo.nc
- nco_openmp_thread_number :
- 1
Let’s look at the variable “Temperature” (T
)
ds.T
<xarray.DataArray 'T' (time: 1, lev: 13, lat: 94, lon: 192)> [234624 values with dtype=float32] Coordinates: * lat (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54 * time (time) int32 1 * lev (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0 * lon (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1 Attributes: units: K long_name: Temperature climatology: AIRS monthly climatology 9/2002-8/2006
- time: 1
- lev: 13
- lat: 94
- lon: 192
- ...
[234624 values with dtype=float32]
- lat(lat)float64-88.54 -86.65 ... 86.65 88.54
- units :
- degrees_north
- long_name :
- latitude
array([-88.541946, -86.653168, -84.753227, -82.850769, -80.947357, -79.04348 , -77.139351, -75.235054, -73.330658, -71.426186, -69.52166 , -67.617104, -65.712509, -63.807896, -61.903259, -59.998611, -58.093948, -56.189278, -54.284599, -52.379913, -50.47522 , -48.570518, -46.665817, -44.761108, -42.8564 , -40.951687, -39.04697 , -37.14225 , -35.23753 , -33.332806, -31.428082, -29.523355, -27.618628, -25.7139 , -23.80917 , -21.90444 , -19.999708, -18.094976, -16.190243, -14.28551 , -12.380776, -10.476042, -8.571308, -6.666573, -4.761838, -2.857103, -0.952368, 0.952368, 2.857103, 4.761838, 6.666573, 8.571308, 10.476042, 12.380776, 14.28551 , 16.190243, 18.094976, 19.999708, 21.90444 , 23.80917 , 25.7139 , 27.618628, 29.523355, 31.428082, 33.332806, 35.23753 , 37.14225 , 39.04697 , 40.951687, 42.8564 , 44.761108, 46.665817, 48.570518, 50.47522 , 52.379913, 54.284599, 56.189278, 58.093948, 59.998611, 61.903259, 63.807896, 65.712509, 67.617104, 69.52166 , 71.426186, 73.330658, 75.235054, 77.139351, 79.04348 , 80.947357, 82.850769, 84.753227, 86.653168, 88.541946])
- time(time)int321
array([1], dtype=int32)
- lev(lev)float321e+03 925.0 850.0 ... 100.0 70.0
- fill_value :
- 1e+36
- units :
- hPa
- long_name :
- Pressure
array([1000., 925., 850., 700., 600., 500., 400., 300., 250., 200., 150., 100., 70.], dtype=float32)
- lon(lon)float320.0 1.875 3.75 ... 356.2 358.1
- units :
- degrees_east
- long_name :
- longitude
array([ 0. , 1.875, 3.75 , 5.625, 7.5 , 9.375, 11.25 , 13.125, 15. , 16.875, 18.75 , 20.625, 22.5 , 24.375, 26.25 , 28.125, 30. , 31.875, 33.75 , 35.625, 37.5 , 39.375, 41.25 , 43.125, 45. , 46.875, 48.75 , 50.625, 52.5 , 54.375, 56.25 , 58.125, 60. , 61.875, 63.75 , 65.625, 67.5 , 69.375, 71.25 , 73.125, 75. , 76.875, 78.75 , 80.625, 82.5 , 84.375, 86.25 , 88.125, 90. , 91.875, 93.75 , 95.625, 97.5 , 99.375, 101.25 , 103.125, 105. , 106.875, 108.75 , 110.625, 112.5 , 114.375, 116.25 , 118.125, 120. , 121.875, 123.75 , 125.625, 127.5 , 129.375, 131.25 , 133.125, 135. , 136.875, 138.75 , 140.625, 142.5 , 144.375, 146.25 , 148.125, 150. , 151.875, 153.75 , 155.625, 157.5 , 159.375, 161.25 , 163.125, 165. , 166.875, 168.75 , 170.625, 172.5 , 174.375, 176.25 , 178.125, 180. , 181.875, 183.75 , 185.625, 187.5 , 189.375, 191.25 , 193.125, 195. , 196.875, 198.75 , 200.625, 202.5 , 204.375, 206.25 , 208.125, 210. , 211.875, 213.75 , 215.625, 217.5 , 219.375, 221.25 , 223.125, 225. , 226.875, 228.75 , 230.625, 232.5 , 234.375, 236.25 , 238.125, 240. , 241.875, 243.75 , 245.625, 247.5 , 249.375, 251.25 , 253.125, 255. , 256.875, 258.75 , 260.625, 262.5 , 264.375, 266.25 , 268.125, 270. , 271.875, 273.75 , 275.625, 277.5 , 279.375, 281.25 , 283.125, 285. , 286.875, 288.75 , 290.625, 292.5 , 294.375, 296.25 , 298.125, 300. , 301.875, 303.75 , 305.625, 307.5 , 309.375, 311.25 , 313.125, 315. , 316.875, 318.75 , 320.625, 322.5 , 324.375, 326.25 , 328.125, 330. , 331.875, 333.75 , 335.625, 337.5 , 339.375, 341.25 , 343.125, 345. , 346.875, 348.75 , 350.625, 352.5 , 354.375, 356.25 , 358.125], dtype=float32)
- units :
- K
- long_name :
- Temperature
- climatology :
- AIRS monthly climatology 9/2002-8/2006
In this case, we want to include the list of variables available from this single file, such that each entry in our catalog represents a single file. We can search for variables in this dataset using the following:
variable_list = [var for var in ds if 'long_name' in ds[var].attrs]
variable_list
['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O']
Assembling These Parts into a Function¶
Now that we have methods of extracting the relevant information, we can assemble this into a function which returns a dictionary. You’ll notice the addition of the exception handling, which will add the unparsable file to a pandas.DataFrame
with the unparsable file, and the associated traceback error.
def parse_amwg_obs(file):
"""Atmospheric observational data stored in"""
file = pathlib.Path(file)
info = {}
try:
stem = file.stem
split = stem.split('_')
source = split[0]
temporal = split[-2]
if len(temporal) == 2:
month_number = int(temporal)
time_period = 'monthly'
temporal = datetime(2020, month_number, 1).strftime('%b').upper()
elif temporal == 'ANN':
time_period = 'annual'
else:
time_period = 'seasonal'
with xr.open_dataset(file, chunks={}, decode_times=False) as ds:
variable_list = [var for var in ds if 'long_name' in ds[var].attrs]
info = {
'source': source,
'temporal': temporal,
'time_period': time_period,
'variable': variable_list,
'path': str(file),
}
return info
except Exception:
return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}
Test this Parser on Some Files¶
We can try this parser on a single file, to make sure that it returns a dictionary
parse_amwg_obs(files[0])
{'source': 'AIRS',
'temporal': 'JAN',
'time_period': 'monthly',
'variable': ['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O'],
'path': '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc'}
Now that we made sure that it works, we can implement in ecgtools
!
First, we setup the Builder
object
b = Builder(paths=['/glade/p/cesm/amwg/amwg_diagnostics/obs_data'])
Next, we build the catalog using our newly created parser!
b.build(parsing_func=parse_amwg_obs)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.9s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done 82 tasks | elapsed: 3.4s
[Parallel(n_jobs=-1)]: Done 216 tasks | elapsed: 3.8s
[Parallel(n_jobs=-1)]: Done 760 tasks | elapsed: 4.1s
[Parallel(n_jobs=-1)]: Done 2333 tasks | elapsed: 5.1s
[Parallel(n_jobs=-1)]: Done 2882 tasks | elapsed: 5.5s
[Parallel(n_jobs=-1)]: Done 3096 out of 3096 | elapsed: 5.8s finished
/glade/work/mgrover/git_repos/ecgtools/ecgtools/builder.py:180: UserWarning: Unable to parse 510 assets/files. A list of these assets can be found in `.invalid_assets` attribute.
parsing_func, parsing_func_kwargs
Builder(root_path=PosixPath('/glade/p/cesm/amwg/amwg_diagnostics/obs_data'), extension='.nc', depth=0, exclude_patterns=None, njobs=-1)
Let’s take a look at our resultant catalog…
b.df
source | temporal | time_period | variable | path | |
---|---|---|---|---|---|
0 | ABLE-2A | c2h6 | seasonal | [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
1 | ABLE-2A | c2h6 | seasonal | [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
2 | ABLE-2A | c3h8 | seasonal | [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
3 | ABLE-2A | c3h8 | seasonal | [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
6 | ABLE-2A | noday | seasonal | [dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
... | ... | ... | ... | ... | ... |
3091 | ozonesondes | polar1995 | seasonal | [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
3092 | ozonesondes | tropics11995 | seasonal | [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
3093 | ozonesondes | tropics21995 | seasonal | [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
3094 | ozonesondes | tropics31995 | seasonal | [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
3095 | ozonesondes | europe1995 | seasonal | [levels, o3_mean, o3_med, o3_num, o3_std, o3_w... | /glade/p/cesm/amwg/amwg_diagnostics/obs_data/c... |
2586 rows × 5 columns