read_bufr

read_bufr(path, columns=[], filters={}, required_columns=True, flat=False)

Extract data from BUFR as a pandas.DataFrame with the specified columns applying the filters either in hierarchical or flat mode.

Parameters
  • path (str, bytes, os.PathLike or a message_list_object) – path to the BUFR file or a message_list_object

  • columns (str, sequence[str]) –

    a list of ecCodes BUFR keys to extract for each BUFR message/subset. When flat is True columns must be one of the following string values:

    • ”all”, empty str or empty list (default): all the columns are extracted

    • ”header”: only the columns from the header section are extracted

    • ”data”: only the columns from the data section are extracted

  • filters (dict) – defines the conditions when to extract the specified columns. The individual conditions are combined together with the logical AND operator to form the filter. See Filters for details.

  • required_columns (bool, iterable[str]) –

    the list of ecCodes BUFR keys that are required to be present in the BUFR message/subset. Bool values are interpreted as follows:

    • if flat is False:

      • True means all the keys in columns are required

      • False means no columns are required

    • if flat is True either bool value means no columns are required

  • flat (bool) – enables flat extraction mode. When it is True each message/subset is treated as a flat list, while when it is False (default), data is extracted as if the message had a tree-like hierarchy. See details below. New in version 0.10.0

Return type

pandas.DataFrame

In order to correctly use read_bufr() for a given BUFR file first you need to understand the structure of the messages and the keys/values you can use for data extraction and filter definition. The BUFR structure can be explored with ecCodes command line tools bufr_ls and bufr_dump. You can also use CodesUI or Metview, which provide graphical user interfaces to inspect BUFR/GRIB data.

There are some notebook examples available demonstrating how to use read_bufr() for various observation/forecast BUFR data types.

BUFR keys

ecCodes keys from both the BUFR header and data sections are supported in columns, filters and required_columns. However, there are some limitations:

  • keys containing the rank e.g. “#1#latitude” cannot be used

  • key attributes e.g. “latitude->code” cannot be used

The “count” generated key, which refers to the message index, is also supported but please note that message indexing starts at 1 and not at 0!

There is also a set of computed keys that can be used for read_bufr():

  • “data_datetime” (datetime.datetime): generated from the “year”, “month”, “day”, “hour”, “minute”, “second” keys in the BUFR data section.

  • “typical_datetime” (datetime.datetime): generated from the “typicalYear”, “typicalMonth”, “typicalDay”, “typicalHour”, “typicalMinute”, “typicalSecond” keys in the BUFR header section.

  • “WMO_station_id”: generated from the “blockNumber” and “stationNumber” keys as:

    blockNumber*1000+stationNumber
    
  • “geometry”: values extracted as a list of:

    [longitude,latitude,heightOfStationGroundAboveMeanSeaLevel]
    

    as required for geopandas.

  • “CRS”: generated from the “coordinateReferenceSystem” key using the following mapping:

    coordinateReferenceSystem

    CRS

    0

    EPSG:4326

    1

    EPSG:4258

    2

    EPSG:4269

    3

    EPSG:4314

    4 or 5

    not supported

    missing

    EPSG:4326

Note

The computed keys do not preserve their position in columns but are placed to the end of the resulting DataFrame.

Filters

The filter conditions are specified as a dict via filters and determine when the specified columns will actually be extracted.

Single value

A filter condition can be a single value match:

filters={"blockNumber": 12}

List of values

A list of values specifies an “in” relation:

filters={"stationNumber": [843, 925]}
filters={"blockNumber": range(10, 13)}

Slices

Intervals can be expressed as a slice (the boundaries as inclusive):

# closed interval (>=273.16 and <=293.16)
filters={"airTemperature": slice(273.16, 293.16)}

# open interval (<=273.16)
filters={"airTemperature": slice(None, 273.16)}

# open interval (>=273.16)
filters={"airTemperature": slice(273.16, None)}

Callables

We can even use a callable condition. This example uses a lambda expression to filter values in a certain range:

filters={"airTemperature": lambda x: x > 250 and x <= 300}

The same task can also be achieved by using a function:

def filter_temp(t):
    return t > 250 and t <= 300

df = pdbufr.read_bufr("temp.bufr",
    columns=("latitude", "longitude", "airTemperature"),
    filters={"airTemperature": filter_temp},
)

Combining conditions

When multiple conditions are specified they are connected with a logical AND:

filters={"blockNumber": 12,
     "stationNumber": [843, 925],
     "airTemperature": slice(273.16, 293.16)}

A geographical filter can be defined like this:

# locations in the 40W,10S - 30E,20N area
filters={"latitude": slice(-10, 20),
         "longitude": slice(-40, 30)}

while the following expression can be used as a temporal filter:

filters={"data_datetime":
     slice(datetime.datetime(2009,1,23,13,0),
           datetime.datetime(2009,1,23,13,1))}

Hierarchical mode

When flat is False the contents of a BUFR message/subset is interpreted as a hierarchical structure. This is based on a certain group of BUFR keys (related to instrumentation, location etc), which according to the WMO BUFR manual introduce a new hierarchy level in the message/susbset. During data extraction read_bufr traverses this hierarchy and when all the columns are collected and the all the filters match a new record is added to the output. With this several records can be extracted from the same message/subset.

Example

In this example we extract values from a classic radiosonde observation BUFR file. Here each message contains a single location (“latitude”, “longitude”) with several pressure levels of temperature, dewpoint etc. The message hierarchy is shown in the following snapshot:

_images/temp_structure.png

To extract the temperature profile for the first two stations we can use this code:

df = pdbufr.read_bufr("temp.bufr",
columns=("latitude", "longitude", "pressure", "airTemperature"),
filters={"count": [1, 2]},
)

which results in the following DataFrame:

    latitude  longitude  pressure  airTemperature
0      58.47     -78.08  100300.0           258.3
1      58.47     -78.08  100000.0           259.7
2      58.47     -78.08   99800.0           261.1
...
46     53.75     -73.67   25000.0           221.1
47     53.75     -73.67   23200.0           223.1
48     53.75     -73.67   20500.0           221.5

[48 rows x 4 columns]

Flat mode

New in version 0.10.0

When flat is True messages/subsets are extracted as a whole preserving the column order (see the note below for exceptions) and each extracted message/subset will be a separate record in the resulting DataFrame.

With filters we can control which messages/subsets should be selected. By default, all the columns in a message/subset are extracted (see the exceptions below), but this can be changed by setting columns to “header” or “data” to get only the header or data section keys. Other column selection modes are not available.

In the resulting DataFrame the original ecCodes keys containing the rank are used as column names, e.g. “#1#latitude” instead of “latitude”. The following set of keys are omitted:

  • from the header: “unexpandedDescriptors”

  • from the data section: data description operator qualifiers (e.g. “delayedDescriptorReplicationFactor”) and “operator”

  • key attributes e.g. “latitude->code”

The rank appearing in the keys in a message containing uncompressed subsets is not reset by ecCodes when a new subset started. To make the columns as aligned as a possible in the output read_bufr() resets the rank and ensures that e.g. the first “latitude” key is always called “#1#latitude” in each uncompressed subset.

filters control what messages/subsets should be extracted from the BUFR file. They are interpreted in a different way than in the hierarchical mode:

  • they can only contain keys without a rank

  • for non-computed keys the filter condition matches if there is a match for the same key with any given rank in the message/subset. E.g. if

    filters = {"pressure": 50000}
    

    and there is e.g. a value “#12#pressure” = 50000 in the message/subset then the filter matches.

  • for computed keys the filter condition matches if there is a match for the involved keys at their first occurrence (e.i. rank=1) in the message/subset. E.g:

    filters = {"WMO_station_id": 12925}
    

    matches if “#1#blockNumber” = 12 and “#1#stationNumber” = 925 in the message/subset (remember WMO_station_id=blockNumber*1000+stationNumber)

Warning

Messages/subsets in a BUFR file can have a different set of BUFR keys. When a new message/subset is processed read_bufr() adds it to the resulting DataFrame as a new record and columns that are not yet present in the output are automatically appended by Pandas to the end changing the original order of keys for that message. When this happens pdbufr() prints a warning message to the stdout (see the example below or the Flat dump notebook for details).

Example

We use the same radiosonde BUFR file as for the hierarchical mode example above. To extract all the data values for the first two stations we can use this code:

df = pdbufr.read_bufr("temp.bufr", columns="data",
        flat=True
        filters={"count": [1, 2]},
      )

which results in the following DataFrame:

    subsetNumber  #1#blockNumber  #1#stationNumber    ...  #25#airTemperature  #25#dewpointTemperature  #25#windDirection  #25#windSpeed
0        1             71               907                    NaN                      NaN                NaN            NaN
1        1             71               823                  221.5                    191.5                NaN            NaN

[2 rows x 197 columns]

and generates the following warning:

Warning: not all BUFR messages/subsets have the same structure in the input file.
Non-overlapping columns (starting with column[189] = #1#generatingApplication)
were added to end of the resulting dataframe altering the original column order
for these messages.

This warning can be disabled by using the warnings module. The code below produces the same DataFrame as the one above but does not print the warning message:

import warnings
warnings.filterwarnings("ignore", module="pdbufr")

df = pdbufr.read_bufr("temp.bufr", columns="data",
      flat=True
      filters={"count": [1, 2]},
    )

Note

See the Flat dump notebook for more details.

Message list object

read_bufr() can take a message list object as an input. It is particularly useful if we already have the BUFR data in another object/storage structure and we want to directly use it with pdbufr.

A message list object is sequence of messages, where a message must be a mutable mapping of str BUFR keys to values. Ideally, the message object should implement a context manager (__enter__ and __exit__) and also the is_coord method, which determines if a key is a BUFR coordinate descriptor. If any of these methods are not available read_bufr() will automatically create a wrapper object to provide default implementations. For details see MessageWrapper in pdbufr/bufr_structure.py.