read_bufr
- read_bufr(path, columns=[], filters={}, required_columns=True, flat=False)
Extract data from BUFR as a pandas.DataFrame with the specified
columns
applying thefilters
either in hierarchical or flat mode.- Parameters
path (str, bytes, os.PathLike or a message_list_object) – path to the BUFR file or a message_list_object
columns (str, sequence[str]) –
a list of ecCodes BUFR keys to extract for each BUFR message/subset. When
flat
is Truecolumns
must be one of the following string values:”all”, empty str or empty list (default): all the columns are extracted
”header”: only the columns from the header section are extracted
”data”: only the columns from the data section are extracted
filters (dict) – defines the conditions when to extract the specified
columns
. The individual conditions are combined together with the logical AND operator to form the filter. See Filters for details.required_columns (bool, iterable[str]) –
the list of ecCodes BUFR keys that are required to be present in the BUFR message/subset. Bool values are interpreted as follows:
if
flat
is False:True means all the keys in
columns
are requiredFalse means no columns are required
if
flat
is True either bool value means no columns are required
flat (bool) – enables flat extraction mode. When it is
True
each message/subset is treated as a flat list, while when it isFalse
(default), data is extracted as if the message had a tree-like hierarchy. See details below. New in version 0.10.0
- Return type
pandas.DataFrame
In order to correctly use
read_bufr()
for a given BUFR file first you need to understand the structure of the messages and the keys/values you can use for data extraction and filter definition. The BUFR structure can be explored with ecCodes command line tools bufr_ls and bufr_dump. You can also use CodesUI or Metview, which provide graphical user interfaces to inspect BUFR/GRIB data.There are some notebook examples available demonstrating how to use
read_bufr()
for various observation/forecast BUFR data types.
BUFR keys
ecCodes keys from both the BUFR header and data sections are supported in
columns
,filters
andrequired_columns
. However, there are some limitations:
keys containing the rank e.g. “#1#latitude” cannot be used
key attributes e.g. “latitude->code” cannot be used
The “count” generated key, which refers to the message index, is also supported but please note that message indexing starts at 1 and not at 0!
There is also a set of computed keys that can be used for
read_bufr()
:
“data_datetime” (datetime.datetime): generated from the “year”, “month”, “day”, “hour”, “minute”, “second” keys in the BUFR data section.
“typical_datetime” (datetime.datetime): generated from the “typicalYear”, “typicalMonth”, “typicalDay”, “typicalHour”, “typicalMinute”, “typicalSecond” keys in the BUFR header section.
“WMO_station_id”: generated from the “blockNumber” and “stationNumber” keys as:
blockNumber*1000+stationNumber“geometry”: values extracted as a list of:
[longitude,latitude,heightOfStationGroundAboveMeanSeaLevel]as required for geopandas.
“CRS”: generated from the “coordinateReferenceSystem” key using the following mapping:
coordinateReferenceSystem
CRS
0
EPSG:4326
1
EPSG:4258
2
EPSG:4269
3
EPSG:4314
4 or 5
not supported
missing
EPSG:4326
Note
The computed keys do not preserve their position in
columns
but are placed to the end of the resulting DataFrame.
Filters
The filter conditions are specified as a dict via
filters
and determine when the specifiedcolumns
will actually be extracted.
Single value
A filter condition can be a single value match:
filters={"blockNumber": 12}
List of values
A list of values specifies an “in” relation:
filters={"stationNumber": [843, 925]} filters={"blockNumber": range(10, 13)}
Slices
Intervals can be expressed as a
slice
(the boundaries as inclusive):# closed interval (>=273.16 and <=293.16) filters={"airTemperature": slice(273.16, 293.16)} # open interval (<=273.16) filters={"airTemperature": slice(None, 273.16)} # open interval (>=273.16) filters={"airTemperature": slice(273.16, None)}
Callables
We can even use a
callable
condition. This example uses a lambda expression to filter values in a certain range:filters={"airTemperature": lambda x: x > 250 and x <= 300}The same task can also be achieved by using a function:
def filter_temp(t): return t > 250 and t <= 300 df = pdbufr.read_bufr("temp.bufr", columns=("latitude", "longitude", "airTemperature"), filters={"airTemperature": filter_temp}, )
Combining conditions
When multiple conditions are specified they are connected with a logical AND:
filters={"blockNumber": 12, "stationNumber": [843, 925], "airTemperature": slice(273.16, 293.16)}A
geographical filter
can be defined like this:# locations in the 40W,10S - 30E,20N area filters={"latitude": slice(-10, 20), "longitude": slice(-40, 30)}while the following expression can be used as a
temporal filter
:filters={"data_datetime": slice(datetime.datetime(2009,1,23,13,0), datetime.datetime(2009,1,23,13,1))}
Hierarchical mode
When
flat
isFalse
the contents of a BUFR message/subset is interpreted as a hierarchical structure. This is based on a certain group of BUFR keys (related to instrumentation, location etc), which according to the WMO BUFR manual introduce a new hierarchy level in the message/susbset. During data extractionread_bufr
traverses this hierarchy and when all thecolumns
are collected and the all thefilters
match a new record is added to the output. With this several records can be extracted from the same message/subset.Example
In this example we extract values from a classic radiosonde observation BUFR file. Here each message contains a single location (“latitude”, “longitude”) with several pressure levels of temperature, dewpoint etc. The message hierarchy is shown in the following snapshot:
To extract the temperature profile for the first two stations we can use this code:
df = pdbufr.read_bufr("temp.bufr", columns=("latitude", "longitude", "pressure", "airTemperature"), filters={"count": [1, 2]}, )which results in the following DataFrame:
latitude longitude pressure airTemperature 0 58.47 -78.08 100300.0 258.3 1 58.47 -78.08 100000.0 259.7 2 58.47 -78.08 99800.0 261.1 ... 46 53.75 -73.67 25000.0 221.1 47 53.75 -73.67 23200.0 223.1 48 53.75 -73.67 20500.0 221.5 [48 rows x 4 columns]
Flat mode
New in version 0.10.0
When
flat
isTrue
messages/subsets are extracted as a whole preserving the column order (see the note below for exceptions) and each extracted message/subset will be a separate record in the resulting DataFrame.With
filters
we can control which messages/subsets should be selected. By default, all the columns in a message/subset are extracted (see the exceptions below), but this can be changed by settingcolumns
to “header” or “data” to get only the header or data section keys. Other column selection modes are not available.In the resulting DataFrame the original ecCodes keys containing the rank are used as column names, e.g. “#1#latitude” instead of “latitude”. The following set of keys are omitted:
from the header: “unexpandedDescriptors”
from the data section: data description operator qualifiers (e.g. “delayedDescriptorReplicationFactor”) and “operator”
key attributes e.g. “latitude->code”
The rank appearing in the keys in a message containing uncompressed subsets is not reset by ecCodes when a new subset started. To make the columns as aligned as a possible in the output
read_bufr()
resets the rank and ensures that e.g. the first “latitude” key is always called “#1#latitude” in each uncompressed subset.
filters
control what messages/subsets should be extracted from the BUFR file. They are interpreted in a different way than in the hierarchical mode:
they can only contain keys without a rank
for non-computed keys the filter condition matches if there is a match for the same key with any given rank in the message/subset. E.g. if
filters = {"pressure": 50000}and there is e.g. a value “#12#pressure” = 50000 in the message/subset then the filter matches.
for computed keys the filter condition matches if there is a match for the involved keys at their first occurrence (e.i. rank=1) in the message/subset. E.g:
filters = {"WMO_station_id": 12925}matches if “#1#blockNumber” = 12 and “#1#stationNumber” = 925 in the message/subset (remember WMO_station_id=blockNumber*1000+stationNumber)
Warning
Messages/subsets in a BUFR file can have a different set of BUFR keys. When a new message/subset is processed
read_bufr()
adds it to the resulting DataFrame as a new record and columns that are not yet present in the output are automatically appended by Pandas to the end changing the original order of keys for that message. When this happenspdbufr()
prints a warning message to the stdout (see the example below or the Flat dump notebook for details).Example
We use the same radiosonde BUFR file as for the hierarchical mode example above. To extract all the data values for the first two stations we can use this code:
df = pdbufr.read_bufr("temp.bufr", columns="data", flat=True filters={"count": [1, 2]}, )which results in the following DataFrame:
subsetNumber #1#blockNumber #1#stationNumber ... #25#airTemperature #25#dewpointTemperature #25#windDirection #25#windSpeed 0 1 71 907 NaN NaN NaN NaN 1 1 71 823 221.5 191.5 NaN NaN [2 rows x 197 columns]and generates the following warning:
Warning: not all BUFR messages/subsets have the same structure in the input file. Non-overlapping columns (starting with column[189] = #1#generatingApplication) were added to end of the resulting dataframe altering the original column order for these messages.This warning can be disabled by using the warnings module. The code below produces the same DataFrame as the one above but does not print the warning message:
import warnings warnings.filterwarnings("ignore", module="pdbufr") df = pdbufr.read_bufr("temp.bufr", columns="data", flat=True filters={"count": [1, 2]}, )Note
See the Flat dump notebook for more details.
Message list object
read_bufr()
can take a message list object as an input. It is particularly useful if we already have the BUFR data in another object/storage structure and we want to directly use it with pdbufr.
A message list object is sequence of messages, where a message must be a mutable mapping of str
BUFR keys to values. Ideally, the message object should implement a context manager (__enter__
and __exit__
) and also the is_coord
method, which determines if a key is a BUFR coordinate descriptor. If any of these methods are not available read_bufr()
will automatically create a wrapper object to provide default implementations. For details see MessageWrapper
in pdbufr/bufr_structure.py
.