This page was generated from a notebook available in the repository for this website.

Objectives

In this notebook we will demonstrate the use of h5py for writing complex data structures. This tutorial follows the h5py documentation closely.

Requirements

We will build a python environment in the usual way. Our goal is to install the h5py package.

In [1]:

# start with Anaconda or a virtualenv
# on MARCC: ml anaconda
# then install with: conda update --file reqs.yaml -p ./path/to/new/env
# then use the environment with: conda activate -p ./path/to/new/env
# then start a notebook with: jupyter notebook
# the reqs.yaml file contents are:
_ = """
name: demo_h5py
dependencies:
  - python==3.8.3
  - conda-forge::ipdb
  - h5py
  - notebook
  - pip
  - pip:
    - scipy
    - numpy
    - pyyaml
    - ruamel
    - mpi4py
"""

In [2]:

# alternative virtual environment instructions
_ = """
# get a python if you are on a cluster
$ ml python/3.8.6
# install a virtual environment
$ python -m venv ./env
# activate the environment
$ source ./env/bin/activate
# install packages as needed
(env) $ pip install h5py
"""

Use cases

Case 1: Writing and reading arrays

The documentation says that h5py files can hold two kinds of information. Groups work like dictionaries, and datasets work like NumPy arrays.

In [3]:

import numpy as np
import h5py

In [4]:

eg_dims = (5,4,3)
data = np.random.rand(*eg_dims)

In [5]:

# check the docs on the create_dataset function before we start
h5py.File.create_dataset?
# we see that shape is required if the data is not provided

In [6]:

# make a new file (we always use a context manager)
# using a context manager allows us to skip the fp.close at the end without issue
with h5py.File('my_data.hdf5','w') as fp:
    # create a dataset. recall that the shape and dtype are detected
    fp.create_dataset('result_1',data=data)

In [7]:

# check to see that the file exists using cell magic in Jupyter
! ls
demo-h5py-v2.ipynb my_data.hdf5       test
demo-h5py.ipynb    richer_data.hdf5

In [8]:

# read the file. it prefers a read mode
with h5py.File('my_data.hdf5','r') as fp:
    # pring the object (it tells us that it is an HDF5 file)
    print(fp)
    # the object acts like a dict
    print(list(fp.keys()))
    # we can check the dimensions of the result object
    result = fp['result_1']
    print(result.shape)
    # we can validate the dimensions
    if not fp['result_1'].shape==eg_dims: 
        raise ValueError('wrong dimensions!')
    # we can slice the arrays per usual
    subset = result[:2,...,:2]
    print(subset.shape)
<HDF5 file "my_data.hdf5" (mode r)>
['result_1']
(5, 4, 3)
(2, 4, 2)

Case 2: Using the hierarchy

In short, a hierarchical data format contains a POSIX-like filesystem.

In [9]:

# create a new file
with h5py.File('richer_data.hdf5','w') as fp:
    # create a group
    grp = fp.create_group('timeseries')
    for i in range(10):
        # generate some fake timseries data
        ts = np.random.rand(1000,2)
        grp.create_dataset(str(i),data=ts)

In [10]:

# read the data
with h5py.File('richer_data.hdf5','r') as fp:
    # create a group
    print(fp['timeseries/2'])
    # you can print the data by 
    # casting it: print(np.array(fp['timeseries/2']))
<HDF5 dataset "2": shape (1000, 2), type "<f8">

Question: how can we add columns to arrays?

The best way to include metadata in your file is to attach it directly to an Array. You can do this by using the numpy dtype

In [11]:

# first we found a nice example from the docs
np.array?

In [12]:

# in this example we set the string and 
# float types, along with names, for our colums
data_with_cols = np.array(
    [('ryan',2.5),('jane',4.0)],
    dtype=[('student','<S4'),('grades','<f4')])

In [13]:

# the strings are converted to bytes
print(data_with_cols)
# we can review the column names here
print(data_with_cols.dtype.names)
# and this result can be stored directly with h5py
# in the next section we add unstructured data
# (possibly metadata) to the h5py file
[(b'ryan', 2.5) (b'jane', 4. )]
('student', 'grades')

In [14]:

# we can now add this to our file
# the following demonstrates the append feature
# and we also use try/except to rewrite a dataset if it exists
with h5py.File('richer_data.hdf5','a') as fp:
    if 'student_data' in fp: del fp['student_data']
    fp.create_dataset('student_data',data=data_with_cols)

In [15]:

# check the file
with h5py.File('richer_data.hdf5','r') as fp:
    print(list(fp.keys()))
['student_data', 'timeseries']

In [16]:

# here is an alternate method for formulating the array
# start with columns, ordered by rows, with distinct types
students = np.array(['ryan','jane','nikhil']).astype('<S16')
grades = np.array([2.5,4.0,3.6])
students,grades
(array([b'ryan', b'jane', b'nikhil'], dtype='|S16'), array([2.5, 4. , 3.6]))

In [17]:

# if you transpose this without a dtype, h5py will not tolerate the U6 (unicode) type
np.array(np.transpose((students,grades)))
array([[b'ryan', b'2.5'],
       [b'jane', b'4.0'],
       [b'nikhil', b'3.6']], dtype='|S32')

In [18]:

n_rows = students.shape[0]
dt = np.dtype([('student','<S32'),('grade','<f4')])
student_data = np.empty(n_rows,dtype=dt)
student_data['student'] = students
student_data['grade'] = grades

In [19]:

# our data is now structured by dtype
student_data
array([(b'ryan', 2.5), (b'jane', 4. ), (b'nikhil', 3.6)],
      dtype=[('student', 'S32'), ('grade', '<f4')])

In [20]:

# we can select one column or row
print(student_data['grade'])
print(student_data[1])
[2.5 4.  3.6]
(b'jane', 4.)

Users interested in more data science-oriented structures are encouraged to check out pandas.

Case 3: Using metadata (attributes)

In [21]:

# simulate some metadata
meta = {
    'model':{
        'k0':1.2,'k1':1.3,'tau':0.01,'mass':123.},
    'integrator':{
        'dt':0.001,'method':'rk4',},
    'data':{
        'source_path':'path/to/data',
        'host':'bluecrab',},}

In [22]:

# or use yaml (not part of the standard library) (pip install pyyaml)
import yaml
text = yaml.dump(meta)
meta = yaml.load(text,Loader=yaml.SafeLoader)
print(text)
data:
  host: bluecrab
  source_path: path/to/data
integrator:
  dt: 0.001
  method: rk4
model:
  k0: 1.2
  k1: 1.3
  mass: 123.0
  tau: 0.01

In [23]:

# serialize the data
import json
meta_s = json.dumps(meta)
print(meta_s)
{"data": {"host": "bluecrab", "source_path": "path/to/data"}, "integrator": {"dt": 0.001, "method": "rk4"}, "model": {"k0": 1.2, "k1": 1.3, "mass": 123.0, "tau": 0.01}}

In [24]:

# create a new file, this time with metadata
with h5py.File('richer_data.hdf5','w') as fp:
    # add the metadata
    fp.create_dataset('meta',data=np.string_(meta_s))
    # create a group
    grp = fp.create_group('timeseries')
    for i in range(10):
        # generate some fake timseries data
        ts = np.random.rand(1000,2)
        grp.create_dataset(str(i),data=ts)

In [25]:

# extract the data
with h5py.File('richer_data.hdf5','r') as fp:
    #! first we get an h5py object
    #! meta_s = fp['meta']
    #! next we realize our docs are old
    #! meta_s = fp['meta'].value
    #! next we get a byte
    #! meta_s = fp['meta'][()]
    meta_s = fp['meta'][()].decode()
    #! print(meta_s)
    # unpack the data
    meta = json.loads(meta_s)
    print(yaml.dump(meta))
    # get an item in the timeseries
    result = fp['timeseries/8']
    # cast this as an array
    # you can view the data by
    # casting it as an array: print(np.array(result))
data:
  host: bluecrab
  source_path: path/to/data
integrator:
  dt: 0.001
  method: rk4
model:
  k0: 1.2
  k1: 1.3
  mass: 123.0
  tau: 0.01

In [26]:

with h5py.File('richer_data.hdf5','r') as fp:
    this = fp['meta'][()].decode()
    meta = json.loads(this)

In [27]:

import pprint
pprint.pprint(meta,width=10)
{'data': {'host': 'bluecrab',
          'source_path': 'path/to/data'},
 'integrator': {'dt': 0.001,
                'method': 'rk4'},
 'model': {'k0': 1.2,
           'k1': 1.3,
           'mass': 123.0,
           'tau': 0.01}}

Case 4: Hashing your data

Or, how do I organize the data? The following is a poor-man’s database.

In [28]:

# if you change the metadata, you change the hash
meta['data']['host'] = 'rockfish'

In [29]:

import hashlib,json
# serialize the data with some special flags
# read about stability: https://stackoverflow.com/a/22003440
meta_s = json.dumps(meta,
    ensure_ascii=True,sort_keys=True,default=str)
hashcode = hashlib.sha1(meta_s.encode()).hexdigest()[:10]
# we can save unique files with guaranteed-unique names
print('data_%s.h5py'%hashcode)
data_8c9c7dcc57.h5py

To conclude: you can treat each file as a metaphorical row in a database. The filesystem thereby acts as a large parallel database. This saves the effort of he traditional wrappers, handlers, validation, etc, that a database requires. You only have to perform some modest file management. In the future we can discuss more fully-featured database solutions.