Machine learning: a very quick primer
This post will introduce the basic fundamentals of two popular machine learning (ML) tools: PyTorch and TensorFlow. Instead of introducing the academic basis for this question, which would be much more appropriate for a proper lecture, we will formulate a few simple examples to demonstrate the basic patterns for solving problems with ML. In particular, we will start with a simple linear regression to connect the ML methods with a very common statistical inference. While we formulate these problems, we will also review the best practices for deploying ML codes on Blue Crab.
Using ML and GPU accelerators on Blue Crab
As of this writing (December 2019) we have not yet upgraded the CUDA drivers on Blue Crab because we are preparing a pending driver and kernel update. For this reason, users must very carefully control their software versions. Thankfully, we already know how to do this using custom conda environments.
For the following exercise, we will install PyTorch version 1.3 which has been
compiled against CUDA 9.2 using the following environment file. After consulting
the TensorFlow compatibility charts we have also selected version 1.12,
which is the latest version compatible with CUDA 9.2. Since this version of
TensorFlow requires Python 3.6, we have also requested that version in our
requirements.yaml
file.
name: ml
dependencies:
- python<3.7
- pytorch::torchvision
- pytorch::pytorch=1.3=*cuda9.2*
- matplotlib
- numpy
- scipy
- jupyter
- nb_conda_kernels
# use tensorflow-gpu if you have a GPU
- tensorflow==1.12
- scikit-learn
- pandas
- seaborn
- pip
- pip:
- torchviz
Note that this environment may be precompiled before class.
Linear regression with PyTorch
In this section we will use PyTorch to perform a simple linear regression. This might seem like “fishing with dynamite” but the connection between a familiar regression and the vast ML infrastructure is fairly direct and work understanding in detail.
The code in this section tracks a this useful tutorial.
Before starting with PyTorch we will perform a simple linear regression using some synthetic data.
In [1]:
In [2]:
Gradient descent
To perform this regression, we will use the gradient descent optimization algorithm. In short, this method iteratively searches for a local minimum of a function by taking steps in the direction of the negative gradient at the current point. You can think of it like walking downhill along the steepest downward path. The procedure includes three steps:
- estimate the loss
- compute the gradients
- update our parameters
We repeat this until we think we have found a minimum. The number of sample points determines if we are performing a stochastic ($1$), mini-batch ($<N$), or batch ($N$) gradient descent.
Our model for the data is: $y = a + b x + \epsilon$ where $\epsilon$ is random noise. For a linear regression, we define the error $E$ or loss function as the mean squared error (or MSE) given below.
\[\begin{aligned} E &= \frac{1}{N} \sum_{i=1}^{N}{({y_i - \hat{y_i}})^{2}} \\ E &= \frac{1}{N} \sum_{i=1}^{N}{(y_i - a - bx_i)^{2}} \end{aligned}\]The gradients, which tell us how the error changes as we vary each parameter, depend on the partial deriviatves, which can be computed with the chain rule.
\[\begin{aligned} E &= \frac{\partial E}{\partial a} = \frac{\partial E}{\partial \hat{y_i}} \cdot \frac{\partial \hat{y_i}}{\partial a} = \frac{1}{N} \sum_{i=1}^{N} {2(y_i-a- bx_i)\cdot(-1)} = -2 \frac{1}{N} \sum_{i=1}^{N} {(y_i - \hat{y_i})} \\ E &= \frac{\partial E}{\partial b} = \frac{\partial E}{\partial \hat{y_i}} \cdot \frac{\partial \hat{y_i}}{\partial b} = \frac{1}{N} \sum_{i=1}^{N} {2(y_i-a- bx_i)\cdot(-x_i)} = -2 \frac{1}{N} \sum_{i=1}^{N} {x_i (y_i - \hat{y_i})} \\ \end{aligned}\]Lastly, we update according to a learning rate $\eta$ using these definitions.
\[\begin{aligned} a &= a - \eta \frac{\partial E}{\partial a} \\ b &= b - \eta \frac{\partial E}{\partial b} \\ \end{aligned}\]The choice of loss function, calculation of the gradient, and update method are the key ingredients for implementing this optimization. The following code is comparatively simple.
In [3]:
[0.49671415] [-0.1382643]
[1.02354094] [1.96896411]
The results above show the initial random guess followed by the result of the
optimization. Next we will check the results of our basic regression using
scikit-learn
, which provides a single function to perform the regression. We
can see that the results are nearly identical.
In [4]:
[1.02354075] [1.96896447]
Prepare to use PyTorch by loading the libraries and selecting a device. In the remainder of the tutorial, we will offload our work to the GPU if one exists. Note that using the GPUs on Blue Crab may require special instructions which will be discussed in class. In particular, it may be necessary to load a very specific version of PyTorch which is built against our current maximum CUDA toolkit version, which is itself set by the device drivers on our machine.
In [5]:
'cpu'
Note that PyTorch allows you to convert numpy
objects and also send them
directly to your GPU device. PyTorch manages its own data types. We use the
following techniques later in the exercise when we build the regression in
PyTorch.
In [6]:
<class 'numpy.ndarray'> <class 'torch.Tensor'> torch.FloatTensor
A PyTorch tensor is an implementation of the standard (mathematical) tensor which allows you to easily compute its gradients. This radically simplifies the process of building machine learning models because it saves you the effort of deriving the update procedure manually.
In [7]:
tensor([1.0235], requires_grad=True) tensor([1.9690], requires_grad=True)
We can inspect the dynamic computation graph for this model. In the following
diagram, blue boxes are tensors with computable gradients, the gray boxes
represent Python operations, and the green boxes are the starting points for the
computation of gradients. The resulting map shows us how the backward
function
infers the gradients for tensors with the requires_grad
flag set at
construction.
In [8]:
This analysis gives us useful insights into the construction of more complex models. We can refine the training procedure further by replacing our manual update steps with the use of a proper optimizer.
In [9]:
tensor([1.0235], requires_grad=True) tensor([1.9690], requires_grad=True)
Our linear regression uses the standard mean-squared error (MSE) as the loss function, however many other machine learning models use alternate loss functions, many of which lack an analytical solution.
In [12]:
tensor([1.0235], requires_grad=True) tensor([1.9690], requires_grad=True)
The remainder of the source material for this post dives deeper into the process of building these kinds of regressions. For now, we will turn to Tensorflow for a similar exercise.
Tensorflow and Keras
There are many competing frameworks for performing ML on your data. Tensorflow provides a useful alternative. The following exercise performs a similer, but higher dimensional regression based on the documentation here. First we must import a number of libraries, including the abstraction layer Keras which makes it easier to work with Tensorflow. Keras is an API which can use multiple ML drivers, sometimes called back ends.
In [13]:
In [None]:
First we will prepare some data. In this example, we will attempt to predict the fuel efficiency of cars build in the 1970s and 80s using attributes like the number of cylinders, displacement, horsepower, and weight. Use the following commands to collect the data.
In [16]:
'/Users/rpb/.keras/datasets/auto-mpg.data'
In [18]:
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model Year | Origin | |
---|---|---|---|---|---|---|---|---|
393 | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | 1 |
394 | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | 2 |
395 | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | 1 |
396 | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | 1 |
397 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | 1 |
We can inspect some of the features of the data, and clean the data by removing incomplete records.
In [21]:
MPG 0
Cylinders 0
Displacement 0
Horsepower 6
Weight 0
Acceleration 0
Model Year 0
Origin 0
dtype: int64
In [22]:
We further clean the data by mapping the origin column to separate categories.
In [23]:
In [24]:
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model Year | Europe | Japan | USA | |
---|---|---|---|---|---|---|---|---|---|---|
393 | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | 0 | 0 | 1 |
394 | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | 1 | 0 | 0 |
395 | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | 0 | 0 | 1 |
396 | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | 0 | 0 | 1 |
397 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | 0 | 0 | 1 |
Next we will split the data into training and testing sets.
In [25]:
The seaborn pair plot allows us to inspect the correlations between four columns in our data set (off axis) along with their distributions (diagonal).
In [26]:
<seaborn.axisgrid.PairGrid at 0x1a3c7d5e10>
We can also collect summary statistics.
In [27]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Cylinders | 314.0 | 5.477707 | 1.699788 | 3.0 | 4.00 | 4.0 | 8.00 | 8.0 |
Displacement | 314.0 | 195.318471 | 104.331589 | 68.0 | 105.50 | 151.0 | 265.75 | 455.0 |
Horsepower | 314.0 | 104.869427 | 38.096214 | 46.0 | 76.25 | 94.5 | 128.00 | 225.0 |
Weight | 314.0 | 2990.251592 | 843.898596 | 1649.0 | 2256.50 | 2822.5 | 3608.00 | 5140.0 |
Acceleration | 314.0 | 15.559236 | 2.789230 | 8.0 | 13.80 | 15.5 | 17.20 | 24.8 |
Model Year | 314.0 | 75.898089 | 3.675642 | 70.0 | 73.00 | 76.0 | 79.00 | 82.0 |
Europe | 314.0 | 0.178344 | 0.383413 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
Japan | 314.0 | 0.197452 | 0.398712 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
USA | 314.0 | 0.624204 | 0.485101 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
Before continuing, we must separate the labels from the features in order to make our predictions.
In [33]:
We also normalize the data because they all use different numerical ranges. This typically helps with convergence.
In [34]:
We are finally ready to build the model. In the following function we will
generate a Sequential
model with two densely connected hidden layers along
with an output layer that returns a single continuous value. We wrap the model
in a function for later.
In [35]:
In [36]:
In [37]:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_3 (Dense) (None, 64) 640
_________________________________________________________________
dense_4 (Dense) (None, 64) 4160
_________________________________________________________________
dense_5 (Dense) (None, 1) 65
=================================================================
Total params: 4,865
Trainable params: 4,865
Non-trainable params: 0
_________________________________________________________________
To test the model, we can take 10 examples from our training data and use
model.predict
.
In [38]:
array([[0.14024973],
[0.15446681],
[0.07823651],
[0.08530468],
[0.30942103],
[0.23088537],
[0.37081474],
[0.03645729],
[0.22059843],
[0.30308604]], dtype=float32)
To train all of the data, we iterate over smaller batches. One epoch represents a single iteration over all of the data.
In [39]:
After the training is complete, we can inspect the history. Note that the tensorflow tutorial includes a useful set of callbacks for viewing the results. These are not available for our version of tensorflow, but we will discuss them in class.
In [40]:
val_loss | val_mean_absolute_error | val_mean_squared_error | loss | mean_absolute_error | mean_squared_error | epoch | |
---|---|---|---|---|---|---|---|
995 | 9.490661 | 2.434818 | 9.490661 | 2.585021 | 1.045370 | 2.585021 | 995 |
996 | 9.364464 | 2.404208 | 9.364464 | 2.392526 | 0.978031 | 2.392526 | 996 |
997 | 9.934777 | 2.476030 | 9.934777 | 2.339362 | 0.990820 | 2.339362 | 997 |
998 | 9.749488 | 2.503732 | 9.749488 | 2.428867 | 0.997238 | 2.428867 | 998 |
999 | 9.696823 | 2.466403 | 9.696823 | 2.400815 | 0.975567 | 2.400815 | 999 |
We can compare our test predictions with the labels and also view the error distribution.
In [41]:
In [42]:
The workflow above includes many of the same components as our first example,
however much of the design of our algorithm has been compartmentalized in the
keras.Sequential
function. The question of which API or framework or library
is best for your research may be partly a matter of taste and partly determined
by your technical needs. In our final exercise, we will return to Pytorch to
perform an image classification, which provides a more complex model compared to
the regression we started with.
Image classification with PyTorch
The following exercise comes directly from the PyTorch documentation. In this section we
will classify images from the CIFAR10
dataset. Classification and regression
are the two main types of supervised learning.
In [43]:
We will start by downloading the data.
In [44]:
Files already downloaded and verified
Files already downloaded and verified
We can plot a few representative images.
In [45]:
horse dog frog dog
Next we will build the model by building a nn.Model
class and constructing the
forward
function. In the following example we have built a convolutional
neural network.
In [46]:
Next we will select the classification cross-entropy loss function along with stochastic gradient descent (SGD) with momentum.
In [47]:
The training loop is very similar to the one we started with: we compute the
error, backpropagate, and then update our optimizer. We train for two epochs,
each of which coversa all of the data in the trainloader
object.
In [48]:
[1, 2000] loss: 2.187
[1, 4000] loss: 1.874
[1, 6000] loss: 1.707
[1, 8000] loss: 1.607
[1, 10000] loss: 1.547
[1, 12000] loss: 1.504
[2, 2000] loss: 1.449
[2, 4000] loss: 1.391
[2, 6000] loss: 1.370
[2, 8000] loss: 1.339
[2, 10000] loss: 1.332
[2, 12000] loss: 1.294
Finished Training
In [49]:
We can review some of our predictions by iterating over the test set.
In [56]:
GroundTruth: cat ship ship plane
In [51]:
Predicted: deer ship bird ship
Overall accuracy is much higher than chance (10% since there are 10 categories).
In [52]:
Accuracy of the network on the 10000 test images: 56 %
We can also analyze the resulting predictions to see which images are the hardest to classify.
In [53]:
Accuracy of plane : 56 %
Accuracy of car : 79 %
Accuracy of bird : 39 %
Accuracy of cat : 38 %
Accuracy of deer : 43 %
Accuracy of dog : 29 %
Accuracy of frog : 79 %
Accuracy of horse : 56 %
Accuracy of ship : 73 %
Accuracy of truck : 64 %
We have only scratched the surface so far. The three exercises and corresponding discussions above have ideally emphasized the following important features of ML workflows.
- Machine learning methods have a strong basis in statistical inference, and many statistical questions can be reformulated with machine learning models to interrogate large and complex data sets.
- Machine learning methods perform high-dimensional optimizations of many model parameters.
- Our methods must carefully avoid the perils of overfitting by training and testing on separate data sets.
- The various machine learning tools provide similar levels of abstraction with different methods for formulating the underlying optimization algorithms, loss functions, model structure.
- We should carefully track our software versions so that they are both compatible with our HPC cluster and reproducible.
- Our predictions are only as good as the data!
We recommend that readers interested in learning more check out the Stanford
course in Convolutional Neural Networks for visual
recognition. Users with specific questions about the
hardware, software, or best practices for building machine learning models are
welcome to contact our staff (marcc-help@marcc.jhu.edu
) for further advice.