llcao

Different ways of optimization with deep learning toolkit

This is one from a series of notes taken for the course of Columbia E6489.

In the last note, we suggested that any model can be represented as a deep network so we encourage you to think broadly and be open to new models. Following this idea, this note will discuss different ways of using deep learning toolkits.

Classical optimization via Theano

To simplify the discussion, we stick to Theano and but you can find similar solutions with Tensorflow and PyTorch.

In Theano, we can use the magic function Tensor.grad() to compute the gradient. For example

my_grad = theano.Tensor.grad(loss, shared_w)

Thus we can connect theano to the classical optimization algorithm. Here we choose the L-BFGS-B algorithm from scipy and show with only one small trick we can build a theano interface to be called by scipy.optimize.fmin_lbfgs_b.

import scipy
from theano import Tensor as T

grad = T.grad(loss, shared_w)

f_loss = theano.function([], loss)
f_grad = theano.function([], grad)

print 'before opt:', f_loss()

# Trick: build non-theano functions that can be called by optimizer
def eval_loss(x0):
    shared_w.set_value(x0.astype(np.float32))
    return f_loss().astype('float64')


def eval_grad(x0):
    shared_w.set_value(x0.astype(np.float32))
    return np.array(f_grad()).flatten().astype('float64')

opt_result = scipy.optimize.fmin_l_bfgs_b(eval_loss, w_init, fprime=eval_grad, maxfun=40)

The following figure shows an example of such optimization process.

Traditional optimization Gradient descent

Stochastic optimization

The example above is quite similar from traditional optimization work (except it may make your code shorter). However, here comes a different story when the dataset becomes too big to fit in the memory.

When the dataset becomes too big to be loaded once, a straightforward strategy is to randomly sample one example or a batch of examples at a time, then optimize the model parameters using the random examples. We keep the process of sampling and optimization until all the data have been visited a number of times. This optimization strategy is called stochastic gradient decent (SGD). Keras provides a very easy to use wrapper for SGD. The figure below shos an example of SGD optimization, and you can easily see the differences between SGD and GD.

Stochastic gradient descentStochastic optimization:

Although stochastic optimization is conceptually only a small modification, it has several important implications:

1 SGD and its variants make it possible to learn from big datasets.

2 Global mimimum is no longer crucial for big training dataset. That means SGD is no longer criticized by the shortcoming of staying in local minimum.

3 GPUs become the most popular tool to speed up batch SGD.

Now a powerful GPU server can easily handle large datasets with millions of examples or hundreds of GBs data. Many companies, including Google, Amazon, Facebook, Nivdia, Tesla, and others are competing very hard to provide solutions to learn from larger dataset. It will be interesting to observe how and where the optimization with bigger datasets will evolve in the future.

Share on Twitter Share on Facebook