Implementing convolution as a matrix multiplication

CNN中的卷积操作

卷积层是CNNs网络中可以说是最重要的层了，卷积层的主要作用是对输入图像求卷积运算。如下图所示，输入图片的维数为$[c_0,h_0,w_0]$ ；卷积核的维数为$[c_1,c_0,h_k,w_k]$，其中$c_0$在图中没有表示出来，一个卷积核可以看成由$c_1$个维数为$[c_0,h_k,w_k]$的三维滤波器组成；除了这些参数通常在计算卷积运算的时候还有一些超参数比如：stride（步长）：$S$,padding（填充）：$P$。

根据上面所说的参数就可以求出输出特征的维数为$[c_1,h_1,w_1]$,其中$h_1 = (h_0-h_k+2P)/S+1$,$w_1 = (w_0-w_k+2P)/S+1$。

卷积的计算过程其实很简单，但不是很容易说清楚，下面通过代码来说明。

基本环境设置:

%load_ext cython  #代码运行在jupyter-notebook中
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

卷积层计算的代码如下，想象一副图像尺寸为MxM，卷积核mxm。在计算时，卷积核与图像中每个mxm大小的图像块做element-wise相乘，然后得到的结果相加得到一个值，然后再移动一个stride，做同样的运算，直到整副输入图像遍历完，上述过程得到的值就组成了输出特征，具体运算过程还是看代码。

def conv_forward_naive(x, w, b, conv_param):
  out = None
  stride = conv_param['stride']
  pad = conv_param['pad']
  N, C, W, H = x.shape
  F, C, HH, WW = w.shape
  H_out = 1 + (H + 2 * pad - HH) / stride
  W_out = 1 + (W + 2 * pad - WW) / stride
  npad = ((0,0), (0,0), (pad,pad), (pad,pad))
  x_pad = np.pad(x, pad_width=npad, mode='constant', constant_values=0)
  out = np.zeros((N, F, H_out, W_out))
  for i in xrange(N):
      for j in xrange(F):
          for k in xrange(H_out):
              for z in xrange(W_out):
                  out[i, j, k, z] = np.sum(x_pad[i, :, k*stride:k*stride+HH,  z*stride:z*stride+WW]*w[j, :, :, :])+b[j]            
  cache = (x, w, b, conv_param)
  return out, cache

下面来检测下上面的卷积计算代码，我们人为的设置两个卷积核（分别为求灰度特征，和边缘特征），然后对两幅输入图像求卷积，观察输出的结果：

from scipy.misc import imread, imresize
kitten, puppy = imread('kitten.jpg'), imread('puppy.jpg')

d = kitten.shape[1] - kitten.shape[0]
kitten_cropped = kitten[:, d/2:-d/2, :]

img_size = 200   # Make this smaller if it runs too slow
x = np.zeros((2, 3, img_size, img_size))
x[0, :, :, :] = imresize(puppy, (img_size, img_size)).transpose((2, 0, 1))
x[1, :, :, :] = imresize(kitten_cropped, (img_size, img_size)).transpose((2, 0, 1))

# Set up a convolutional weights holding 2 filters, each 3x3
w = np.zeros((2, 3, 3, 3))

# The first filter converts the image to grayscale.
# Set up the red, green, and blue channels of the filter.
w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]
w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]
w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]

# Second filter detects horizontal edges in the blue channel.
w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]

# Vector of biases. We don't need any bias for the grayscale
# filter, but for the edge detection filter we want to add 128
# to each output so that nothing is negative.
b = np.array([0, 128])

# Compute the result of convolving each input in x with each filter in w,
# offsetting by b, and storing the results in out.
out, _ = conv_forward_naive(x, w, b, {'stride': 1, 'pad': 1})

def imshow_noax(img, normalize=True):
    """ Tiny helper to show images as uint8 and remove axis labels """
    if normalize:
        img_max, img_min = np.max(img), np.min(img)
        img = 255.0 * (img - img_min) / (img_max - img_min)
    plt.imshow(img.astype('uint8'))
    plt.gca().axis('off')

# Show the original images and the results of the conv operation
plt.subplot(2, 3, 1)
imshow_noax(puppy, normalize=False)
plt.title('Original image')
plt.subplot(2, 3, 2)
imshow_noax(out[0, 0])
plt.title('Grayscale')
plt.subplot(2, 3, 3)
imshow_noax(out[0, 1])
plt.title('Edges')
plt.subplot(2, 3, 4)
imshow_noax(kitten_cropped, normalize=False)
plt.subplot(2, 3, 5)
imshow_noax(out[1, 0])
plt.subplot(2, 3, 6)
imshow_noax(out[1, 1])
plt.show()

图像经过卷积后，输入结果如下所示：

im2col

运行上面代码的时候，我们发现对这两张图片计算卷积还是比较慢的，而在CNN中是存在大量的卷积运算的，所以我们需要一个更加快速的计算卷积的方法。如下图所示为Caffe中计算卷积的示意图，通过上面普通卷积运算的实现我们可以发现，卷积操作实际上是在对输入特征的一定范围内和卷积核滤波器做点乘，如下图我们可以利用这一特性把卷积操作转换成两个大矩阵相乘。

把输入图像要经行卷积操作的这一区域展成列向量的操作通常称为im2col，具体过程如下图所示。

下图为一个具体的例子，看懂下面这个图应该就会清楚上面的做法。

下面的im2col_cython是使用Cython代码来实现im2col功能，有关Cython在Python中的具体使用可参考：Python速度优化-Cython中numpy以及多线程的使用。

%%cython
import cython
cimport numpy as np
import numpy as np
ctypedef fused DTYPE_t:
    np.float32_t
    np.float64_t

def im2col_cython(np.ndarray[DTYPE_t, ndim=4] x, int field_height,
                  int field_width, int padding, int stride):
    cdef int N = x.shape[0]
    cdef int C = x.shape[1]
    cdef int H = x.shape[2]
    cdef int W = x.shape[3]

    cdef int HH = (H + 2 * padding - field_height) / stride + 1
    cdef int WW = (W + 2 * padding - field_width) / stride + 1

    cdef int p = padding
    cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.pad(x,
            ((0, 0), (0, 0), (p, p), (p, p)), mode='constant')

    cdef np.ndarray[DTYPE_t, ndim=2] cols = np.zeros(
            (C * field_height * field_width, N * HH * WW),
            dtype=x.dtype)

    # Moving the inner loop to a C function with no bounds checking works, but does
    # not seem to help performance in any measurable way.

    cdef int c, ii, jj, row, yy, xx, i, col

    for c in range(C):
        for yy in range(HH):
            for xx in range(WW):
                for ii in range(field_height):
                    for jj in range(field_width):
                        row = c * field_width * field_height + ii * field_height + jj
                        for i in range(N):
                            col = yy * WW * N + xx * N + i
                            cols[row, col] = x_padded[i, c, stride * yy + ii, stride * xx + jj]
    return cols

调用上面的im2col_cython函数来实现卷积操作：

def conv_forward_im2col(x, w, b, conv_param):
  """
  A fast implementation of the forward pass for a convolutional layer
  based on im2col and col2im.
  """
  N, C, H, W = x.shape
  num_filters, _, filter_height, filter_width = w.shape
  stride, pad = conv_param['stride'], conv_param['pad']

  # Check dimensions
  assert (W + 2 * pad - filter_width) % stride == 0, 'width does not work'
  assert (H + 2 * pad - filter_height) % stride == 0, 'height does not work'

  # Create output
  out_height = (H + 2 * pad - filter_height) / stride + 1
  out_width = (W + 2 * pad - filter_width) / stride + 1
  out = np.zeros((N, num_filters, out_height, out_width), dtype=x.dtype)

  # x_cols = im2col_indices(x, w.shape[2], w.shape[3], pad, stride)
  x_cols = im2col_cython(x, w.shape[2], w.shape[3], pad, stride)
  res = w.reshape((w.shape[0], -1)).dot(x_cols) + b.reshape(-1, 1)

  out = res.reshape(w.shape[0], out.shape[2], out.shape[3], x.shape[0])
  out = out.transpose(3, 0, 1, 2)

  cache = (x, w, b, conv_param, x_cols)
  return out, cache

测试使用im2col方法的卷积操作，从输出的图片可以看出和原始卷积方法一样。

out, _ = conv_forward_im2col(x, w, b, {'stride': 1, 'pad': 1})
# Show the original images and the results of the conv operation
plt.subplot(2, 3, 1)
imshow_noax(puppy, normalize=False)
plt.title('Original image')
plt.subplot(2, 3, 2)
imshow_noax(out[0, 0])
plt.title('Grayscale')
plt.subplot(2, 3, 3)
imshow_noax(out[0, 1])
plt.title('Edges')
plt.subplot(2, 3, 4)
imshow_noax(kitten_cropped, normalize=False)
plt.subplot(2, 3, 5)
imshow_noax(out[1, 0])
plt.subplot(2, 3, 6)
imshow_noax(out[1, 1])
plt.show()

下面来测试一下使用两种方法的时间，使用原始的卷积操作每次循环需要2.19s，而使用im2col方法则只需要28.3ms，时间大概缩短了77倍，当然这其中也包括了使用Cython所降低的时间，但总体上来说还是大大加快了卷积的计算速度。

虽然使用im2col方法加快了计算速度，但也会使用更多的内存，因为把输入图像转换为col的时候，会有很多重复的元素。

1 2	%timeit conv_forward_naive(x, w, b, {'stride': 1, 'pad': 1}) %timeit conv_forward_im2col(x, w, b, {'stride': 1, 'pad': 1})

1 2	1 loop, best of 3: 2.19 s per loop 10 loops, best of 3: 28.3 ms per loop

参考

Convolutional Neural Networks (CNNs / ConvNets)

深入理解Caffe源码（卷积实现详细分析