https://cloud.google.com/vision/

문자인식쪽으로 사용될 부분이 많아보이네요. 90도 회전문자도 잘 지원해 주고~사물이나 얼굴같은경우는 해당 도메인에 맞춤 인식이 필요한 부분이 많을듯 하여 직접 딥러닝 (CNN, R-CNN) 아키텍처를 만들어야 할 경우가 많을거 같습니다.






ㅋㅋㅋ




개발자 여러분 그동안 영어 이해하랴 이론 이해하랴 정말 수고 많으셨습니다.  

정말 좋은 한글 동영상이 있어서 링크 해 봅니다.  정말 쉽게 딥러닝의 이론에 대해서 설명해주네요.

이 동영상으로 기반을 이해하고, Caffe 등을 통한 다양한 실제 적용을 통해 통찰력과 직관을 기르면  일반 개발자 입장에서  딥러닝을 본인의 프로젝트에 좋은 서브무기로 사용 할 수 있으리라 봅니다.


* 이래서 동강동강하나 봅니다. 





제공 : 모두의 연구소 - 이찬우 연구원



[딥러닝] 1. Introduction

[딥러닝] 2. 선형회귀와 Gradient Descent

[딥러닝] 3. Gradient Descent & Normal Eq.

[딥러닝] 4. 로지스틱 회귀

[딥러닝] 5. 로지스틱 회귀의 비용함수

[딥러닝] 6. 신경망의 표현

[딥러닝] 7. 신경망의 역전파

[딥러닝] 그래프 개론

[딥러닝]RNN Introduction

[딥러닝] RNN : LSTM Basic

[딥러닝] RNN : LSTM 구조

[딥러닝] RNN : Back Propagation

[딥러닝] RNN 학습 메커니즘

[딥러닝] Convolutional NeuralNet 1

[딥러닝] Convolutional NeuralNet 2

[딥러닝] CNN Back Propagation


















binaryproto -> npy 변경 함수 예시 

 
#!/usr/bin/env python
import caffe
import numpy as np
import sys


## proto / datum / ndarray conversion
def blobproto_to_array(blob, return_diff=False):
    """
    Convert a blob proto to an array. In default, we will just return the data,
    unless return_diff is True, in which case we will return the diff.
    """
    # Read the data into an array
    if return_diff:
        data = np.array(blob.diff)
    else:
        data = np.array(blob.data)

    # Reshape the array
    if blob.HasField('num') or blob.HasField('channels') or blob.HasField('height') or blob.HasField('width'):
        # Use legacy 4D shape
        return data.reshape(blob.num, blob.channels, blob.height, blob.width)  // bug ??
    else:
        return data.reshape(blob.shape.dim)



blob = caffe.proto.caffe_pb2.BlobProto()
data = open( './company_mean.binaryproto' , 'rb' ).read()
blob.ParseFromString(data)
arr = blobproto_to_array(blob)
np.save('./company_mean.npy', arr)


문제 발생 !!!

위의 함수는 python/caffe 안의 io.py 에 있는 blobproto_to_array 를 활용한것인데, 이렇게 npy 로 바꾼 
mean 파일가지고 caffe 를 실행하면 raise ValueError('Mean shape invalid') 예외가 뜨더라. 
에러메세지를 살펴보니 set_mean 함수 내부에서 나온 예외인데 

 def set_mean(self, in_, mean):

     

        ms = mean.shape

        if mean.ndim == 1:

            # broadcast channels

            if ms[0] != self.inputs[in_][1]:

                raise ValueError('Mean channels incompatible with input.')

            mean = mean[:, np.newaxis, np.newaxis]

        else:

            # elementwise mean

            if len(ms) == 2:

                ms = (1,) + ms

            if len(ms) != 3:

                raise ValueError('Mean shape invalid')

            

보다시피 ms 의 크기가 3이 아니라서 나온것이다.  
저 ms 의 크기는 첫번째 예제 소스의  data.reshape(blob.num, blob.channels, blob.height, 
blob.width) 에서 만들어지는데 첫번째 인자인 blob.num 를  제외하니깐 예외는 없어졌다.
즉 원래 이미지와 mean 이미지를 조합할때 channels, height,width 만 필요한데, num 가 들어가서 
길이가 4가 되버려서 문제가 된 이다.


Layers

To create a Caffe model you need to define the model architecture in a protocol buffer definition file (prototxt).

Caffe layers and their parameters are defined in the protocol buffer definitions for the project incaffe.proto.

Vision Layers

  • Header: ./include/caffe/vision_layers.hpp

Vision layers usually take images as input and produce other images as output. A typical “image” in the real-world may have one color channel (c=1c=1), as in a grayscale image, or three color channels (c=3c=3) as in an RGB (red, green, blue) image. But in this context, the distinguishing characteristic of an image is its spatial structure: usually an image has some non-trivial height h>1h>1 and width w>1w>1. This 2D geometry naturally lends itself to certain decisions about how to process the input. In particular, most of the vision layers work by applying a particular operation to some region of the input to produce a corresponding region of the output. In contrast, other layers (with few exceptions) ignore the spatial structure of the input, effectively treating it as “one big vector” with dimension chwchw.

Convolution

  • Layer type: Convolution
  • CPU implementation: ./src/caffe/layers/convolution_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/convolution_layer.cu
  • Parameters (ConvolutionParameter convolution_param)
    • Required
      • num_output (c_o): the number of filters
      • kernel_size (or kernel_h and kernel_w): specifies height and width of each filter
    • Strongly Recommended
      • weight_filler [default type: 'constant' value: 0]
    • Optional
      • bias_term [default true]: specifies whether to learn and apply a set of additive biases to the filter outputs
      • pad (or pad_h and pad_w) [default 0]: specifies the number of pixels to (implicitly) add to each side of the input
      • stride (or stride_h and stride_w) [default 1]: specifies the intervals at which to apply the filters to the input
      • group (g) [default 1]: If g > 1, we restrict the connectivity of each filter to a subset of the input. Specifically, the input and output channels are separated into g groups, and the iith output group channels will be only connected to the iith input group channels.
  • Input
    • n * c_i * h_i * w_i
  • Output
    • n * c_o * h_o * w_o, where h_o = (h_i + 2 * pad_h - kernel_h) / stride_h + 1 and w_o likewise.
  • Sample (as seen in ./models/bvlc_reference_caffenet/train_val.prototxt)

    layer {
      name: "conv1"
      type: "Convolution"
      bottom: "data"
      top: "conv1"
      # learning rate and decay multipliers for the filters
      param { lr_mult: 1 decay_mult: 1 }
      # learning rate and decay multipliers for the biases
      param { lr_mult: 2 decay_mult: 0 }
      convolution_param {
        num_output: 96     # learn 96 filters
        kernel_size: 11    # each filter is 11x11
        stride: 4          # step 4 pixels between each filter application
        weight_filler {
          type: "gaussian" # initialize the filters from a Gaussian
          std: 0.01        # distribution with stdev 0.01 (default mean: 0)
        }
        bias_filler {
          type: "constant" # initialize the biases to zero (0)
          value: 0
        }
      }
    }
    

The Convolution layer convolves the input image with a set of learnable filters, each producing one feature map in the output image.

Pooling

  • Layer type: Pooling
  • CPU implementation: ./src/caffe/layers/pooling_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/pooling_layer.cu
  • Parameters (PoolingParameter pooling_param)
    • Required
      • kernel_size (or kernel_h and kernel_w): specifies height and width of each filter
    • Optional
      • pool [default MAX]: the pooling method. Currently MAX, AVE, or STOCHASTIC
      • pad (or pad_h and pad_w) [default 0]: specifies the number of pixels to (implicitly) add to each side of the input
      • stride (or stride_h and stride_w) [default 1]: specifies the intervals at which to apply the filters to the input
  • Input
    • n * c * h_i * w_i
  • Output
    • n * c * h_o * w_o, where h_o and w_o are computed in the same way as convolution.
  • Sample (as seen in ./models/bvlc_reference_caffenet/train_val.prototxt)

    layer {
      name: "pool1"
      type: "Pooling"
      bottom: "conv1"
      top: "pool1"
      pooling_param {
        pool: MAX
        kernel_size: 3 # pool over a 3x3 region
        stride: 2      # step two pixels (in the bottom blob) between pooling regions
      }
    }
    

Local Response Normalization (LRN)

  • Layer type: LRN
  • CPU Implementation: ./src/caffe/layers/lrn_layer.cpp
  • CUDA GPU Implementation: ./src/caffe/layers/lrn_layer.cu
  • Parameters (LRNParameter lrn_param)
    • Optional
      • local_size [default 5]: the number of channels to sum over (for cross channel LRN) or the side length of the square region to sum over (for within channel LRN)
      • alpha [default 1]: the scaling parameter (see below)
      • beta [default 5]: the exponent (see below)
      • norm_region [default ACROSS_CHANNELS]: whether to sum over adjacent channels (ACROSS_CHANNELS) or nearby spatial locaitons (WITHIN_CHANNEL)

The local response normalization layer performs a kind of “lateral inhibition” by normalizing over local input regions. In ACROSS_CHANNELS mode, the local regions extend across nearby channels, but have no spatial extent (i.e., they have shape local_size x 1 x 1). In WITHIN_CHANNEL mode, the local regions extend spatially, but are in separate channels (i.e., they have shape 1 x local_size x local_size). Each input value is divided by (1+(α/n)ix2i)β(1+(α/n)ixi2)β, where nn is the size of each local region, and the sum is taken over the region centered at that value (zero padding is added where necessary).

im2col

Im2col is a helper for doing the image-to-column transformation that you most likely do not need to know about. This is used in Caffe’s original convolution to do matrix multiplication by laying out all patches into a matrix.

Loss Layers

Loss drives learning by comparing an output to a target and assigning cost to minimize. The loss itself is computed by the forward pass and the gradient w.r.t. to the loss is computed by the backward pass.

Softmax

  • Layer type: SoftmaxWithLoss

The softmax loss layer computes the multinomial logistic loss of the softmax of its inputs. It’s conceptually identical to a softmax layer followed by a multinomial logistic loss layer, but provides a more numerically stable gradient.

Sum-of-Squares / Euclidean

  • Layer type: EuclideanLoss

The Euclidean loss layer computes the sum of squares of differences of its two inputs, 12NNi=1x1ix2i2212Ni=1Nxi1xi222.

Hinge / Margin

  • Layer type: HingeLoss
  • CPU implementation: ./src/caffe/layers/hinge_loss_layer.cpp
  • CUDA GPU implementation: none yet
  • Parameters (HingeLossParameter hinge_loss_param)
    • Optional
      • norm [default L1]: the norm used. Currently L1, L2
  • Inputs
    • n * c * h * w Predictions
    • n * 1 * 1 * 1 Labels
  • Output
    • 1 * 1 * 1 * 1 Computed Loss
  • Samples

    # L1 Norm
    layer {
      name: "loss"
      type: "HingeLoss"
      bottom: "pred"
      bottom: "label"
    }
    
    # L2 Norm
    layer {
      name: "loss"
      type: "HingeLoss"
      bottom: "pred"
      bottom: "label"
      top: "loss"
      hinge_loss_param {
        norm: L2
      }
    }
    

The hinge loss layer computes a one-vs-all hinge or squared hinge loss.

Sigmoid Cross-Entropy

SigmoidCrossEntropyLoss

Infogain

InfogainLoss

Accuracy and Top-k

Accuracy scores the output as the accuracy of output with respect to target – it is not actually a loss and has no backward step.

Activation / Neuron Layers

In general, activation / Neuron layers are element-wise operators, taking one bottom blob and producing one top blob of the same size. In the layers below, we will ignore the input and out sizes as they are identical:

  • Input
    • n * c * h * w
  • Output
    • n * c * h * w

ReLU / Rectified-Linear and Leaky-ReLU

  • Layer type: ReLU
  • CPU implementation: ./src/caffe/layers/relu_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/relu_layer.cu
  • Parameters (ReLUParameter relu_param)
    • Optional
      • negative_slope [default 0]: specifies whether to leak the negative part by multiplying it with the slope value rather than setting it to 0.
  • Sample (as seen in ./models/bvlc_reference_caffenet/train_val.prototxt)

    layer {
      name: "relu1"
      type: "ReLU"
      bottom: "conv1"
      top: "conv1"
    }
    

Given an input value x, The ReLU layer computes the output as x if x > 0 and negative_slope * x if x <= 0. When the negative slope parameter is not set, it is equivalent to the standard ReLU function of taking max(x, 0). It also supports in-place computation, meaning that the bottom and the top blob could be the same to preserve memory consumption.

Sigmoid

  • Layer type: Sigmoid
  • CPU implementation: ./src/caffe/layers/sigmoid_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/sigmoid_layer.cu
  • Sample (as seen in ./examples/mnist/mnist_autoencoder.prototxt)

    layer {
      name: "encode1neuron"
      bottom: "encode1"
      top: "encode1neuron"
      type: "Sigmoid"
    }
    

The Sigmoid layer computes the output as sigmoid(x) for each input element x.

TanH / Hyperbolic Tangent

  • Layer type: TanH
  • CPU implementation: ./src/caffe/layers/tanh_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/tanh_layer.cu
  • Sample

    layer {
      name: "layer"
      bottom: "in"
      top: "out"
      type: "TanH"
    }
    

The TanH layer computes the output as tanh(x) for each input element x.

Absolute Value

  • Layer type: AbsVal
  • CPU implementation: ./src/caffe/layers/absval_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/absval_layer.cu
  • Sample

    layer {
      name: "layer"
      bottom: "in"
      top: "out"
      type: "AbsVal"
    }
    

The AbsVal layer computes the output as abs(x) for each input element x.

Power

  • Layer type: Power
  • CPU implementation: ./src/caffe/layers/power_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/power_layer.cu
  • Parameters (PowerParameter power_param)
    • Optional
      • power [default 1]
      • scale [default 1]
      • shift [default 0]
  • Sample

    layer {
      name: "layer"
      bottom: "in"
      top: "out"
      type: "Power"
      power_param {
        power: 1
        scale: 1
        shift: 0
      }
    }
    

The Power layer computes the output as (shift + scale * x) ^ power for each input element x.

BNLL

  • Layer type: BNLL
  • CPU implementation: ./src/caffe/layers/bnll_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/bnll_layer.cu
  • Sample

    layer {
      name: "layer"
      bottom: "in"
      top: "out"
      type: BNLL
    }
    

The BNLL (binomial normal log likelihood) layer computes the output as log(1 + exp(x)) for each input element x.

Data Layers

Data enters Caffe through data layers: they lie at the bottom of nets. Data can come from efficient databases (LevelDB or LMDB), directly from memory, or, when efficiency is not critical, from files on disk in HDF5 or common image formats.

Common input preprocessing (mean subtraction, scaling, random cropping, and mirroring) is available by specifying TransformationParameters.

Database

  • Layer type: Data
  • Parameters
    • Required
      • source: the name of the directory containing the database
      • batch_size: the number of inputs to process at one time
    • Optional
      • rand_skip: skip up to this number of inputs at the beginning; useful for asynchronous sgd
      • backend [default LEVELDB]: choose whether to use a LEVELDB or LMDB

In-Memory

  • Layer type: MemoryData
  • Parameters
    • Required
      • batch_size, channels, height, width: specify the size of input chunks to read from memory

The memory data layer reads data directly from memory, without copying it. In order to use it, one must call MemoryDataLayer::Reset (from C++) or Net.set_input_arrays (from Python) in order to specify a source of contiguous data (as 4D row major array), which is read one batch-sized chunk at a time.

HDF5 Input

  • Layer type: HDF5Data
  • Parameters
    • Required
      • source: the name of the file to read from
      • batch_size

HDF5 Output

  • Layer type: HDF5Output
  • Parameters
    • Required
      • file_name: name of file to write to

The HDF5 output layer performs the opposite function of the other layers in this section: it writes its input blobs to disk.

Images

  • Layer type: ImageData
  • Parameters
    • Required
      • source: name of a text file, with each line giving an image filename and label
      • batch_size: number of images to batch together
    • Optional
      • rand_skip
      • shuffle [default false]
      • new_height, new_width: if provided, resize all images to this size

Windows

WindowData

Dummy

DummyData is for development and debugging. See DummyDataParameter.

Common Layers

Inner Product

  • Layer type: InnerProduct
  • CPU implementation: ./src/caffe/layers/inner_product_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/inner_product_layer.cu
  • Parameters (InnerProductParameter inner_product_param)
    • Required
      • num_output (c_o): the number of filters
    • Strongly recommended
      • weight_filler [default type: 'constant' value: 0]
    • Optional
      • bias_filler [default type: 'constant' value: 0]
      • bias_term [default true]: specifies whether to learn and apply a set of additive biases to the filter outputs
  • Input
    • n * c_i * h_i * w_i
  • Output
    • n * c_o * 1 * 1
  • Sample

    layer {
      name: "fc8"
      type: "InnerProduct"
      # learning rate and decay multipliers for the weights
      param { lr_mult: 1 decay_mult: 1 }
      # learning rate and decay multipliers for the biases
      param { lr_mult: 2 decay_mult: 0 }
      inner_product_param {
        num_output: 1000
        weight_filler {
          type: "gaussian"
          std: 0.01
        }
        bias_filler {
          type: "constant"
          value: 0
        }
      }
      bottom: "fc7"
      top: "fc8"
    }
    

The InnerProduct layer (also usually referred to as the fully connected layer) treats the input as a simple vector and produces an output in the form of a single vector (with the blob’s height and width set to 1).

Splitting

The Split layer is a utility layer that splits an input blob to multiple output blobs. This is used when a blob is fed into multiple output layers.

Flattening

The Flatten layer is a utility layer that flattens an input of shape n * c * h * w to a simple vector output of shape n * (c*h*w)

Reshape

  • Layer type: Reshape
  • Implementation: ./src/caffe/layers/reshape_layer.cpp
  • Parameters (ReshapeParameter reshape_param)
    • Optional: (also see detailed description below)
      • shape
  • Input
    • a single blob with arbitrary dimensions
  • Output
    • the same blob, with modified dimensions, as specified by reshape_param
  • Sample

      layer {
        name: "reshape"
        type: "Reshape"
        bottom: "input"
        top: "output"
        reshape_param {
          shape {
            dim: 0  # copy the dimension from below
            dim: 2
            dim: 3
            dim: -1 # infer it from the other dimensions
          }
        }
      }
    

The Reshape layer can be used to change the dimensions of its input, without changing its data. Just like theFlatten layer, only the dimensions are changed; no data is copied in the process.

Output dimensions are specified by the ReshapeParam proto. Positive numbers are used directly, setting the corresponding dimension of the output blob. In addition, two special values are accepted for any of the target dimension values:

  • 0 means “copy the respective dimension of the bottom layer”. That is, if the bottom has 2 as its 1st dimension, the top will have 2 as its 1st dimension as well, given dim: 0 as the 1st target dimension.
  • -1 stands for “infer this from the other dimensions”. This behavior is similar to that of -1 in numpy’s or[] for MATLAB’s reshape: this dimension is calculated to keep the overall element count the same as in the bottom layer. At most one -1 can be used in a reshape operation.

As another example, specifying reshape_param { shape { dim: 0 dim: -1 } } makes the layer behave in exactly the same way as the Flatten layer.

Concatenation

  • Layer type: Concat
  • CPU implementation: ./src/caffe/layers/concat_layer.cpp
  • CUDA GPU implementation: ./src/caffe/layers/concat_layer.cu
  • Parameters (ConcatParameter concat_param)
    • Optional
      • axis [default 1]: 0 for concatenation along num and 1 for channels.
  • Input
    • n_i * c_i * h * w for each input blob i from 1 to K.
  • Output
    • if axis = 0: (n_1 + n_2 + ... + n_K) * c_1 * h * w, and all input c_i should be the same.
    • if axis = 1: n_1 * (c_1 + c_2 + ... + c_K) * h * w, and all input n_i should be the same.
  • Sample

    layer {
      name: "concat"
      bottom: "in1"
      bottom: "in2"
      top: "out"
      type: "Concat"
      concat_param {
        axis: 1
      }
    }
    

The Concat layer is a utility layer that concatenates its multiple input blobs to one single output blob.

Slicing

The Slice layer is a utility layer that slices an input layer to multiple output layers along a given dimension (currently num or channel only) with given slice indices.

  • Sample

    layer {
      name: "slicer_label"
      type: "Slice"
      bottom: "label"
      ## Example of label with a shape N x 3 x 1 x 1
      top: "label1"
      top: "label2"
      top: "label3"
      slice_param {
        axis: 1
        slice_point: 1
        slice_point: 2
      }
    }
    

axis indicates the target axis; slice_point indicates indexes in the selected dimension (the number of indices must be equal to the number of top blobs minus one).

Elementwise Operations

Eltwise

Argmax

ArgMax

Softmax

Softmax

Mean-Variance Normalization

MVN

“Flickr Style” 데이터를 이용한  스타일 인식을 위해 CaffeNet 모델을 파인튜닝해보자.


파인튜닝은 이미 학습된 모델을 기반으로  / 아키텍처를  새로운 목적으로 변형하고  /  이미 학습된 모델 weights 로부터 학습을 업데이트한다. BVLC-distributed CaffeNet 모델을 통한 100가지 객체 카테고리 인식을 대신해서 20 가지 이미지 스타일을 예측 하기위해  다른 데이터셋 ( Flickr Style ) 을 이용하여  파인튜닝해보자.

설명

스타일 데이터셋의 Flickr 이미지들은 보기에는 이미지넷 데이터셋 (bvlc_reference_caffenet 모델) 과 매우 유사하다. caffenet 모델은 개체분류에는 잘 작동하며, 우리는 그것을 우리의 스타일 분류에도 이용하고싶다. 우리의 목적에 맞게  학습하기 위한   80,000 개의 이미지들을 가지고있으며, 1,000,000 개의 이미지넷 이미지상에서 학습된 파라미터들을 가지고 파인튜닝을 시작할것이다. 만약 우리가 weights 인자를 caffe train 명령어에 제공하면 이전에 학습된 weights 는 우리의 모델안으로 로딩될것이며, 이름에 의해 레이어들이 매칭될것이다.  즉 이전에 학습된 모델 기반으로 새로운 데이터 분류기가 만들어질것이다.

우리는 기존 1000가지 분류를 대신해서 20 개의 스타일 분류를 예측할것이다. 우린 모델에서 마지막 레이어를 바꿀 필요가 있으며,  마지막 레이어의 이름을 fc8  에서 fc8_flickr 로 prototxt 에서 바꿀것이다.  bvlc_reference_caffenet 에 레이어 에 fc8_flickr 라는 레이어의 이름이 없기때문에 ,  이 레이어는 랜덤 weights 와 함께 학습이 시작될것이다.

우리는 또한 전반적인 learning rate (base_lr)  을  solver prototxt 에서 줄일것이나, 새롭게 소개된 레이어상에서 lr_mult 는 증가 시킬 것 입니다.  이 아이디어는  모델의  변화가 매우 천천히 새로운 데이터와 함께 하게 하지만,  새로운 레이어 학습은 빠르게 할 것입니다. 게다가, 우리는 솔버안에 있는  stepsize 를 scratch 로부터 학습했던것 보다는 더 작은 값으로 조정할것입니다.  since we’re virtually far along in training and therefore want the learning rate to go down faster. 우리는 전체적으로 모든 레이어의 파인튜닝을 막기위해 그들의 lr_mult 를 0으로 세팅할수 있습니다.

진행 순서

모든 절차(실행)는 caffe 설치 루트 디렉토리에서 수행됩니다.

데이터셋은 상응하는 라벨들과 함께 URLs 의 리스트로 표시됩니다. 해당 스크립트(assemble_data.py) 를 사용하여 우리는 데이터의 서브셋을 다운로드 받고 그것들을 train 과 val 셋으로 나눌것입니다.

caffe % python examples/finetune_flickr_style/assemble_data.py --workers=-1 --images=2000 --seed 831486
Downloading 2000 images with 7 workers...
Writing train/val for 1939 successfully downloaded images.

이 스크립트는 이미지들을 다운로드하고 train/val 파일 리스트를 data/flickr_style 폴더에 쓸 것입니다. 예제 안의 prototxts 는 이것이 존재한다고 가정할것이며, imageNet mean 파일의 존재를 가정할 것입니다. (혹시 없으면 data/ilsvrc12  폴더의   get_ilsvrc_aux.sh 를 이용해서 다운받으세요.)   

우리는 또한 ImageNet-trained 모델(models/bvlc_reference_caffenet.caffemodel) 이 필요할것이며, 만약 없다면 ./scripts/download_model_binary.py 를 통해 얻으세요.

이제 우린 학습할수 있게 되었습니다!! ( 당신은 CPU 모드로 파인튜닝을 할것입니다. (-gpu 플레그 제거)

caffe % ./build/tools/caffe train -solver models/finetune_flickr_style/solver.prototxt -weights models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel -gpu 0

[...]

I0828 22:10:04.025378  9718 solver.cpp:46] Solver scaffolding done.
I0828 22:10:04.025388  9718 caffe.cpp:95] Use GPU with device ID 0
I0828 22:10:04.192004  9718 caffe.cpp:107] Finetuning from models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel

[...]

I0828 22:17:48.338963 11510 solver.cpp:165] Solving FlickrStyleCaffeNet
I0828 22:17:48.339010 11510 solver.cpp:251] Iteration 0, Testing net (#0)
I0828 22:18:14.313817 11510 solver.cpp:302]     Test net output #0: accuracy = 0.0308
I0828 22:18:14.476822 11510 solver.cpp:195] Iteration 0, loss = 3.78589
I0828 22:18:14.476878 11510 solver.cpp:397] Iteration 0, lr = 0.001
I0828 22:18:19.700408 11510 solver.cpp:195] Iteration 20, loss = 3.25728
I0828 22:18:19.700461 11510 solver.cpp:397] Iteration 20, lr = 0.001
I0828 22:18:24.924685 11510 solver.cpp:195] Iteration 40, loss = 2.18531
I0828 22:18:24.924741 11510 solver.cpp:397] Iteration 40, lr = 0.001
I0828 22:18:30.114858 11510 solver.cpp:195] Iteration 60, loss = 2.4915
I0828 22:18:30.114910 11510 solver.cpp:397] Iteration 60, lr = 0.001
I0828 22:18:35.328071 11510 solver.cpp:195] Iteration 80, loss = 2.04539
I0828 22:18:35.328127 11510 solver.cpp:397] Iteration 80, lr = 0.001
I0828 22:18:40.588317 11510 solver.cpp:195] Iteration 100, loss = 2.1924
I0828 22:18:40.588373 11510 solver.cpp:397] Iteration 100, lr = 0.001
I0828 22:18:46.171576 11510 solver.cpp:195] Iteration 120, loss = 2.25107
I0828 22:18:46.171669 11510 solver.cpp:397] Iteration 120, lr = 0.001
I0828 22:18:51.757809 11510 solver.cpp:195] Iteration 140, loss = 1.355
I0828 22:18:51.757863 11510 solver.cpp:397] Iteration 140, lr = 0.001
I0828 22:18:57.345080 11510 solver.cpp:195] Iteration 160, loss = 1.40815
I0828 22:18:57.345135 11510 solver.cpp:397] Iteration 160, lr = 0.001
I0828 22:19:02.928794 11510 solver.cpp:195] Iteration 180, loss = 1.6558
I0828 22:19:02.928850 11510 solver.cpp:397] Iteration 180, lr = 0.001
I0828 22:19:08.514497 11510 solver.cpp:195] Iteration 200, loss = 0.88126
I0828 22:19:08.514552 11510 solver.cpp:397] Iteration 200, lr = 0.001

[...]

I0828 22:22:40.789010 11510 solver.cpp:195] Iteration 960, loss = 0.112586
I0828 22:22:40.789175 11510 solver.cpp:397] Iteration 960, lr = 0.001
I0828 22:22:46.376626 11510 solver.cpp:195] Iteration 980, loss = 0.0959077
I0828 22:22:46.376682 11510 solver.cpp:397] Iteration 980, lr = 0.001
I0828 22:22:51.687258 11510 solver.cpp:251] Iteration 1000, Testing net (#0)
I0828 22:23:17.438894 11510 solver.cpp:302]     Test net output #0: accuracy = 0.2356

얼마나 빠르게 loss 가 줄어드는지 확인하자. 23.5% 의 정확도가  하찮아 보일지라도, 그것은 오직 1000번의 반복을 통해서 나온 수치이다. 모델이 빠르고 잘 학습하길 시작하는 증거이다. 일단 모델이 100,000 반복되는 전체 학습 셋상에서 파인튜닝되면 마지막 정확도는 39.16%가 될것이다. 이것은 대략 K40 GPU 상의 Caffe 시스템에서  7시간 정도 걸릴것이다. 

우리가 pre-trained 모델과 함께  하지 않았을때와 비교적 해보면 어떻게 loss 가 줄어지는지 알수있다. 

I0828 22:24:18.624004 12919 solver.cpp:165] Solving FlickrStyleCaffeNet
I0828 22:24:18.624099 12919 solver.cpp:251] Iteration 0, Testing net (#0)
I0828 22:24:44.520992 12919 solver.cpp:302]     Test net output #0: accuracy = 0.0366
I0828 22:24:44.676905 12919 solver.cpp:195] Iteration 0, loss = 3.47942
I0828 22:24:44.677120 12919 solver.cpp:397] Iteration 0, lr = 0.001
I0828 22:24:50.152454 12919 solver.cpp:195] Iteration 20, loss = 2.99694
I0828 22:24:50.152509 12919 solver.cpp:397] Iteration 20, lr = 0.001
I0828 22:24:55.736256 12919 solver.cpp:195] Iteration 40, loss = 3.0498
I0828 22:24:55.736311 12919 solver.cpp:397] Iteration 40, lr = 0.001
I0828 22:25:01.316514 12919 solver.cpp:195] Iteration 60, loss = 2.99549
I0828 22:25:01.316567 12919 solver.cpp:397] Iteration 60, lr = 0.001
I0828 22:25:06.899554 12919 solver.cpp:195] Iteration 80, loss = 3.00573
I0828 22:25:06.899610 12919 solver.cpp:397] Iteration 80, lr = 0.001
I0828 22:25:12.484624 12919 solver.cpp:195] Iteration 100, loss = 2.99094
I0828 22:25:12.484678 12919 solver.cpp:397] Iteration 100, lr = 0.001
I0828 22:25:18.069056 12919 solver.cpp:195] Iteration 120, loss = 3.01616
I0828 22:25:18.069149 12919 solver.cpp:397] Iteration 120, lr = 0.001
I0828 22:25:23.650928 12919 solver.cpp:195] Iteration 140, loss = 2.98786
I0828 22:25:23.650984 12919 solver.cpp:397] Iteration 140, lr = 0.001
I0828 22:25:29.235535 12919 solver.cpp:195] Iteration 160, loss = 3.00724
I0828 22:25:29.235589 12919 solver.cpp:397] Iteration 160, lr = 0.001
I0828 22:25:34.816898 12919 solver.cpp:195] Iteration 180, loss = 3.00099
I0828 22:25:34.816953 12919 solver.cpp:397] Iteration 180, lr = 0.001
I0828 22:25:40.396656 12919 solver.cpp:195] Iteration 200, loss = 2.99848
I0828 22:25:40.396711 12919 solver.cpp:397] Iteration 200, lr = 0.001

[...]

I0828 22:29:12.539094 12919 solver.cpp:195] Iteration 960, loss = 2.99203
I0828 22:29:12.539258 12919 solver.cpp:397] Iteration 960, lr = 0.001
I0828 22:29:18.123092 12919 solver.cpp:195] Iteration 980, loss = 2.99345
I0828 22:29:18.123147 12919 solver.cpp:397] Iteration 980, lr = 0.001
I0828 22:29:23.432059 12919 solver.cpp:251] Iteration 1000, Testing net (#0)
I0828 22:29:49.409044 12919 solver.cpp:302]     Test net output #0: accuracy = 0.0572

이 모델은 학습하기를 첨부터 시작하고 있어서  형편없이 시작된다.

파인튜닝은 처음부터 학습하는게 시간이나 데이터의 부족으로 잘 되지 않을 적합하다.  심지어 CPU 모드에서도 빠르며,  GPU 파인튜닝은 하루나 일주일이 아닌 몇분이나 시간안에 유용한 모델을 학습할 수 있을것이다.  Flickr 스타일 인식처럼 이미지 넷의  기 학습된 모델로 부터 새로운 업무를 학습하게 하는것은 처음부터 학습하는것 보다 적은 데이터를 요구할수있다. 

이제 님들  스스로의 업무와 데이터에 맞추어서  파인튜닝을 해 보자!!

Trained model  학습모델

우리는 80,00개의 이미지상에서 모델을 학습하게 제공한다. 마지막 정확도는 39% 이다.  간단하게 ./scripts/download_model_binary.py models/finetune_flickr_style  를 실행해서 얻을수 있다.

라이선스

Flickr 스타일 데이터베이스는 오직 이미지에 대한 URL 로만 얻을수 있다. 어떤 이미지들은 아마 copyright 가 있을것이다.  연구용/비상업용으로 카테고리-인지모델을 학습시키는것은 이 데이터를 이용하여 가능할것이다. 그러나 결과는 상업적 목적으로 사용되어서는 안된다.



http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html 번역

* 미리 만들어 놓은 모델 모음집

 

Network in Network model

이 모델은 여기 자세히 나와있다.  ICLR-2014 paper:

Network In Network
M. Lin, Q. Chen, S. Yan
International Conference on Learning Representations, 2014 (arXiv:1409.1556)

please cite the paper if you use the models.


Models from the BMVC-2014 paper "Return of the Devil in the Details: Delving Deep into Convolutional Nets"

이 모델은 ILSVRC-2012 데이타셋으로 학습되었다. 자세한건 project page 와 BMVC-2014 paper:

Return of the Devil in the Details: Delving Deep into Convolutional Nets
K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman
British Machine Vision Conference, 2014 (arXiv ref. cs1405.3531)

Please cite the paper if you use the models.


Models used by the VGG team in ILSVRC-2014

이 모델은 LSVRC-2014 에서 VGG 팀에 의해 사용된 모델의 강화 버전. 참고 : project page  arXiv paper:

Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan, A. Zisserman
arXiv:1409.1556

Please cite the paper if you use the models.


Places-CNN model from MIT.

Places CNN is described in the following NIPS 2014 paper:

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva
Learning Deep Features for Scene Recognition using Places Database.
Advances in Neural Information Processing Systems 27 (NIPS) spotlight, 2014.

The project page is here


GoogLeNet GPU implementation from Princeton.

We implemented GoogLeNet using a single GPU. Our main contribution is an effective way to initialize the network and a trick to overcome the GPU memory constraint by accumulating gradients over two training iterations.


Fully Convolutional Semantic Segmentation Models (FCN-Xs)

These models are described in the paper:

Fully Convolutional Models for Semantic Segmentation
Jonathan Long, Evan Shelhamer, Trevor Darrell
CVPR 2015
arXiv:1411.4038


CaffeNet fine-tuned for Oxford flowers dataset

https://gist.github.com/jimgoo/0179e52305ca768a601f

The is the reference CaffeNet (modified AlexNet) fine-tuned for the Oxford 102 category flower dataset. The number of outputs in the inner product layer has been set to 102 to reflect the number of flower categories. Hyperparameter choices reflect those in Fine-tuning CaffeNet for Style Recognition on “Flickr Style” Data. The global learning rate is reduced while the learning rate for the final fully connected is increased relative to the other layers.


CNN Models for Salient Object Subitizing.

CNN models described in the following CVPR'15 papger "Salient Object Subitizing":

Salient Object Subitizing
J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin, X. Shen, B. Price and R. Mech. 
CVPR, 2015.

Deep Learning of Binary Hash Codes for Fast Image Retrieval

We present an effective deep learning framework to create the hash-like binary codes for fast image retrieval. The details can be found in the following "CVPRW'15 paper":

Deep Learning of Binary Hash Codes for Fast Image Retrieval
K. Lin, H.-F. Yang, J.-H. Hsiao, C.-S. Chen
CVPR 2015, DeepVision workshop


Places_CNDS_models on Scene Recognition

The details of training this model are described in the following report. Please cite this work if the model is useful for you.

Training Deeper Convolutional Networks with Deep Supervision
L.Wang, C.Lee, Z.Tu, S. Lazebnik, arXiv:1505.02496, 2015 


Models for Age and Gender Classification.


GoogLeNet_cars on car model classification

GoogLeNet_cars is the GoogLeNet model pre-trained on ImageNet classification task and fine-tuned on 431 car models in CompCars dataset. It is described in the technical report. Please cite the following work if the model is useful for you.

A Large-Scale Car Dataset for Fine-Grained Categorization and Verification
L. Yang, P. Luo, C. C. Loy, X. Tang, arXiv:1506.08959, 2015


ParseNet: Looking wider to see better

These models are described in the paper:

ParseNet: Looking Wider to See Better
Wei Liu, Andrew Rabinovich, Alexander C. Berg
arXiv:1506.04579


SegNet and Bayesian SegNet

SegNet is a real-time semantic segmentation architecture for scene understanding. Code and trained models for SegNet and Bayesian SegNet are available.

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla
arXiv preprint arXiv:1511.00561, 2015

Conditional Random Fields as Recurrent Neural Networks

Code (with Matlab/Python API) and model are described in the ICCV 2015 paper

Conditional Random Fields as Recurrent Neural Networks
S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr
ICCV 2015.


Holistically-Nested Edge Detection

The model and code provided are described in the ICCV 2015 paper:

Holistically-Nested Edge Detection
Saining Xie and Zhuowen Tu
ICCV 2015


Translating Videos to Natural Language

These models are described in this NAACL-HLT 2015 paper.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks 
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, K. Saenko   
NAACL-HLT 2015

More details can be found on this project page.


VGG Face CNN descriptor

These models are described in this BMVC 2015 paper.

Deep Face Recognition 
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman    
BMVC 2015

More details can be found on this project page.


Yearbook Photo Dating

Model from the ICCV 2015 Extreme Imaging Workshop paper:

A Century of Portraits: Exploring the Visual Historical Record of American High School Yearbooks 
Shiry Ginosar, Kate Rakelly, Brian Yin, Sarah Sachs, Alyosha Efros
ICCV Workshop 2015

Model and prototxt files: Yearbook


CCNN: Constrained Convolutional Neural Networks for Weakly Supervised Segmentation

These models are described in the ICCV 2015 paper.

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation
Deepak Pathak, Philipp Krähenbühl, Trevor Darrell
ICCV 2015
arXiv:1506.03648

These are pre-release models. They do not run in any current version of BVLC/caffe, as they require unmerged PRs. Full details, source code, models, prototxts are available here: CCNN.


Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns

We provide models for facial emotion classification for different image representation obtained using mapped binary patterns. See the Project page for more details.

The models are described in the following paper:

Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns
Gil Levi and Tal Hassner
Proc. ACM International Conference on Multimodal Interaction (ICMI), Seattle, Nov. 2015

If you find our models useful, please add suitable reference to our paper in your work.


Facial Landmark Detection with Tweaked Convolutional Neural Networks

We provide source code and model for article: Yue Wu and Tal Hassner, "Facial Landmark Detection with Tweaked Convolutional Neural Networks", arXiv preprint arXiv:1511.04031, 12 Nov. 2015. See project page for more information about this project.

Written by Ishay Tubi

This software is provided as is, without any warranty, with no legal constraints. If you find our models useful, please add suitable reference to our paper in your work.


Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Download pre-computed Faster R-CNN detectors cd $FRCN_ROOT
./data/scripts/fetch_faster_rcnn_models.sh This will populate the $FRCN_ROOT/data folder with faster_rcnn_models. See data/README.md for details. These models were trained on VOC 2007 trainval.


Sequence to Sequence - Video to Text

These models are described in this ICCV 2015 paper.

Sequence to Sequence - Video to Text
S. Venugopalan, M. Rohrbach, J. Donahue, T. Darrell, R. Mooney, K. Saenko
The IEEE International Conference on Computer Vision (ICCV) 2015

More details can be found on this project page.

Model:
S2VT_VGG_RGB:
This is the S2VT (RGB) model described in the ICCV 2015 paper. It uses video frame features from the VGG-16 layer model. This is trained only on the Youtube video dataset.

Compatibility:
These are pre-release models. They do not run in any current version of BVLC/caffe, as they require unmerged PRs. The models are currently supported by the recurrent branch of the Caffe fork provided at https://github.com/jeffdonahue/caffe/tree/recurrent andhttps://github.com/vsubhashini/caffe/tree/recurrent.


ResNets: Deep Residual Networks from MSRA at ImageNet and COCO 2015

This repository contains the original models (ResNet-50, ResNet-101, and ResNet-152) described in the paper "Deep Residual Learning for Image Recognition" (http://arxiv.org/abs/1512.03385). These models are those used in ILSVRC and COCO 2015 competitions, which won the 1st places in: ImageNet classification, ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

More instructions with prototxt and binary weight files are in:https://github.com/KaimingHe/deep-residual-networks



일단 Caffe 를 이미지분류를 위한 목적으로 설치 했다고 치면 아래와 같은 2가지가 핵심입니다.

1.  내 이미지 (혹은 샘플이미지)를 학습시켜서 나만의 모델을 만들자. 

2.  나의 모델 (혹은 샘플 모델) 을 이용하여 이미지를 잘 이해(분류)하는지 보자.


용어 정리 ) 

* 모델 정의 하기:   데이터를 학습 시키기위한,  네트워크 과정 및 파라미터등을 총칭. 그 결과로 '모델' 을 만듬.

* 파인 튜닝 하기:  더 적당한 모델을 만들거나, 더 나은 분류를 위해 조작하면서 최적값을 찾는 행위 


1.  내 데이터를 학습시켜라 


가. 일단 Caffe 내에는 이미 학습된 모델들이 여러개 있습니다. 그 중 ImageNet 의 이미지를 이용하여 학습시킨 예를 살펴봅니다.

/path/to/imagenet/train/n01440764/n01440764_10026.JPEG  //   이런 ImageNet 이미지들이 있구요.

나. 자신의 이미지를 Caffe 에서 요구하는 input 데이터형식으로 바꾸는 작업도 해야합니다. (lmdb 형식 , leveldb 형식) 

ImageNet 의 경우 examples/imagenet/create_imagenet.sh.  이걸 이용했으니 확인하셔야하구요. 

다. 학습과 분류를 더 잘 하기위해서 Image Mean 값을 구해서 이미지를 변화시켜야합니다.

./examples/imagenet/make_imagenet_mean.sh    // ImageNet 의 경우 이것으로 변경했습니다.

data/ilsvrc12/imagenet_mean.binaryproto   // jpeg 들이 이러한 binaryproto 형식으로 변환되었구요. 
io.py (이전에는 convert.py 사용) 의 blobproto_to_array 함수등을 이용하여 최종적으로 iLsvrc_2012_mean.npy

라. 모델 정의를 합니다. 

 imageNet 의 모델을 만들기 위해서  (즉 우리것을 만들기 위해서는 이것들을 참고해서 만들면 됨) 

 네트워크 정의 파일로는 models/bvlc_reference_caffenet/train_val.prototxt 있고

 input 데이터로는 ilsvrc12_train_leveldb 와 ilsvrc12_val_leveld  이 사용되었습니다.

 마지막으로 솔버로는 bvlc_reference_caffenet/solver.prototxt  가 사용되었습니다.


마. 이미지들을 학습시킵니다.

./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt

.examples/imagenet/train_caffenet.sh 파일 이용하여 간편히도 가능. 

즉 caffe train 명령으로 데이터들을 학습시킵니다. 인자로는 solver 정의서가 들어가네요. 


바. 학습시킨 중간 결과등을 저장 해 놓을 수 도 있습니다.

./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt --snapshot=models/bvlc_reference_caffenet/caffenet_train_iter_10000.solverstate


2. 모델을 이용하여 나의 이미지를 잘 이해(분류)하는지 보자.


 가. ImageNet 의 경우 models/bvlc_reference_caffenet/deploy.prototxt 에 분류를 위한 모델정의를 담았고 

    models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel 이것이 1번에서 최종적으로 만든 모델인데 

     이것도 물론 사용됩니다.


  나. Caffe 를 처음 설치하면 Caffe 의 root 로 가서 간단히 run cat1.jpg 하면 고양이를 인식하게 되는데 

      run 의 내부에는 python/classify.py 를 호출하도록 되있습니다. classify.py 내부를 보면 

      아래와 같으며 , 내용을 가져다 우리의 데이터 분류에 사용하면 됩니다. 

# 분류기 만들기. ( 1에서만든 모델, Mean데이터등이 입력으로 이용) 

classifier = caffe.Classifier(args.model_def, args.pretrained_model,

        image_dims=image_dims, mean=mean,

        input_scale=args.input_scale, raw_scale=args.raw_scale,

        channel_swap=channel_swap)


# 고양이 이미지 가져옴 (npy, jpg 형식등) 

args.input_file = os.path.expanduser(args.input_file)

if args.input_file.endswith('npy'):

    print("Loading file: %s" % args.input_file)

    inputs = np.load(args.input_file)

elif os.path.isdir(args.input_file):

    print("Loading folder: %s" % args.input_file)

    inputs =[caffe.io.load_image(im_f)

             for im_f in glob.glob(args.input_file + '/*.' + args.ext)]

else:

    print("Loading file: %s" % args.input_file)

    inputs = [caffe.io.load_image(args.input_file)]


print("Classifying %d inputs." % len(inputs))


# 분류 (알아내기) 시작 

start = time.time()

predictions = classifier.predict(inputs, not args.center_only)

print("Done in %.2f s." % (time.time() - start))


3. 딥러닝 개념 요약 

 http://sanghyukchun.github.io/75/  참고 


Brewing ImageNet

This guide is meant to get you ready to train your own model on your own data. If you just want an ImageNet-trained network, then note that since training takes a lot of energy and we hate global warming, we provide the CaffeNet model trained as described below in the model zoo.

Data Preparation

The guide specifies all paths and assumes all commands are executed from the root caffe directory.

By “ImageNet” we here mean the ILSVRC12 challenge, but you can easily train on the whole of ImageNet as well, just with more disk space, and a little longer training time.

We assume that you already have downloaded the ImageNet training data and validation data, and they are stored on your disk like:

/path/to/imagenet/train/n01440764/n01440764_10026.JPEG
/path/to/imagenet/val/ILSVRC2012_val_00000001.JPEG

You will first need to prepare some auxiliary data for training. This data can be downloaded by:

./data/ilsvrc12/get_ilsvrc_aux.sh

The training and validation input are described in train.txt and val.txt as text listing all the files and their labels. Note that we use a different indexing for labels than the ILSVRC devkit: we sort the synset names in their ASCII order, and then label them from 0 to 999. See synset_words.txt for the synset/name mapping.

You may want to resize the images to 256x256 in advance. By default, we do not explicitly do this because in a cluster environment, one may benefit from resizing images in a parallel fashion, using mapreduce. For example, Yangqing used his lightweight mincepie package. If you prefer things to be simpler, you can also use shell commands, something like:

for name in /path/to/imagenet/val/*.JPEG; do
    convert -resize 256x256\! $name $name
done

Take a look at examples/imagenet/create_imagenet.sh. Set the paths to the train and val dirs as needed, and set “RESIZE=true” to resize all images to 256x256 if you haven’t resized the images in advance. Now simply create the leveldbs with examples/imagenet/create_imagenet.sh. Note thatexamples/imagenet/ilsvrc12_train_leveldb and examples/imagenet/ilsvrc12_val_leveldb should not exist before this execution. It will be created by the script. GLOG_logtostderr=1 simply dumps more information for you to inspect, and you can safely ignore it.

Compute Image Mean

The model requires us to subtract the image mean from each image, so we have to compute the mean. tools/compute_image_mean.cpp implements that - it is also a good example to familiarize yourself on how to manipulate the multiple components, such as protocol buffers, leveldbs, and logging, if you are not familiar with them. Anyway, the mean computation can be carried out as:

./examples/imagenet/make_imagenet_mean.sh

which will make data/ilsvrc12/imagenet_mean.binaryproto.

Model Definition

We are going to describe a reference implementation for the approach first proposed by Krizhevsky, Sutskever, and Hinton in their NIPS 2012 paper.

The network definition (models/bvlc_reference_caffenet/train_val.prototxt) follows the one in Krizhevsky et al. Note that if you deviated from file paths suggested in this guide, you’ll need to adjust the relevant paths in the .prototxt files.

If you look carefully at models/bvlc_reference_caffenet/train_val.prototxt, you will notice several includesections specifying either phase: TRAIN or phase: TEST. These sections allow us to define two closely related networks in one file: the network used for training and the network used for testing. These two networks are almost identical, sharing all layers except for those marked with include { phase: TRAIN } or include { phase: TEST }. In this case, only the input layers and one output layer are different.

Input layer differences: The training network’s data input layer draws its data fromexamples/imagenet/ilsvrc12_train_leveldb and randomly mirrors the input image. The testing network’sdata layer takes data from examples/imagenet/ilsvrc12_val_leveldb and does not perform random mirroring.

Output layer differences: Both networks output the softmax_loss layer, which in training is used to compute the loss function and to initialize the backpropagation, while in validation this loss is simply reported. The testing network also has a second output layer, accuracy, which is used to report the accuracy on the test set. In the process of training, the test network will occasionally be instantiated and tested on the test set, producing lines like Test score #0: xxx and Test score #1: xxx. In this case score 0 is the accuracy (which will start around 1/1000 = 0.001 for an untrained network) and score 1 is the loss (which will start around 7 for an untrained network).

We will also lay out a protocol buffer for running the solver. Let’s make a few plans:

  • We will run in batches of 256, and run a total of 450,000 iterations (about 90 epochs).
  • For every 1,000 iterations, we test the learned net on the validation data.
  • We set the initial learning rate to 0.01, and decrease it every 100,000 iterations (about 20 epochs).
  • Information will be displayed every 20 iterations.
  • The network will be trained with momentum 0.9 and a weight decay of 0.0005.
  • For every 10,000 iterations, we will take a snapshot of the current status.

Sound good? This is implemented in models/bvlc_reference_caffenet/solver.prototxt.

Training ImageNet

Ready? Let’s train.

./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt

Sit back and enjoy!

On a K40 machine, every 20 iterations take about 26.5 seconds to run (while a on a K20 this takes 36 seconds), so effectively about 5.2 ms per image for the full forward-backward pass. About 2 ms of this is on forward, and the rest is backward. If you are interested in dissecting the computation time, you can run

./build/tools/caffe time --model=models/bvlc_reference_caffenet/train_val.prototxt

Resume Training?

We all experience times when the power goes out, or we feel like rewarding ourself a little by playing Battlefield (does anyone still remember Quake?). Since we are snapshotting intermediate results during training, we will be able to resume from snapshots. This can be done as easy as:

./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt --snapshot=models/bvlc_reference_caffenet/caffenet_train_iter_10000.solverstate

where in the script caffenet_train_iter_10000.solverstate is the solver state snapshot that stores all necessary information to recover the exact solver state (including the parameters, momentum history, etc).

Parting Words

Hope you liked this recipe! Many researchers have gone further since the ILSVRC 2012 challenge, changing the network architecture and/or fine-tuning the various parameters in the network to address new data and tasks. Caffe lets you explore different network choices more easily by simply writing different prototxt files - isn’t that exciting?

And since now you have a trained network, check out how to use it with the Python interface forclassifying ImageNet.

+ Recent posts