The network defines the entire model bottom-to-top from input data to loss. As data and derivatives flow through the network in the forward and backward passes Caffe stores, communicates, and manipulates the information as blobs: the blob is the standard array and unified memory interface for the framework. The layer comes next as the foundation of both model and computation. The net follows as the collection and connection of layers. The details of blob describe how information is stored and communicated in and across layers and nets.

Solving is configured separately to decouple modeling and optimization.

We will go over the details of these components in more detail.

Blob storage and communication

A Blob is a wrapper over the actual data being processed and passed along by Caffe, and also under the hood provides synchronization capability between the CPU and the GPU. Mathematically, a blob is an N-dimensional array stored in a C-contiguous fashion.

Caffe stores and communicates data using blobs. Blobs provide a unified memory interface holding data; e.g., batches of images, model parameters, and derivatives for optimization.

Blobs conceal the computational and mental overhead of mixed CPU/GPU operation by synchronizing from the CPU host to the GPU device as needed. Memory on the host and device is allocated on demand (lazily) for efficient memory usage.

The conventional blob dimensions for batches of image data are number N x channel K x height H x width W. Blob memory is row-major in layout, so the last / rightmost dimension changes fastest. For example, in a 4D blob, the value at index (n, k, h, w) is physically located at index ((n * K + k) * H + h) * W + w.

  • Number / N is the batch size of the data. Batch processing achieves better throughput for communication and device processing. For an ImageNet training batch of 256 images N = 256.
  • Channel / K is the feature dimension e.g. for RGB images K = 3.

Note that although many blobs in Caffe examples are 4D with axes for image applications, it is totally valid to use blobs for non-image applications. For example, if you simply need fully-connected networks like the conventional multi-layer perceptron, use 2D blobs (shape (N, D)) and call the InnerProductLayer (which we will cover soon).

Parameter blob dimensions vary according to the type and configuration of the layer. For a convolution layer with 96 filters of 11 x 11 spatial dimension and 3 inputs the blob is 96 x 3 x 11 x 11. For an inner product / fully-connected layer with 1000 output channels and 1024 input channels the parameter blob is 1000 x 1024.

For custom data it may be necessary to hack your own input preparation tool or data layer. However once your data is in your job is done. The modularity of layers accomplishes the rest of the work for you.

Implementation Details

As we are often interested in the values as well as the gradients of the blob, a Blob stores two chunks of memories, data and diff. The former is the normal data that we pass along, and the latter is the gradient computed by the network.

Further, as the actual values could be stored either on the CPU and on the GPU, there are two different ways to access them: the const way, which does not change the values, and the mutable way, which changes the values:

const Dtype* cpu_data() const;
Dtype* mutable_cpu_data();

(similarly for gpu and diff).

The reason for such design is that, a Blob uses a SyncedMem class to synchronize values between the CPU and GPU in order to hide the synchronization details and to minimize data transfer. A rule of thumb is, always use the const call if you do not want to change the values, and never store the pointers in your own object. Every time you work on a blob, call the functions to get the pointers, as the SyncedMem will need this to figure out when to copy data.

In practice when GPUs are present, one loads data from the disk to a blob in CPU code, calls a device kernel to do GPU computation, and ferries the blob off to the next layer, ignoring low-level details while maintaining a high level of performance. As long as all layers have GPU implementations, all the intermediate data and gradients will remain in the GPU.

If you want to check out when a Blob will copy data, here is an illustrative example:

// Assuming that data are on the CPU initially, and we have a blob.
const Dtype* foo;
Dtype* bar;
foo = blob.gpu_data(); // data copied cpu->gpu.
foo = blob.cpu_data(); // no data copied since both have up-to-date contents.
bar = blob.mutable_gpu_data(); // no data copied.
// ... some operations ...
bar = blob.mutable_gpu_data(); // no data copied when we are still on GPU.
foo = blob.cpu_data(); // data copied gpu->cpu, since the gpu side has modified the data
foo = blob.gpu_data(); // no data copied since both have up-to-date contents
bar = blob.mutable_cpu_data(); // still no data copied.
bar = blob.mutable_gpu_data(); // data copied cpu->gpu.
bar = blob.mutable_cpu_data(); // data copied gpu->cpu.

Layer computation and connections

The layer is the essence of a model and the fundamental unit of computation. Layers convolve filters, pool, take inner products, apply nonlinearities like rectified-linear and sigmoid and other elementwise transformations, normalize, load data, and compute losses like softmax and hinge.See the layer catalogue for all operations. Most of the types needed for state-of-the-art deep learning tasks are there.

A layer with bottom and top blob.

A layer takes input through bottom connections and makes output through top connections.

Each layer type defines three critical computations: setupforward, and backward.

  • Setup: initialize the layer and its connections once at model initialization.
  • Forward: given input from bottom compute the output and send to the top.
  • Backward: given the gradient w.r.t. the top output compute the gradient w.r.t. to the input and send to the bottom. A layer with parameters computes the gradient w.r.t. to its parameters and stores it internally.

More specifically, there will be two Forward and Backward functions implemented, one for CPU and one for GPU. If you do not implement a GPU version, the layer will fall back to the CPU functions as a backup option. This may come handy if you would like to do quick experiments, although it may come with additional data transfer cost (its inputs will be copied from GPU to CPU, and its outputs will be copied back from CPU to GPU).

Layers have two key responsibilities for the operation of the network as a whole: a forward passthat takes the inputs and produces the outputs, and a backward pass that takes the gradient with respect to the output, and computes the gradients with respect to the parameters and to the inputs, which are in turn back-propagated to earlier layers. These passes are simply the composition of each layer’s forward and backward.

Developing custom layers requires minimal effort by the compositionality of the network and modularity of the code. Define the setup, forward, and backward for the layer and it is ready for inclusion in a net.

Net definition and operation

The net jointly defines a function and its gradient by composition and auto-differentiation. The composition of every layer’s output computes the function to do a given task, and the composition of every layer’s backward computes the gradient from the loss to learn the task. Caffe models are end-to-end machine learning engines.

The net is a set of layers connected in a computation graph – a directed acyclic graph (DAG) to be exact. Caffe does all the bookkeeping for any DAG of layers to ensure correctness of the forward and backward passes. A typical net begins with a data layer that loads from disk and ends with a loss layer that computes the objective for a task such as classification or reconstruction.

The net is defined as a set of layers and their connections in a plaintext modeling language. A simple logistic regression classifier

Softmax Regression

is defined by

name: "LogReg"
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  data_param {
    source: "input_leveldb"
    batch_size: 64
  }
}
layer {
  name: "ip"
  type: "InnerProduct"
  bottom: "data"
  top: "ip"
  inner_product_param {
    num_output: 2
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip"
  bottom: "label"
  top: "loss"
}

Model initialization is handled by Net::Init(). The initialization mainly does two things: scaffolding the overall DAG by creating the blobs and layers (for C++ geeks: the network will retain ownership of the blobs and layers during its lifetime), and calls the layers’ SetUp() function. It also does a set of other bookkeeping things, such as validating the correctness of the overall network architecture. Also, during initialization the Net explains its initialization by logging to INFO as it goes:

I0902 22:52:17.931977 2079114000 net.cpp:39] Initializing net from parameters:
name: "LogReg"
[...model prototxt printout...]
# construct the network layer-by-layer
I0902 22:52:17.932152 2079114000 net.cpp:67] Creating Layer mnist
I0902 22:52:17.932165 2079114000 net.cpp:356] mnist -> data
I0902 22:52:17.932188 2079114000 net.cpp:356] mnist -> label
I0902 22:52:17.932200 2079114000 net.cpp:96] Setting up mnist
I0902 22:52:17.935807 2079114000 data_layer.cpp:135] Opening leveldb input_leveldb
I0902 22:52:17.937155 2079114000 data_layer.cpp:195] output data size: 64,1,28,28
I0902 22:52:17.938570 2079114000 net.cpp:103] Top shape: 64 1 28 28 (50176)
I0902 22:52:17.938593 2079114000 net.cpp:103] Top shape: 64 (64)
I0902 22:52:17.938611 2079114000 net.cpp:67] Creating Layer ip
I0902 22:52:17.938617 2079114000 net.cpp:394] ip <- data
I0902 22:52:17.939177 2079114000 net.cpp:356] ip -> ip
I0902 22:52:17.939196 2079114000 net.cpp:96] Setting up ip
I0902 22:52:17.940289 2079114000 net.cpp:103] Top shape: 64 2 (128)
I0902 22:52:17.941270 2079114000 net.cpp:67] Creating Layer loss
I0902 22:52:17.941305 2079114000 net.cpp:394] loss <- ip
I0902 22:52:17.941314 2079114000 net.cpp:394] loss <- label
I0902 22:52:17.941323 2079114000 net.cpp:356] loss -> loss
# set up the loss and configure the backward pass
I0902 22:52:17.941328 2079114000 net.cpp:96] Setting up loss
I0902 22:52:17.941328 2079114000 net.cpp:103] Top shape: (1)
I0902 22:52:17.941329 2079114000 net.cpp:109]     with loss weight 1
I0902 22:52:17.941779 2079114000 net.cpp:170] loss needs backward computation.
I0902 22:52:17.941787 2079114000 net.cpp:170] ip needs backward computation.
I0902 22:52:17.941794 2079114000 net.cpp:172] mnist does not need backward computation.
# determine outputs
I0902 22:52:17.941800 2079114000 net.cpp:208] This network produces output loss
# finish initialization and report memory usage
I0902 22:52:17.941810 2079114000 net.cpp:467] Collecting Learning Rate and Weight Decay.
I0902 22:52:17.941818 2079114000 net.cpp:219] Network initialization done.
I0902 22:52:17.941824 2079114000 net.cpp:220] Memory required for data: 201476

Note that the construction of the network is device agnostic - recall our earlier explanation that blobs and layers hide implementation details from the model definition. After construction, the network is run on either CPU or GPU by setting a single switch defined in Caffe::mode() and set byCaffe::set_mode(). Layers come with corresponding CPU and GPU routines that produce identical results (up to numerical errors, and with tests to guard it). The CPU / GPU switch is seamless and independent of the model definition. For research and deployment alike it is best to divide model and implementation.

Model format

The models are defined in plaintext protocol buffer schema (prototxt) while the learned models are serialized as binary protocol buffer (binaryproto) .caffemodel files.

The model format is defined by the protobuf schema in caffe.proto. The source file is mostly self-explanatory so one is encouraged to check it out.

Caffe speaks Google Protocol Buffer for the following strengths: minimal-size binary strings when serialized, efficient serialization, a human-readable text format compatible with the binary version, and efficient interface implementations in multiple languages, most notably C++ and Python. This all contributes to the flexibility and extensibility of modeling in Caffe.

시나리오 1 

질문 : 이게 머하는건지 문서를 봐도 찾지를 못하겠어. 좀 알려주라 ㅠㅠ

나는 지금 이전에 나의 이미지데이터를 학습시킨 모델을 예측을 위해  사용하려고 하고 있어. 모델을 만들때 ImageNet 코드을 템플릿으로 차용했었고''' 내가 말할수 있는건 , 예측을 하기 위한  유일한 방법은 wraper.py 를 사용하는것인데, 그건 ilsvrc_2012_mean.npy 파일을 참조한다는거지. 알다시피 이건 명백히 imagenet mean 을 포함한 numpy 파일이란건데 , 생각컨데 이미지넷 데이터셋을 위해 미리 계산시켜 놓은 파일일 꺼잖아. 근데 내 데이터를 위해 내가 가지고있는 유일한 mean 파일은 imagenet_mean.binaryproto 야. 

make_imagenet_mean.sh 를 통해서 만들어진거지.

이거 내 경우에도 정확한 (사용해도 되는 ) 데이터일까? 만약 그렇지 않다면  equivalent of ilsvrc_2012_mean.npy 과 동등한 내 데이터를 위한 것은 어떻게 만드는거야 ㅜㅜ 


시나리오 2

이미지넷을 위한 mean 파일인 ilsvrc_2012_mean.npy 를 어떻게 만드는지 알아야, 우리의 데이터도 mean 으로 만들수 있다.  

/examples/imagenet/make_imagenet_mean.sh 파일의 내부는 아래와 같은데 

$TOOLS/compute_image_mean $EXAMPLE/ilsvrc12_train_lmdb \  $DATA/imagenet_mean.binaryproto

compute_image_mean 을 통해 만드는것을 알 수 있으며 ,  source 로는 ilsvrc12_train_lmdb 를 사용하는것을 보아, 우리의 데이터를 lmdb 로 만든후에, compute_image_mean 에게 넘겨주면 알아서 mean 파일을 만들어 준다는것을 알 수 있다.


시나리오 3

이렇게 만들어진 mean.binaryproto 를 npy 로 변경을 해주어야하는데 어떻게 할까? 팁하나는 convert.py 는 더이상 존재하지않고, blobproto_to_array 함수는 caffe.io 로 이동했다는거지. 이건  caffe\python\caffe\io.py 에서 찾을 수 있다


시나리오 4


 classifier = caffe.Classifier(args.model_def, args.pretrained_model,

            image_dims=image_dims, mean=mean,

            input_scale=args.input_scale, raw_scale=args.raw_scale,

            channel_swap=channel_swap)


최종적으로 이렇게 분류를 하는데 여기에 mean 에 시나리오 3에서 만든 npy 가 들어간다.

Ubuntu + CUDA + CAFFE 설치  (2016 년 3월 05일 정리)
 
*환경
   - Ubuntu 14.04 데스크탑 (x64) 
   - NVIDIA GTX 980  
   - 기가바이트 H170-GAMING 3
   - Intel i5-6600 cpu @ 3.30GHz   (core 4) 

* 특이사항

   - 듀얼 모니터 인식을 못하더라~
   -  Ubuntu 내의 소프트웨어 업그레이드에서 그래픽카드 바꾸면 맛가기도 함.
   -  앞으론 Ubuntu  서버로 깔거나 Docker 를 잘 활용하자.

* 본 포스트 특이사항
    - 외부에서 접근해서 사용하기 위한 내용들이 포함됨.



1. Ubuntu (14. 04 ) /  Samba / 그래픽 드라이버 설치 

* Ubuntu 14.04 를 usb 부팅으로 만들어서 설치함. 

1.1 한글입력도구(키보드 등) 설치가 어려워 처음에 한글버전으로 설치 권장 

1.1.1 터미날 모드를 주로 사용하는 경우 영문버전으로 설치 권장 

1.2 apt 갱신 

1.2.1 탐색기에서 /etc/apt/sources.list 를 더블클릭 

1.2.2 ftp.daum.net 등으로 변경 

1.2.3 터미날에서 다음 수행 

1.2.3.1 sudo apt-get update 

1.2.3.2 sudo apt-get upgrade 

1.2.4 또는 한참 기다리면 soft update 가 자동으로 뜬다. 수행 

1.2.4.1 apt-get upgrade 로 진행하면 일부가 업뎃이 안된다. 

1.3 절전모드 변경 

1.3.1 원하는 대로 수정 (이후 터미날 모드로 진행하더라도 절전모드가 일부 먹힌다.) 

1.3.2 sudo service lightdm stop 을 해도 중지가 완전히 안되는듯 

1.4 network 주소 manual 로 변경 및 수정 

1.5 ssh server 설치 

1.5.1 sudo apt-get install openssh-server 

1.5.2 이후로는 터미날로 작업가능 

1.6 samba 설치 

1.6.1 sudo apt-get install samba samba-common-bin 

1.6.2 sudo smbpasswd –a yourusername 로 user 추가 

1.6.3 /etc/samba/smb.conf 수정, 끝에 아래 추가 

1.6.3.1 [yourusername] 

1.6.3.2 comment = yourusername

1.6.3.3 valid user = yourusername

1.6.3.4 path = /home/yourusername

1.6.3.5 browsable = yes 

1.6.3.6 writable = yes 

1.6.4 재부팅 

1.7 외부 데이터 폴더 network drive 연결 (학습을 위한 이미지가 외부저장소에 있을 경우) 

1.7.1 sudo apt-get install cifs-utils 

1.7.2 /etc/fstab 수정 

1.7.2.1 //외부ip/Share /home/pi/work/share cifs guest,uid=1000,gid=1000,iocharset=utf8 0 0 

1.8 kernel driver nvidia 로 변경 (X윈도우에서 작업) 

1.8.1 nouveau 기본 driver 를 nvidia driver 로 변경함 

1.8.2 확인 : lspci –vnn |grep VGA –A 10 

1.8.2.1 kernel driver in use : nouveau 로 확인될 것임 

1.8.3 윈도우 - 설정창 – 소프트업뎃 – 기타 에서 nouveau 를 nvidia 로 변경 

1.8.4 이후로 재부팅하면 화면이 안나오는 경우가 있음 . Ctrl-Alt-F1 으로 터미날 모드로 변경 

1.8.5 또는 sudo apt-get install nvidia-current 로 설치가능, 이 경우 kernel driver 가 변경되는지는       확인못함 


2. CUDA 설치 

2.1 URL : http://www.r-tutor.com/gpu-computing/cuda-installation/cuda7.5-ubuntu 

2.2 deb download : https://developer.nvidia.com/cuda-downloads 

2.3 CUDA Repository 

2.3.1 sudo dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb 

2.3.2 sudo apt-get update 

2.4 CUDA Toolkit 

2.4.1 sudo apt-get install cuda 

2.4.2 reboot 

2.5 CUDA environment (.bashrc 수정) 

2.5.1 export CUDA_HOME=/usr/local/cuda-7.5  

2.5.2 export LD_LIBRARY_PATH=${CUDA_HOME}/lib64  

2.5.3 export PATH=${CUDA_HOME}/bin:${PATH}:. 

2.6 CUDA SDK Samples 

2.6.1 샘플디렉토리에서 다음 실행 cuda-install-samples-7.5.sh  ~    

2.6.2 cd ~/NVIDIA_CUDA-7.5_Samples  

2.6.3 cd 1_Utilities/deviceQuery  

2.6.4 make

2.6.5 deviceQuery 샘플  실행해서 성공여부 확인 

3.  CAFFE 설치 

3.1 구글링해서 적당한 설치 가이드를 일단 참고하시면서 보세요.

3.2 필수 빌드 패키지와 최신 커널 헤더 설치 

3.2.1 sudo apt-get install build-essential 

3.2.2 sudo apt-get install linux-headers-‘uname –r’ 

3.2.3 uname –r 을 터미날에서 확인 후 위에 대입 

3.3 의존성 라이브러리 설치 

3.3.1 sudo apt-get install -y libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev protobuf-compiler gfortran libjpeg62 libfreeimage-dev libatlas-base-dev git python-dev python-pip libgoogle-glog-dev libbz2-dev libxml2-dev libxslt1-dev libffi-dev libssl-dev libgflags-dev liblmdb-dev 

3.4 Caffe 소스 받고 python 패키지 다운 

3.4.1 git clone https://github.com/BVLC/caffe.git 

3.4.2 cd caffe 

3.4.3 cat python/requirements.txt | xargs -L 1 sudo pip install 

3.5 충돌방지를 위한 심볼릭 링크 생성 

3.5.1 sudo ln -s /usr/include/python2.7/ /usr/local/include/python2.7 

3.5.2 sudo ln -s /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy /usr/local/include/python2.7/numpy 

3.6 Make 복사 후 수정 

3.6.1 cp Makefile.config.example Makefile.config 

3.6.2 vi Makefile.config 

3.6.3 #CPU_ONLY := 1 (GPU 를 안쓰는 경우 주석처리를 푼다) 

3.6.4 PYTHON_INCLUDE := /usr/local/include/python2.7 \ 

3.6.5                     /usr/local/include/python2.7/numpy 

3.7 make pycaffe 

3.7.1 make pycaffe 

3.7.2 make all 

3.7.3 make test 

3.8 ImageNet Caffe model 과 Label 다운로드 

3.8.1 python scripts/download_model_binary.py models/bvlc_reference_caffenet 

3.8.2 sh data/ilsvrc12/get_ilsvrc_aux.sh 

3.9 python/classify.py io.py 수정 

3.9.1  하단 참조 


4.  CAFFE 테스트  

4.1 python python/classify.py --print_results examples/images/cat.jpg foo  실행해서 결과봄 

     4.2 run car1.jpg 처럼 run 을 활용해도 됨. 


tabby 로 인식했네요~


5.  앞으로 알아야 할것들

5-1. 모델이 먼지, 학습시킨다는것이 먼지, 분류한다는것이 먼지 알아야함

5-2. 경험을 통한 파인튜닝 학습.

5-3. CAFFE 에서 이미 제공하는 모델 살펴보기 (CaffeNet ,MNIST 등 )

5-4. CAFFE 에서 이미 제공하는 모델을 이용하여 자신의 데이터를 분류하여 보기 

5-5. CAFFE 를 이용하여 자신만의 모델 만들어보기 

5-6. 자신만의 모델을 이용하여 자신의 데이터 분류해보기

5-7. 파인튜닝을 통하여 좀더 나은 결과를 얻기


https://cmusatyalab.github.io/openface/

딥러닝을 이용한 얼굴인식기 



목적 

이 세션에서는 

  • 얼굴 탐지의 기본을 살펴볼것입니다. ( Haar Feature-based Cascade 분류기를 통한) 
  • 눈 탐지 등  다양한 탐지를 위한  확장에 대해서도 알아볼것입니다.

기본 

 Haar feature-based cascade 분류기를 이용한 객체 탐지는 Paul Viola 와  Michael Jones 의 논문 ( "간단한 피처의 Boosted Cascade 를 이용한  빠른 객체 탐지"  -  2001년)  에서 제안된  매우 효율적인 객체 탐지 방법이다. 많은 수의 옳고 그른 이미지들로 부터 학습된 cascade function 에 의한  접근 기반의 머신러닝이다. 

여기서 우리는 얼굴탐지를 해볼것인데, 우선 알고리즘은 많은 수의 옳은 이미지 (얼굴)들 과 그른 이미지들 (얼굴 이외의 이미지들) 이 분류를 위한 학습을 위해 필요하다. 그리고 나서 우리는 그것으로부터 특징 features 를 추출할 것이다.이것을 위해 haar features 가 사용된다. 그것들은 convolutional kernel 과 같다. 각각의 피처는 단일 값을 갖는데 ,  검정색 사각형 아래 픽셀 값의 합계와 흰색 사각형 아래 픽셀의 합계에서부터 추출되어 얻어진다.  


haar_features.jpg
image

각 커널의 가능한 크기와 위치들은  피처들의 양을 계산하기위해 사용된다. ( 얼마나 많은 계산이 필요할것인지 상상해보라. 24x24 윈도우는 160,000 피처를 만들것이다). 각각의 피처를 계산하면서 우리는 흰색과 검정색 사각형아래의 픽셀의 합을 구할 필요가 있는데  이것을 해결하기위해 그들은 integral 이미지를 소개했다.이것은 픽셀의 합을 구하는걸 간소화 시킨다.  

이런 모든 피처들 모두를 계산하는것은  불필요하다. 예를들어 아래 이미지를 보자. 첫번째 줄은 2개의 좋은 피처를 보여준다.  첫번째 피처는 눈 부위를 나타내는데 대부분 코나 빰에 비해서는 좀 더 어둡다.  두번째 피처는 눈은 콧대에 비해 더 어둡다는 특성을 나타낸다. 빰이나 다른 어떤 곳에 동일한 윈도우를 적용할 필요는 없어보인다. 160,000+ 의 피쳐들 중에서 가장 좋은것을 어떻게 선택할까?? 그것은 Adaboost 방법에 의해 처리된다.

haar.png
image

For this, we apply each and every feature on all the training images. For each feature, it finds the best threshold which will classify the faces to positive and negative. But obviously, there will be errors or misclassifications. We select the features with minimum error rate, which means they are the features that best classifies the face and non-face images. (The process is not as simple as this. Each image is given an equal weight in the beginning. After each classification, weights of misclassified images are increased. Then again same process is done. New error rates are calculated. Also new weights. The process is continued until required accuracy or error rate is achieved or required number of features are found).

Final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can't classify the image, but together with others forms a strong classifier. The paper says even 200 features provide detection with 95% accuracy. Their final setup had around 6000 features. (Imagine a reduction from 160000+ features to 6000 features. That is a big gain).

So now you take an image. Take each 24x24 window. Apply 6000 features to it. Check if it is face or not. Wow.. Wow.. Isn't it a little inefficient and time consuming? Yes, it is. Authors have a good solution for that.

In an image, most of the image region is non-face region. So it is a better idea to have a simple method to check if a window is not a face region. If it is not, discard it in a single shot. Don't process it again. Instead focus on region where there can be a face. This way, we can find more time to check a possible face region.

For this they introduced the concept of Cascade of Classifiers. Instead of applying all the 6000 features on a window, group the features into different stages of classifiers and apply one-by-one. (Normally first few stages will contain very less number of features). If a window fails the first stage, discard it. We don't consider remaining features on it. If it passes, apply the second stage of features and continue the process. The window which passes all stages is a face region. How is the plan !!!

Authors' detector had 6000+ features with 38 stages with 1, 10, 25, 25 and 50 features in first five stages. (Two features in the above image is actually obtained as the best two features from Adaboost). According to authors, on an average, 10 features out of 6000+ are evaluated per sub-window.

So this is a simple intuitive explanation of how Viola-Jones face detection works. Read paper for more details or check out the references in Additional Resources section.

( 동영상 링크 : https://www.youtube.com/watch?v=WfdYYNamHZ8 )

Haar-cascade Detection in OpenCV

OpenCV comes with a trainer as well as detector. If you want to train your own classifier for any object like car, planes etc. you can use OpenCV to create one. Its full details are given here: Cascade Classifier Training.

Here we will deal with detection. OpenCV already contains many pre-trained classifiers for face, eyes, smile etc. Those XML files are stored in opencv/data/haarcascades/ folder. Let's create face and eye detector with OpenCV.

First we need to load the required XML classifiers. Then load our input image (or video) in grayscale mode.

1 import numpy as np
2 import cv2
3 
4 face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
5 eye_cascade = cv2.CascadeClassifier('haarcascade_eye.xml')
6 
7 img = cv2.imread('sachin.jpg')
8 gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Now we find the faces in the image. If faces are found, it returns the positions of detected faces as Rect(x,y,w,h). Once we get these locations, we can create a ROI for the face and apply eye detection on this ROI (since eyes are always on the face !!! ).

1 faces = face_cascade.detectMultiScale(gray, 1.3, 5)
2 for (x,y,w,h) in faces:
3  cv2.rectangle(img,(x,y),(x+w,y+h),(255,0,0),2)
4  roi_gray = gray[y:y+h, x:x+w]
5  roi_color = img[y:y+h, x:x+w]
6  eyes = eye_cascade.detectMultiScale(roi_gray)
7  for (ex,ey,ew,eh) in eyes:
8  cv2.rectangle(roi_color,(ex,ey),(ex+ew,ey+eh),(0,255,0),2)
9 
10 cv2.imshow('img',img)
11 cv2.waitKey(0)
12 cv2.destroyAllWindows()

Result looks like below:

face.jpg
image



HOG detectMultiScale 파라미터 설명 


Figure 2: On my system, it takes approximately 0.09s to process a single image using the default parameters.

Figure 2: On my system, it takes approximately 0.09s to process a single image using the default parameters.


img (required)

이 파라미터는 꽤 명쾌한데 - 우리가 탐지하고 싶은 객체를 가지고 있는 이미지이다. (사진에서는 사람) . 이것은 detectMultiScale 함수에 무조건 들어가야하는 인자이다. 색상을 가지고 있거나 그레이 스케일 이미지이면 된다.

hitThreshold (optional)

 hitThreshold  파라미터는 옵셔널이고 detectMultiScale 함수에서 디폴트로 사용되진 않는다. 

OpenCV 문서를 보면 단지 이렇게 쓰여져 있다. 

:  SVM 분류 평면과 피처 사이의 거리에 대한 설정 값.

만약 유클리언 거리 (SVM 평면과 HOG 피처사이) 가 설정 값을 초과하면 탐지는 반려된다. 내 개인적인 의견으로는 당신이 이미지를 탐지할때 , false-positive 탐지율을 높히고 싶지 않으면 건드리지 않는게 좋다.  

winStride (optional)

The winStride  parameter is a 2-tuple that dictates the “step size” in both the x and y location of the sliding window.

Both winStride  and scale  are extremely important parameters that need to be set properly. These parameter have tremendous implications on not only the accuracy of your detector, but also the speed in which your detector runs.

In the context of object detection, a sliding window is a rectangular region of fixed width and height that “slides” across an image, just like in the following figure:

Figure 3: An example of applying a sliding window to an image for face detection.

Figure 3: An example of applying a sliding window to an image for face detection.

At each stop of the sliding window (and for each level of the image pyramid, discussed in thescale  section below), we (1) extract HOG features and (2) pass these features on to our Linear SVM for classification. The process of feature extraction and classifier decision is an expensive one, so we would prefer to evaluate as little windows as possible if our intention is to run our Python script in near real-time.

The smaller winStride  is, the more windows need to be evaluated (which can quickly turn into quite the computational burden):

Figure 4: Decreasing the winStride increases the amount of time it takes it process each each.

Figure 4: Decreasing the winStride increases the amount of time it takes it process each each.

 winStride  를 (4,4) 로 바꾸었더니 탐지 시간이 0.27초로 증가되었다.  반대로 winStride 를 크게 하면 탐색 윈도우의 숫자는 더 작아지고,  이것은 탐지기를 엄청 빨라지게 하지만 탐지를 못할 확율이 전체적으로 높아진다. 

Figure 5: Increasing the winStride can reduce our pedestrian detection time (0.09s down to 0.06s, respectively), but as you can see, we miss out on detecting the boy in the background.

Figure 5: Increasing the winStride can reduce our pedestrian detection time (0.09s down to 0.06s, respectively), but as you can see, we miss out on detecting the boy in the background.

나는 주로 winStride 값을 (4,4)에서 시작한다. 그 후에  스피드와 탐색 정확도 사이의 트레이드 오프가 합당해질때까지 조금씩 값을 올린다. 

padding (optional)

The padding  parameter is a tuple which indicates the number of pixels in both the x and direction in which the sliding window ROI is “padded” prior to HOG feature extraction.

As suggested by Dalal and Triggs in their 2005 CVPR paper, Histogram of Oriented Gradients for Human Detection, adding a bit of padding surrounding the image ROI prior to HOG feature extraction and classification can actually increase the accuracy of your detector.

Typical values for padding include (8, 8)(16, 16)(24, 24), and (32, 32).

scale (optional)

An image pyramid is a multi-scale representation of an image:

Figure 6: An example image pyramid.

Figure 6: An example image pyramid.

At each layer of the image pyramid the image is downsized and (optionally) smoothed via a Gaussian filter.

This scale  parameter controls the factor in which our image is resized at each layer of the image pyramid, ultimately influencing the number of levels in the image pyramid.

scale 을 더 작게하면 이미지 레이어의 갯수를 증가시키고 , 계산하는 시간을 증가시킨다.

Figure 7: Decreasing the scale to 1.01

Figure 7: Decreasing the scale to 1.01

The amount of time it takes to process our image has significantly jumped to 0.3s. We also now have an issue of overlapping bounding boxes. However, that issue can be easily remedied using non-maxima suppression.

Meanwhile a larger scale will decrease the number of layers in the pyramid as well as decreasethe amount of time it takes to detect objects in an image:

Figure 8: Increasing our scale allows us to process nearly 20 images per second -- at the expense of missing some detections.

Figure 8: Increasing our scale allows us to process nearly 20 images per second — at the expense of missing some detections.

Here we can see that we performed pedestrian detection in only 0.02s, implying that we can process nearly 50 images per second. However, this comes at the expense of missing some detections, as evidenced by the figure above.

Finally, if you decrease both winStride  and scale  at the same time, you’ll dramaticallyincrease the amount of time it takes to perform object detection:

Figure 9: Decreasing both the scale and window stride.

Figure 9: Decreasing both the scale and window stride.

We are able to detect both people in the image — but it’s taken almost half a second to perform this detection, which is absolutely not suitable for real-time applications.

Keep in mind that for each layer of the pyramid a sliding window with winStride  steps is moved across the entire layer. While it’s important to evaluate multiple layers of the image pyramid, allowing us to find objects in our image at different scales, it also adds a significant computational burden since each layer also implies a series of sliding windows, HOG feature extractions, and decisions by our SVM must be performed.

Typical values for scale  are normally in the range [1.01, 1.5]. If you intend on runningdetectMultiScale  in real-time, this value should be as large as possible without significantly sacrificing detection accuracy.

Again, along with the winStride , the scale  is the most important parameter for you to tune in terms of detection speed.

finalThreshold (optional)

I honestly can’t even find finalThreshold  inside the OpenCV documentation (specifically for the Python bindings) and I have no idea what it does. I assume it has some relation to thehitThreshold , allowing us to apply a “final threshold” to the potential hits, weeding out potential false-positives, but again, that’s simply speculation based on the argument name.

If anyone knows what this parameter controls, please leave a comment at the bottom of this post.

useMeanShiftGrouping (optional)

The useMeanShiftGrouping  parameter is a boolean indicating whether or not mean-shift grouping should be performed to handle potential overlapping bounding boxes. This value defaults to False  and in my opinion, should never be set to True  — use non-maxima suppression instead; you’ll get much better results.

When using HOG + Linear SVM object detectors you will undoubtably run into the issue of multiple, overlapping bounding boxes where the detector has fired numerous times in regions surrounding the object we are trying to detect:

Figure 10: An example of detecting multiple, overlapping bounding boxes.

Figure 10: An example of detecting multiple, overlapping bounding boxes.

To suppress these multiple bounding boxes, Dalal suggested using mean shift (Slide 18). However, in my experience mean shift performs sub-optimally and should not be used as a method of bounding box suppression, as evidenced by the image below:

Figure 11: Applying mean-shift to handle overlapping bounding boxes.

Figure 11: Applying mean-shift to handle overlapping bounding boxes.

Instead, utilize non-maxima suppression (NMS). Not only is NMS faster, but it obtains much more accurate final detections:

Figure 12: Instead of applying mean-shift, utilize NMS instead. Your results will be much better.

Figure 12: Instead of applying mean-shift, utilize NMS instead. Your results will be much better.

Tips on speeding up the object detection process

Whether you’re batch processing a dataset of images or looking to get your HOG detector to run in real-time (or as close to real-time as feasible), these three tips should help you milk as much performance out of your detector as possible:

  1. Resize your image or frame to be as small as possible without sacrificing detection accuracy. Prior to calling the detectMultiScale  function, reduce the width and height of your image. The smaller your image is, the less data there is to process, and thus the detector will run faster.
  2. Tune your scale  and winStride  parameters. These two arguments have a tremendous impact on your object detector speed. Both scale  and winStride  should be as large as possible, again, without sacrificing detector accuracy.
  3. If your detector still is not fast enough…you might want to look into re-implementing your program in C/C++. Python is great and you can do a lot with it. But sometimes you need the compiled binary speed of C or C++ — this is especially true for resource constrained environments.

Summary

In this lesson we reviewed the parameters to the detectMultiScale  function of the HOG descriptor and SVM detector. Specifically, we examined these parameter values in context of pedestrian detection. We also discussed the speed and accuracy tradeoffs you must consider when utilizing HOG detectors.

If your goal is to apply HOG + Linear SVM in (near) real-time applications, you’ll first want to start by resizing your image to be as small as possible without sacrificing detection accuracy:the smaller the image is, the less data there is to process. You can always keep track of your resizing factor and multiply the returned bounding boxes by this factor to obtain the bounding box sizes in relation to the original image size.

Secondly, be sure to play with your scale  and winStride  parameters. This values can dramatically affect the detection accuracy (as well as false-positive rate) of your detector.

Finally, if you still are not obtaining your desired frames per second (assuming you are working on a real-time application), you might want to consider re-implementing your program in C/C++. While Python is very fast (all things considered), there are times you cannot beat the speed of a binary executable.


HOG PERSON DETECTOR TUTORIAL

가장 인기있고 성공적인 "사람 탐지기" 중 하나가 SVM 접근을 이용한 HOG 이다. 내가 2013년 4월에 임베디드 비전 서밋에 참가했었을때, 그것은 내가 들은 가장 일반적인 알고리즘이었다.

HOG 는 경사지향 히스토그램 (Histograms of Oriented Gradients ) 이고, HOG 는 피쳐 기술자 (feature descriptor ) 의 한 타입이다. 피쳐 기술자의 의도는 동일한 객체 (이 경우에는 사람) 들을 그것이 조금 다른 상태 (모습)라도 가능한 하나의 객체로 일반화하는것이다. 이것은 분류를 더 쉽게 한다.

이런 접근법을 만든이는 사람에 대한 HOG 기술자들을 인지하기위해  SVM  (분류에 대한 머신러닝의 한 타입.최대한 분류사이의 갭이 크도록 계산) 으로  학습시켰다.

HOG 사람 검출기는 이해하기 간단하다 (SIFT 객체인식기와 비교하여).  중요 이유 중 하나는 사람을 묘사하기위해  "지역"  피처들의 모음보다는 하나의 "전역" 피처를 사용한다는 점이다. 이것은 사람이 단일 피처 벡터에 의해 표현된다는것인데, 사람을  작은 부분으로 표현하는  즉, 다수의 피처 벡터들을 사용하는것과는  대조적이다.

HOG 사람 탐지기는 슬라이딩 탐지 윈도우를 사용하는데 이미지를 조금씩 이동하며 검색한다. 각 포지션에서  탐지윈도우는  HOG 기술자에 의해 계산되어 학습된 SVM 에 알려지고 SVM 은 그것을 "사람" 또는 " 사람이 아닌것" 으로 분류한다. 

또한 사람을 다른 스케일에서 인식하기 위해 이미지는 여러 크기로 서브샘플링되며  탐색된다.

오리지널 작업 

HOG 사람 탐지기는 Dalal 와 Triggs 에 의해 CVPR 컨퍼런스 2005 에서 소개되었다. 오리지널 논문은 여기 here.

오리지널 학습셋은 여기  here.

경사 히스토그램  (Gradient Histograms ).

HOG 사람 탐지기는 탐지 윈도우를 사용하는데 그것은 64 픽셀 너비에 128 픽셀 높이를 가진다.
아래  탐지기의 학습(train) 에 사용된 오리지널 이미지들이 있다. 64*128 윈도우로 추출되었다.

TrainingImages

HOG 기술자를 계산하기 위해,  탐지윈도우안의  8*8 픽셀의 각각의 셀에서 수행하고 , 이 셀들은 오버래핑된 블럭들 안에서 조작될것이다. 

여기 줌인된 이미지의 한가지 버전이있다. 8*8 셀이 빨강색 사각형으로 나타나고 있으며,  우리가 작업할 셀 사이즈와 이미지 해상도에 대한 아이디어를 줄 것이다.

crop001025a

셀 안에서, 각각의 픽셀에서 경사 벡터를 계산한다.( 익숙하지 않다면 참고 gradient vectors)  64 경사 벡터 ( 8*8 픽셀) 를 가지며 그리고 그것들은 9-bin 히스토그램 ( 64 -> 9 개로 리듀스) 으로 만들어진다. 히스토그램 범위는 0~180도를 가지며,  하나당 20 도이다. 

Histogram


노트: Dalal 과 Triggs 는 "unsigned 경사" 를 이용했다. 그것은 범위가 0~360 이 아니라  0~180의 범위라는것이다. 

각각의 경사벡터에 대해, 히스토그램의 모습은  벡터의 크기에 의해 반영된다. (강한 경사는 히스토그램에 더 큰 영향을 준다). 두가지 가장 가까운  bins 사이에서 그 기여도를 나누어지는데. 예를들어 만약 경사 벡터가 85도 라면 ,  70도 와  90도 두개의 bin 에게 크기를 나누어서 배분할것이고 ,  (90도가 다 갖는것이 아님)  70도쪽에는 1/4를 추가할것이고 , 3/4를 90 도에 추가 할 것이다.

기여도를 나누는 의도가 두개의 bin 사이의 경계에 놓여있는 경사 문제를 최소화 하기 위함이라고 믿는다. 만약 강한 경사가 bin 의 모서리에 걸쳐있다면, 경사 각도의 약간의 변화로도 두개의 bin 사이의 값이 급격히 달라지는것 과 같이 히스토그램에 강항 영향을 줄수 있기때문이다.  

왜 경사를 이렇게 히스토그램으로 놓았을까? 경사 값을 그대로 사용하지 않구서?  경사 히스토그램은 "양자화 (quantization)" 의 형태이다. 이 경우에 우리는 2개의 컴포넌트와 함께  64 벡터들을 단지 9개 값의 문자로 축소화 한다.  ( 각 bin 의 크기로 ). 피처 기술자를 압축하는것은 분류기의 성능에 꽤 중요하다. 그러나 나는 주요 의도는 사실 8*8 셀의 컨텐츠를 일반화 (generalize) 하는것이라 본다.

만약 당신이 8*8 셀의 컨텐츠를 약간 망가뜨려 놓는다고 생각해보자. 당신은 여전히 동일한 벡터를 러프하게 가질 것이다,  셀 안에서 약간 다른 각도와 함께 약간  다른 포지션을 가질것이지만 말이다. 이 히스토그램 bin 들은 경사도의 각도 따라   비슷하게 만들어질 것이다. (히스토그램은 셀안에서 각각의 경사가 어디에 있는지에 대해 구분 하지 않고, 단지 셀안에서 경사의 분포에 의한다.) 

경사 벡터 일반화 (Normalizing)

The next step in computing the descriptors is to normalize the histograms. Let’s take a moment to first look at the effect of normalizing  gradient vectors in general.

In my post on gradient vectors, I show how you can add or subtract a fixed amount of brightness to every pixel in the image, and you’ll still get the same the same gradient vectors at every pixel.

It turns out that by normalizing your gradient vectors, you can also make them invariant to multiplications of the pixel values.  Take a look at the below examples. The first image shows a pixel, highlighted in red, in the original image. In the second image, all pixel values have been increased by 50. In the third image, all pixel values in the original image have been multiplied by 1.5.

Multiplication

Notice how the third image displays an increase in contrast. The effect of the multiplication is that bright pixels became much brighter while dark pixels only became a little brighter, thereby increasing the contrast between the light and dark parts of the image.

Let’s look at the actual pixel values and how the gradient vector changes in these three images. The numbers in the boxes below represent the values of the pixels surrounding the pixel marked in red.

Magnitudes

The gradient vectors are equivalent in the first and second images, but in the third, the gradient vector magnitude has increased by a factor of 1.5.

If you divide all three vectors by their respective magnitudes, you get the same result for all three vectors: [ 0.71  0.71]’.

So in the above example we see that by dividing the gradient vectors by their magnitude we can make them invariant (or at least more robust) to changes in contrast.

Dividing a vector by its magnitude is referred to as normalizing the vector to unit length, because the resulting vector has a magnitude of 1. Normalizing a vector does not affect its orientation, only the magnitude.

히스토그램 일반화 

Recall that the value in each of the nine bins in the histogram is based on the magnitudes of the gradients in the 8×8 pixel cell over which it was computed. If every pixel in a cell is multiplied by 1.5, for example, then we saw above that the magnitude of all of the gradients in the cell will be increased by a factor of 1.5 as well. In turn, this means that the value for each bin of the histogram will also be increased by 1.5x. By normalizing the histogram, we can make it invariant to this type of illumination change.

블럭일반화 

Rather than normalize each histogram individually, the cells are first grouped into blocks and normalized based on all histograms in the block.

The blocks used by Dalal and Triggs consisted of 2 cells by 2 cells. The blocks have “50% overlap”, which is best described through the illustration below.

Blocks

This block normalization is performed by concatenating the histograms of the four cells within the block into a vector with 36 components (4 histograms x 9 bins per histogram). Divide this vector by its magnitude to normalize it.

The effect of the block overlap is that each cell will appear multiple times in the final descriptor, but normalized by a different set of neighboring cells. (Specifically, the corner cells appear once, the other edge cells appear twice each, and the interior cells appear four times each).

Honestly, my understanding of the rationale behind the block normalization is still a little shaky. In my earlier normalization example with the penguin flash card, I multiplied every pixel in the image by 1.5, effectively increasing the contrast by the same amount over the whole image. I imagine the rationale in the block normalization approach is that changes in contrast are more likely to occur over smaller regions within the image. So rather than normalizing over the entire image, we normalize within a small region around the cell.

Final Descriptor Size

The 64 x 128 pixel detection window will be divided into 7 blocks across and 15 blocks vertically, for a total of 105 blocks. Each block contains 4 cells with a 9-bin histogram for each cell, for a total of 36 values per block. This brings the final vector size to 7 blocks across x 15 blocks vertically x 4 cells per block x 9-bins per histogram = 3,780 values.

HOG Detector in OpenCV

OpenCV includes a class for running the HOG person detector on an image.

Check out this post for some example code that should get you up and running quickly with the HOG person detector, using a webcam as the video source.

HOG Descriptor in Octave / MATLAB

To help in my understanding of the HOG descriptor, as well as to allow me to easily test out modifications to the descriptor, I’ve written a function in Octave for computing the HOG descriptor for a 64×128 image.

As a starting point, I began with the MATLAB code provided by another researcher here. That code doesn’t implement all of the features of the original HOG person detector, though, and didn’t make very effective use of vectorization.

I’ve dedicated a separate post to the Octave code, check it out here.



 OpenCV 를 설치하면 샘플에 python2/peopledetect.py 이 있으며  활용하면 된다.

 함수 설명 포스팅은 여기 


+ Recent posts