1편가기    (현재 블록체인 분야에 집중하고 있어서 2편 번역 할 시간이 없네요.. 죄송합니다 ㅜㅜ) 

Part II: Advanced concepts

We now have a very good intuition of what convolution is, and what is going on in convolutional nets, and why convolutional nets are so powerful. But we can dig deeper to understand what is really going on within a convolution operation. In doing so, we will see that the original interpretation of computing a convolution is rather cumbersome and we can develop more sophisticated interpretations which will help us to think about convolutions much more broadly so that we can apply them on many different data. To achieve this deeper understanding the first step is to understand the convolution theorem.

The convolution theorem

To develop the concept of convolution further, we make use of the convolution theorem, which relates convolution in the time/space domain — where convolution features an unwieldy integral or sum — to a mere element wise multiplication in the frequency/Fourier domain. This theorem is very powerful and is widely applied in many sciences. The convolution theorem is also one of the reasons why the fast Fourier transform (FFT) algorithm is thought by some to be one of the most important algorithms of the 20th century.

convolution theorem

The first equation is the one dimensional continuous convolution theorem of two general continuous functions; the second equation is the 2D discrete convolution theorem for discrete image data. Here {\otimes} denotes a convolution operation, {\mathcal{F}} denotes the Fourier transform, {\mathcal{F}^{-1}} the inverse Fourier transform, and {\sqrt{2\pi}} is a normalization constant. Note that “discrete” here means that our data consists of a countable number of variables (pixels); and 1D means that our variables can be laid out in one dimension in a meaningful way, e.g. time is one dimensional (one second after the other), images are two dimensional (pixels have rows and columns), videos are three dimensional (pixels have rows and columns, and images come one after another).

To get a better understanding what happens in the convolution theorem we will now look at the interpretation of Fourier transforms with respect to digital image processing.

Fast Fourier transforms

The fast Fourier transform is an algorithm that transforms data from the space/time domain into the frequency or Fourier domain. The Fourier transform describes the original function in a sum of wave-like cosine and sine terms. It is important to note, that the Fourier transform is generally complex valued, which means that a real value is transformed into a complex value with a real and imaginary part. Usually the imaginary part is only important for certain operations and to transform the frequencies back into the space/time domain and will be largely ignored in this blog post. Below you can see a visualization how a signal (a function of information often with a time parameter, often periodic) is transformed by a Fourier transform.

Fourier_transform_time_and_frequency_domains
Transformation of the time domain (red) into the frequency domain (blue).Source

You may be unaware of this, but it might well be that you see Fourier transformed values on a daily basis: If the red signal is a song then the blue values might be the equalizer bars displayed by your mp3 player.

The Fourier domain for images

fourier Transforms
Images by Fisher & Koryllos (1998)Bob Fisher also runs an excellent website aboutFourier transforms and image processing in general.

How can we imagine frequencies for images? Imagine a piece of paper with one of the two patterns from above on it. Now imagine a wave traveling from one edge of the paper to the other where the wave pierces through the paper at each stripe of a certain color and hovers over the other. Such waves pierce the black and white parts in specific intervals, for example, every two pixels — this represents the frequency. In the Fourier transform lower frequencies are closer to the center and higher frequencies are at the edges (the maximum frequency for an image is at the very edge). The location of Fourier transform values with high intensity (white in the images) are ordered according to the direction of the greatest change in intensity in the original image. This is very apparent from the next image and its log Fourier transforms (applying the log to the real values decreases the differences in pixel intensity in the image — we see information more easily this way).

fourier_direction_detection
Images by Fisher & Koryllos (1998)Source

We immediately see that a Fourier transform contains a lot of information about the orientation of an object in an image. If an object is turned by, say, 37% degrees, it is difficult to tell that from the original pixel information, but very clear from the Fourier transformed values.

This is an important insight: Due to the convolution theorem, we can imagine that convolutional nets operate on images in the Fourier domain and from the images above we now know that images in that domain contain a lot of information about orientation. Thus convolutional nets should be better than traditional algorithms when it comes to rotated images and this is indeed the case (although convolutional nets are still very bad at this when we compare them to human vision).

Frequency filtering and convolution

The reason why the convolution operation is often described as a filtering operation, and why convolution kernels are often named filters will be apparent from the next example, which is very close to convolution.

Images by Fisher & Koryllos (1998)Source

If we transform the original image with a Fourier transform and then multiply it by a circle padded by zeros (zeros=black) in the Fourier domain, we filter out all high frequency values (they will be set to zero, due to the zero padded values). Note that the filtered image still has the same striped pattern, but its quality is much worse now — this is how jpeg compression works (although a different but similar transform is used), we transform the image, keep only certain frequencies and transform back to the spatial image domain; the compression ratio would be the size of the black area to the size of the circle in this example.

If we now imagine that the circle is a convolution kernel, then we have fully fledged convolution — just as in convolutional nets. There are still many tricks to speed up and stabilize the computation of convolutions with Fourier transforms, but this is the basic principle how it is done.

Now that we have established the meaning of the convolution theorem and Fourier transforms, we can now apply this understanding to different fields in science and enhance our interpretation of convolution in deep learning.

Insights from fluid mechanics

Fluid mechanics concerns itself with the creation of differential equation models for flows of fluids like air and water (air flows around an airplane; water flows around suspended parts of a bridge). Fourier transforms not only simplify convolution, but also differentiation, and this is why Fourier transforms are widely used in the field of fluid mechanics, or any field with differential equations for that matter.  Sometimes the only way to find an analytic solution to a fluid flow problem is to simplify a partial differential equation with a Fourier transform. In this process we can sometimes rewrite the solution of such a partial differential equation in terms of a convolution of two functions which then allows for very easy interpretation of the solution. This is the case for the diffusion equation in one dimension, and for some two dimensional diffusion processes for functions in cylindrical or spherical polar coordinates.

Diffusion

You can mix two fluids (milk and coffee) by moving the fluid with an outside force (mixing with a spoon) — this is called convection and is usually very fast. But you could also wait and the two fluids would mix themselves on their own (if it is chemically possible)  — this is called diffusion and is usually a very slow when compared to convection.

Imagine an aquarium that is split into two by a thin, removable barrier where one side of the aquarium is filled with salt water, and the other side with fresh water. If you now remove the thin barrier carefully, the two fluids will mix together until the whole aquarium has the same concentration of salt everywhere. This process is more “violent” the greater the difference in saltiness between the fresh water and salt water.

Now imagine you have a square aquarium with 256×256 thin barriers that separate 256×256 cubes each with different salt concentration. If you remove the barrier now, there will be little mixing between two cubes with little difference in salt concentration, but rapid mixing between two cubes with very different salt concentrations. Now imagine that the 256×256 grid is an image, the cubes are pixels, and the salt concentration is the intensity of each pixel. Instead of diffusion of salt concentrations we now have diffusion of pixel information.

It turns out, this is exactly one part of the convolution for the diffusion equation solution: One part is simply the initial concentrations of a certain fluid in a certain area — or in image terms — the initial image with its initial pixel intensities. To complete the interpretation of convolution as a diffusion process we need to interpret the second part of the solution to the diffusion equation: The propagator.

Interpreting the propagator

The propagator is a probability density function, which denotes into which direction fluid particles diffuse over time. The problem here is that we do not have a probability function in deep learning, but a convolution kernel — how can we unify these concepts?

We can apply a normalization that turns the convolution kernel into a probability density function. This is just like computing the softmax for output values in a classification tasks. Here the softmax normalization for the edge detector kernel from the first example above.

softmax
Softmax of an edge detector: To calculate the softmax normalization, we taking each value [latex background="ffffff"]{x}[/latex] of the kernel and apply [latex background="ffffff"]{e^x}[/latex]. After that we divide by the sum of all [latex background="ffffff"]{e^x}[/latex]. Please note that this technique to calculate the softmax will be fine for most convolution kernels, but for more complex data the computation is a bit different to ensure numerical stability (floating point computation is inherently unstable for very large and very small values and you have to carefully navigate around troubles in this case).

Now we have a full interpretation of convolution on images in terms of diffusion. We can imagine the operation of convolution as a two part diffusion process: Firstly, there is strong diffusion where pixel intensities change (from black to white, or from yellow to blue, etc.) and secondly, the diffusion process in an area is regulated by the probability distribution of the convolution kernel. That means that each pixel in the kernel area, diffuses into another position within the kernel according to the kernel probability density.

For the edge detector above almost all information in the surrounding area will concentrate in a single space (this is unnatural for diffusion in fluids, but this interpretation is mathematically correct). For example all pixels that are under the 0.0001 values, will very likely flow into the center pixel and accumulate there. The final concentration will be largest where the largest differences between neighboring pixels are, because here the diffusion process is most marked. In turn, the greatest differences in neighboring pixels is there, where the edges between different objects are, so this explains why the kernel above is an edge detector.

So there we have it: Convolution as diffusion of information. We can apply this interpretation directly on other kernels. Sometimes we have to apply a softmax normalization for interpretation, but generally the numbers in itself say a lot about what will happen. Take the following kernel for example. Can you now interpret what that kernel is doing? Click here to find the solution (there is a link back to this position).

softmax_quiz

Wait, there is something fishy here

How come that we have deterministic behavior if we have a convolution kernel with probabilities? We have to interpret that single particles diffuse according to the probability distribution of the kernel, according to the propagator, don’t we?

Yes, this is indeed true. However, if you take a tiny piece of fluid, say a tiny drop of water, you still have millions of water molecules in that tiny drop of water, and while a single molecule behaves stochastically according to the probability distribution of the propagator, a whole bunch of molecules have quasi deterministic behavior —this is an important interpretation from statistical mechanics and thus also for diffusion in fluid mechanics. We can interpret the probabilities of the propagator as the average distribution of information or pixel intensities;  Thus our interpretation is correct from a viewpoint of fluid mechanics. However, there is also a valid stochastic interpretation for convolution.

Insights from quantum mechanics

The propagator is an important concept in quantum mechanics. In quantum mechanics a particle can be in a superposition where it has two or more properties which usually exclude themselves in our empirical world: For example, in quantum mechanics a particle can be at two places at the same time —  that is a single object in two places.

However, when you measure the state of the particle — for example where the particle is right now — it will be either at one place or the other. In other terms, you destroy the superposition state by observation of the particle. The propagator then describes the probability distribution where you can expect the particle to be. So after measurement a particle might be — according to the probability distribution of the propagator — with 30% probability in place A and 70% probability in place B.

If we have entangled particles (spooky action at a distance), a few particles can hold hundreds or even millions of different states at the same time — this is the power promised by quantum computers.

So if we use this interpretation for deep learning, we can think that the pixels in an image are in a superposition state, so that in each image patch, each pixel is in 9 positions at the same time (if our kernel is 3×3). Once we apply the convolution we make a measurement and the superposition of each pixel collapses into a single position as described by the probability distribution of the convolution kernel, or in other words: For each pixel, we choose one pixel of the 9 pixels at random (with the probability of the kernel) and the resulting pixel is the average of all these pixels. For this interpretation to be true, this needs to be a true stochastic process, which means, the same image and the same kernel will generally yield different results. This interpretation does not relate one to one to convolution but it might give you ideas how to the apply convolution in stochastic ways or how to develop quantum algorithms for convolutional nets. A quantum algorithm would be able to calculate all possible combinations described by the kernel with onecomputation and in linear time/qubits with respect to the size of image and kernel.

Insights from probability theory

Convolution is closely related to cross-correlation. Cross-correlation is an operation which takes a small piece of information (a few seconds of a song) to filter a large piece of information (the whole song) for similarity (similar techniques are used on youtube to automatically tag videos for copyrights infringements).

Relation between cross-correlation and convolution: Here [latex background="ffffff"]{\star}[/latex] denotes cross correlation and [latex background="ffffff"]{f^*}[/latex] denotes the complex conjugate of [latex background="ffffff"]{f}[/latex].

While cross correlation seems unwieldy, there is a trick with which we can easily relate it to convolution in deep learning: For images we can simply turn the search image upside down to perform cross-correlation through convolution. When we perform convolution of an image of a person with an upside image of a face, then the result will be an image with one or multiple bright pixels at the location where the face was matched with the person.

crosscorrelation_Example
Cross-correlation via convolution: The input and kernel are padded with zeros and the kernel is rotated by 180 degrees. The white spot marks the area with the strongest pixel-wise correlation between image and kernel. Note that the output image is in the spatial domain, the inverse Fourier transform was already applied. Images taken from Steven Smith’s excellent free online book about digital signal processing.

This example also illustrates padding with zeros to stabilize the Fourier transform and this is required in many version of Fourier transforms. There are versions which require different padding schemes: Some implementation warp the kernel around itself and require only padding for the kernel, and yet other implementations perform divide-and-conquer steps and require no padding at all. I will not expand on this; the literature on Fourier transforms is vast and there are many tricks to be learned to make it run better — especially for images.

At lower levels, convolutional nets will not perform cross correlation, because we know that they perform edge detection in the very first convolutional layers. But in later layers, where more abstract features are generated, it is possible that a convolutional net learns to perform cross-correlation by convolution. It is imaginable that the bright pixels from the cross-correlation will be redirected to units which detect faces (the Google brain project has some units in its architecture which are dedicated to faces, cats etc.; maybe cross correlation plays a role here?).

Insights from statistics

What is the difference between statistical models and machine learning models? Statistical models often concentrate on very few variables which can be easily interpreted. Statistical models are built to answer questions: Is drug A better than drug B?

Machine learning models are about predictive performance: Drug A increases successful outcomes by 17.83% with respect to drug B for people with age X, but 22.34% for people with age Y.

Machine learning models are often much more powerful for prediction than statistical models, but they are not reliable. Statistical models are important to reach accurate and reliable conclusions:  Even when drug A is 17.83% better than drug B, we do not know if this might be due to chance or not; we need statistical models to determine this.

Two important statistical models for time series data are the weighted moving average and the autoregressive models which can be combined into the ARIMA model (autoregressive integrated moving average model). ARIMA models are rather weak when compared to models like long short-term recurrent neural networks, but ARIMA models are extremely robust when you have low dimensional data (1-5 dimensions). Although their interpretation is often effortful, ARIMA models are not a blackbox like deep learning algorithms and this is a great advantage if you need very reliable models.

It turns out that we can rewrite these models as convolutions and thus we can show that convolutions in deep learning can be interpreted as functions which produce local ARIMA features which are then passed to the next layer. This idea however, does not overlap fully, and so we must be cautious and see when we really can apply this idea.

autoregression_weighted_average

Here {C(\mbox{kernel})} is a constant function which takes the kernel as parameter; white noise is data with mean zero, a standard deviation of one, and each variable is uncorrelated with respect to the other variables.

When we pre-process data we make it often very similar to white noise: We often center it around zero and set the variance/standard deviation to one. Creating uncorrelated variables is less often used because it is computationally intensive, however, conceptually it is straight forward: We reorient the axes along the eigenvectors of the data.

eigenvector_decorrelation
Decorrelation by reorientation along eigenvectors: The eigenvectors of this data are represented by the arrows. If we want to decorrelate the data, we reorient the axes to have the same direction as the eigenvectors. This technique is also used in PCA, where the dimensions with the least variance (shortest eigenvectors) are dropped after reorientation.

Now, if we take {C(\mbox{kernel})} to be the bias, then we have an expression that is very similar to a convolution in deep learning. So the outputs from a convolutional layer can be interpreted as outputs from an autoregressive model if we pre-process the data to be white noise.

The interpretation of the weighted moving average is simple: It is just standard convolution on some data (input) with a certain weight (kernel). This interpretation becomes clearer when we look at the Gaussian smoothing kernel at the end of the page. The Gaussian smoothing kernel can be interpreted as a weighted average of the pixels in each pixel’s neighborhood, or in other words, the pixels are averaged in their neighborhood (pixels “blend in”, edges are smoothed).

While a single kernel cannot create both, autoregressive and weighted moving average features, we usually have multiple kernels and in combination all these kernels might contain some features which are like a weighted moving average model and some which are like an autoregressive model.

Conclusion

In this blog post we have seen what convolution is all about and why it is so powerful in deep learning. The interpretation of image patches is easy to understand and easy to compute but it has many conceptual limitations. We developed convolutions by Fourier transforms and saw that Fourier transforms contain a lot of information about orientation of an image. With the powerful convolution theorem we then developed an interpretation of convolution as the diffusion of information across pixels. We then extended the concept of the propagator in the view of quantum mechanics to receive a stochastic interpretation of the usually deterministic process. We showed that cross-correlation is very similar to convolution and that the performance of convolutional nets may depend on the correlation between feature maps which is induced through convolution. Finally, we finished with relating convolution to autoregressive and moving average models.

Personally, I found it very interesting to work on this blog post. I felt for long time that my undergraduate studies in mathematics and statistics were wasted somehow, because they were so unpractical (even though I study applied math). But later — like an emergent property — all these thoughts linked together and practically useful understanding emerged. I think this is a great example why one should be patient and carefully study all university courses — even if they seem useless at first.

convolution_quiz
Solution to the quiz above: The information diffuses nearly equally among all pixels; and this process will be stronger for neighboring pixels that differ more. This means that sharp edges will be smoothed out and information that is in one pixel, will diffuse and mix slightly with surrounding pixels. This kernel is known as a Gaussian blur or as Gaussian smoothing. Continue reading. Sources: 1 2


평소 무엇인가를 쉽게 설명하는 능력이 있다고 생각해서  , CNN (convolutional neural network) 도 그렇게 해볼까 했는데 역시 무리. 쉽게 설명한다는것은 그것에 대해 확실한 이해를 가지고 있다고 생각될때 가능한것인데 아직 CNN 라는 풍랑에서 표류중이기 때문에 대신해서 좋은 해외 블로그글을 번역하고자 한다. 일반 소프트웨어 엔지니어 입장에서 딥러닝을 활용하기 위해서 수학적인 이해는 필요 없다고 생각하며, 대신 직관적인 이해는 반드시 해야하는데 이 글은 좋은 지침이 될거 같다. 


이 글을 이해하기 위해서 선수학습으로 다음의 것을 알고 있으면 좋다. (물론 이것들도 직관적인 이해만 하면 된다. 옆에서 누가 설명해주면 반나절이면 족할 내용이지만 아마 직접 책을 통해 공부한다면 꽤나 오래 걸릴수도 있겠다.)

- 퍼셉트론
- 하강경사법 
- Overfitting
- 오류역전파 알고리즘
- 소벨마스크 
- 선형회귀/로지스틱 회귀 
- Sigmoid

(오토인코더나,RBM 같은 어려운것들은 나중에~) 


* 딥러닝 처음 접한 사람을 위한 팁 

이 포스트에서 설명할 CNN 은 딥러닝은 한 종류로 주로 이미지를 인식하는데 사용됩니다. ( 음성 및 1차원 타임시리즈 데이타도 가능) 2012년 세계적인 이미지 인식 경연 대회 (ILSVRC) 에서 세계 유수의 기관을 제치고 난데없이 큰 격차로 캐나다의 토론토 대학의 슈퍼비전이 우승하게 되는데 그때 사용된 방법이 CNN 에 기반합니다. (그 동안은 SIFT,HOG방식등 ) 
이 대회는 천만장의 이미지데이터를 기계학습으로 하급하고 15만장의 이미지를 사용해서 테스트하여 정답율을 측정합니다.
즉 고양이 이미지를 고양이로 인식하면 성공~

이 고양이 인식에는 '특징표현 학습(feature representation learning)' 이라는 엄청난 발명이 사용되었는데, 컴퓨터 스스로가 특징표현을 만들어 내는것이다. 

다음은 구글의 고양이 연구 및 기타 자료들로써 어느정도 직관적인 이해를 할 수 있을것이다.  아래층에서는 점이나 엣지등의 이미지에 자주 검출되는 '모양' 을 인식하는것 뿐이지만 위로 가면서 원이나 삼각등의 모형을 인식할수 있으며 그 위로는 얼굴과 같은 형상을 얻게되며, 새로운 이미지를 분류할때 저러한  형상이 나올 확율이 높으면  = 고양이다 라고 분류 하는것이다. 





위에것은 CONV 1 레이어이고 아래것은 CONV 5 레이어이다. AlexNet 아키텍쳐에서 고양이 이미지를 학습한것이며 각각의 박스는 각 필터들과 연관된 엑티베이션 맵을 보여준다.  액티베이션은 Sparse (대부분 0 이며 위 이미지에서는 검정으로 보여진다.) 이고 대부분 local 이다. 


딥러닝에서 Convolution 을  이해해보자.

2015-03-26 by Tim Dettmers 

현재 컨볼루션 은 아마도 딥러닝에 관해서  가장 중요한 개념일것이다. 대부분의 기계학습의 전면에  컨볼루션 과 컨볼루션 nets 은 딥러닝을  스타덤에 올려놓았다. 근데 무엇이 컨볼루션을 그렇게  강력하게 만들까? 어떻게 작동할까?  이런 질문에 대해 이 블로그에서 다른 컨셉들과 비교하며 설명할것이다. 컨볼루션에 대한 직관적인 이해를 할수 있는데  도움이 될것이다.

이미 몇몇의 컨볼루션 관련 블로그들이 있지만, 그것들은 굳이 필요없는 (이해하는데 전혀 도움이 된다는 방식의) 수학적 상세표현을 통해 매우 큰 혼동만 주고 있다고 생각한다.  이 블로그에도 수학표현이 없다고 말하지는 않겠으나, 적어도 나는 그것들을 모든사람이 이해할수있는 이미지와 함께 표현하여 개념적 이해를 도울것이다. 이 블로그의 첫번째 장에서는 누구나 컨볼루션과 Convolutional Neural Network 를 직관적으로 이해할수 있도록 하는게 목표이다. 두번째 장에서는 좀 더 깊숙한 개념을 설명하여 연구자나 깊이있는 이해를 하고 싶어하는 사람들에게 도움이 되는 글들로 채울것이다. 

컨볼루션이 무엇인가?

전체 블로그는 이 질문에 대한 정확한 대답을 드리기 위해 채워져 있다. 먼저 방향설정을 잡아보자. 그래 컨볼루션의 대략적인 의미는 무엇인가? 

먼저 당신은 정보를 섞는것으로서 컨볼루션을 상상할수 있다. 2개의 양동이에 어떤 정보가 가득 차있고, 그것을 하나의 양동이에 쏟아넣는것을 상상해보자. 그리고 어떤 특정한 룰 따라서 섞어보자. 각각의 양동이는 그 자신의 레시피를 가지고 있고,  그것을 통해 어떻게 정보들이 하나의 양동이에 서로서로 섞이는지 알려준다. 즉  컨볼루션은 2개의 정보가 서로 섞이는 순서가 있는 절차이다. (역주 : 두 벡터의 내적을 생각해보자)

컨볼루션은 또한 수학적으로 표현될 수 있다. 사실, 더하기,곱하기,미분하기등과 같은 수학적인 연산이다. 복잡한 방정식을 간단하게 하기 위한 좋은 도구가 될 수 있다. 컨볼루션들은 물리학이나 엔지니어링에서 중요하게 사용되는데, 그런 복잡한 방정식을 간소화 할 경우가 많기 때문이다. 두번째 장에서  이러한 두 분야에서의 관계와 통합에 대해 알아 볼 예정이다. 그러나 지금은 현실적인 관점에서만 컨볼션을 살펴볼것이다.

어떻게 컨볼루션을 이미지에 적용할 수 있을까?

우리가 이미지에 컨볼루션을 적용할때, 우리는 2차원으로 생각해 볼수 있다. 즉 너비와 높이를 가진 이미지.우리가 2개의 양동이를 섞을때, 첫번째 양동이에는 원본 이미지 (3차원 행렬의 픽셀 전체) 가 들어가고, 빨강,녹생,파랑의 색상 채널들이 하나의 행렬이 된다. (역주: 3가지 색상을 하나의 그레이스케일로 줄여도 된다) 하나의 픽셀은 각각의 색상 채널에서 0~255 사이의 정수로 구성된다. 두번째 양동이에는 컨볼루션 커널이 있고, 이것은 실수의 단일 행렬로 이루어져 있고 이것은  어떻게 원본 이미지와 커널을 컨볼루션 연산에 의해서 섞는지에 대한 레시피로써 크기와 패턴들이 구성된다. 이 커널의 출력은 '피쳐 맵' 이라고 불리는 이미지이다. 커널 하나당 각각의 색상 채널에는 각각의 피처맵이 생길것이다. (역주: 2번째 양동이, 즉 커널종류가 다양할 수록 다양한 피처맵이 생긴다.)

convolution

원본이미지 와 경계선 추출 커널 (두번째 양동이) 을 섞어서 만들어진 피처맵.
(역주:  CNN 은 이 커널을 자동으로 만들어 준다는게 핵심이다) 


이 두가지 정보들을 컨볼루션을 통해 엮었다.  여기서 컨볼루션을 적용하는 한가지 방식은 이미지 패치를 원본이미지로 부터 커널 사이즈 만큼 가져와서 이미지 패치 와 컨볼루션 커널을 연산한다. (역주:  위에 원본 이미지가 100*100 사이즈 이고 3*3 크기 컨볼루션  커널을 행렬 곱 할때, 원본 이미지에서 커널과 곱할 부분을 떼어내게 되는데 이 떼어낸것을 패치라고 한다.)  

하나의 연산의 합이 피쳐맵에서 하나의 픽셀에 사상된다. 피처맵의 하나의 픽셀이 계산되고 난 후에는 이미지 패치가 하나씩 오른쪽으로 이동하면서 새로운 패치 정의되고 그 패치는 커널과 연산하여 새로운 픽셀이 계산되어진다. 이 절차를 아래 이미지를 통해 이해해 보자.

Calculating convolution by operating on images patches.

(역주: 이렇게 커널이 패치와 연산될때, 패딩/슬라이딩값들을 조절한다던가, 커다란 이미지를 1/4 같이  줄이는 풀링이라든지 , 아예 DropOut 시킨다던가 레이어간에 부분적으로만 연결한다던가, 파라미터들을 그룹핑하여 공유한다던가 하는 것들이 있는데 전체적인 직관력을 기르고 난후에 접근하시면 됩니다. 일단은 원본이미지에서 어떤 필터를 중간에 껴서 새로운 이미지 집합을 만들어낸다는것에 집중하면 될거 같습니다.)

왜 이미지의 컨볼루션이 머신러닝에서 중요한가?

이미지에는 다양하게 추출될수있는 정보들이 포함되어있다. 좋은 예로 내가 참여한 프로젝트를 살펴보자. Burda Bootcamp 는 빠른속도의 학생들이 기술적으로 위험한 프로젝트들을 매우 짧은 기간안에 만들기위한 해커튼 스타일의 환경의 프로토 타이핑 랩인데 우리는 11개의 프로덕트를 2달만에 만들었다. 하나의 프로젝트에서 나는 딥 오토엔코더와 함께 패션 이미지 탐색을 만들길 원했다. 당신은 패션 아이템의 이미지를 업로드하고 오토엔코더는 비슷한 스타일의 옷을 포함한 이미지를 찾아야했다.

지금 만약 당신이 옷의 스타일에 관한 차이점에 대해서 설명할때 그 옷의 색깔은 그것을 하기에 별로 유용하진 않을것이다. 또한 상표의 엠블램같은것 또한 마찬가지이다. 가장 중요한것은 옷의 윤곽/모양그 자체일것인데 일반적으로 브라우스의 모양은 셔츠,자켓등의 모양과는 아주 다르다. 그래서 만약 우리가 필요없는 정보를 추출하는 필터를 설계하면 그것은 필요없는 것들에 의해 구분되어지는 참사를 막을 수 있을것이다. 결국 우리는 좀 더 쉽게 커널과 함께 컨볼루션을 수행할 것이다. 

나의 동료는 데이타를 전처리하고 소벨 윤곽선 검출기를 적용하였다. (첫번째 그림에서 윤곽선 검출한것과 비슷) 객체의 모양의 외곽선을 제외하고 이미지로부터 모드것을 제거하는 필터이며, 이것은 왜 컨볼루션의 어플리케이션이 종종 필터라고 불리는지 말해준다. 그리고 커널들도 필터라고도 불리운다. 결과로 만들어진 피쳐맵은 당신이 다양한 종류의 옷을 구분할때 매우 큰 도움을 줄것이다. 아래 이미지들을 보자.

autoencoder_fashion_features_and_results

이런 종류의 절차를 사용하는것이  -  입력을 받고, 변환하고 , 변형된 이미지를 알고리즘으로 먹이고 - 피처 엔지니어링 이라고 불린다. 피쳐 엔지니어링은 매우 어려운데, 이런 기술을 익히기 위해 당신을 도울 리소스는  별로 없기 때문이다.  결과적으로 아주 소수의 사람들이 피쳐 엔지니어링을 넓은 범위의 업무에서 잘 적용할 수 있는데. 그냥 피쳐 엔지니어링을 잘하는 방법을 살펴봄으로써 느껴보자.  the most important skill to score well in Kaggle competitions

피쳐 엔지니어링은 너무 어렵다. 각각의 데이타 타입과 각각 문제의 타입에 대해 다른 피쳐들이 적합하기 때문이다. 이미지 업무에 관한 피처 엔지니어링의 지식은 타임시리즈 데이터에는 쓸모가 없을때가 많으며, 심지어 우리는 비슷한 이미지 업무에 대해서도 좋은 피처들을 엔지니어링하기 쉽지 않다. 이미지 안의 객체들이 우리가 무엇을 하려고하는지에 관해서 다르게 결정되기 때문이다. 굉장히 오랜기간의 경험이 필요한 일이다. 그래서 피처 엔지니어링은 매우 어려우며 당신 앞에 닥치게된 업무는  빈바닥에서 시작해서 새롭게 시작해야 한다는것을 의미한다. 그러나 그러나 !!!!  그런 커널들이 대부분의 업무에 맞춰서 자동적으로 적합하게 찾아진다면 어떨까?? 

CNN (Convolutional Neural Network)  으로 

우리의 커널이 정확하게 고정된 값을 갖게하는 대신해서 (역주: 엔지니어가 직접 커널을 선택해주는것을 대신해서) 컨볼루셔널 넷이 정확하게 그것을 한다.  우리의 컨볼루션넷을 학습 시킴으로써 , 커널은 점점 더 주어진 이미지(혹은 주어진 피처맵) 를 필터링을 잘 하게될것이다. 이 프로세스는 자동적으로 이루어지므로 피처 학습이라고 불리운다. 피처 학습은 자동적으로 각각의 목적에 대해 생성된다. 우리는 단지 새로운 목적과 부합한 필터를 찾기위해 우리의 네트워크를 간단하게 학습시키기만 하면 된다. (역주 : 이미 CaffeNet, GoogleNet 등의 아키텍처 모델들이 존재함. 물론 파인튜닝도 필요하고, 전혀 다른 도메인이라면 어려운 도전이 필요합니다)  이것이 컨볼루션 넷을 강력하게 만드는 것이라고 볼수있습니다. 피쳐엔지니어링은 더 이상 어렵지 않아요. 자동적으로 만들어지니깐!

보통 우리는 컨볼루션 넷에서 싱글 커널을 학습하진 않습니다. 대신해 우리는 다층의 커널들의 계층에 대해 학습하게 되는데 예를들어 32*16*16 의 커널이 256*256 이미지에 적용되면 32 개의 피처맵들을 241*241 사이즈로 생성하게 됩니다. 자동적으로 우리는 32개의 새로운 피처 (우리의 목적에 부합되는 형태를 가진) 들을 학습하게되며 이러한 피처들은 다음 커널의 입력으로 사용됩니다. (즉 여러개의 커널을 거침. 인간의 뇌가 여러 레이어의 신경망으로 되어 있는것 처럼 ) 일단 계층적인 특징(feature)들을 배우게 되면 우리는 간단히 그것들을 fully connected 로 통과하게 하고 , 이미지에서 특정 클래스(고양이, 자전거) 로의 분류를 위해  그것들을 합칩니다.  (역주 : 특정 표현들이 얼마만큼 존재하냐에 따라서 확율적으로 선택해줍니다. 소프트맥스라는 회귀방법이 주로 사용됩니다.)  

이것들이 개념적인 수준에서 CNN 을 이해할수있는 거의 모든 것 이며 , 2편에서는 좀 더 상세한 얘기들을 해 볼 것 입니다.

2편가기   원본가기  가시화를 통한 이해 


+ Recent posts