Korean, Edit

Chapter 16. Convolutional Neural Networks (convolutional neural network; CNN)

Recommended reading: 【Algorithm】 Algorithm Table of Contents


1. Overview

2. Components

3. Models

4. Examples



1. Overview

⑴ Concept

① CNN: a deep neural network that combines the filtering function (convolution) of traditional image processing with a neural network

② Also called a convolutional neural network

③ A typical CNN algorithm has 100 million parameters

④ A typical CNN algorithm consists of 10–20 layers

⑤ Beyond its original purpose, CNNs are now used in many fields, including natural language processing

⑥ CNNs are considered to correspond to visual cortex V1, which recognizes edges, and visual cortex V4, which recognizes color, etc.

⑦ Comparison between a general neural network and a CNN


스크린샷 2026-03-13 오후 12 39 52

Table 1. Comparison between a general neural network and a CNN


⑵ Background of introduction

① Fully-connected layer: According to the universal approximation theorem, patterns such as images can be recognized even without the special architecture called CNN

② However, a fully-connected layer requires too many parameters, so the computing burden is high and the training time is long

③ A fully-connected layer is essentially a special case of a CNN

⑶ Assumptions

Assumption 1. spatial locality : the idea that the pattern of an entire object can be fully identified from patterns obtained from only local parts rather than the whole image

Assumption 2. positional invariance: the idea that the same pattern will be recognized regardless of position or viewing angle

⑷ Order

Example 1. augmentation_layer → Input → Conv2D, MaxPooling2D, etc.

Example 2. Input - Embedding - Convolution - Max pooling - Convolution 2 - Max pooling - ReLU - Linear(fc) - Output

○ Embedding: a layer that converts a one hot vector into a dense vector

③ nested conv-layers: low-level features → mid-level features → high- level features → trainable classifier

Activation function

① identity: this form is called a linear classifier

② sigmoid σ(x) = 1 / (1 + e-x)

③ tanh(x)

④ ReLU (rectified linear unit): max(0, x). Most frequently used

⑤ leaky ReLU: max(0.1x, x)

⑥ maxout: max(w1Tx + b1, w2Tx + b2)

⑦ elu (exponential linear unit): x if x ≥ 0; α(ex - 1) if x < 0

⑧ softmax



2. Components

Component 1. Input: input layer

Component 2. augmentation_layer

① Definition: makes the input more diverse through transformations such as random cropping and random rotation

② Purpose: this helps the model learn more robustly

③ Reference: https://www.tensorflow.org/tutorials/images/data_augmentation

Component 3. convolutional_layer: Conv2D, etc.


model = tf.keras.Sequential([
    ...
    layers.Conv2D(filters=96, kernel_size=3, activation='elu', strides=2),
    ...
])


① Purpose: to identify local patterns

② Input: an input of W × H × C is given

○ W: width

○ H: height

○ C: number of channels on the input side (e.g.: RGB channels)

③ Parameters: a total of four hyperparameters are required

○ K: number of filters. That is, the number of channels on the output side (e.g.: RGB channels). A filter is also called a kernel

○ F: size of the filter. Also called kernel_size

○ S: stride. The interval at which the filter moves with the specified step

○ P: padding or zero padding. A preprocessing method that adds pixels with value 0 to the borders to prevent image shrinkage

○ Unlike pooling_layer, it does not have spatial extent, but it has padding or kernel_size

Number of parameters = number of parameters required for the filters = K(F 2C + 1)

○ In CNN deep learning algorithms, these parameters are learned

④ Operation: normalization is performed for each inner product

○ Reason: because the cosine angle between two vectors is important, not the absolute values of each vector

○ Example of the operation: in the example below, connected lines indicate multiplication, and summation is applied across lines


스크린샷 2026-03-13 오후 12 41 59

Figure 1. Example of the operation]

(However, note that the bias is omitted in the figure above)


⑤ Filters: generally, 3 × 3 filters are used frequently

○ If the filter is large, the map becomes smaller quickly

○ If the image is large, too much computation is required

○ Filter sizes are mostly odd numbers

○ 1 × 1 conv: used to identify differences between channels**

⑥ Output: if the output size of the convolutional layer is W’ × H’ × K,

○ W’ = (W - F + 2P) / S + 1

○ H’ = (H - F + 2P) / S + 1

○ In the fully-connected case, the total number of required parameters = (W × H × C + 1) × (W’ × H’ × K)

○ In the formulas for W’ and H’ above, “+1” indicates the last one-step movement when the kernel moves slowly one step at a time

○ If S = 1 and P = (F-1) / 2, then W’ = W and H’ = H: in this case, it is expressed in Python as follows


model = tf.keras.Sequential([
    ...
    layers.Conv2D(filters=96, kernel_size=3, activation='elu', strides=1, padding='same'),
    ...
])


⑦ Output example


스크린샷 2026-03-13 오후 12 42 49

Figure 2. Example of CNN output by a convolutional layer


Component 4. pooling_layer


model = tf.keras.Sequential([
    ...
    layers.MaxPooling2D((2,2), strides=2),
    ...
])


① Purpose: to reduce the input dimension

② Input: an input of W × H × C is given

○ W: width

○ H: height

○ C: number of channels on the input side (e.g.: RGB channels)

③ Parameters: a total of two hyperparameters are required

○ F: spatial extent

○ S: stride

○ Usually set as F = S = 2

○ Unlike convolutional_layer, it has spatial extent but does not have padding or kernel_size

Number of parameters = 0

○ Reason **: **because only simple operations that do not require learning, such as averaging, are performed

○ Types of operations: max pooling (e.g. , MaxPooling2D), average pooling

④ Output: if the output size of the pooling layer is W’ × H’ × K,

○ W’ = (W - F) / S + 1

○ H’ = (H - F) / S + 1

Component **5.** Miscellaneous

① BatchNormalization: quite effective for normalizing batch data. The author uses it after maxpooling

② Flatten

③ Dense

④ Dropout



3. Models

Example 1. TensorFlow API


tf.keras.layers.Conv2D(
    filters,
    kernel_size,
    strides=(1,1),
    padding="valid",
    data_format=None,
    dilation_rate=(1,1),
    groups=1,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)


Example 2. U-net


스크린샷 2026-03-13 오후 12 44 06

Figure 3. Structure of U-net


① Frequently used in biomedical images

Example 3. AlexNet


스크린샷 2026-03-13 오후 12 44 31

Figure 4. Structure of AlexNet


① The square box on the far left is the input layer

② Starting from the input layer, the square boxes on the right are called CONV1, ···, CONV5, respectively

③ The maxpooling layers are called POOL1, ···, POOL3 from left to right

④ The network parts labeled maxpooling, dense, and dense on the far right are called FC6, FC7, and FC8, respectively

⑤ Uses ImageNet as the training dataset

Example 4. Znet: a 3D extension of 2D U-net

Example 5. DIP(deep image prior)



4. Examples

Example 1. ImageNet

① Labeled for computer vision research

② Inspired by WordNet

③ Created by Fei Fei Li

④ Contains more than 1 million images in 1,000 categories

Example 2. CIFAR-10

① A famous toy image classification dataset

② Consists of 60,000 small RGB images with width and height of 32 pixels

③ There are a total of 10 image classes

○ airplane

○ automobile

○ bird

○ cat

○ deer

○ dog

○ frog

○ horse

○ ship

○ truck

④ Of the 60,000 images, 50,000 are the training set and 10,000 are the test set



Entered: 2021.12.01 10:50

Revised: 2022.11.21 01:18

results matching ""

    No results matching ""