Chapter 16. Convolutional Neural Networks (CNN)

Recommended reading: 【Algorithm】 Algorithm Table of Contents

1. Overview

⑴ Concept

① CNN: a deep neural network that combines the filtering function (convolution) of traditional image processing with a neural network

② Also called a convolutional neural network

③ A typical CNN algorithm has 100 million parameters

④ A typical CNN algorithm consists of 10–20 layers

⑤ Beyond its original purpose, CNNs are now used in many fields, including natural language processing

⑥ CNNs are considered to correspond to visual cortex V1, which recognizes edges, and visual cortex V4, which recognizes color, etc.

⑦ Comparison between a general neural network and a CNN

Table 1. Comparison between a general neural network and a CNN

⑵ Background of introduction

① Fully-connected layer: According to the universal approximation theorem, patterns such as images can be recognized even without the special architecture called CNN

② However, a fully-connected layer requires too many parameters, so the computing burden is high and the training time is long

③ A fully-connected layer is essentially a special case of a CNN

⑶ Assumptions

① Assumption 1. Spatial locality : the idea that the pattern of an entire object can be fully identified from patterns obtained from only local parts rather than the whole image

② Assumption 2. Positional invariance (translational invariance): the idea that the same pattern will be recognized regardless of position or viewing angle

⑷ Order

① Example 1. augmentation_layer → Input → Conv2D, MaxPooling2D, etc.

② Example 2. Input - Embedding - Convolution - Max pooling - Convolution 2 - Max pooling - ReLU - Linear(fc) - Output

○ Embedding: a layer that converts a one hot vector into a dense vector

③ nested conv-layers: low-level features → mid-level features → high- level features → trainable classifier

⑸ Activation function

① identity: this form is called a linear classifier

② sigmoid σ(x) = 1 / (1 + e^-x)

③ tanh(x)

④ ReLU (rectified linear unit): max(0, x). Most frequently used

⑤ leaky ReLU: max(0.1x, x)

⑥ maxout: max(w₁^Tx + b₁, w₂^Tx + b₂)

⑦ elu (exponential linear unit): x if x ≥ 0; α(ex - 1) if x ＜ 0

⑧ softmax

2. Components

⑴ Component 1. Input: input layer

⑵ Component 2. augmentation_layer

① Definition: makes the input more diverse through transformations such as random cropping and random rotation

② Purpose: this helps the model learn more robustly

③ Reference: https://www.tensorflow.org/tutorials/images/data_augmentation

⑶ Component 3. convolutional_layer: Conv2D, etc.

model = tf.keras.Sequential([
    ...
    layers.Conv2D(filters=96, kernel_size=3, activation='elu', strides=2),
    ...
])

① Purpose: to identify local patterns

② Input: an input of W × H × C is given

○ W: width

○ H: height

○ C: number of channels on the input side (e.g.: RGB channels)

③ Parameters: a total of four hyperparameters are required

○ K: number of filters. That is, the number of channels on the output side (e.g.: RGB channels). A filter is also called a kernel

○ F: size of the filter. Also called kernel_size

○ S: stride. The interval at which the filter moves with the specified step

○ P: padding or zero padding. A preprocessing method that adds pixels with value 0 to the borders to prevent image shrinkage

○ Unlike pooling_layer, it does not have spatial extent, but it has padding or kernel_size

○ Number of parameters = number of parameters required for the filters = K(F 2C + 1)

○ In CNN deep learning algorithms, these parameters are learned

④ Operation: normalization is performed for each inner product

○ Reason: because the cosine angle between two vectors is important, not the absolute values of each vector

○ Example of the operation: in the example below, connected lines indicate multiplication, and summation is applied across lines

Figure 1. Example of the operation]

(However, note that the bias is omitted in the figure above)

⑤ Filters: generally, 3 × 3 filters are used frequently

○ If the filter is large, the map becomes smaller quickly

○ If the image is large, too much computation is required

○ Filter sizes are mostly odd numbers

○ 1 × 1 conv: used to identify differences between channels**

⑥ Output: if the output size of the convolutional layer is W’ × H’ × K,

○ W’ = (W - F + 2P) / S + 1

○ H’ = (H - F + 2P) / S + 1

○ In the fully-connected case, the total number of required parameters = (W × H × C + 1) × (W’ × H’ × K)

○ In the formulas for W’ and H’ above, “+1” indicates the last one-step movement when the kernel moves slowly one step at a time

○ If S = 1 and P = (F-1) / 2, then W’ = W and H’ = H: in this case, it is expressed in Python as follows

model = tf.keras.Sequential([
    ...
    layers.Conv2D(filters=96, kernel_size=3, activation='elu', strides=1, padding='same'),
    ...
])

⑦ Output example

Figure 2. Example of CNN output by a convolutional layer

⑷ Component 4. pooling_layer

model = tf.keras.Sequential([
    ...
    layers.MaxPooling2D((2,2), strides=2),
    ...
])

① Purpose: to reduce the input dimension

② Input: an input of W × H × C is given

○ W: width

○ H: height

○ C: number of channels on the input side (e.g.: RGB channels)

③ Parameters: a total of two hyperparameters are required

○ F: spatial extent

○ S: stride

○ Usually set as F = S = 2

○ Unlike convolutional_layer, it has spatial extent but does not have padding or kernel_size

○ Number of parameters = 0

○ Reason **: **because only simple operations that do not require learning, such as averaging, are performed

○ Types of operations: max pooling (e.g. , MaxPooling2D), average pooling

④ Output: if the output size of the pooling layer is W’ × H’ × K,

○ W’ = (W - F) / S + 1

○ H’ = (H - F) / S + 1

⑸ Component **5.** Miscellaneous

① BatchNormalization

○ Quite effective for normalizing batch data. The author uses it after maxpooling

○ Problem 1. It behaves differently during training and inference.

○ Problem 2. It introduces additional state (running mean and variance) that must be updated outside standard gradient updates.

○ Problem 3. It breaks a key assumption: most layers operate indepdendently on each batch element, but batch norm computes statistics across the batch. This makes it incompatible with tools like vmap or sharded training (pmap, pjit) unless you take special care to synchronize statistics across devices.

② Flatten

③ Dense

④ Dropout

○ Problem 1. Dropout introduces stochasticity, making it harder to determine whether poor performance is due to randomness or a deeper issue.

3. Models

⑴ Example 1. TensorFlow API

tf.keras.layers.Conv2D(
    filters,
    kernel_size,
    strides=(1,1),
    padding="valid",
    data_format=None,
    dilation_rate=(1,1),
    groups=1,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

⑵ Example 2. ResNet

① A CNN architecture introduced in 2015.

② Introduced to address the vanishing gradient problem, where gradients are not effectively propagated through the network, slowing down learning.

③ To solve this problem, ResNet introduced residual connections, also known as skip connections. These connections allow information to bypass some layers.

④ Applications: computer vision tasks such as image classification and object detection.

⑶ Example 3. U-net

Figure 3. Structure of U-net

① Frequently used in biomedical images

⑷ Example 4. AlexNet

Figure 4. Structure of AlexNet

① The square box on the far left is the input layer

② Starting from the input layer, the square boxes on the right are called CONV1, ···, CONV5, respectively

③ The maxpooling layers are called POOL1, ···, POOL3 from left to right

④ The network parts labeled maxpooling, dense, and dense on the far right are called FC6, FC7, and FC8, respectively

⑤ Uses ImageNet as the training dataset

⑸ Example 5. Znet: a 3D extension of 2D U-net

⑹ Example 6. DIP(deep image prior)

4. Examples

⑴ Example 1. ImageNet

① Labeled for computer vision research

② Inspired by WordNet

③ Created by Fei Fei Li

④ Contains more than 1 million images in 1,000 categories

⑵ Example 2. CIFAR-10

① A famous toy image classification dataset

② Consists of 60,000 small RGB images with width and height of 32 pixels

③ There are a total of 10 image classes

○ airplane

○ automobile

○ bird

○ cat

○ deer

○ dog

○ frog

○ horse

○ ship

○ truck

④ Of the 60,000 images, 50,000 are the training set and 10,000 are the test set

Entered: 2021.12.01 10:50

Revised: 2022.11.21 01:18

2152

Chapter 16. Convolutional Neural Networks (CNN)

1. Overview

2. Components

3. Models

4. Examples

results matching ""

No results matching ""