Chapter 16. Convolutional Neural Networks (convolutional neural network; CNN)
Recommended reading: 【Algorithm】 Algorithm Table of Contents
1. Overview
2. Components
3. Models
4. Examples
1. Overview
⑴ Concept
① CNN: a deep neural network that combines the filtering function (convolution) of traditional image processing with a neural network
② Also called a convolutional neural network
③ A typical CNN algorithm has 100 million parameters
④ A typical CNN algorithm consists of 10–20 layers
⑤ Beyond its original purpose, CNNs are now used in many fields, including natural language processing
⑥ CNNs are considered to correspond to visual cortex V1, which recognizes edges, and visual cortex V4, which recognizes color, etc.
⑦ Comparison between a general neural network and a CNN
Table 1. Comparison between a general neural network and a CNN
⑵ Background of introduction
① Fully-connected layer: According to the universal approximation theorem, patterns such as images can be recognized even without the special architecture called CNN
② However, a fully-connected layer requires too many parameters, so the computing burden is high and the training time is long
③ A fully-connected layer is essentially a special case of a CNN
⑶ Assumptions
① Assumption 1. spatial locality : the idea that the pattern of an entire object can be fully identified from patterns obtained from only local parts rather than the whole image
② Assumption 2. positional invariance: the idea that the same pattern will be recognized regardless of position or viewing angle
⑷ Order
① Example 1. augmentation_layer → Input → Conv2D, MaxPooling2D, etc.
② Example 2. Input - Embedding - Convolution - Max pooling - Convolution 2 - Max pooling - ReLU - Linear(fc) - Output
○ Embedding: a layer that converts a one hot vector into a dense vector
③ nested conv-layers: low-level features → mid-level features → high- level features → trainable classifier
① identity: this form is called a linear classifier
② sigmoid σ(x) = 1 / (1 + e-x)
③ tanh(x)
④ ReLU (rectified linear unit): max(0, x). Most frequently used
⑤ leaky ReLU: max(0.1x, x)
⑥ maxout: max(w1Tx + b1, w2Tx + b2)
⑦ elu (exponential linear unit): x if x ≥ 0; α(ex - 1) if x < 0
⑧ softmax
2. Components
⑴ Component 1. Input: input layer
⑵ Component 2. augmentation_layer
① Definition: makes the input more diverse through transformations such as random cropping and random rotation
② Purpose: this helps the model learn more robustly
③ Reference: https://www.tensorflow.org/tutorials/images/data_augmentation
⑶ Component 3. convolutional_layer: Conv2D, etc.
model = tf.keras.Sequential([
...
layers.Conv2D(filters=96, kernel_size=3, activation='elu', strides=2),
...
])
① Purpose: to identify local patterns
② Input: an input of W × H × C is given
○ W: width
○ H: height
○ C: number of channels on the input side (e.g.: RGB channels)
③ Parameters: a total of four hyperparameters are required
○ K: number of filters. That is, the number of channels on the output side (e.g.: RGB channels). A filter is also called a kernel
○ F: size of the filter. Also called kernel_size
○ S: stride. The interval at which the filter moves with the specified step
○ P: padding or zero padding. A preprocessing method that adds pixels with value 0 to the borders to prevent image shrinkage
○ Unlike pooling_layer, it does not have spatial extent, but it has padding or kernel_size
○ Number of parameters = number of parameters required for the filters = K(F 2C + 1)
○ In CNN deep learning algorithms, these parameters are learned
④ Operation: normalization is performed for each inner product
○ Reason: because the cosine angle between two vectors is important, not the absolute values of each vector
○ Example of the operation: in the example below, connected lines indicate multiplication, and summation is applied across lines
Figure 1. Example of the operation]
(However, note that the bias is omitted in the figure above)
⑤ Filters: generally, 3 × 3 filters are used frequently
○ If the filter is large, the map becomes smaller quickly
○ If the image is large, too much computation is required
○ Filter sizes are mostly odd numbers
○ 1 × 1 conv: used to identify differences between channels**
⑥ Output: if the output size of the convolutional layer is W’ × H’ × K,
○ W’ = (W - F + 2P) / S + 1
○ H’ = (H - F + 2P) / S + 1
○ In the fully-connected case, the total number of required parameters = (W × H × C + 1) × (W’ × H’ × K)
○ In the formulas for W’ and H’ above, “+1” indicates the last one-step movement when the kernel moves slowly one step at a time
○ If S = 1 and P = (F-1) / 2, then W’ = W and H’ = H: in this case, it is expressed in Python as follows
model = tf.keras.Sequential([
...
layers.Conv2D(filters=96, kernel_size=3, activation='elu', strides=1, padding='same'),
...
])
⑦ Output example
Figure 2. Example of CNN output by a convolutional layer
⑷ Component 4. pooling_layer
model = tf.keras.Sequential([
...
layers.MaxPooling2D((2,2), strides=2),
...
])
① Purpose: to reduce the input dimension
② Input: an input of W × H × C is given
○ W: width
○ H: height
○ C: number of channels on the input side (e.g.: RGB channels)
③ Parameters: a total of two hyperparameters are required
○ F: spatial extent
○ S: stride
○ Usually set as F = S = 2
○ Unlike convolutional_layer, it has spatial extent but does not have padding or kernel_size
○ Number of parameters = 0
○ Reason **: **because only simple operations that do not require learning, such as averaging, are performed
○ Types of operations: max pooling (e.g. , MaxPooling2D), average pooling
④ Output: if the output size of the pooling layer is W’ × H’ × K,
○ W’ = (W - F) / S + 1
○ H’ = (H - F) / S + 1
⑸ Component **5.** Miscellaneous
① BatchNormalization: quite effective for normalizing batch data. The author uses it after maxpooling
② Flatten
③ Dense
④ Dropout
3. Models
⑴ Example 1. TensorFlow API
tf.keras.layers.Conv2D(
filters,
kernel_size,
strides=(1,1),
padding="valid",
data_format=None,
dilation_rate=(1,1),
groups=1,
activation=None,
use_bias=True,
kernel_initializer="glorot_uniform",
bias_initializer="zeros",
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
**kwargs
)
⑵ Example 2. U-net
Figure 3. Structure of U-net
① Frequently used in biomedical images
⑶ Example 3. AlexNet
Figure 4. Structure of AlexNet
① The square box on the far left is the input layer
② Starting from the input layer, the square boxes on the right are called CONV1, ···, CONV5, respectively
③ The maxpooling layers are called POOL1, ···, POOL3 from left to right
④ The network parts labeled maxpooling, dense, and dense on the far right are called FC6, FC7, and FC8, respectively
⑤ Uses ImageNet as the training dataset
⑷ Example 4. Znet: a 3D extension of 2D U-net
⑸ Example 5. DIP(deep image prior)
4. Examples
⑴ Example 1. ImageNet
① Labeled for computer vision research
② Inspired by WordNet
③ Created by Fei Fei Li
④ Contains more than 1 million images in 1,000 categories
⑵ Example 2. CIFAR-10
① A famous toy image classification dataset
② Consists of 60,000 small RGB images with width and height of 32 pixels
③ There are a total of 10 image classes
○ airplane
○ automobile
○ bird
○ cat
○ deer
○ dog
○ frog
○ horse
○ ship
○ truck
④ Of the 60,000 images, 50,000 are the training set and 10,000 are the test set
Entered: 2021.12.01 10:50
Revised: 2022.11.21 01:18