Chapter 22. Image Generative Models

Recommended Reading : 【Algorithm】 Algorithm Index

1. DIP

2. Vision Transformer

3. Image Generative Model

4. Vision-Language Model

5. Video Generative Model

1. DIP (deep image prior)

⑴ Features : Overfits the CNN architecture to the input image without training data to generate new images

2. Computer Vision Foundation Models

⑴ Vision Transformer (ViT)

① ViT uses only the transformer encoder structure

② Step 1. Split the image into multiple small patches and treat each patch as a token for input into the transformer

③ Step 2. Embed each patch using the transformer encoder

④ Step 3. Just like embedding words in a sentence and outputting the sentence embedding that represents the sentence’s meaning, ViT learns the relationships between patches and outputs features representing the whole image.

⑤ Limitation: The computation of self-attention is proportional to the square of the number of patches that make up the image, making it difficult to input high-resolution images at once.

○ Solution 1: Divide the given image into smaller patches and apply ViT independently to each patch (e.g., iSTAR).

○ Solution 2: Introduce an extended self-attention mechanism, such as dilated self-attention using models like LongNet (e.g., Prov-GigaPath).

⑵ Types

① DINO(self-distillation with no labels)

② IBOT(image BERT pretraining with online tokenizer)

③ BEiT: A ViT variant that adopts the idea of the BERT model, trained similarly to masked language modeling.

○ iSTAR: Used to enhance the resolution of spatial transcriptomics. It utilizes a BEiT-based model trained with the DINO method.

Figure 1. Diagram for data preparation step in iSTAR

○ Step 1. Divide the given image into 256 × 256 patches.

○ Step 2. Further divide each patch into 16 × 16 sub-patches.

○ Step 3. Apply ViT (denoted as f2) to each sub-patch to obtain a 384-dimensional vector.

○ Step 4. Aggregate the 384-dimensional vectors to form a 16 × 16 × 384 data structure, then apply another ViT (denoted as f1) to obtain a 192-dimensional vector.

○ Step 5. Gather the 192-dimensional vectors and apply ViT (denoted as f0).

○ Feature extraction and loss function formulation.

④ Swin Transformer: A ViT variant that uses window-based local self-attention.

⑤ CTransPath : Wang et al., Medical Image Analysis (2022)

⑥ UNI : Chen et al., Nature Medicine (2024)

⑦ CONCH (CONtrastive learning from Captions for Histopathology) : Lu et al., Nature Medicine (2024)

⑧ Virchow : Vorontsov et al., arxiv (2023)

⑨ RudolfV : Dippel et al., arxiv (2024)

⑩ Campanella : Campanella et al., arxiv (2023)

⑪ Prov-GigaPath: Announced by Microsoft, it is a vision foundation model trained on 170,000 pathology images (1.3 billion tiles) (2024).

⑫ PRISM

3. Image Generation Model

⑴ Types

① DALL·E3 (OpenAI)

② Midjourney

③ Stable Diffusion

④ Sora (OpenAI)

⑤ Video LLM

4. Vision-Language Model

⑴ Types

① Stable diffusion: AI algorithm to generate digital images from natural language

② MedGemma

5. Video Generative Model

⑴ Types

① XVFI: A kind of optical flow.

② FILM(Frame Interpolation for Large Motion): Encoder + U-Net like decoder

Input: 2024.04.22 14:08

314

Chapter 22. Image Generative Models

1. DIP (deep image prior)

2. Computer Vision Foundation Models

3. Image Generation Model

4. Vision-Language Model

5. Video Generative Model

results matching ""

No results matching ""