Chapter 22. Image Generative Models
Recommended Reading : 【Algorithm】 Algorithm Index
1. DIP
1. DIP (deep image prior)
⑴ Features : Overfits the CNN architecture to the input image without training data to generate new images
2. Computer Vision Foundation Models
⑴ Vision Transformer (ViT)
① ViT uses only the transformer encoder structure
② Step 1. Split the image into multiple small patches and treat each patch as a token for input into the transformer
③ Step 2. Embed each patch using the transformer encoder
④ Step 3. Just like embedding words in a sentence and outputting the sentence embedding that represents the sentence’s meaning, ViT learns the relationships between patches and outputs features representing the whole image.
⑤ Limitation: The computation of self-attention is proportional to the square of the number of patches that make up the image, making it difficult to input high-resolution images at once.
○ Solution 1: Divide the given image into smaller patches and apply ViT independently to each patch (e.g., iSTAR).
○ Solution 2: Introduce an extended self-attention mechanism, such as dilated self-attention using models like LongNet (e.g., Prov-GigaPath).
⑵ Types
① BEiT: A ViT variant that adopts the idea of the BERT model, trained similarly to masked language modeling.
○ iSTAR: Used to enhance the resolution of spatial transcriptomics. It utilizes a BEiT-based model trained with the DINO method.
② Swin Transformer: A ViT variant that uses window-based local self-attention.
③ CTransPath : Wang et al., Medical Image Analysis (2022)
④ UNI : Chen et al., Nature Medicine (2024)
⑤ CONCH (CONtrastive learning from Captions for Histopathology) : Lu et al., Nature Medicine (2024)
⑥ Virchow : Vorontsov et al., arxiv (2023)
⑦ RudolfV : Dippel et al., arxiv (2024)
⑧ Campanella : Campanella et al., arxiv (2023)
⑨ Prov-GigaPath: Announced by Microsoft, it is a vision foundation model trained on 170,000 pathology images (1.3 billion tiles) (2024).
3. Image Generation Models
⑴ Types
① DALL·E3 (OpenAI)
② Midjourney
③ Stable Diffusion
④ Sora (OpenAI)
⑤ Video LLM
Input: 2024.04.22 14:08