Korean, Edit

Chapter 22. Image Generative Models

Recommended Reading : 【Algorithm】 Algorithm Index


1. DIP

2. Vision Transformer

3. Image Generative Models



1. DIP (deep image prior)

⑴ Features : Overfits the CNN architecture to the input image without training data to generate new images



2. Computer Vision Foundation Models

⑴ Vision Transformer (ViT)

① ViT uses only the transformer encoder structure

Step 1. Split the image into multiple small patches and treat each patch as a token for input into the transformer

Step 2. Embed each patch using the transformer encoder

Step 3. Just like embedding words in a sentence and outputting the sentence embedding that represents the sentence’s meaning, ViT learns the relationships between patches and outputs features representing the whole image.

Limitation: The computation of self-attention is proportional to the square of the number of patches that make up the image, making it difficult to input high-resolution images at once.

Solution 1: Divide the given image into smaller patches and apply ViT independently to each patch (e.g., iSTAR).

Solution 2: Introduce an extended self-attention mechanism, such as dilated self-attention using models like LongNet (e.g., Prov-GigaPath).

⑵ Types

BEiT: A ViT variant that adopts the idea of the BERT model, trained similarly to masked language modeling.

iSTAR: Used to enhance the resolution of spatial transcriptomics. It utilizes a BEiT-based model trained with the DINO method.


image

Figure 1. Diagram for data preparation step in iSTAR


Step 1. Divide the given image into 256 × 256 patches.

Step 2. Further divide each patch into 16 × 16 sub-patches.

Step 3. Apply ViT (denoted as f2) to each sub-patch to obtain a 384-dimensional vector.

Step 4. Aggregate the 384-dimensional vectors to form a 16 × 16 × 384 data structure, then apply another ViT (denoted as f1) to obtain a 192-dimensional vector.

Step 5. Gather the 192-dimensional vectors and apply ViT (denoted as f0).

○ Feature extraction and loss function formulation.


스크린샷 2025-02-24 오후 12 46 10


② Swin Transformer: A ViT variant that uses window-based local self-attention.

CTransPath : Wang et al., Medical Image Analysis (2022)

UNI : Chen et al., Nature Medicine (2024)

CONCH (CONtrastive learning from Captions for Histopathology) : Lu et al., Nature Medicine (2024)

Virchow : Vorontsov et al., arxiv (2023)

RudolfV : Dippel et al., arxiv (2024)

Campanella : Campanella et al., arxiv (2023)

Prov-GigaPath: Announced by Microsoft, it is a vision foundation model trained on 170,000 pathology images (1.3 billion tiles) (2024).



3. Image Generation Models

⑴ Types

① DALL·E3 (OpenAI)

② Midjourney

③ Stable Diffusion

④ Sora (OpenAI)

⑤ Video LLM



Input: 2024.04.22 14:08

results matching ""

    No results matching ""