○ One of the statistical models in the fields of machine learning and natural language processing used to discover abstract topics, referred to as ‘topics’, within a collection of documents.

○ Used to uncover the hidden meaning structure in the text body.

② Word Cloud

○ A visualization technique that uses natural language processing to simply count and visualize people’s interests or frequency.

③ Social Network Analysis (SNA)

○ An analytical technique for analyzing and visualizing the network characteristics and structure among people within a group.

④ TF-IDF (Term Frequency-Inverse Document Frequency)

○ A technique used to extract how important a word is within a specific document, in a collection of multiple documents.

⑷ Types

① Type 1. Latent Semantic Analysis (LSA)

② Type 2. Probabilistic Latent Semantic Analysis (PLSA)

③ Type 3. Latent Dirichlet Allocation (LDA): Generative probabilistic models. Can be used even for deconvolution without reference (ref)

④ Type 4. Word2Vec: Skip-gram is a component of the Word2Vec model.

⑸ Evaluation of Language Models

① Parallel text datasets: Canadian Parliament (English ← Spanish), European Parliament (multiple languages supported)

② SNS

③ Bleu score (ref)

④ perplexity

2. Large Language Models (LLM)

⑴ Definition: Natural language processing models with billions of parameters

① Counting parameters method: here

② Terms 1: Meaning of 7B, 13B, 30B, 65B: The number of parameters in natural language models is 7 billion, 13 billion, 30 billion, and 65 billion

③ Term 2: Token: The unit of text processed by the model

○ Word-level tokenization: [ChatGPT, is, an, AI, language, model, .]

○ Subword-level tokenization: [Chat, G, PT, is, an, AI, language, model, .]

○ Character-level tokenization: [C, h, a, t, G, P, T, i, s, a, n, A, I, l, a, n, g, u, a, g, e, m, o, d, e, l]

④ Term 3: Meaning of 0-shot, 1-shot, etc.: The number of example inputs given per task

○ 0-shot prompting

Q: <Question>?
A:

○ Few-shot prompting

Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A:

⑵ Transformer

① Stage 1. Transformer Encoder: Converts the input sequence into a higher-level representation

○ Training Methods: NSP, MLM

○ Next Sentence Prediction (NSP): Composes pairs of sentences A and B from the input data, and determines whether B follows A

○ First Sentence: “The quick brown fox jumps over the lazy dog.”

○ Second Sentence: “The dog is not amused.”

○ Prediction Result: Whether the second sentence follows the first sentence (e.g., “No”)

○ Masked Language Modeling (MLM): Masks some words in the sentence and predicts the masked words

○ In transformers like BERT, the encoder is trained.

○ 1-1. Self-Attention Mechanism

○ 1-1-1. Splits the sentence into multiple tokens

○ 1-1-2. For each pair of tokens in the input sequence, generates “query”, “key”, and “value” to calculate the attention score, which indicates how much each token should attend to other tokens

Figure 1. Attention score of a pair of tokens

○ 1-1-3. Positional Encoding: The transformer cannot directly know the order of the input words, so positional encoding is added to each word’s position information.

○ This allows the model to understand the sequence’s order information.

○ For example, when defined by sine and cosine functions, the closer the values, the larger the inner product becomes, allowing us to understand the adjacency relationship.

○ If the input has no inherent order, we can omit positional encodings from it.

○ 1-1-4. Each token (word) within the encoder learns the relationships with all other tokens.

○ 1-1-5. Token Embedding: Based on the attention score, the weighted sum of each token is calculated and each token is transformed into a new representation. The result embedded with positional encoding is called positional embedding.

○ 1-1-6. When there are multiple attention heads, it is called multi-head attention.

○ 1-2. FFN (Feed-Forward Neural Network): Refines the representations generated through self-attention

○ Individually tunes each token embedding, using non-linear activation functions

○ 1-3. Add & Norm (LayerNorm): Performs layer normalization after self-attention and FFN

○ 1-4. The structure of “self-attention → Add & Norm → FFN → Add & Norm” is repeated across several layers

○ This gradually transforms each token embedding into a more context-aware vector.

○ Type 1. Initial Layers:

○ Attention heads in each layer have significant variance in focus areas within the input features

○ During this stage, broad exploration helps to distinguish between important and less important information

○ Type 2. Intermediate Layers:

○ Attention heads in each layer consistently focus on lower-ranked information

○ Identifies noise inherent in the data and understands details

○ Type 3. Final Layers:

○ Attention heads in each layer consistently focus on higher-ranked information

○ Extracts critical information and contributes to making final decisions

○ 1-5. Sentence embedding: Ultimately, it synthesizes each token embedding to convert it into a single vector that represents the overall meaning of the sentence. The Transformer decoder does not take the sentence embedding created by the Transformer encoder as input. Instead, the Transformer decoder receives the contextualized token embeddings from the encoder as input.

② Stage 2. Transformer Decoder: Takes the representation generated by the encoder and ultimately generates the desired output

○ Training Method: NWP

Figure 2. Defining Transformer Decoder Problems

○ Next Word Prediction (NWP): The task of predicting the next word within a given context

○ In transformers like GPT, the decoder is trained.

○ 2-1. Masked Self-Attention Mechanism

○ 2-1-1. The decoder refers to the token embeddings generated by the encoder

○ 2-1-2. Each token within the decoder refers only to the previous tokens to predict the next token: Applies masking to ensure the model does not see future information. The figure below shows that each token is embedded into vector representations

Figure 3. Token embedding results and next word prediction situation

○ 2-2. FFN (Feed-Forward Neural Network)

○ Used to refine the representations generated by the decoder to produce the final output

○ Similar to the encoder, the FFN non-linearly transforms each token embedding

Figure 4. attention multilayer perceptron

○ 2-3. Multilayer Decoder: Stacks multiple decoder layers to ultimately generate the output sequence

○ By serially attaching attention and multilayer perceptron, it is possible to continuously generate sentences

○ The decoder combines the information from the encoder with the information of the currently generated sequence to predict the next token

○ To date, the cleverest thinker of all time was ___ → undoubtedly

○ To date, the cleverest thinker of all time was undoubtedly ___ → Einstein

○ To date, the cleverest thinker of all time was undoubtedly Einstein, ___ → for

Figure 5. Multiple decoders

⑶ Important parameters

① temperature = 0: leading to very deterministic outputs

② max_tokens API Call settings = 100: response is capped at 100 tokens

③ top_p = 1.0: uses all possible tokens for generating the response

④ frequency_penalty = 0.0: does not avoide repeating tokens more than it naturally would

⑤ presence_penalty = 0.0: no penalty applied for reusing tokens

⑷ Types

① BERT, RoBERTa, ALBERT

○ Using a bidirectional transformer, which is advantageous for understanding text as it considers the context of both the left and right sides of the input sequence.

○ The input for BERT is a maximum of 512 tokens, but the input can be increased by extending the positional embedding: 2ⁿ is commonly used as the maximum input.

○ A function that loads BERT or BioBERT from Hugging Face to create an attention matrix for a given sentence.

import torch
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt
import seaborn as sns

# Load BERT model and tokenizer (OPTION 1)
'''
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name, output_attentions=True)
'''

# Load BioBERT model and tokenizer (OPTION 2)
model_name = 'dmis-lab/biobert-base-cased-v1.1'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name, output_attentions=True)

# Input sentence
sentence = "Find diseases associated with glucose"

# Tokenization
inputs = tokenizer(sentence, return_tensors='pt')

# Calculate outputs and attention weights through the model
outputs = model(**inputs)
attentions = outputs.attentions  # These are the attention weights for each layer

# Visualize the attention weights from the first head of the first layer
attention = attentions[0][0][0].detach().numpy()

# Token list
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Visualize Attention weights
plt.figure(figsize=(10, 10))
sns.heatmap(attention, xticklabels=tokens, yticklabels=tokens, cmap='viridis')
plt.title('Attention Weights')
plt.show()

② GPT, GPT-2, GPT-3, GPT-J, GPT-Neo, GPT-3.5, GPT-4

○ Focuses on predicting the next word by sequentially considering the context from left to right in the input sequence.

○ Uses an autoregressive model.

③ Comparison between BERT and GPT

	BERT	GPT-4
Developer	Google AI	OpenAI
Input Data	Considers both left and right context of input sequences	Considers input sequences sequentially from left to right
Parameters	1.5 B	340 M (= 0.34 B)
Training Method	Transformer Encoder (MLM, NSP)	Transformer Decoder (NWP)
Training Data	3 TB	45 TB
Main Application Area	Text Understanding	Text Generation

Table. 1. Comparison between BERT and GPT

④ Gopher

⑤ Chinchilla

⑥ Flan, PaLM, Flan-PaLM

⑦ OPT-IML

⑧ LLaMA, LLaMA2: Installed models. Developed by Meta. Specialized in question answering.

⑨ Alpaca

⑩ XLNet, T5, CTRL, BART: Specialized in text generation.

⑪ Gemma 2/3

○ multi-query attention

○ RoPE embedding

○ GeGLU activation

○ RMSNorm

⑫ Mistral

○ group-query attention

○ sliding window attention

○ rolling buffer cache

○ pre-fill and chunking

⑬ ollama: Supports the following LLMs.

○ Llama2 (7B)

○ Mistral (7B)

○ Dolphin Phi (2.7B)

○ Phi-2 (2.7B)

○ Neural Chat (7B)

○ Starling (7B)

○ Code Llama (7B)

○ Llama2 Uncensored (7B)

○ Llama2 (13B)

○ Llama2 (70B)

○ Orca Mini (3B)

○ Vicuna (7B)

○ LLaVA (7B)

⑭ AlphaGeometry: Implements symbolic deduction to solve geometry problems at the International Mathematical Olympiad (IMO) level.

⑮ MiniLM

○ Function that transforms arbitrary variable-length natural language sentences into 384-dimensional vectors by considering their meaning (ref)

⑯ Other useful generative AI proprietary tools

○ GitHub Copilot

○ Perplexity AI

○ Consensus

○ Scite

○ SciSpace / typeset.io

○ Elicit.com

○ Claude AI

○ Elicit AI

○ Research Rabbit

○ Gemini

○ Mistral AI

○ Tabnine

○ CodiumAI

○ AWS Code Whisperer

○ Sourcegraph Cody

○ NotebookLM

○ Grok

○ DeepSeek

○ Qwen

○ LSTM: Specialized in text classification.

○ Falcon: Specialized in question answering.

○ StableLM

○ Rasa

Figure 6. Various LLM models

3. Bioinformatics and Language Models

Figure 7. Bioinformatics and language models

⑴ BioBERT

⑵ BioNER

⑶ SQuAD

⑷ BioASQ

⑸ PubMedGPT (BioMedLM)

⑹ BioGPT

⑺ scBERT

⑻ GPT-Neo

⑼ PubMedBERT

⑽ BioLinkBERT

⑾ DRAGON

⑿ BioMedLM

⒀ Med-PaLM, Med-PaLM M

⒁ BioMedGPT

⒂ tGPT

⒃ CellLM

⒄ Geneformer: Based on BERT. Uses a transformer encoder-based architecture. Utilized with a pretraining → finetuning approach. Zero-shot capabilities are practically useless.

⒅ scGPT: Based on GPT. Uses a transformer decoder-based architecture. Utilized with a pretraining → finetuning approach. The zero-shot performance of the pretraining model is also quite excellent.

⒆ scFoundation

⒇ SCimilarity

⒇ CellPLM

⒇ Nicheformer

⒇ Evo2: An alignment-free foundation model built by compiling the DNA of almost all living organisms, totaling 8.8 trillion nucleotides.

⒇ GenePT

⒇ Concerto

⒇ scTrans

Input: 2021-12-11 17:34

Modified: 2023-05-18 11:36