Chapter 21. Natural Language Processing (NLP) and Large Language Models (LLM)
Recommended Readings : 【Algorithms】 Table of Algorithm Contents
1. Natural Language Processing
3. Bioinformatics and Language Models
b. Natural Language Processing and LLM Useful Function Collection
c. Research Topics related to LLM
1. Natural Language Processing (NLP)
⑴ Definition : AI models based on text
⑵ Text Preprocessing : Preprocessing to make unstructured text recognizable to computers
① Tokenization
○ Dividing sentences or corpora into minimum meaning units, tokens, for computer recognition
○ English : Mainly divided by spaces
○ 75 English words ≃ 100 tokens
○ Example : I / ate / noodle / very / deliciously
○ Example : OpenAI Tokenizer
② Part-of-Speech Tagging (POS tagging)
○ Technique to tag the parts of speech of morphemes
③ Lemmatization
○ Technique to find lemmas (base words) from words
○ Example : am, are, is → be
④ Stemming
○ Technique to obtain stems by removing prefixes and suffixes from words
⑤ Stopword Removal
○ Technique to handle words that contribute little to actual meaning analysis, such as particles and suffixes
⑶ Text Mining
① Topic Modeling
○ One of the statistical models in the fields of machine learning and natural language processing used to discover abstract topics, referred to as ‘topics’, within a collection of documents.
○ Used to uncover the hidden meaning structure in the text body.
② Word Cloud
○ A visualization technique that uses natural language processing to simply count and visualize people’s interests or frequency.
③ Social Network Analysis (SNA)
○ An analytical technique for analyzing and visualizing the network characteristics and structure among people within a group.
④ TF-IDF (Term Frequency-Inverse Document Frequency)
○ A technique used to extract how important a word is within a specific document, in a collection of multiple documents.
⑷ Types
① Type 1: Latent Semantic Analysis (LSA)
② Type 2: Probabilistic Latent Semantic Analysis (PLSA)
③ Type 3: Latent Dirichlet Allocation (LDA): Generative probabilistic models. Can be used even for deconvolution without reference (ref)
⑸ Evaluation of Language Models
① Parallel text datasets : Canadian Parliament (English ← Spanish), European Parliament (multiple languages supported)
② SNS
③ Bleu score (ref)
④ perplexity
2. Large Language Models (LLM)
⑴ Definition : Natural language processing models with billions of parameters
① Counting parameters method: here
② Terms 1: Meaning of 7B, 13B, 30B, 65B : The number of parameters in natural language models is 7 billion, 13 billion, 30 billion, and 65 billion
③ Term 2: Token : The unit of text processed by the model
○ Word-level tokenization : [ChatGPT, is, an, AI, language, model, .]
○ Subword-level tokenization : [Chat, G, PT, is, an, AI, language, model, .]
○ Character-level tokenization : [C, h, a, t, G, P, T, i, s, a, n, A, I, l, a, n, g, u, a, g, e, m, o, d, e, l]
④ Term 3: Meaning of 0-shot, 1-shot, etc. : The number of example inputs given per task
○ 0-shot prompting
Q: <Question>?
A:
○ Few-shot prompting
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A:
⑵ Transformer
① Stage 1. Transformer Encoder: Converts the input sequence into a higher-level representation
○ Training Methods: NSP, MLM
○ Next Sentence Prediction (NSP): Composes pairs of sentences A and B from the input data, and determines whether B follows A
○ First Sentence: “The quick brown fox jumps over the lazy dog.”
○ Second Sentence: “The dog is not amused.”
○ Prediction Result: Whether the second sentence follows the first sentence (e.g., “No”)
○ Masked Language Modeling (MLM): Masks some words in the sentence and predicts the masked words
○ In transformers like BERT, the encoder is trained.
○ 1-1. Self-Attention Mechanism
○ 1-1-1. Splits the sentence into multiple tokens
○ 1-1-2. For each pair of tokens in the input sequence, generates “query”, “key”, and “value” to calculate the attention score, which indicates how much each token should attend to other tokens
Figure 1. Attention score of a pair of tokens
○ 1-1-3. Positional Encoding: The transformer cannot directly know the order of the input words, so positional encoding is added to each word’s position information. This allows the model to understand the sequence’s order information. For example, when defined by sine and cosine functions, the closer the values, the larger the inner product becomes, allowing us to understand the adjacency relationship.
○ 1-1-4. Each token (word) within the encoder learns the relationships with all other tokens.
○ 1-1-5. Token Embedding: Based on the attention score, the weighted sum of each token is calculated and each token is transformed into a new representation. The result embedded with positional encoding is called positional embedding.
○ 1-1-6. When there are multiple attention heads, it is called multi-head attention.
○ 1-2. FFN (Feed-Forward Neural Network): Refines the representations generated through self-attention
○ Individually tunes each token embedding, using non-linear activation functions
○ 1-3. Add & Norm (LayerNorm): Performs layer normalization after self-attention and FFN
○ 1-4. The structure of “self-attention → Add & Norm → FFN → Add & Norm” is repeated across several layers
○ This gradually transforms each token embedding into a more context-aware vector.
○ Type 1. Initial Layers:
○ Attention heads in each layer have significant variance in focus areas within the input features
○ During this stage, broad exploration helps to distinguish between important and less important information
○ Type 2. Intermediate Layers:
○ Attention heads in each layer consistently focus on lower-ranked information
○ Identifies noise inherent in the data and understands details
○ Type 3. Final Layers:
○ Attention heads in each layer consistently focus on higher-ranked information
○ Extracts critical information and contributes to making final decisions
○ 1-5. Sentence embedding: Ultimately, it synthesizes each token embedding to convert it into a single vector that represents the overall meaning of the sentence. The Transformer decoder does not take the sentence embedding created by the Transformer encoder as input. Instead, the Transformer decoder receives the contextualized token embeddings from the encoder as input.
② Stage 2. Transformer Decoder: Takes the representation generated by the encoder and ultimately generates the desired output
○ Training Method: NWP
Figure 2. Defining Transformer Decoder Problems
○ Next Word Prediction (NWP): The task of predicting the next word within a given context
○ In transformers like GPT, the decoder is trained.
○ 2-1. Masked Self-Attention Mechanism
○ 2-1-1. The decoder refers to the token embeddings generated by the encoder
○ 2-1-2. Each token within the decoder refers only to the previous tokens to predict the next token: Applies masking to ensure the model does not see future information. The figure below shows that each token is embedded into vector representations
Figure 3. Token embedding results and next word prediction situation
○ 2-2. FFN (Feed-Forward Neural Network)
○ Used to refine the representations generated by the decoder to produce the final output
○ Similar to the encoder, the FFN non-linearly transforms each token embedding
Figure 4. attention multilayer perceptron
○ 2-3. Multilayer Decoder: Stacks multiple decoder layers to ultimately generate the output sequence
○ By serially attaching attention and multilayer perceptron, it is possible to continuously generate sentences
○ The decoder combines the information from the encoder with the information of the currently generated sequence to predict the next token
○ To date, the cleverest thinker of all time was ___ → undoubtedly
○ To date, the cleverest thinker of all time was undoubtedly ___ → Einstein
○ To date, the cleverest thinker of all time was undoubtedly Einstein, ___ → for
Figure 5. Multiple decoders
⑶ Important parameters
① temperature = 0: leading to very deterministic outputs
② max_tokens API Call settings = 100: response is capped at 100 tokens
③ top_p = 1.0: uses all possible tokens for generating the response
④ frequency_penalty = 0.0: does not avoide repeating tokens more than it naturally would
⑤ presence_penalty = 0.0: no penalty applied for reusing tokens
⑷ Types
○ Using a bidirectional transformer, which is advantageous for understanding text as it considers the context of both the left and right sides of the input sequence.
○ The input for BERT is a maximum of 512 tokens, but the input can be increased by extending the positional embedding: 2n is commonly used as the maximum input.
○ A function that loads BERT or BioBERT from Hugging Face to create an attention matrix for a given sentence.
import torch
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt
import seaborn as sns
# Load BERT model and tokenizer (OPTION 1)
'''
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name, output_attentions=True)
'''
# Load BioBERT model and tokenizer (OPTION 2)
model_name = 'dmis-lab/biobert-base-cased-v1.1'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name, output_attentions=True)
# Input sentence
sentence = "Find diseases associated with glucose"
# Tokenization
inputs = tokenizer(sentence, return_tensors='pt')
# Calculate outputs and attention weights through the model
outputs = model(**inputs)
attentions = outputs.attentions # These are the attention weights for each layer
# Visualize the attention weights from the first head of the first layer
attention = attentions[0][0][0].detach().numpy()
# Token list
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
# Visualize Attention weights
plt.figure(figsize=(10, 10))
sns.heatmap(attention, xticklabels=tokens, yticklabels=tokens, cmap='viridis')
plt.title('Attention Weights')
plt.show()
○ Focuses on predicting the next word by sequentially considering the context from left to right in the input sequence.
○ Uses an autoregressive model.
③ Comparison between BERT and GPT
BERT | GPT-4 | |
---|---|---|
Developer | Google AI | OpenAI |
Input Data | Considers both left and right context of input sequences | Considers input sequences sequentially from left to right |
Parameters | 1.5 B | 340 M (= 0.34 B) |
Training Method | Transformer Encoder (MLM, NSP) | Transformer Decoder (NWP) |
Training Data | 3 TB | 45 TB |
Main Application Area | Text Understanding | Text Generation |
Table. 1. Comparison between BERT and GPT
④ Gopher
⑥ Flan, PaLM, Flan-PaLM
⑦ OPT-IML
⑧ LLaMA, LLaMA2 : Installed models. Developed by Meta
⑨ Alpaca
⑪ ollama : Supports the following LLMs.
○ Llama2 (7B)
○ Mistral (7B)
○ Dolphin Phi (2.7B)
○ Phi-2 (2.7B)
○ Neural Chat (7B)
○ Starling (7B)
○ Code Llama (7B)
○ Llama2 Uncensored (7B)
○ Llama2 (13B)
○ Llama2 (70B)
○ Orca Mini (3B)
○ Vicuna (7B)
○ LLaVA (7B)
⑫ AlphaGeometry: Implements symbolic deduction to solve geometry problems at the International Mathematical Olympiad (IMO) level.
⑬ MiniLM
○ Function that transforms arbitrary variable-length natural language sentences into 384-dimensional vectors by considering their meaning (ref)
⑭ Other useful generative AI proprietary tools
○ Scite
○ SciSpace / typeset.io
○ Elicit.com
○ Elicit AI
○ Research Rabbit
○ Gemini
○ Tabnine
○ CodiumAI
○ AWS Code Whisperer
○ Sourcegraph Cody
○ NotebookLM
3. Bioinformatics and Language Models
⑴ BioBERT
⑵ BioNER
⑶ SQuAD
⑷ BioASQ
⑸ PubMedGPT (BioMedLM)
⑹ BioGPT
⑺ scBERT
⑻ GPT-Neo
⑼ PubMedBERT
⑽ BioLinkBERT
⑾ DRAGON
⑿ BioMedLM
⒂ tGPT
⒃ CellLM
⒄ Geneformer: Based on BERT. Uses a transformer encoder-based architecture. Utilized with a pretraining → finetuning approach. Zero-shot capabilities are practically useless.
⒅ scGPT: Based on GPT. Uses a transformer decoder-based architecture. Utilized with a pretraining → finetuning approach. The zero-shot performance of the pretraining model is also quite excellent.
⒇ CellPLM
Input: 2021-12-11 17:34
Modified: 2023-05-18 11:36