Korean, Edit

Chapter 21. Natural Language Processing (NLP) and Large Language Models (LLM)

Recommended Readings : 【Algorithms】 Table of Algorithm Contents


1. Natural Language Processing

2. Large Language Models

3. Bioinformatics and Language Models


a. Prompt Engineering

b. Natural Language Processing and LLM Useful Function Collection

c. Research Topics related to LLM



1. Natural Language Processing (NLP)

⑴ Definition : AI models based on text

⑵ Text Preprocessing : Preprocessing to make unstructured text recognizable to computers

① Tokenization

○ Dividing sentences or corpora into minimum meaning units, tokens, for computer recognition

English : Mainly divided by spaces

○ 75 English words ≃ 100 tokens

○ Example : I / ate / noodle / very / deliciously

○ Example : OpenAI Tokenizer

② Part-of-Speech Tagging (POS tagging)

○ Technique to tag the parts of speech of morphemes

③ Lemmatization

○ Technique to find lemmas (base words) from words

○ Example : am, are, is → be

④ Stemming

○ Technique to obtain stems by removing prefixes and suffixes from words

⑤ Stopword Removal

○ Technique to handle words that contribute little to actual meaning analysis, such as particles and suffixes

⑶ Text Mining

① Topic Modeling

○ One of the statistical models in the fields of machine learning and natural language processing used to discover abstract topics, referred to as ‘topics’, within a collection of documents.

○ Used to uncover the hidden meaning structure in the text body.

② Word Cloud

○ A visualization technique that uses natural language processing to simply count and visualize people’s interests or frequency.

③ Social Network Analysis (SNA)

○ An analytical technique for analyzing and visualizing the network characteristics and structure among people within a group.

④ TF-IDF (Term Frequency-Inverse Document Frequency)

○ A technique used to extract how important a word is within a specific document, in a collection of multiple documents.

⑷ Transformer

① Problem Definition: Next Word Prediction


image

Figure 1. Transformer Problem Definition


Step 1. Segment the sentence into multiple tokens and lift each token onto the embedding space


image

Figure 2. Token Embedding


Step 2. Assign attention weights to each word, allowing for next word prediction


image

Figure 3. Attention Weights and Next Word Prediction


Step 3. Use attention multilayer perceptron to refine attention weights


image

Figure 4. Attention Multilayer Perceptron


Step 4. Serially connect attention and multilayer perceptron to continuously generate sentences

○ To date, the cleverest thinker of all time was ___ → undoubtedly

○ To date, the cleverest thinker of all time was undoubtedly ___ → Einstein

○ To date, the cleverest thinker of all time was undoubtedly Einstein, ___ → for


image

Figure 5. Serial Circuit of Attention and Perceptrons


⑸ Types

Type 1: Latent Semantic Analysis (LSA)

Type 2: Probabilistic Latent Semantic Analysis (PLSA)

Type 3: Latent Dirichlet Allocation (LDA)

○ Generative probabilistic models

○ Can be used even for deconvolution without reference (ref)

⑹ Evaluation of Language Models

① Parallel text datasets : Canadian Parliament (English ← Spanish), European Parliament (multiple languages ​​supported)

② SNS

③ Bleu score (ref)

④ perplexity



2. Large Language Models (LLM)

⑴ Definition : Natural language processing models with billions of parameters

① Counting parameters method: here

② Terms 1: Meaning of 7B, 13B, 30B, 65B : The number of parameters in natural language models is 7 billion, 13 billion, 30 billion, and 65 billion

③ Term 2: Token : The unit of text processed by the model

○ Word-level tokenization : [ChatGPT, is, an, AI, language, model, .]

○ Subword-level tokenization : [Chat, G, PT, is, an, AI, language, model, .]

○ Character-level tokenization : [C, h, a, t, G, P, T, i, s, a, n, A, I, l, a, n, g, u, a, g, e, m, o, d, e, l]

④ Term 3: Meaning of 0-shot, 1-shot, etc. : The number of example inputs given per task

○ 0-shot prompting


Q: <Question>?
A:


○ Few-shot prompting


Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A:


⑵ Types

GPT, GPT-2, GPT-3, GPT-J, GPT-Neo, GPT-3.5, GPT-4

② Gopher

Chinchilla

④ Flan, PaLM, Flan-PaLM

⑤ OPT-IML

⑥ LLaMA, LLaMA2 : Installed models. Developed by Meta

⑦ Alpaca

BERT, RoBERTa, ALBERT

XLNet, T5, CTRL, BART

ollama : Supports the following LLMs.

○ Llama2 (7B)

○ Mistral (7B)

○ Dolphin Phi (2.7B)

○ Phi-2 (2.7B)

○ Neural Chat (7B)

○ Starling (7B)

○ Code Llama (7B)

○ Llama2 Uncensored (7B)

○ Llama2 (13B)

○ Llama2 (70B)

○ Orca Mini (3B)

○ Vicuna (7B)

○ LLaVA (7B)

AlphaGeometry: Implements symbolic deduction to solve geometry problems at the International Mathematical Olympiad (IMO) level.

⑫ Other useful generative AI proprietary tools

○ GitHub Copilot

https://www.perplexity.ai

https://consensus.app

https://scite.ai/assistant

○ SciSpace / typeset.io

○ Elicit.com

⑶ Important parameters

① temperature = 0: leading to very deterministic outputs

② max_tokens API Call settings = 100: response is capped at 100 tokens

③ top_p = 1.0: uses all possible tokens for generating the response

④ frequency_penalty = 0.0: does not avoide repeating tokens more than it naturally would

⑤ presence_penalty = 0.0: no penalty applied for reusing tokens

⑷ Role of Programmers

Role 1: Development of LLM models : Not very practical

○ LLM model development requires substantial server resources, making it difficult without the scale of OpenAI, Meta, Google, etc.

○ Even Naver, Kakao have not achieved that scale

○ Some predict that ChatGPT 4.0 might be the final version of LLM unless OpenAI, Meta, Google, etc. join forces

Role 2: Fine-tuning

○ Model is already defined, improvement in a specific field by providing significant training data

○ Example : ChatGPT fine-tuning

Role 3: Crawling from DB and tagging with LLM

Role 4: Prompt Engineering

Role 5: Utilizing existing models with different frontends, backends for various services

⑸ Limitations of ChatGPT

① It cannot process images

② Lack of timeliness: cannot be solved by parameter tuning

③ Lack of creativity: only strives for the average

④ Does not understand the real world: the difference between the real world governed by quantum mechanics and the cyber space governed by semiconductors



3. Bioinformatics and Language Models


image

image


⑴ BioBERT

⑵ BioNER

⑶ SQuAD

⑷ BioASQ

PubMedGPT (BioMedLM)

⑹ BioGPT

scBERT

⑻ GPT-Neo

⑼ PubMedBERT

⑽ BioLinkBERT

⑾ DRAGON

⑿ BioMedLM

Med-PaLM, Med-PaLM M

BioMedGPT

tGPT

CellLM

Geneformer

scGPT

scFoundation

SCimilarity

CellPLM



Input: 2021-12-11 17:34

Modified: 2023-05-18 11:36

results matching ""

    No results matching ""