Chapter 21. Natural Language Processing (NLP) and Large Language Models (LLM)
Recommended Readings : 【Algorithms】 Table of Algorithm Contents
1. Natural Language Processing
3. Bioinformatics and Language Models
b. Natural Language Processing and LLM Useful Function Collection
c. Research Topics related to LLM
1. Natural Language Processing (NLP)
⑴ Definition : AI models based on text
⑵ Text Preprocessing : Preprocessing to make unstructured text recognizable to computers
① Tokenization
○ Dividing sentences or corpora into minimum meaning units, tokens, for computer recognition
○ English : Mainly divided by spaces
○ 75 English words ≃ 100 tokens
○ Example : I / ate / noodle / very / deliciously
○ Example : OpenAI Tokenizer
② Part-of-Speech Tagging (POS tagging)
○ Technique to tag the parts of speech of morphemes
③ Lemmatization
○ Technique to find lemmas (base words) from words
○ Example : am, are, is → be
④ Stemming
○ Technique to obtain stems by removing prefixes and suffixes from words
⑤ Stopword Removal
○ Technique to handle words that contribute little to actual meaning analysis, such as particles and suffixes
⑶ Text Mining
① Topic Modeling
○ One of the statistical models in the fields of machine learning and natural language processing used to discover abstract topics, referred to as ‘topics’, within a collection of documents.
○ Used to uncover the hidden meaning structure in the text body.
② Word Cloud
○ A visualization technique that uses natural language processing to simply count and visualize people’s interests or frequency.
③ Social Network Analysis (SNA)
○ An analytical technique for analyzing and visualizing the network characteristics and structure among people within a group.
④ TF-IDF (Term Frequency-Inverse Document Frequency)
○ A technique used to extract how important a word is within a specific document, in a collection of multiple documents.
⑷ Transformer
① Problem Definition: Next Word Prediction
Figure 1. Transformer Problem Definition
② Step 1. Segment the sentence into multiple tokens and lift each token onto the embedding space
Figure 2. Token Embedding
③ Step 2. Assign attention weights to each word, allowing for next word prediction
Figure 3. Attention Weights and Next Word Prediction
④ Step 3. Use attention multilayer perceptron to refine attention weights
Figure 4. Attention Multilayer Perceptron
⑤ Step 4. Serially connect attention and multilayer perceptron to continuously generate sentences
○ To date, the cleverest thinker of all time was ___ → undoubtedly
○ To date, the cleverest thinker of all time was undoubtedly ___ → Einstein
○ To date, the cleverest thinker of all time was undoubtedly Einstein, ___ → for
Figure 5. Serial Circuit of Attention and Perceptrons
⑸ Types
① Type 1: Latent Semantic Analysis (LSA)
② Type 2: Probabilistic Latent Semantic Analysis (PLSA)
③ Type 3: Latent Dirichlet Allocation (LDA)
○ Generative probabilistic models
○ Can be used even for deconvolution without reference (ref)
⑹ Evaluation of Language Models
① Parallel text datasets : Canadian Parliament (English ← Spanish), European Parliament (multiple languages supported)
② SNS
③ Bleu score (ref)
④ perplexity
2. Large Language Models (LLM)
⑴ Definition : Natural language processing models with billions of parameters
① Counting parameters method: here
② Terms 1: Meaning of 7B, 13B, 30B, 65B : The number of parameters in natural language models is 7 billion, 13 billion, 30 billion, and 65 billion
③ Term 2: Token : The unit of text processed by the model
○ Word-level tokenization : [ChatGPT, is, an, AI, language, model, .]
○ Subword-level tokenization : [Chat, G, PT, is, an, AI, language, model, .]
○ Character-level tokenization : [C, h, a, t, G, P, T, i, s, a, n, A, I, l, a, n, g, u, a, g, e, m, o, d, e, l]
④ Term 3: Meaning of 0-shot, 1-shot, etc. : The number of example inputs given per task
○ 0-shot prompting
Q: <Question>?
A:
○ Few-shot prompting
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A: <Answer>
Q: <Question>?
A:
⑵ Types
② Gopher
④ Flan, PaLM, Flan-PaLM
⑤ OPT-IML
⑥ LLaMA, LLaMA2 : Installed models. Developed by Meta
⑦ Alpaca
⑩ ollama : Supports the following LLMs.
○ Llama2 (7B)
○ Mistral (7B)
○ Dolphin Phi (2.7B)
○ Phi-2 (2.7B)
○ Neural Chat (7B)
○ Starling (7B)
○ Code Llama (7B)
○ Llama2 Uncensored (7B)
○ Llama2 (13B)
○ Llama2 (70B)
○ Orca Mini (3B)
○ Vicuna (7B)
○ LLaVA (7B)
⑪ AlphaGeometry: Implements symbolic deduction to solve geometry problems at the International Mathematical Olympiad (IMO) level.
⑫ Other useful generative AI proprietary tools
○ GitHub Copilot
○ SciSpace / typeset.io
○ Elicit.com
⑶ Important parameters
① temperature = 0: leading to very deterministic outputs
② max_tokens API Call settings = 100: response is capped at 100 tokens
③ top_p = 1.0: uses all possible tokens for generating the response
④ frequency_penalty = 0.0: does not avoide repeating tokens more than it naturally would
⑤ presence_penalty = 0.0: no penalty applied for reusing tokens
⑷ Role of Programmers
① Role 1: Development of LLM models : Not very practical
○ LLM model development requires substantial server resources, making it difficult without the scale of OpenAI, Meta, Google, etc.
○ Even Naver, Kakao have not achieved that scale
○ Some predict that ChatGPT 4.0 might be the final version of LLM unless OpenAI, Meta, Google, etc. join forces
② Role 2: Fine-tuning
○ Model is already defined, improvement in a specific field by providing significant training data
○ Example : ChatGPT fine-tuning
③ Role 3: Crawling from DB and tagging with LLM
④ Role 4: Prompt Engineering
⑤ Role 5: Utilizing existing models with different frontends, backends for various services
⑸ Limitations of ChatGPT
① It cannot process images
② Lack of timeliness: cannot be solved by parameter tuning
③ Lack of creativity: only strives for the average
④ Does not understand the real world: the difference between the real world governed by quantum mechanics and the cyber space governed by semiconductors
3. Bioinformatics and Language Models
⑴ BioBERT
⑵ BioNER
⑶ SQuAD
⑷ BioASQ
⑸ PubMedGPT (BioMedLM)
⑹ BioGPT
⑺ scBERT
⑻ GPT-Neo
⑼ PubMedBERT
⑽ BioLinkBERT
⑾ DRAGON
⑿ BioMedLM
⒂ tGPT
⒃ CellLM
⒅ scGPT
⒇ CellPLM
Input: 2021-12-11 17:34
Modified: 2023-05-18 11:36