How LLM Works Under the Hood
Materials — open to everyone, no sign-in
Topic: How LLM Works Under the Hood
Presenter: Cindy Sheng
Additional Resources:
Building Generative AI Applications
Coach Ken LinkedIn:
https://commitway.com/linkedin
WeChat QRCodes
Linear regression with 2 variables and 3 variables
Neural network. Outer layers + hidden layers
GTP-1: 117 million parameters
GPT-2: 1.5 billion parameters
GPT-3: 175 billion parameters
Model size increases as time progresses
Claude is smaller with good performance
Large language model
Powerful, expensive, slow
Benchmark with 57 subjects
Compared against random
Massive multitask language understanding
Getting close to expert level
Better than average people
Knowledge + problem solving
Expensive
GPU
Can compute in parallel because parameters are independent
1D: Vector
2D: matrix
=3D: tensor
LLAMA 16,000 x H100 = $500M
LLAMA 405B technical report
https://ai.meta.com/blog/meta-llama-3-1/
LLAMA: 10.6 million
GPT4: 4.5 million
LLM training
Green: web data
Pile: common crawl
Reddit: informal written dialog
Stackexchange
Wikipedia
Blue: research paper and documents
Orange: bible, movies
Total 825 GB
Processing
Tokenization, break by words or segments
GPT2 used BPE
Cost is very high: 10.6million for LLAMA model
1 word is about 1.3 tokens
Tokenization: convert a token into a list of floating numbers
GPT2 50,257 x 1024
LLAMA 32,000 x
LLaMA 7B: 4,096
LLaMA 13B: 5,120
LLaMA 30B: 6,656
LLaMA 65B: 8,192
====
Attention:
The relationship between every 2 pairs
Essentially added position information
Activation function
Convert from linear to non-linear
Linear: regardless how many layers there are, it’s always equivalent to one layer
Output the token with biggest probability
Temperature =0: always choose the token with highest probability
Limits of Pre-trained Model
Pretrained model
Trained up to a certain time
Hallucinations
Domain shift
Task shift
Solving these: be specific about the domain and task in the prompt
Resource constraint
No access to private data
Retrieval Augmented Generation
To include effectiveness: May change embedding
Fine tune: train LLM further, especially with high quality data
Usually embedding model to search database is different from embedding model of LLM