How LLM Works Under the Hood

AI & LLMs

Materials — open to everyone, no sign-in

Topic: How LLM Works Under the Hood

Presenter: Cindy Sheng

Additional Resources:


Building Generative AI Applications

Coach Ken LinkedIn:

https://commitway.com/linkedin

WeChat QRCodes

Linear regression with 2 variables and 3 variables

Neural network. Outer layers + hidden layers

GTP-1: 117 million parameters

GPT-2: 1.5 billion parameters

GPT-3: 175 billion parameters

Model size increases as time progresses

Claude is smaller with good performance

Large language model

Powerful, expensive, slow

Benchmark with 57 subjects

Compared against random

Massive multitask language understanding

Getting close to expert level

Better than average people

Knowledge + problem solving

Expensive

GPU

Can compute in parallel because parameters are independent

1D: Vector

2D: matrix

=3D: tensor

LLAMA 16,000 x H100 = $500M

LLAMA 405B technical report

https://ai.meta.com/blog/meta-llama-3-1/

LLAMA: 10.6 million

GPT4: 4.5 million

LLM training

Green: web data

Pile: common crawl

Reddit: informal written dialog

Stackexchange

Wikipedia

Blue: research paper and documents

Orange: bible, movies

Total 825 GB

Processing

Tokenization, break by words or segments

GPT2 used BPE

Cost is very high: 10.6million for LLAMA model

1 word is about 1.3 tokens

Tokenization: convert a token into a list of floating numbers

GPT2 50,257 x 1024

LLAMA 32,000 x

LLaMA 7B: 4,096

LLaMA 13B: 5,120

LLaMA 30B: 6,656

LLaMA 65B: 8,192

====

Attention:

The relationship between every 2 pairs

Essentially added position information

Activation function

Convert from linear to non-linear

Linear: regardless how many layers there are, it’s always equivalent to one layer

Output the token with biggest probability

Temperature =0: always choose the token with highest probability

Limits of Pre-trained Model

Pretrained model

Trained up to a certain time

Hallucinations

Domain shift

Task shift

Solving these: be specific about the domain and task in the prompt

Resource constraint

No access to private data

Retrieval Augmented Generation

To include effectiveness: May change embedding

Fine tune: train LLM further, especially with high quality data

Usually embedding model to search database is different from embedding model of LLM