NLP & Embeddings
Interview Context
NLP 面試的核心不是背 model architecture — 而是理解 representation 的演進(one-hot → static embeddings → contextual embeddings → LLMs)以及每個階段解決了什麼問題。面試官想知道你理解「為什麼需要 embeddings」和「怎麼在實際場景中使用」。
What You Should Understand
- 理解從 one-hot 到 contextual embeddings 的演進和每步解決的問題
- 知道 Word2Vec / GloVe 的原理和限制
- 能比較 BERT 和 GPT 的架構差異和適用場景
- 理解 tokenization 方法(BPE, WordPiece)為什麼重要
- 知道 fine-tuning vs feature extraction vs prompting 的 tradeoffs
- 了解 RAG 的基本概念和 embedding-based retrieval
The Evolution of Text Representation
| Era | Method | Dimension | Semantic Info | Context-Aware |
|---|---|---|---|---|
| 1990s | One-hot | Vocab size (10K-100K) | None | No |
| 2000s | Bag-of-Words / TF-IDF | Vocab size (sparse) | Statistical | No |
| 2013 | Word2Vec / GloVe | 100-300 (dense) | Yes (static) | No |
| 2018 | ELMo | 1024 | Yes | Yes (BiLSTM) |
| 2018 | BERT | 768 (base) | Yes | Yes (Transformer) |
| 2020+ | GPT-3/4, LLaMA | 4096-12288 | Yes | Yes (Transformer, massive scale) |
每一步都解決了前一步的核心限制。
Text Preprocessing
Tokenization: Why It Matters
Model 不能直接吃 raw text — 需要先切成 tokens(sub-word units)。Tokenization 的選擇直接影響 model performance。
Tokenization Methods
| Method | How It Works | Used By |
|---|---|---|
| Word-level | Split by spaces | Legacy(vocabulary 太大) |
| Character-level | Each character = 1 token | Very long sequences, limited context |
| BPE (Byte-Pair Encoding) | Iteratively merge frequent character pairs | GPT-2, GPT-3, LLaMA |
| WordPiece | Like BPE but uses likelihood to decide merges | BERT |
| SentencePiece | Language-agnostic, treats input as byte stream | T5, LLaMA, multilingual models |
BPE: How It Works
- Start with individual characters as initial vocabulary
- Count all adjacent character pairs in corpus
- Merge the most frequent pair into one token
- Repeat until desired vocabulary size reached
Example: "low" "lower" "lowest" → after merging l+o, lo+w → tokens: low, er, est
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unbelievable")
# → ['un', '##be', '##lie', '##va', '##ble']
# WordPiece: ## prefix means "continuation of previous token"
tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2")
tokens_gpt = tokenizer_gpt.tokenize("unbelievable")
# → ['un', 'believ', 'able']
# BPE: different splitting strategy
面試常見問題
「為什麼不用 word-level tokenization?」— Vocabulary 太大(英文 100K+ unique words)且無法處理 OOV (out-of-vocabulary) words。Sub-word tokenization 用有限的 vocabulary(30K-50K)覆蓋所有 possible words — rare words 被拆成 known sub-words(例如 「unbelievable」→「un」+「believe」+「able」)。
Static Word Embeddings
One-Hot: The Starting Point
Each word = a sparse vector with a single 1:
Problems: No semantic information("cat" 和 "dog" 的距離 = "cat" 和 "car" 的距離 = )、dimensionality = vocabulary size、extremely sparse。
TF-IDF
Term Frequency × Inverse Document Frequency:
TF-IDF 給 frequent-in-this-document 但 rare-across-corpus 的 words 高 weight。比 raw counts 好,但仍然 high-dimensional + sparse + no semantic similarity。
Word2Vec (Mikolov et al., 2013)
Train a shallow neural network to predict words from context, and use the learned weights as embeddings.
Two architectures:
| CBOW | Skip-gram | |
|---|---|---|
| Input | Context words | Center word |
| Output | Predict center word | Predict context words |
| Better for | Frequent words | Rare words |
| Training speed | Faster | Slower |
Skip-gram objective:
Denominator sums over entire vocabulary → too expensive → use negative sampling(sample a few random negatives instead of computing full softmax)。
The magic: Learned embeddings capture semantic relationships:
from gensim.models import Word2Vec
# Train Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, sg=1) # sg=1: skip-gram
# Get embedding
vec = model.wv["king"] # 100-dim dense vector
# Find similar words
model.wv.most_similar("king") # → [("queen", 0.85), ("prince", 0.78), ...]
# Analogy: king - man + woman = ?
model.wv.most_similar(positive=["king", "woman"], negative=["man"]) # → queen
GloVe (Pennington et al., 2014)
Global Vectors for Word Representation — combines count-based and prediction-based methods:
where = co-occurrence count of words and in a context window.
直覺:co-occurrence statistics 包含了 semantic information — GloVe 把這些 statistics 壓縮成 dense vectors。
FastText (Bojanowski et al., 2017)
Extends Word2Vec by representing each word as a bag of character n-grams:
Word embedding = sum of its n-gram embeddings。
Key advantage: Can generate embeddings for unseen words(by summing their n-gram vectors)。Word2Vec 和 GloVe 對 OOV words 完全不行。
Static Embedding Limitations
| Limitation | Example |
|---|---|
| Polysemy | "bank"(銀行 vs 河岸)→ 同一個 vector,但意思完全不同 |
| Context-free | "I love this bank" vs "sitting by the bank" → 同樣的 embedding |
| Fixed vocabulary | OOV words 沒有 embedding(FastText partially solves) |
這些限制 motivated contextual embeddings — 同一個 word 在不同 context 得到不同 embedding。
Contextual Embeddings
The Key Insight
Static embeddings: — 同一個 word 永遠得到同一個 vector。
Contextual embeddings: — 同一個 word 在不同 sentence 得到不同 vector。
ELMo (Peters et al., 2018)
Embeddings from Language Models — use pre-trained BiLSTM language model:
- Train forward + backward LSTM on large corpus
- For each word, concatenate hidden states from all LSTM layers
- Task-specific weighted combination of layers
ELMo 證明了 contextual embeddings 遠優於 static。但 BiLSTM 是 sequential → 慢 → Transformer 取代了它。
BERT (Devlin et al., 2018)
Bidirectional Encoder Representations from Transformers:
| Aspect | Detail |
|---|---|
| Architecture | Transformer encoder only |
| Direction | Bidirectional — sees full left + right context |
| Pre-training | MLM (Masked Language Model) + NSP |
| Input | WordPiece tokens + [CLS] + [SEP] |
| Output | Contextual embedding per token |
Masked Language Model (MLM):
- Randomly mask 15% of tokens
- Model predicts the masked tokens from surrounding context
- Forces model to learn deep bidirectional representations
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("The bank by the river was beautiful.", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# outputs.last_hidden_state: [1, seq_len, 768]
# "bank" embedding is DIFFERENT in "river bank" vs "investment bank"
cls_embedding = outputs.last_hidden_state[:, 0, :] # [CLS] token → sentence embedding
GPT (Radford et al., 2018+)
Generative Pre-trained Transformer:
| Aspect | Detail |
|---|---|
| Architecture | Transformer decoder only |
| Direction | Left-to-right (autoregressive) |
| Pre-training | Next token prediction (CLM) |
| Output | Generates one token at a time |
| Scaling | GPT-2 (1.5B) → GPT-3 (175B) → GPT-4 |
BERT vs GPT
| Aspect | BERT | GPT |
|---|---|---|
| Architecture | Encoder-only | Decoder-only |
| Context | Bidirectional | Left-to-right only |
| Pre-training | MLM (predict masked tokens) | CLM (predict next token) |
| Strength | Understanding(classification, NER, similarity) | Generation(text, code, conversation) |
| Weakness | Cannot generate text | Weaker at bidirectional understanding |
| Fine-tuning | Add task head + fine-tune | Prompt engineering or fine-tune |
| Embedding use | [CLS] or mean pooling | Last token or mean |
面試經典混淆
「BERT 能做 text generation 嗎?」— 不能。BERT 是 encoder-only, bidirectional — 沒有 autoregressive mechanism。要 generation 必須用 GPT(decoder-only, autoregressive)。反過來,GPT 做 classification 時不如 BERT(沒有 bidirectional context)。
Fine-Tuning Strategies
Three Approaches
| Strategy | What | When | Data Needed |
|---|---|---|---|
| Feature extraction | Freeze model, extract embeddings → train separate classifier | Very small data (< 1K) | Very little |
| Fine-tuning | Unfreeze model, train end-to-end on task data | Medium data (1K-100K) | Some |
| Prompt engineering | Design input prompts, no weight update | Zero/few-shot | None |
Fine-Tuning BERT
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# BERT + new classification head → fine-tune all weights
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5, # small lr for fine-tuning
weight_decay=0.01,
warmup_steps=500,
)
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()
Fine-Tuning Tips
(1)Learning rate 要小(2e-5 to 5e-5)— pre-trained weights 已經很好,不要 destroy。(2)Warmup 重要。(3)通常 2-4 epochs 就夠 — 更多容易 overfit on small data。(4)Freeze bottom layers 如果 data 很少(lower layers 的 universal features 不需要改)。
Sentence Embeddings
Word embeddings 不等於 sentence embeddings — 需要把 token-level representations aggregate 成 sentence-level。
Methods
| Method | How | Quality |
|---|---|---|
| [CLS] token | Use BERT's [CLS] output | 不夠好([CLS] optimized for NSP, not similarity) |
| Mean pooling | Average all token embeddings | Better than [CLS],simple |
| Sentence-BERT | Fine-tune with siamese network on NLI data | Best for similarity tasks |
| Instructor / E5 | Task-specific instruction + embedding | State-of-the-art |
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["This is a sentence.", "This is another one."])
# cosine similarity → semantic similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
RAG (Retrieval-Augmented Generation)
RAG 結合 retrieval 和 generation — 讓 LLM 能 access external knowledge without retraining:
How It Works
- Index: 把 documents 用 embedding model encode 成 vectors → store in vector database
- Retrieve: User query → encode → find top-K most similar documents via ANN search
- Generate: Concatenate retrieved documents + user query → feed to LLM → generate answer
Why RAG Matters
| Problem | How RAG Solves It |
|---|---|
| Hallucination | LLM 的回答 grounded in retrieved evidence |
| Knowledge cutoff | Can retrieve from up-to-date documents |
| Domain expertise | Index domain-specific documents without fine-tuning |
| Attribution | Can cite sources (which documents were used) |
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
# 1. Index documents
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)
# 2. Retrieve relevant docs for a query
docs = vectorstore.similarity_search("What is attention mechanism?", k=3)
# 3. Generate answer using retrieved context + LLM
prompt = f"Based on these documents:\n{docs}\n\nAnswer: What is attention?"
answer = llm(prompt)
面試中的 RAG
RAG 是 2024+ 面試高頻題。關鍵概念:embedding quality 決定 retrieval quality → retrieval quality 決定 generation quality → 「garbage in, garbage out」。Chunking strategy(怎麼切文件)和 embedding model 選擇往往比 LLM 本身更重要。
Real-World Use Cases
Case 1: 信用卡詐欺 — Text Features from Merchant Descriptions
Merchant name 和 description 包含 fraud signal — 用 embeddings 把 text features 加入 fraud model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
merchant_embeddings = model.encode(df["merchant_description"].tolist())
# Concatenate with numerical features → feed to GBM or MLP
X = np.hstack([numerical_features, merchant_embeddings])
面試 follow-up:「為什麼不用 TF-IDF?」— TF-IDF 不 capture semantic similarity(「online store」和「e-commerce shop」的 TF-IDF vectors 完全不同但意思相近)。Pre-trained embeddings 直接 encode semantic meaning。
Case 2: 推薦系統 — Semantic Search for Content
用 embeddings 做 content-based recommendation — user query 或 liked items → embed → find similar items:
| Approach | Method |
|---|---|
| Product search | Query embedding vs product description embeddings → cosine similarity |
| Similar items | Item embedding similarity(content-based) |
| Cold start | New item description → embedding → find similar existing items |
Case 3: 客戶分群 — Clustering on Support Tickets
把 customer support tickets embed → cluster → 自動分類 issue types:
- Embed all tickets with Sentence-BERT
- Reduce dimensions with UMAP
- Cluster with K-Means or HDBSCAN
- Label each cluster by examining representative tickets
Hands-on: NLP in Python
Word2Vec
from gensim.models import Word2Vec
# sentences: list of lists of words
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, sg=1)
model.wv.most_similar("king")
# → [("queen", 0.85), ("prince", 0.78), ...]
model.wv.most_similar(positive=["king", "woman"], negative=["man"])
# → [("queen", 0.89)]
BERT Embeddings
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling over token embeddings
return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
emb1 = get_embedding("The cat sat on the mat.")
emb2 = get_embedding("A kitten was resting on the rug.")
# High cosine similarity — semantically similar
Sentence-BERT for Similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = ["I love machine learning", "ML is my passion", "The weather is nice"]
embeddings = model.encode(sentences)
sim_matrix = cosine_similarity(embeddings)
# sentences[0] vs [1]: high similarity (~0.85)
# sentences[0] vs [2]: low similarity (~0.15)
Interview Signals
What interviewers listen for:
- 你能解釋 one-hot → Word2Vec → BERT 的 evolution 和每步解決的問題
- 你知道 Word2Vec 的 polysemy limitation(同一個 word = 同一個 vector)
- 你能比較 BERT vs GPT 的架構和適用場景
- 你理解 sub-word tokenization 為什麼比 word-level 好
- 你知道 RAG 的基本原理和 embedding quality 的重要性
Practice
Flashcards
Flashcards (1/10)
Word2Vec 的 skip-gram 和 CBOW 有什麼不同?
Skip-gram: input = center word, predict context words → 對 rare words 更好。CBOW: input = context words, predict center word → 對 frequent words 更好,training 更快。Skip-gram 通常 produce 更好的 embeddings(尤其 small datasets)。
Quiz
king - man + woman ≈ queen 的 Word2Vec analogy 是怎麼工作的?