NLP & Embeddings

Interview Context

NLP 面試的核心不是背 model architecture — 而是理解 representation 的演進（one-hot → static embeddings → contextual embeddings → LLMs）以及每個階段解決了什麼問題。面試官想知道你理解「為什麼需要 embeddings」和「怎麼在實際場景中使用」。

What You Should Understand

理解從 one-hot 到 contextual embeddings 的演進和每步解決的問題
知道 Word2Vec / GloVe 的原理和限制
能比較 BERT 和 GPT 的架構差異和適用場景
理解 tokenization 方法（BPE, WordPiece）為什麼重要
知道 fine-tuning vs feature extraction vs prompting 的 tradeoffs
了解 RAG 的基本概念和 embedding-based retrieval

The Evolution of Text Representation

Era	Method	Dimension	Semantic Info	Context-Aware
1990s	One-hot	Vocab size (10K-100K)	None	No
2000s	Bag-of-Words / TF-IDF	Vocab size (sparse)	Statistical	No
2013	Word2Vec / GloVe	100-300 (dense)	Yes (static)	No
2018	ELMo	1024	Yes	Yes (BiLSTM)
2018	BERT	768 (base)	Yes	Yes (Transformer)
2020+	GPT-3/4, LLaMA	4096-12288	Yes	Yes (Transformer, massive scale)

每一步都解決了前一步的核心限制。

Text Preprocessing

Tokenization: Why It Matters

Model 不能直接吃 raw text — 需要先切成 tokens（sub-word units）。Tokenization 的選擇直接影響 model performance。

Tokenization Methods

Method	How It Works	Used By
Word-level	Split by spaces	Legacy（vocabulary 太大）
Character-level	Each character = 1 token	Very long sequences, limited context
BPE (Byte-Pair Encoding)	Iteratively merge frequent character pairs	GPT-2, GPT-3, LLaMA
WordPiece	Like BPE but uses likelihood to decide merges	BERT
SentencePiece	Language-agnostic, treats input as byte stream	T5, LLaMA, multilingual models

BPE: How It Works

Start with individual characters as initial vocabulary
Count all adjacent character pairs in corpus
Merge the most frequent pair into one token
Repeat until desired vocabulary size reached

Example: "low" "lower" "lowest" → after merging l+o, lo+w → tokens: low, er, est

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unbelievable")
# → ['un', '##be', '##lie', '##va', '##ble']
# WordPiece: ## prefix means "continuation of previous token"

tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2")
tokens_gpt = tokenizer_gpt.tokenize("unbelievable")
# → ['un', 'believ', 'able']
# BPE: different splitting strategy

面試常見問題

「為什麼不用 word-level tokenization？」— Vocabulary 太大（英文 100K+ unique words）且無法處理 OOV (out-of-vocabulary) words。Sub-word tokenization 用有限的 vocabulary（30K-50K）覆蓋所有 possible words — rare words 被拆成 known sub-words（例如「unbelievable」→「un」+「believe」+「able」）。

Static Word Embeddings

One-Hot: The Starting Point

Each word = a sparse vector with a single 1:

\text{"cat"} = [0, 0, 1, 0, \ldots, 0] \quad \text{(10K-dim, only one 1)}

Problems: No semantic information（"cat" 和 "dog" 的距離 = "cat" 和 "car" 的距離 = $\sqrt{2}$ ）、dimensionality = vocabulary size、extremely sparse。

TF-IDF

Term Frequency × Inverse Document Frequency:

\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\frac{N}{\text{DF}(t)}

TF-IDF 給 frequent-in-this-document 但 rare-across-corpus 的 words 高 weight。比 raw counts 好，但仍然 high-dimensional + sparse + no semantic similarity。

Word2Vec (Mikolov et al., 2013)

Train a shallow neural network to predict words from context, and use the learned weights as embeddings.

Two architectures:

	CBOW	Skip-gram
Input	Context words	Center word
Output	Predict center word	Predict context words
Better for	Frequent words	Rare words
Training speed	Faster	Slower

Skip-gram objective:

\max \sum_{(w, c)} \log P(c \mid w) = \max \sum_{(w, c)} \log \frac{\exp(\mathbf{v}_c^\top \mathbf{v}_w)}{\sum_{c'} \exp(\mathbf{v}_{c'}^\top \mathbf{v}_w)}

Denominator sums over entire vocabulary → too expensive → use negative sampling（sample a few random negatives instead of computing full softmax）。

The magic: Learned embeddings capture semantic relationships:

\text{king} - \text{man} + \text{woman} \approx \text{queen}

from gensim.models import Word2Vec

# Train Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, sg=1)  # sg=1: skip-gram
# Get embedding
vec = model.wv["king"]  # 100-dim dense vector
# Find similar words
model.wv.most_similar("king")  # → [("queen", 0.85), ("prince", 0.78), ...]
# Analogy: king - man + woman = ?
model.wv.most_similar(positive=["king", "woman"], negative=["man"])  # → queen

GloVe (Pennington et al., 2014)

Global Vectors for Word Representation — combines count-based and prediction-based methods:

J = \sum_{i,j} f(X_{ij})(\mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2

where $X_{ij}$ = co-occurrence count of words $i$ and $j$ in a context window.

直覺：co-occurrence statistics 包含了 semantic information — GloVe 把這些 statistics 壓縮成 dense vectors。

FastText (Bojanowski et al., 2017)

Extends Word2Vec by representing each word as a bag of character n-grams:

\text{``where''} \to \{``<wh'', ``whe'', ``her'', ``ere'', ``re>''\}

Word embedding = sum of its n-gram embeddings。

Key advantage: Can generate embeddings for unseen words（by summing their n-gram vectors）。Word2Vec 和 GloVe 對 OOV words 完全不行。

Static Embedding Limitations

Limitation	Example
Polysemy	"bank"（銀行 vs 河岸）→ 同一個 vector，但意思完全不同
Context-free	"I love this bank" vs "sitting by the bank" → 同樣的 embedding
Fixed vocabulary	OOV words 沒有 embedding（FastText partially solves）

這些限制 motivated contextual embeddings — 同一個 word 在不同 context 得到不同 embedding。

Contextual Embeddings

The Key Insight

Static embeddings: $\text{embed}(\text{word})$ — 同一個 word 永遠得到同一個 vector。

Contextual embeddings: $\text{embed}(\text{word}, \text{context})$ — 同一個 word 在不同 sentence 得到不同 vector。

ELMo (Peters et al., 2018)

Embeddings from Language Models — use pre-trained BiLSTM language model:

Train forward + backward LSTM on large corpus
For each word, concatenate hidden states from all LSTM layers
Task-specific weighted combination of layers

ELMo 證明了 contextual embeddings 遠優於 static。但 BiLSTM 是 sequential → 慢 → Transformer 取代了它。

BERT (Devlin et al., 2018)

Bidirectional Encoder Representations from Transformers:

Aspect	Detail
Architecture	Transformer encoder only
Direction	Bidirectional — sees full left + right context
Pre-training	MLM (Masked Language Model) + NSP
Input	WordPiece tokens + [CLS] + [SEP]
Output	Contextual embedding per token

Masked Language Model (MLM):

Randomly mask 15% of tokens
Model predicts the masked tokens from surrounding context
Forces model to learn deep bidirectional representations

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("The bank by the river was beautiful.", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# outputs.last_hidden_state: [1, seq_len, 768]
# "bank" embedding is DIFFERENT in "river bank" vs "investment bank"
cls_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token → sentence embedding

GPT (Radford et al., 2018+)

Generative Pre-trained Transformer:

Aspect	Detail
Architecture	Transformer decoder only
Direction	Left-to-right (autoregressive)
Pre-training	Next token prediction (CLM)
Output	Generates one token at a time
Scaling	GPT-2 (1.5B) → GPT-3 (175B) → GPT-4

BERT vs GPT

Aspect	BERT	GPT
Architecture	Encoder-only	Decoder-only
Context	Bidirectional	Left-to-right only
Pre-training	MLM (predict masked tokens)	CLM (predict next token)
Strength	Understanding（classification, NER, similarity）	Generation（text, code, conversation）
Weakness	Cannot generate text	Weaker at bidirectional understanding
Fine-tuning	Add task head + fine-tune	Prompt engineering or fine-tune
Embedding use	[CLS] or mean pooling	Last token or mean

面試經典混淆

「BERT 能做 text generation 嗎？」— 不能。BERT 是 encoder-only, bidirectional — 沒有 autoregressive mechanism。要 generation 必須用 GPT（decoder-only, autoregressive）。反過來，GPT 做 classification 時不如 BERT（沒有 bidirectional context）。

Fine-Tuning Strategies

Three Approaches

Strategy	What	When	Data Needed
Feature extraction	Freeze model, extract embeddings → train separate classifier	Very small data (< 1K)	Very little
Fine-tuning	Unfreeze model, train end-to-end on task data	Medium data (1K-100K)	Some
Prompt engineering	Design input prompts, no weight update	Zero/few-shot	None

Fine-Tuning BERT

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
# BERT + new classification head → fine-tune all weights

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,           # small lr for fine-tuning
    weight_decay=0.01,
    warmup_steps=500,
)

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()

Fine-Tuning Tips

（1）Learning rate 要小（2e-5 to 5e-5）— pre-trained weights 已經很好，不要 destroy。（2）Warmup 重要。（3）通常 2-4 epochs 就夠 — 更多容易 overfit on small data。（4）Freeze bottom layers 如果 data 很少（lower layers 的 universal features 不需要改）。

Sentence Embeddings

Word embeddings 不等於 sentence embeddings — 需要把 token-level representations aggregate 成 sentence-level。

Methods

Method	How	Quality
[CLS] token	Use BERT's [CLS] output	不夠好（[CLS] optimized for NSP, not similarity）
Mean pooling	Average all token embeddings	Better than [CLS]，simple
Sentence-BERT	Fine-tune with siamese network on NLI data	Best for similarity tasks
Instructor / E5	Task-specific instruction + embedding	State-of-the-art

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["This is a sentence.", "This is another one."])
# cosine similarity → semantic similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])

RAG (Retrieval-Augmented Generation)

RAG 結合 retrieval 和 generation — 讓 LLM 能 access external knowledge without retraining：

How It Works

Index: 把 documents 用 embedding model encode 成 vectors → store in vector database
Retrieve: User query → encode → find top-K most similar documents via ANN search
Generate: Concatenate retrieved documents + user query → feed to LLM → generate answer

Why RAG Matters

Problem	How RAG Solves It
Hallucination	LLM 的回答 grounded in retrieved evidence
Knowledge cutoff	Can retrieve from up-to-date documents
Domain expertise	Index domain-specific documents without fine-tuning
Attribution	Can cite sources (which documents were used)

from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# 1. Index documents
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)

# 2. Retrieve relevant docs for a query
docs = vectorstore.similarity_search("What is attention mechanism?", k=3)

# 3. Generate answer using retrieved context + LLM
prompt = f"Based on these documents:\n{docs}\n\nAnswer: What is attention?"
answer = llm(prompt)

面試中的 RAG

RAG 是 2024+ 面試高頻題。關鍵概念：embedding quality 決定 retrieval quality → retrieval quality 決定 generation quality → 「garbage in, garbage out」。Chunking strategy（怎麼切文件）和 embedding model 選擇往往比 LLM 本身更重要。

Real-World Use Cases

Case 1: 信用卡詐欺 — Text Features from Merchant Descriptions

Merchant name 和 description 包含 fraud signal — 用 embeddings 把 text features 加入 fraud model：

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
merchant_embeddings = model.encode(df["merchant_description"].tolist())
# Concatenate with numerical features → feed to GBM or MLP
X = np.hstack([numerical_features, merchant_embeddings])

面試 follow-up：「為什麼不用 TF-IDF？」— TF-IDF 不 capture semantic similarity（「online store」和「e-commerce shop」的 TF-IDF vectors 完全不同但意思相近）。Pre-trained embeddings 直接 encode semantic meaning。

Case 2: 推薦系統 — Semantic Search for Content

用 embeddings 做 content-based recommendation — user query 或 liked items → embed → find similar items：

Approach	Method
Product search	Query embedding vs product description embeddings → cosine similarity
Similar items	Item embedding similarity（content-based）
Cold start	New item description → embedding → find similar existing items

Case 3: 客戶分群 — Clustering on Support Tickets

把 customer support tickets embed → cluster → 自動分類 issue types：

Embed all tickets with Sentence-BERT
Reduce dimensions with UMAP
Cluster with K-Means or HDBSCAN
Label each cluster by examining representative tickets

Hands-on: NLP in Python

Word2Vec

from gensim.models import Word2Vec

# sentences: list of lists of words
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, sg=1)
model.wv.most_similar("king")
# → [("queen", 0.85), ("prince", 0.78), ...]
model.wv.most_similar(positive=["king", "woman"], negative=["man"])
# → [("queen", 0.89)]

BERT Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling over token embeddings
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

emb1 = get_embedding("The cat sat on the mat.")
emb2 = get_embedding("A kitten was resting on the rug.")
# High cosine similarity — semantically similar

Sentence-BERT for Similarity

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = ["I love machine learning", "ML is my passion", "The weather is nice"]
embeddings = model.encode(sentences)

sim_matrix = cosine_similarity(embeddings)
# sentences[0] vs [1]: high similarity (~0.85)
# sentences[0] vs [2]: low similarity (~0.15)

Interview Signals

What interviewers listen for:

你能解釋 one-hot → Word2Vec → BERT 的 evolution 和每步解決的問題
你知道 Word2Vec 的 polysemy limitation（同一個 word = 同一個 vector）
你能比較 BERT vs GPT 的架構和適用場景
你理解 sub-word tokenization 為什麼比 word-level 好
你知道 RAG 的基本原理和 embedding quality 的重要性

Practice

Flashcards

Flashcards (1/10)

Word2Vec 的 skip-gram 和 CBOW 有什麼不同？

Skip-gram: input = center word, predict context words → 對 rare words 更好。CBOW: input = context words, predict center word → 對 frequent words 更好，training 更快。Skip-gram 通常 produce 更好的 embeddings（尤其 small datasets）。

Click card to flip

Quiz

Question 1/10

king - man + woman ≈ queen 的 Word2Vec analogy 是怎麼工作的？

Mark as Complete

How confident are you with this topic?

3/5 — Okay