NLP & Embeddings

Interview Context

NLP 面試的核心不是背 model architecture — 而是理解 representation 的演進(one-hot → static embeddings → contextual embeddings → LLMs)以及每個階段解決了什麼問題。面試官想知道你理解「為什麼需要 embeddings」和「怎麼在實際場景中使用」。

What You Should Understand

  • 理解從 one-hot 到 contextual embeddings 的演進和每步解決的問題
  • 知道 Word2Vec / GloVe 的原理和限制
  • 能比較 BERT 和 GPT 的架構差異和適用場景
  • 理解 tokenization 方法(BPE, WordPiece)為什麼重要
  • 知道 fine-tuning vs feature extraction vs prompting 的 tradeoffs
  • 了解 RAG 的基本概念和 embedding-based retrieval

The Evolution of Text Representation

EraMethodDimensionSemantic InfoContext-Aware
1990sOne-hotVocab size (10K-100K)NoneNo
2000sBag-of-Words / TF-IDFVocab size (sparse)StatisticalNo
2013Word2Vec / GloVe100-300 (dense)Yes (static)No
2018ELMo1024YesYes (BiLSTM)
2018BERT768 (base)YesYes (Transformer)
2020+GPT-3/4, LLaMA4096-12288YesYes (Transformer, massive scale)

每一步都解決了前一步的核心限制。

Text Preprocessing

Tokenization: Why It Matters

Model 不能直接吃 raw text — 需要先切成 tokens(sub-word units)。Tokenization 的選擇直接影響 model performance。

Tokenization Methods

MethodHow It WorksUsed By
Word-levelSplit by spacesLegacy(vocabulary 太大)
Character-levelEach character = 1 tokenVery long sequences, limited context
BPE (Byte-Pair Encoding)Iteratively merge frequent character pairsGPT-2, GPT-3, LLaMA
WordPieceLike BPE but uses likelihood to decide mergesBERT
SentencePieceLanguage-agnostic, treats input as byte streamT5, LLaMA, multilingual models

BPE: How It Works

  1. Start with individual characters as initial vocabulary
  2. Count all adjacent character pairs in corpus
  3. Merge the most frequent pair into one token
  4. Repeat until desired vocabulary size reached

Example: "low" "lower" "lowest" → after merging l+o, lo+w → tokens: low, er, est

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unbelievable")
# → ['un', '##be', '##lie', '##va', '##ble']
# WordPiece: ## prefix means "continuation of previous token"

tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2")
tokens_gpt = tokenizer_gpt.tokenize("unbelievable")
# → ['un', 'believ', 'able']
# BPE: different splitting strategy

面試常見問題

「為什麼不用 word-level tokenization?」— Vocabulary 太大(英文 100K+ unique words)且無法處理 OOV (out-of-vocabulary) words。Sub-word tokenization 用有限的 vocabulary(30K-50K)覆蓋所有 possible words — rare words 被拆成 known sub-words(例如 「unbelievable」→「un」+「believe」+「able」)。

Static Word Embeddings

One-Hot: The Starting Point

Each word = a sparse vector with a single 1:

"cat"=[0,0,1,0,,0](10K-dim, only one 1)\text{"cat"} = [0, 0, 1, 0, \ldots, 0] \quad \text{(10K-dim, only one 1)}

Problems: No semantic information("cat" 和 "dog" 的距離 = "cat" 和 "car" 的距離 = 2\sqrt{2})、dimensionality = vocabulary size、extremely sparse。

TF-IDF

Term Frequency × Inverse Document Frequency:

TF-IDF(t,d)=TF(t,d)×logNDF(t)\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\frac{N}{\text{DF}(t)}

TF-IDF 給 frequent-in-this-document 但 rare-across-corpus 的 words 高 weight。比 raw counts 好,但仍然 high-dimensional + sparse + no semantic similarity。

Word2Vec (Mikolov et al., 2013)

Train a shallow neural network to predict words from context, and use the learned weights as embeddings.

Two architectures:

CBOWSkip-gram
InputContext wordsCenter word
OutputPredict center wordPredict context words
Better forFrequent wordsRare words
Training speedFasterSlower

Skip-gram objective:

max(w,c)logP(cw)=max(w,c)logexp(vcvw)cexp(vcvw)\max \sum_{(w, c)} \log P(c \mid w) = \max \sum_{(w, c)} \log \frac{\exp(\mathbf{v}_c^\top \mathbf{v}_w)}{\sum_{c'} \exp(\mathbf{v}_{c'}^\top \mathbf{v}_w)}

Denominator sums over entire vocabulary → too expensive → use negative sampling(sample a few random negatives instead of computing full softmax)。

The magic: Learned embeddings capture semantic relationships:

kingman+womanqueen\text{king} - \text{man} + \text{woman} \approx \text{queen}
from gensim.models import Word2Vec

# Train Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, sg=1)  # sg=1: skip-gram
# Get embedding
vec = model.wv["king"]  # 100-dim dense vector
# Find similar words
model.wv.most_similar("king")  # → [("queen", 0.85), ("prince", 0.78), ...]
# Analogy: king - man + woman = ?
model.wv.most_similar(positive=["king", "woman"], negative=["man"])  # → queen

GloVe (Pennington et al., 2014)

Global Vectors for Word Representation — combines count-based and prediction-based methods:

J=i,jf(Xij)(wiw~j+bi+b~jlogXij)2J = \sum_{i,j} f(X_{ij})(\mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2

where XijX_{ij} = co-occurrence count of words ii and jj in a context window.

直覺:co-occurrence statistics 包含了 semantic information — GloVe 把這些 statistics 壓縮成 dense vectors。

FastText (Bojanowski et al., 2017)

Extends Word2Vec by representing each word as a bag of character n-grams:

“where”{<wh,whe,her,ere,re>}\text{``where''} \to \{``<wh'', ``whe'', ``her'', ``ere'', ``re>''\}

Word embedding = sum of its n-gram embeddings。

Key advantage: Can generate embeddings for unseen words(by summing their n-gram vectors)。Word2Vec 和 GloVe 對 OOV words 完全不行。

Static Embedding Limitations

LimitationExample
Polysemy"bank"(銀行 vs 河岸)→ 同一個 vector,但意思完全不同
Context-free"I love this bank" vs "sitting by the bank" → 同樣的 embedding
Fixed vocabularyOOV words 沒有 embedding(FastText partially solves)

這些限制 motivated contextual embeddings — 同一個 word 在不同 context 得到不同 embedding。

Contextual Embeddings

The Key Insight

Static embeddings: embed(word)\text{embed}(\text{word}) — 同一個 word 永遠得到同一個 vector。

Contextual embeddings: embed(word,context)\text{embed}(\text{word}, \text{context}) — 同一個 word 在不同 sentence 得到不同 vector。

ELMo (Peters et al., 2018)

Embeddings from Language Models — use pre-trained BiLSTM language model:

  1. Train forward + backward LSTM on large corpus
  2. For each word, concatenate hidden states from all LSTM layers
  3. Task-specific weighted combination of layers

ELMo 證明了 contextual embeddings 遠優於 static。但 BiLSTM 是 sequential → 慢 → Transformer 取代了它。

BERT (Devlin et al., 2018)

Bidirectional Encoder Representations from Transformers:

AspectDetail
ArchitectureTransformer encoder only
DirectionBidirectional — sees full left + right context
Pre-trainingMLM (Masked Language Model) + NSP
InputWordPiece tokens + [CLS] + [SEP]
OutputContextual embedding per token

Masked Language Model (MLM):

  • Randomly mask 15% of tokens
  • Model predicts the masked tokens from surrounding context
  • Forces model to learn deep bidirectional representations
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("The bank by the river was beautiful.", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# outputs.last_hidden_state: [1, seq_len, 768]
# "bank" embedding is DIFFERENT in "river bank" vs "investment bank"
cls_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token → sentence embedding

GPT (Radford et al., 2018+)

Generative Pre-trained Transformer:

AspectDetail
ArchitectureTransformer decoder only
DirectionLeft-to-right (autoregressive)
Pre-trainingNext token prediction (CLM)
OutputGenerates one token at a time
ScalingGPT-2 (1.5B) → GPT-3 (175B) → GPT-4

BERT vs GPT

AspectBERTGPT
ArchitectureEncoder-onlyDecoder-only
ContextBidirectionalLeft-to-right only
Pre-trainingMLM (predict masked tokens)CLM (predict next token)
StrengthUnderstanding(classification, NER, similarity)Generation(text, code, conversation)
WeaknessCannot generate textWeaker at bidirectional understanding
Fine-tuningAdd task head + fine-tunePrompt engineering or fine-tune
Embedding use[CLS] or mean poolingLast token or mean

面試經典混淆

「BERT 能做 text generation 嗎?」— 不能。BERT 是 encoder-only, bidirectional — 沒有 autoregressive mechanism。要 generation 必須用 GPT(decoder-only, autoregressive)。反過來,GPT 做 classification 時不如 BERT(沒有 bidirectional context)。

Fine-Tuning Strategies

Three Approaches

StrategyWhatWhenData Needed
Feature extractionFreeze model, extract embeddings → train separate classifierVery small data (< 1K)Very little
Fine-tuningUnfreeze model, train end-to-end on task dataMedium data (1K-100K)Some
Prompt engineeringDesign input prompts, no weight updateZero/few-shotNone

Fine-Tuning BERT

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
# BERT + new classification head → fine-tune all weights

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,           # small lr for fine-tuning
    weight_decay=0.01,
    warmup_steps=500,
)

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()

Fine-Tuning Tips

(1)Learning rate 要小(2e-5 to 5e-5)— pre-trained weights 已經很好,不要 destroy。(2)Warmup 重要。(3)通常 2-4 epochs 就夠 — 更多容易 overfit on small data。(4)Freeze bottom layers 如果 data 很少(lower layers 的 universal features 不需要改)。

Sentence Embeddings

Word embeddings 不等於 sentence embeddings — 需要把 token-level representations aggregate 成 sentence-level。

Methods

MethodHowQuality
[CLS] tokenUse BERT's [CLS] output不夠好([CLS] optimized for NSP, not similarity)
Mean poolingAverage all token embeddingsBetter than [CLS],simple
Sentence-BERTFine-tune with siamese network on NLI dataBest for similarity tasks
Instructor / E5Task-specific instruction + embeddingState-of-the-art
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["This is a sentence.", "This is another one."])
# cosine similarity → semantic similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])

RAG (Retrieval-Augmented Generation)

RAG 結合 retrieval 和 generation — 讓 LLM 能 access external knowledge without retraining:

How It Works

  1. Index: 把 documents 用 embedding model encode 成 vectors → store in vector database
  2. Retrieve: User query → encode → find top-K most similar documents via ANN search
  3. Generate: Concatenate retrieved documents + user query → feed to LLM → generate answer

Why RAG Matters

ProblemHow RAG Solves It
HallucinationLLM 的回答 grounded in retrieved evidence
Knowledge cutoffCan retrieve from up-to-date documents
Domain expertiseIndex domain-specific documents without fine-tuning
AttributionCan cite sources (which documents were used)
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# 1. Index documents
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)

# 2. Retrieve relevant docs for a query
docs = vectorstore.similarity_search("What is attention mechanism?", k=3)

# 3. Generate answer using retrieved context + LLM
prompt = f"Based on these documents:\n{docs}\n\nAnswer: What is attention?"
answer = llm(prompt)

面試中的 RAG

RAG 是 2024+ 面試高頻題。關鍵概念:embedding quality 決定 retrieval quality → retrieval quality 決定 generation quality → 「garbage in, garbage out」。Chunking strategy(怎麼切文件)和 embedding model 選擇往往比 LLM 本身更重要。

Real-World Use Cases

Case 1: 信用卡詐欺 — Text Features from Merchant Descriptions

Merchant name 和 description 包含 fraud signal — 用 embeddings 把 text features 加入 fraud model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
merchant_embeddings = model.encode(df["merchant_description"].tolist())
# Concatenate with numerical features → feed to GBM or MLP
X = np.hstack([numerical_features, merchant_embeddings])

面試 follow-up:「為什麼不用 TF-IDF?」— TF-IDF 不 capture semantic similarity(「online store」和「e-commerce shop」的 TF-IDF vectors 完全不同但意思相近)。Pre-trained embeddings 直接 encode semantic meaning。

Case 2: 推薦系統 — Semantic Search for Content

用 embeddings 做 content-based recommendation — user query 或 liked items → embed → find similar items:

ApproachMethod
Product searchQuery embedding vs product description embeddings → cosine similarity
Similar itemsItem embedding similarity(content-based)
Cold startNew item description → embedding → find similar existing items

Case 3: 客戶分群 — Clustering on Support Tickets

把 customer support tickets embed → cluster → 自動分類 issue types:

  1. Embed all tickets with Sentence-BERT
  2. Reduce dimensions with UMAP
  3. Cluster with K-Means or HDBSCAN
  4. Label each cluster by examining representative tickets

Hands-on: NLP in Python

Word2Vec

from gensim.models import Word2Vec

# sentences: list of lists of words
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, sg=1)
model.wv.most_similar("king")
# → [("queen", 0.85), ("prince", 0.78), ...]
model.wv.most_similar(positive=["king", "woman"], negative=["man"])
# → [("queen", 0.89)]

BERT Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling over token embeddings
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

emb1 = get_embedding("The cat sat on the mat.")
emb2 = get_embedding("A kitten was resting on the rug.")
# High cosine similarity — semantically similar

Sentence-BERT for Similarity

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = ["I love machine learning", "ML is my passion", "The weather is nice"]
embeddings = model.encode(sentences)

sim_matrix = cosine_similarity(embeddings)
# sentences[0] vs [1]: high similarity (~0.85)
# sentences[0] vs [2]: low similarity (~0.15)

Interview Signals

What interviewers listen for:

  • 你能解釋 one-hot → Word2Vec → BERT 的 evolution 和每步解決的問題
  • 你知道 Word2Vec 的 polysemy limitation(同一個 word = 同一個 vector)
  • 你能比較 BERT vs GPT 的架構和適用場景
  • 你理解 sub-word tokenization 為什麼比 word-level 好
  • 你知道 RAG 的基本原理和 embedding quality 的重要性

Practice

Flashcards

Flashcards (1/10)

Word2Vec 的 skip-gram 和 CBOW 有什麼不同?

Skip-gram: input = center word, predict context words → 對 rare words 更好。CBOW: input = context words, predict center word → 對 frequent words 更好,training 更快。Skip-gram 通常 produce 更好的 embeddings(尤其 small datasets)。

Click card to flip

Quiz

Question 1/10

king - man + woman ≈ queen 的 Word2Vec analogy 是怎麼工作的?

Mark as Complete

3/5 — Okay