Generative Models

Interview Context

Generative models 在面試中通常考概念層級 — 你不需要推導 ELBO 的每一步,但要能解釋 VAE / GAN / Diffusion 的核心思想、它們各自的優缺點、以及在 data science 中的實際應用(data augmentation, anomaly detection, synthetic data)。

What You Should Understand

  • 能區分 generative 和 discriminative models 的根本差異
  • 理解 autoencoder → VAE 的演進和 VAE 的 latent space 性質
  • 知道 GAN 的 min-max game、training challenges(mode collapse, instability)
  • 了解 diffusion models 的高層概念(forward/reverse process)
  • 能說出 generative models 在 DS 中的實際應用

Generative vs Discriminative

AspectDiscriminativeGenerative
LearnsP(yx)P(y \mid x)(decision boundary)P(x)P(x) or P(x,y)P(x, y)(data distribution)
GoalClassify / predictGenerate new data similar to training
ExamplesLogistic regression, SVM, neural netsVAE, GAN, diffusion, GPT
Data efficiencyUsually needs less dataUsually needs more data
Can generate?NoYes — sample from learned distribution

為什麼 DS 需要了解 Generative Models?

不是每個 DS 都需要 train GANs。但 generative models 在以下場景 directly useful:(1)Data augmentation for imbalanced data,(2)Anomaly detection(learn normal → detect abnormal),(3)Synthetic data generation for privacy,(4)Understanding LLMs(GPT is a generative model)。

Autoencoders

Architecture

An autoencoder learns a compressed representation of data by training to reconstruct its input:

Input xEncoder fLatent code zDecoder gReconstruction x^\text{Input } \mathbf{x} \xrightarrow{\text{Encoder } f} \text{Latent code } \mathbf{z} \xrightarrow{\text{Decoder } g} \text{Reconstruction } \hat{\mathbf{x}} minθxg(f(x))2\min_{\theta} \|\mathbf{x} - g(f(\mathbf{x}))\|^2

Bottleneck: dim(z)dim(x)\text{dim}(\mathbf{z}) \ll \text{dim}(\mathbf{x}) forces the network to learn a compressed representation — only the most important information survives。

Relation to PCA

Linear autoencoder with MSE loss = PCA. Non-linear autoencoder(with ReLU)learns a non-linear manifold — more powerful but harder to interpret.

Use Cases

ApplicationHow
Dimensionality reductionUse encoder output z\mathbf{z} as features
Anomaly detectionTrain on normal data → high reconstruction error = anomaly
DenoisingTrain on noisy input → reconstruct clean output
Pre-trainingUse encoder as feature extractor for downstream tasks

Autoencoder Limitations

Autoencoder 學的 latent space 是 unstructured — 不同 regions 之間沒有 smooth interpolation,不能 sample meaningful new data。這就是 VAE 要解決的問題。

import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim=784, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid(),
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

Variational Autoencoder (VAE)

The Key Idea

VAE adds probabilistic structure to the latent space — encoder outputs a distribution (mean + variance), not a single point:

Encoder: qϕ(zx)=N(μ,σ2)\text{Encoder: } q_\phi(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2) Decoder: pθ(xz)\text{Decoder: } p_\theta(\mathbf{x} \mid \mathbf{z})

Training 不只 minimize reconstruction error,還要讓 latent distribution 接近 a prior p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(0, I)

The Loss Function (ELBO)

L=Eq(zx)[logp(xz)]Reconstruction loss+DKL(q(zx)p(z))KL divergence\mathcal{L} = \underbrace{-E_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]}_{\text{Reconstruction loss}} + \underbrace{D_{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{KL divergence}}
TermWhat It Does
Reconstruction loss確保 decoder 能從 z\mathbf{z} 重建 x\mathbf{x}(和 autoencoder 一樣)
KL divergence迫使 encoder 的 latent distribution 接近 standard normal → smooth, continuous latent space

KL divergence 是 VAE 的 magic — 它讓 latent space 有 structure,使得 sampling 和 interpolation 成為可能。

Reparameterization Trick

Problem: Sampling from N(μ,σ2)\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2) 是 stochastic operation → cannot backpropagate through it。

Solution: Reparameterize as deterministic function + external noise:

z=μ+σϵ,ϵN(0,I)\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, I)

現在 z\mathbf{z}μ\boldsymbol{\mu}σ\boldsymbol{\sigma} 的 differentiable function → can backpropagate normally。Randomness 被 externalize 到 ϵ\boldsymbol{\epsilon}

面試 Key Insight

「為什麼 VAE 能 generate 但 autoencoder 不能?」— Autoencoder 的 latent space 沒有 structure → random point in latent space 不一定 decode 成 meaningful output。VAE 的 KL loss 強制 latent space = smooth Gaussian → 任何從 N(0,I) sample 的點都能 decode 成 reasonable output。

VAE vs Autoencoder

AspectAutoencoderVAE
Latent spaceUnstructured pointsStructured Gaussian distribution
Can generate?No(random z → garbage)Yes(sample z ~ N(0,I) → meaningful output)
LossReconstruction onlyReconstruction + KL divergence
InterpolationMay produce artifactsSmooth interpolation in latent space
Encoder outputSingle vector zMean μ + variance σ²
class VAE(nn.Module):
    def __init__(self, input_dim=784, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
        self.fc_mu = nn.Linear(256, latent_dim)
        self.fc_var = nn.Linear(256, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid(),
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_var(h)  # μ and log(σ²)

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + std * eps  # z = μ + σ * ε

    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        return self.decoder(z), mu, log_var

Generative Adversarial Networks (GANs)

The Min-Max Game

Two networks compete against each other:

  • Generator GG: Takes random noise zp(z)\mathbf{z} \sim p(\mathbf{z}) → produces fake data G(z)G(\mathbf{z})
  • Discriminator DD: Distinguishes real data from fake — outputs probability of being real
minGmaxD  Expdata[logD(x)]+Ezp(z)[log(1D(G(z)))]\min_G \max_D \; E_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + E_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))]
PlayerGoalTraining Signal
GeneratorFool the discriminator(produce realistic fakes)Maximize D(G(z))D(G(z))(discriminator thinks fake is real)
DiscriminatorCorrectly identify real vs fakeMaximize accuracy on both real and fake

直覺:Generator = 偽鈔製造者,Discriminator = 警察。兩者互相提升 — 最終 Generator 產出的 fake 和 real 無法區分。

Training Dynamics

# GAN training loop (simplified)
for batch in dataloader:
    # 1. Train Discriminator
    real_labels = torch.ones(batch_size)
    fake_labels = torch.zeros(batch_size)

    z = torch.randn(batch_size, latent_dim)
    fake = generator(z).detach()  # don't update G

    d_loss = bce_loss(discriminator(real_data), real_labels) + \
             bce_loss(discriminator(fake), fake_labels)
    d_loss.backward()
    d_optimizer.step()

    # 2. Train Generator
    z = torch.randn(batch_size, latent_dim)
    fake = generator(z)
    g_loss = bce_loss(discriminator(fake), real_labels)  # want D to think fake is real
    g_loss.backward()
    g_optimizer.step()

GAN Training Challenges

ChallengeDescriptionSolutions
Mode collapseGenerator 只產出少數幾種 output(ignores diversity)Mini-batch discrimination, unrolled GAN, Wasserstein loss
Training instabilityG and D 的 training 不平衡 → oscillation, divergenceSpectral normalization, progressive growing, careful lr tuning
Vanishing gradient for G如果 D 太強 → log(1D(G(z)))0\log(1 - D(G(z))) \approx 0 → G 的 gradient 消失Use logD(G(z))-\log D(G(z)) instead(non-saturating loss)
No convergence guaranteeMin-max optimization 不像 minimization 有 convergence theory實務中靠 tricks + careful monitoring
Evaluation沒有 single metric(no likelihood like VAE)FID score, IS score

Mode Collapse

GAN 最 infamous 的問題:Generator 發現產出某一種 output 就能 fool discriminator → 只產出那一種 → 完全失去 diversity。例如 generate faces 時所有 generated faces 看起來一樣。Wasserstein GAN(WGAN)用 earth mover's distance 替代 JS divergence,大幅改善了這個問題。

GAN Variants

VariantInnovationImpact
DCGAN (2015)Conv layers + architectural guidelinesMade GANs stable for images
WGAN (2017)Wasserstein distance, gradient penaltySolved mode collapse + training stability
Progressive GAN (2018)Grow resolution progressively (4→8→...→1024)First high-res face generation
StyleGAN (2019-2021)Style-based generator, adaptive normalizationPhotorealistic faces, controllable generation
Conditional GANG and D conditioned on class labelGenerate specific classes (e.g., digit 7)

Diffusion Models

High-Level Idea

Diffusion models learn to reverse a gradual noise-adding process:

Forward process (fixed, not learned): Gradually add Gaussian noise to data over TT steps until it becomes pure noise:

q(xtxt1)=N(xt;1βtxt1,βtI)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})

Reverse process (learned): A neural network learns to denoise step by step:

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})

Training: Network learns to predict the noise ϵ\boldsymbol{\epsilon} that was added at each step:

L=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L} = E_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2]

Generation: Start from pure noise xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(0, I) → iteratively denoise → get x0\mathbf{x}_0

Why Diffusion Models Won

AspectVAEGANDiffusion
Training stabilityStableUnstableVery stable
Mode coverageGoodMode collapse riskExcellent
Sample qualityBlurrySharpSharp + diverse
LikelihoodHas ELBONo likelihoodHas likelihood
SpeedFastFastSlow(many denoising steps)
ControllabilityLimitedConditional GANExcellent(classifier-free guidance)

Diffusion vs GAN: 面試重點

「為什麼 diffusion models 取代了 GANs?」— 三個原因:(1)Training 更穩定(no adversarial game, just MSE loss on noise prediction),(2)No mode collapse(covers full data distribution),(3)更好的 controllability(text-to-image via classifier-free guidance)。缺點:generation 慢(需要 iterate T steps)。

Key Applications

ApplicationModelHow
Text-to-imageStable Diffusion, DALL-E 2/3, MidjourneyText prompt → CLIP text encoder → conditioning for diffusion
Image editingInstructPix2Pix, SDEditEdit existing images via text instructions
Video generationSoraExtend diffusion to temporal dimension
AudioAudioLDMDiffusion in spectrogram space
Protein structureAlphaFold 3Diffusion for 3D structure prediction

Model Comparison

ModelHow It GeneratesProsConsBest For
AutoencoderDecode from learned zSimple, fastCannot generate(unstructured latent space)Compression, anomaly detection
VAESample z ~ N(0,I), decodeStable training, has likelihoodBlurry outputsStructured latent space, interpolation
GANG(noise) → fake dataSharp outputsUnstable training, mode collapseImage generation (legacy)
DiffusionIterative denoisingBest quality + diversitySlow generationSOTA image/video/audio generation
Autoregressive (GPT)Predict next tokenExcellent for textSequential → slowText generation, LLMs

Real-World Use Cases

Case 1: 信用卡詐欺 — Anomaly Detection with Autoencoder

Train autoencoder on normal transactions only → high reconstruction error = potential fraud:

# Train on normal data
autoencoder.fit(X_normal)

# At inference: compute reconstruction error
reconstructed = autoencoder(X_test)
errors = ((X_test - reconstructed) ** 2).mean(dim=1)

# High error → anomaly → potential fraud
threshold = np.percentile(errors_on_validation, 99)
is_fraud = errors > threshold

這和 PCA reconstruction error 的思路一樣 — normal data 在 low-dimensional manifold 上 → 可以被 well-reconstructed。Fraud data 偏離 manifold → 高 reconstruction error。

Case 2: Imbalanced Data — Synthetic Minority Oversampling

GAN 或 VAE 可以為 minority class 生成 synthetic samples — 比 SMOTE 更好因為 captures non-linear manifold:

# Train VAE only on fraud transactions
vae = VAE(input_dim=features.shape[1], latent_dim=16)
vae.fit(X_fraud)

# Generate synthetic fraud samples
z = torch.randn(1000, 16)  # sample from latent space
synthetic_fraud = vae.decode(z)
# Augment training data with synthetic fraud → retrain classifier

Synthetic Data 的陷阱

Generated data 的 quality 和 training data 成正比。如果 minority class 只有 50 筆,VAE/GAN 學到的 distribution 不夠好 → synthetic data 可能 noisy 或 unrealistic。通常需要至少幾百到幾千筆 minority samples 才值得用 generative augmentation。

Case 3: Privacy-Preserving Synthetic Data

用 generative model 產出 synthetic dataset 取代 real data — 保護 privacy:

StepAction
1Train VAE / GAN / Diffusion on real data
2Generate synthetic dataset of same size
3Verify: synthetic data 的 statistical properties 和 real data 相似
4Verify: synthetic data 不能 reverse-engineer individual records
5Share synthetic data instead of real data

醫療、金融等 privacy-sensitive 領域越來越多使用 synthetic data 做研究和 model development。

Hands-on: Generative Models in PyTorch

VAE

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(input_dim, hidden_dim), nn.ReLU())
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_var = nn.Linear(hidden_dim, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, input_dim), nn.Sigmoid(),
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_var(h)

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + std * eps

    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        return self.decoder(z), mu, log_var

def vae_loss(recon_x, x, mu, log_var):
    recon = F.binary_cross_entropy(recon_x, x, reduction="sum")
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon + kl

Simple GAN

class Generator(nn.Module):
    def __init__(self, latent_dim=100, output_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, 512), nn.ReLU(),
            nn.Linear(512, output_dim), nn.Tanh(),
        )

    def forward(self, z):
        return self.net(z)

class Discriminator(nn.Module):
    def __init__(self, input_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 1), nn.Sigmoid(),
        )

    def forward(self, x):
        return self.net(x)

Interview Signals

What interviewers listen for:

  • 你能區分 generative vs discriminative models
  • 你知道 VAE 的 KL loss 為什麼讓 latent space 有 structure
  • 你能解釋 GAN 的 min-max game 和 mode collapse 問題
  • 你知道 diffusion models 為什麼取代了 GANs
  • 你能說出 generative models 在 DS 中的實際應用(anomaly detection, data augmentation, synthetic data)

Practice

Flashcards

Flashcards (1/10)

Generative 和 discriminative models 的核心差異?

Discriminative 學 P(y|x)(decision boundary)→ 分類/預測。Generative 學 P(x) 或 P(x,y)(data distribution)→ 可以 generate new data。Logistic regression, SVM, neural nets = discriminative。VAE, GAN, GPT = generative。

Click card to flip

Quiz

Question 1/10

VAE 和 autoencoder 最核心的差異是什麼?

Mark as Complete

3/5 — Okay