SQL Interview Questions
Interview Context
SQL 題和 metric 定義題通常一起出現。面試官想知道你是不是能從原始事件表整理出可信結論的人。關鍵不只是語法正確,而是你能不能定義清楚「我在算什麼、分母是誰、粒度是什麼」。
What You Should Understand
- 熟悉 joins、window functions、aggregation grain 與 deduping 的差異
- 能清楚定義 DAU、retention、conversion、funnel conversion 的計算口徑
- 知道 event-level、user-level、session-level 指標混用時會造成什麼偏差
- 遇到數字怪異時,會先檢查資料新鮮度、漏資料、重複資料與分母定義
JOIN Types
The Four Basic Joins
-- INNER JOIN: only matching rows from both tables
SELECT u.user_id, o.order_id
FROM users u
INNER JOIN orders o ON u.user_id = o.user_id;
-- LEFT JOIN: all rows from left table, matching from right (NULL if no match)
SELECT u.user_id, o.order_id
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id;
-- RIGHT JOIN: all rows from right table, matching from left
-- (rarely used — just swap table order and use LEFT JOIN)
-- FULL OUTER JOIN: all rows from both tables
SELECT u.user_id, o.order_id
FROM users u
FULL OUTER JOIN orders o ON u.user_id = o.user_id;
Join Comparison
| Join Type | Left Table | Right Table | Use Case |
|---|---|---|---|
| INNER | Only matching | Only matching | 只要有 match 的 rows |
| LEFT | All rows | Matching (NULL if none) | 保留所有 left table 的 rows(最常用) |
| RIGHT | Matching (NULL if none) | All rows | 很少用 — 換 table 順序用 LEFT |
| FULL OUTER | All rows | All rows | 找兩邊都有或只有一邊有的 rows |
| CROSS | All rows × All rows | All rows | 產生 cartesian product(很少用,除非刻意要) |
Join Pitfalls: Row Multiplication
Many-to-many joins 是面試中最常見的 SQL bug:
-- DANGEROUS: 1 user × 3 orders × 2 logins = 6 rows per user
SELECT u.user_id, o.order_id, l.login_time
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN logins l ON u.user_id = l.user_id;
Solution: Aggregate before joining:
-- CORRECT: aggregate first, then join → 1 row per user
SELECT u.user_id, o.order_count, l.login_count
FROM users u
LEFT JOIN (
SELECT user_id, COUNT(*) AS order_count
FROM orders GROUP BY user_id
) o ON u.user_id = o.user_id
LEFT JOIN (
SELECT user_id, COUNT(*) AS login_count
FROM logins GROUP BY user_id
) l ON u.user_id = l.user_id;
面試高頻陷阱:JOIN 膨脹
面試時寫了多個 JOIN 卻沒考慮 grain → 面試官一定追問「這個 JOIN 會不會膨脹?」。永遠先想清楚每個 table 的 grain,JOIN 後的 grain 又是什麼。
Window Functions
Window functions compute values across a set of rows without collapsing them(unlike GROUP BY)。
Syntax
function_name() OVER (
[PARTITION BY column1, column2, ...]
[ORDER BY column3, column4, ...]
[ROWS/RANGE BETWEEN ... AND ...]
)
ROW_NUMBER, RANK, DENSE_RANK
-- Get each user's most recent order
SELECT * FROM (
SELECT user_id, order_id, order_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) AS rn
FROM orders
) ranked WHERE rn = 1;
| Function | Ties | Gaps | Scores: 100, 90, 90, 80 |
|---|---|---|---|
| ROW_NUMBER | Arbitrary | No | 1, 2, 3, 4 |
| RANK | Same rank | Yes | 1, 2, 2, 4 |
| DENSE_RANK | Same rank | No | 1, 2, 2, 3 |
面試中 80% 的 window function 題只需 ROW_NUMBER + PARTITION BY。
Running Totals and Moving Averages
-- Cumulative revenue
SELECT order_date, daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date) AS cumulative_revenue
FROM daily_sales;
-- 7-day moving average
SELECT order_date, daily_revenue,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM daily_sales;
LAG and LEAD
-- Day-over-day change
SELECT order_date, daily_revenue,
LAG(daily_revenue, 1) OVER (ORDER BY order_date) AS prev_day,
daily_revenue - LAG(daily_revenue, 1) OVER (ORDER BY order_date) AS dod_change
FROM daily_sales;
-- LEAD: next row's value
-- LAG(x, 2): 2 rows back
-- LAG(x, 1, 0): default value 0 if no previous row
NTILE and Percentiles
-- Divide users into 4 quartiles by spending
SELECT user_id, total_spend,
NTILE(4) OVER (ORDER BY total_spend) AS spend_quartile
FROM user_stats;
-- Percentile rank
SELECT user_id, total_spend,
PERCENT_RANK() OVER (ORDER BY total_spend) AS pct_rank
FROM user_stats;
Window Function 面試要點
典型場景:取每個 user 的最新一筆(ROW_NUMBER)、每個 category 的 top N(ROW_NUMBER + PARTITION)、去除重複但保留最新(ROW_NUMBER)、day-over-day growth(LAG)、cumulative sum(SUM OVER)、moving average(AVG OVER ROWS BETWEEN)。
Common Table Expressions (CTEs)
CTEs make complex queries readable by breaking them into logical steps:
WITH daily_metrics AS (
SELECT DATE(event_time) AS event_date,
COUNT(DISTINCT user_id) AS dau
FROM events WHERE event_type = 'page_view'
GROUP BY DATE(event_time)
),
weekly_avg AS (
SELECT event_date, dau,
AVG(dau) OVER (ORDER BY event_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS dau_7d_avg
FROM daily_metrics
)
SELECT * FROM weekly_avg ORDER BY event_date;
CTE vs Subquery
| Aspect | CTE | Subquery |
|---|---|---|
| Readability | 更易讀(step-by-step) | Nested → 難讀 |
| Reuse | 可以被多次引用 | 每次 reference 要 re-write |
| Debug | 可以逐步執行每個 CTE | 難 debug |
| Performance | 某些 DB 會 materialize | 通常 inline optimize |
| Rule of thumb | > 2 levels nesting → 用 CTE | Simple one-off → subquery OK |
面試中推薦用 CTE — 讓面試官更容易理解你的思路。
Grain Awareness
What Is Grain?
The grain of a table or query = "what does one row represent?" Getting this wrong 是大多數 SQL bugs 的根本原因。
| Table | Grain | One Row = |
|---|---|---|
| users | User | One unique user |
| orders | Order | One order (user can have many) |
| order_items | Order × Item | One item within one order |
| events | Event | One event (many per user per day) |
| daily_user_stats | User × Day | One user on one day |
Grain Mismatch Examples
-- WRONG: counting events, calling it "users"
SELECT COUNT(*) AS "user_count" -- Actually event count!
FROM events WHERE event_type = 'purchase';
-- CORRECT: count distinct users
SELECT COUNT(DISTINCT user_id) AS user_count
FROM events WHERE event_type = 'purchase';
-- WRONG: revenue inflated by JOIN fanout
SELECT SUM(o.amount) AS total_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
-- Order with 3 items → amount counted 3 times!
-- CORRECT: aggregate at right grain
SELECT SUM(amount) AS total_revenue FROM orders;
面試黃金法則
每寫一個 query,先問自己:「結果的一行代表什麼?」如果答不出來,就是 grain 沒想清楚。
Metric Definition Patterns
DAU (Daily Active Users)
SELECT DATE(event_time) AS dt, COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_type IN ('page_view', 'click', 'purchase')
GROUP BY DATE(event_time);
Key decision: What counts as "active"? Page view? Any event? Purchase only? 定義必須和 business meaning align。
Retention
Day-N retention: 在 day 0 signup 的 users 中,有多少比例在 day N 回來?
WITH signups AS (
SELECT user_id, DATE(created_at) AS signup_date FROM users
),
day7_active AS (
SELECT DISTINCT user_id, DATE(event_time) AS active_date FROM events
)
SELECT
s.signup_date,
COUNT(DISTINCT s.user_id) AS cohort_size,
COUNT(DISTINCT d7.user_id) AS retained_day7,
ROUND(COUNT(DISTINCT d7.user_id) * 1.0 / COUNT(DISTINCT s.user_id), 4) AS day7_retention
FROM signups s
LEFT JOIN day7_active d7
ON s.user_id = d7.user_id
AND d7.active_date = s.signup_date + INTERVAL '7 days'
GROUP BY s.signup_date
ORDER BY s.signup_date;
Retention 的常見變體
Day-N(Classic):第 N 天有回來就算。Bounded:第 N 到 N+K 天內有回來(更穩定)。Unbounded:第 N 天之後任何時間回來。面試時先確認用哪種定義。
Conversion Funnel
WITH funnel AS (
SELECT user_id,
MAX(CASE WHEN event_type = 'page_view' THEN 1 ELSE 0 END) AS viewed,
MAX(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS added,
MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchased
FROM events WHERE DATE(event_time) = '2024-01-15'
GROUP BY user_id
)
SELECT
SUM(viewed) AS viewed, SUM(added) AS added_to_cart, SUM(purchased) AS purchased,
ROUND(SUM(added) * 1.0 / NULLIF(SUM(viewed), 0), 4) AS view_to_cart,
ROUND(SUM(purchased) * 1.0 / NULLIF(SUM(added), 0), 4) AS cart_to_purchase
FROM funnel;
NULLIF(x, 0) prevents division by zero — 面試中必備習慣。
Year-over-Year Growth
WITH monthly_revenue AS (
SELECT DATE_TRUNC('month', order_date) AS month, SUM(amount) AS revenue
FROM orders GROUP BY DATE_TRUNC('month', order_date)
)
SELECT curr.month, curr.revenue,
prev.revenue AS prev_year,
ROUND((curr.revenue - prev.revenue) * 100.0 / NULLIF(prev.revenue, 0), 2) AS yoy_pct
FROM monthly_revenue curr
LEFT JOIN monthly_revenue prev ON curr.month = prev.month + INTERVAL '1 year'
ORDER BY curr.month;
Deduplication Patterns
-- ROW_NUMBER: keep most recent record per user
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn
FROM user_profiles
) deduped WHERE rn = 1;
-- PostgreSQL shorthand
SELECT DISTINCT ON (user_id) * FROM user_profiles ORDER BY user_id, updated_at DESC;
Advanced Patterns
Self Join: Finding Consecutive Events
-- Users who purchased within 24 hours of signing up
SELECT DISTINCT s.user_id
FROM events s
JOIN events p ON s.user_id = p.user_id
AND p.event_type = 'purchase'
AND p.event_time BETWEEN s.event_time AND s.event_time + INTERVAL '24 hours'
WHERE s.event_type = 'signup';
CASE WHEN for Pivoting
-- Revenue by category, pivoted into columns
SELECT DATE(order_date) AS dt,
SUM(CASE WHEN category = 'electronics' THEN amount ELSE 0 END) AS electronics,
SUM(CASE WHEN category = 'clothing' THEN amount ELSE 0 END) AS clothing,
SUM(CASE WHEN category = 'food' THEN amount ELSE 0 END) AS food
FROM orders GROUP BY DATE(order_date);
EXISTS vs IN
-- EXISTS (often faster for large subqueries — stops at first match)
SELECT u.user_id, u.name FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.user_id);
-- IN (equivalent but can be slower)
SELECT user_id, name FROM users
WHERE user_id IN (SELECT DISTINCT user_id FROM orders);
Gaps and Islands
-- Find consecutive login streaks per user
WITH daily_logins AS (
SELECT DISTINCT user_id, DATE(login_time) AS login_date FROM logins
),
streaks AS (
SELECT user_id, login_date,
login_date - INTERVAL '1 day' * ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY login_date
) AS streak_group
FROM daily_logins
)
SELECT user_id, MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS streak_length
FROM streaks
GROUP BY user_id, streak_group
HAVING COUNT(*) >= 3 -- streaks of 3+ days
ORDER BY streak_length DESC;
Sessionization
-- Define sessions: gap > 30 min = new session
WITH events_with_gap AS (
SELECT *, LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
FROM events
),
session_starts AS (
SELECT *,
CASE WHEN prev_time IS NULL
OR event_time - prev_time > INTERVAL '30 minutes'
THEN 1 ELSE 0 END AS is_new_session
FROM events_with_gap
),
sessions AS (
SELECT *, SUM(is_new_session) OVER (
PARTITION BY user_id ORDER BY event_time
) AS session_id
FROM session_starts
)
SELECT user_id, session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS events_in_session
FROM sessions
GROUP BY user_id, session_id;
Performance in Interviews
面試通常不深考 optimization,但你應該知道:(1)避免 SELECT *,(2)WHERE 條件用 indexed columns,(3)JOIN 前先 filter 減少資料量,(4)EXISTS 替代大 IN subquery,(5)避免 function on indexed column(WHERE DATE(ts) = ... → full scan)。主動提到這些會加分。
Debugging Weird Numbers
When SQL output looks wrong, check in this order:
| Step | Check | Common Issue |
|---|---|---|
| 1 | Grain | JOIN 造成 fanout? 一行代表什麼? |
| 2 | Duplicates | 需要 DISTINCT? 有重複事件? |
| 3 | NULLs | NULL 被 JOIN 或 WHERE 默默丟掉? |
| 4 | Date boundaries | Filter 是 inclusive or exclusive? Timezone? |
| 5 | Denominator | "Total users" 是正確的 population? |
-- Quick sanity check template
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT user_id) AS unique_users,
MIN(event_time) AS earliest,
MAX(event_time) AS latest,
COUNT(*) - COUNT(user_id) AS null_user_count
FROM events
WHERE DATE(event_time) = '2024-01-15';
Real-World Use Cases
Case 1: 信用卡詐欺偵測 — Feature Engineering in SQL
-- Create fraud detection features from transaction history
WITH user_stats AS (
SELECT user_id,
COUNT(*) AS txn_count_30d,
AVG(amount) AS avg_amount_30d,
MAX(amount) AS max_amount_30d,
COUNT(DISTINCT merchant_id) AS unique_merchants_30d,
SUM(CASE WHEN is_international = 1 THEN 1 ELSE 0 END) AS intl_txn_count
FROM transactions
WHERE txn_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id
)
SELECT t.*, u.avg_amount_30d,
t.amount / NULLIF(u.avg_amount_30d, 0) AS amount_ratio, -- unusually high?
u.txn_count_30d, u.intl_txn_count
FROM transactions t
LEFT JOIN user_stats u ON t.user_id = u.user_id
WHERE t.txn_date = CURRENT_DATE;
amount_ratio = 這筆交易金額 / 使用者近 30 天平均金額。Ratio 很高 → 可能異常 → fraud signal。
Case 2: 推薦系統 — Collaborative Filtering Data
-- Item co-purchase matrix for item-based collaborative filtering
WITH user_purchases AS (
SELECT DISTINCT user_id, product_id
FROM orders WHERE order_date >= '2024-01-01'
)
SELECT a.product_id AS item_a, b.product_id AS item_b,
COUNT(DISTINCT a.user_id) AS co_purchase_count
FROM user_purchases a
JOIN user_purchases b ON a.user_id = b.user_id AND a.product_id < b.product_id
GROUP BY a.product_id, b.product_id
HAVING COUNT(DISTINCT a.user_id) >= 5 -- minimum support
ORDER BY co_purchase_count DESC;
Case 3: 客戶分群 — RFM Analysis
-- RFM (Recency, Frequency, Monetary) segmentation
WITH rfm AS (
SELECT user_id,
CURRENT_DATE - MAX(order_date) AS recency_days,
COUNT(DISTINCT order_id) AS frequency,
SUM(amount) AS monetary
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '1 year'
GROUP BY user_id
),
rfm_scored AS (
SELECT *,
NTILE(4) OVER (ORDER BY recency_days ASC) AS r_score, -- low recency = recent = good
NTILE(4) OVER (ORDER BY frequency DESC) AS f_score,
NTILE(4) OVER (ORDER BY monetary DESC) AS m_score
FROM rfm
)
SELECT *,
CASE
WHEN r_score >= 3 AND f_score >= 3 THEN 'VIP'
WHEN r_score >= 3 AND f_score <= 2 THEN 'New/Promising'
WHEN r_score <= 2 AND f_score >= 3 THEN 'At Risk'
ELSE 'Dormant'
END AS segment
FROM rfm_scored;
Interview Signals
What interviewers listen for:
- 你有 grain awareness,知道每個表和每個 metric 的粒度
- 你會先定義 business logic,再寫 SQL
- 你能解釋為什麼同一個 conversion rate 可能在不同 cohort 下意義不同
- 遇到數字不對時,你有系統性的 debug 流程
- 你會主動用 NULLIF 避免除零、用 DISTINCT 避免重複計數
Practice
Flashcards
Flashcards (1/10)
為什麼 metric 計算前要先確認 grain?
User-level 問題用 event-level 直接算 → 重複計數或分母失真。一個 user 有 5 events → COUNT(*) = 5 不是 1。JOIN 也可能把 grain 從 1-to-1 變成 1-to-many → 膨脹。永遠先問「一行代表什麼?」
Quiz
每位使用者最近一次購買日期,最直接的 SQL 方法?