返回 筑基・数据元府藏真
高级查询技巧:窗口函数与递归查询
博主
大约 14 分钟
高级查询技巧:窗口函数与递归查询
问题引入:复杂的排名需求
-- 需求:查询每个用户最近3笔订单
-- 传统方式:子查询 + JOIN,性能差
SELECT o1.*
FROM orders o1
WHERE (
SELECT COUNT(*)
FROM orders o2
WHERE o2.user_id = o1.user_id
AND o2.created_at >= o1.created_at
) <= 3;
-- 执行时间:15秒,全表扫描多次
-- 窗口函数方案:简洁高效
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) as rn
FROM orders
) t
WHERE rn <= 3;
-- 执行时间:500ms,单次扫描
窗口函数让复杂的分析查询变得简单高效,是现代SQL的必备技能。
现象描述:传统方案的痛点
场景1:分组Top N问题
-- 需求:每个班级成绩前3名的学生
-- 传统方案:关联子查询,性能极差
SELECT s1.*
FROM students s1
WHERE (
SELECT COUNT(DISTINCT score)
FROM students s2
WHERE s2.class_id = s1.class_id
AND s2.score > s1.score
) < 3;
-- 执行时间:30秒(10万学生)
-- 窗口函数方案
SELECT *
FROM (
SELECT *,
DENSE_RANK() OVER (PARTITION BY class_id ORDER BY score DESC) as rank
FROM students
) t
WHERE rank <= 3;
-- 执行时间:200ms
场景2:累计求和
-- 需求:计算每日累计销售额
-- 传统方案:自连接,复杂度O(n²)
SELECT
a.date,
SUM(b.amount) as cumulative_amount
FROM sales a
JOIN sales b ON b.date <= a.date
GROUP BY a.date;
-- 执行时间:10秒(365天数据)
-- 窗口函数方案
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) as cumulative_amount
FROM sales;
-- 执行时间:50ms
场景3:递归查询层级数据
-- 需求:查询部门树形结构
-- 传统方案:多次查询或应用层组装
-- 需要查询多次数据库,代码复杂
-- 递归CTE方案
WITH RECURSIVE dept_tree AS (
-- 锚点:顶级部门
SELECT id, name, parent_id, 0 as level
FROM departments
WHERE parent_id IS NULL
UNION ALL
-- 递归:查询子部门
SELECT d.id, d.name, d.parent_id, dt.level + 1
FROM departments d
JOIN dept_tree dt ON d.parent_id = dt.id
)
SELECT * FROM dept_tree;
-- 单次查询,简洁高效
原因分析:窗口函数与递归原理
1. 窗口函数原理
执行过程
窗口函数执行流程:
1. 数据分区(PARTITION BY)
输入数据 -> [分区1] [分区2] [分区3] ...
2. 窗口排序(ORDER BY)
每个分区内部排序
3. 窗口计算
在当前行的窗口范围内计算
4. 返回结果
每行都返回计算结果
与传统聚合函数的区别:
- 聚合函数:多行输入,单行输出
- 窗口函数:多行输入,每行都输出(保留原始行)
窗口范围
-- 窗口范围定义
ROWS BETWEEN ... AND ...
RANGE BETWEEN ... AND ...
-- 常用范围
UNBOUNDED PRECEDING -- 分区第一行
CURRENT ROW -- 当前行
UNBOUNDED FOLLOWING -- 分区最后一行
-- 示例:计算移动平均
SELECT
date,
amount,
AVG(amount) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7d
FROM sales;
2. 递归CTE原理
执行过程
递归CTE执行流程:
1. 执行锚点成员(非递归部分)
-> 产生初始结果集(种子)
2. 执行递归成员
-> 基于上一次的结果继续查询
-> 产生新的结果集
3. 重复步骤2
-> 直到递归成员返回空结果
4. 合并所有结果
-> 返回最终数据集
注意:必须有终止条件,否则无限循环
递归限制
-- MySQL递归限制参数
SHOW VARIABLES LIKE 'cte_max_recursion_depth';
-- 默认1000层
-- 临时增加限制
SET SESSION cte_max_recursion_depth = 10000;
-- 在CTE中设置限制
WITH RECURSIVE tree AS (
SELECT id, parent_id, 0 as level
FROM categories
WHERE id = 1
UNION ALL
SELECT c.id, c.parent_id, t.level + 1
FROM categories c
JOIN tree t ON c.parent_id = t.id
WHERE t.level < 10 -- 终止条件
)
SELECT * FROM tree;
解决方案:窗口函数实战
1. 排名函数
ROW_NUMBER()
-- 特点:连续排名,无重复
-- 用途:精确排名、分页
-- 示例:用户订单排名
SELECT
user_id,
order_id,
total_amount,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY total_amount DESC
) as amount_rank
FROM orders;
-- 结果:
-- user_id | order_id | total_amount | amount_rank
-- 1 | 1001 | 1000.00 | 1
-- 1 | 1002 | 800.00 | 2
-- 1 | 1003 | 600.00 | 3
-- 2 | 2001 | 1200.00 | 1
-- 2 | 2002 | 900.00 | 2
RANK() 和 DENSE_RANK()
-- RANK():跳跃排名
-- 1, 2, 2, 4, 5(有并列时跳号)
-- DENSE_RANK():密集排名
-- 1, 2, 2, 3, 4(有并列不跳号)
-- 示例:学生成绩排名
SELECT
student_name,
score,
RANK() OVER (ORDER BY score DESC) as rank_jump,
DENSE_RANK() OVER (ORDER BY score DESC) as rank_dense,
ROW_NUMBER() OVER (ORDER BY score DESC) as rank_row
FROM scores;
-- 结果:
-- student_name | score | rank_jump | rank_dense | rank_row
-- 张三 | 100 | 1 | 1 | 1
-- 李四 | 95 | 2 | 2 | 2
-- 王五 | 95 | 2 | 2 | 3
-- 赵六 | 90 | 4 | 3 | 4
2. 聚合窗口函数
累计计算
-- 累计求和
SELECT
date,
daily_amount,
SUM(daily_amount) OVER (ORDER BY date) as cumulative_amount,
SUM(daily_amount) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_explicit
FROM daily_sales;
-- 年度累计(分区累计)
SELECT
year,
month,
amount,
SUM(amount) OVER (
PARTITION BY year
ORDER BY month
) as year_cumulative
FROM monthly_sales;
移动计算
-- 7天移动平均
SELECT
date,
amount,
AVG(amount) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as avg_7d,
SUM(amount) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as sum_7d
FROM daily_sales;
-- 前后值比较
SELECT
date,
amount,
LAG(amount, 1) OVER (ORDER BY date) as prev_day_amount,
LEAD(amount, 1) OVER (ORDER BY date) as next_day_amount,
amount - LAG(amount, 1) OVER (ORDER BY date) as day_over_day_change
FROM daily_sales;
3. 取值窗口函数
FIRST_VALUE 和 LAST_VALUE
-- 分组内第一行和最后一行的值
SELECT
user_id,
order_id,
total_amount,
FIRST_VALUE(total_amount) OVER (
PARTITION BY user_id
ORDER BY created_at
) as first_order_amount,
LAST_VALUE(total_amount) OVER (
PARTITION BY user_id
ORDER BY created_at
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as last_order_amount
FROM orders;
LAG 和 LEAD
-- 前后偏移取值
SELECT
date,
amount,
-- 前一天
LAG(amount, 1, 0) OVER (ORDER BY date) as prev_amount,
-- 前两天
LAG(amount, 2) OVER (ORDER BY date) as prev_2_amount,
-- 后一天
LEAD(amount, 1) OVER (ORDER BY date) as next_amount,
-- 环比增长率
(amount - LAG(amount, 1) OVER (ORDER BY date)) /
LAG(amount, 1) OVER (ORDER BY date) * 100 as growth_rate
FROM daily_sales;
4. 高级窗口函数
NTILE 分桶
-- 将数据分成N个桶
SELECT
student_name,
score,
NTILE(4) OVER (ORDER BY score DESC) as quartile,
CASE NTILE(4) OVER (ORDER BY score DESC)
WHEN 1 THEN '优秀'
WHEN 2 THEN '良好'
WHEN 3 THEN '及格'
ELSE '不及格'
END as grade_level
FROM scores;
PERCENT_RANK 和 CUME_DIST
-- 百分比排名
SELECT
student_name,
score,
PERCENT_RANK() OVER (ORDER BY score) as percentile,
CUME_DIST() OVER (ORDER BY score) as cumulative_distribution,
ROUND(PERCENT_RANK() OVER (ORDER BY score) * 100, 2) as percentile_rank
FROM scores;
解决方案:递归查询实战
1. 树形结构查询
查询完整树
-- 部门树查询
WITH RECURSIVE dept_tree AS (
-- 锚点:顶级部门
SELECT
id,
name,
parent_id,
0 as level,
CAST(name AS CHAR(1000)) as path
FROM departments
WHERE parent_id IS NULL
UNION ALL
-- 递归:查询子部门
SELECT
d.id,
d.name,
d.parent_id,
dt.level + 1,
CONCAT(dt.path, ' > ', d.name)
FROM departments d
JOIN dept_tree dt ON d.parent_id = dt.id
)
SELECT
id,
name,
level,
path,
REPEAT(' ', level) || name as tree_display
FROM dept_tree
ORDER BY path;
查询子树
-- 查询某个部门的所有下级
WITH RECURSIVE sub_dept AS (
-- 锚点:指定部门
SELECT id, name, parent_id, 0 as level
FROM departments
WHERE id = 5
UNION ALL
-- 递归:查询下级
SELECT d.id, d.name, d.parent_id, sd.level + 1
FROM departments d
JOIN sub_dept sd ON d.parent_id = sd.id
)
SELECT * FROM sub_dept;
查询上级路径
-- 查询某个部门的所有上级
WITH RECURSIVE parent_dept AS (
-- 锚点:指定部门
SELECT id, name, parent_id, 0 as level
FROM departments
WHERE id = 10
UNION ALL
-- 递归:查询上级
SELECT d.id, d.name, d.parent_id, pd.level + 1
FROM departments d
JOIN parent_dept pd ON d.id = pd.parent_id
)
SELECT * FROM parent_dept ORDER BY level DESC;
2. 层级数据处理
生成数字序列
-- 生成1到1000的数字序列
WITH RECURSIVE numbers AS (
SELECT 1 as n
UNION ALL
SELECT n + 1
FROM numbers
WHERE n < 1000
)
SELECT * FROM numbers;
-- 生成日期序列
WITH RECURSIVE dates AS (
SELECT '2023-01-01' as date
UNION ALL
SELECT DATE_ADD(date, INTERVAL 1 DAY)
FROM dates
WHERE date < '2023-12-31'
)
SELECT * FROM dates;
字符串拆分
-- 将逗号分隔的字符串拆分成多行
WITH RECURSIVE split AS (
SELECT
'apple,banana,orange' as str,
SUBSTRING_INDEX('apple,banana,orange', ',', 1) as val,
SUBSTRING('apple,banana,orange',
LENGTH(SUBSTRING_INDEX('apple,banana,orange', ',', 1)) + 2
) as remainder
UNION ALL
SELECT
remainder,
SUBSTRING_INDEX(remainder, ',', 1),
CASE
WHEN LOCATE(',', remainder) > 0 THEN
SUBSTRING(remainder, LENGTH(SUBSTRING_INDEX(remainder, ',', 1)) + 2)
ELSE ''
END
FROM split
WHERE remainder != ''
)
SELECT val FROM split;
3. 图遍历
最短路径查询
-- 查找两个节点之间的路径(简化版)
WITH RECURSIVE paths AS (
-- 锚点:起点
SELECT
from_node,
to_node,
CAST(from_node AS CHAR(1000)) as path,
1 as hops
FROM edges
WHERE from_node = 'A'
UNION ALL
-- 递归:扩展路径
SELECT
p.from_node,
e.to_node,
CONCAT(p.path, '->', e.to_node),
p.hops + 1
FROM paths p
JOIN edges e ON p.to_node = e.from_node
WHERE p.hops < 5 -- 限制路径长度
AND LOCATE(e.to_node, p.path) = 0 -- 避免循环
)
SELECT * FROM paths WHERE to_node = 'D' ORDER BY hops;
实施步骤:高级查询应用
步骤1:识别适用场景
窗口函数适用场景: □ 分组Top N查询 □ 累计/移动计算 □ 行间比较(环比、同比) □ 分组内排序和编号 □ 百分比和分位数计算 递归CTE适用场景: □ 树形/层级结构查询 □ 图遍历 □ 序列生成 □ 字符串处理 □ 层级聚合
步骤2:编写和优化
-- 1. 先写基础查询
SELECT * FROM orders WHERE created_at > '2023-01-01';
-- 2. 添加窗口函数
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) as rn
FROM orders
WHERE created_at > '2023-01-01';
-- 3. 添加过滤条件
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) as rn
FROM orders
WHERE created_at > '2023-01-01'
) t
WHERE rn <= 3;
-- 4. 性能优化:确保分区列和排序列有索引
CREATE INDEX idx_user_created ON orders(user_id, created_at);
步骤3:性能验证
-- 查看执行计划
EXPLAIN ANALYZE
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) as rn
FROM orders
) t
WHERE rn <= 3;
-- 关注指标:
-- - 是否使用了索引
-- - 是否需要临时表
-- - 执行时间
效果验证:优化前后对比
案例1:分组Top N
| 方案 | 执行时间 | 扫描次数 | 复杂度 |
|---|---|---|---|
| 子查询方案 | 15s | 多次全表扫描 | O(n²) |
| 窗口函数方案 | 500ms | 单次扫描 | O(n) |
| 提升 | 97% | 大幅减少 | 显著优化 |
案例2:累计计算
| 方案 | 执行时间 | 内存使用 | 可读性 |
|---|---|---|---|
| 自连接方案 | 10s | 高 | 差 |
| 窗口函数方案 | 50ms | 低 | 好 |
| 提升 | 99.5% | 降低 | 显著提升 |
案例3:树形查询
| 方案 | 查询次数 | 代码复杂度 | 维护性 |
|---|---|---|---|
| 应用层组装 | 多次 | 高 | 差 |
| 递归CTE方案 | 1次 | 低 | 好 |
| 提升 | 单次查询 | 简化 | 显著提升 |
经验总结:高级查询最佳实践
✅ 应该做的
- 优先使用窗口函数替代子查询
- 为分区列和排序列创建索引
- 递归查询设置终止条件
- 使用CTE提高可读性
- 注意窗口函数的执行顺序
- 大数据量时测试性能
- 理解不同排名函数的区别
- 合理使用窗口范围定义
❌ 不应该做的
- 忘记处理递归终止条件
- 在窗口函数后错误使用WHERE
- 忽视索引对窗口函数的影响
- 过度使用递归(深度过大)
- 混淆ROW_NUMBER和RANK
- 忘记窗口函数后的过滤需要子查询
- 在递归中忘记去重或循环检测
- 忽视窗口函数的内存开销
深度案例分析:用户行为分析
项目背景
某电商平台需要分析用户购买行为:
- 用户首次购买时间
- 用户购买频次变化
- 用户流失预警
解决方案
-- 1. 计算用户首次购买和最近一次购买
WITH user_purchase_stats AS (
SELECT
user_id,
MIN(created_at) as first_purchase,
MAX(created_at) as last_purchase,
COUNT(*) as total_orders,
SUM(total_amount) as total_amount
FROM orders
GROUP BY user_id
),
-- 2. 计算用户购买频次趋势
user_frequency AS (
SELECT
user_id,
DATE_FORMAT(created_at, '%Y-%m') as month,
COUNT(*) as monthly_orders,
LAG(COUNT(*)) OVER (
PARTITION BY user_id
ORDER BY DATE_FORMAT(created_at, '%Y-%m')
) as prev_month_orders
FROM orders
GROUP BY user_id, DATE_FORMAT(created_at, '%Y-%m')
),
-- 3. 识别流失用户(3个月未购买)
churned_users AS (
SELECT
user_id,
last_purchase,
DATEDIFF(NOW(), last_purchase) as days_since_last,
CASE
WHEN DATEDIFF(NOW(), last_purchase) > 90 THEN 'Churned'
WHEN DATEDIFF(NOW(), last_purchase) > 60 THEN 'At Risk'
ELSE 'Active'
END as user_status
FROM user_purchase_stats
)
-- 最终查询
SELECT
c.user_id,
c.user_status,
c.days_since_last,
s.total_orders,
s.total_amount,
s.first_purchase
FROM churned_users c
JOIN user_purchase_stats s ON c.user_id = s.user_id
ORDER BY c.days_since_last DESC;
优化效果
| 指标 | 优化前 | 优化后 | 提升 |
|---|---|---|---|
| 查询时间 | 30s | 2s | 93% ↓ |
| 代码行数 | 200行 | 50行 | 75% ↓ |
| 可读性 | 差 | 好 | 显著提升 |
| 维护成本 | 高 | 低 | 显著降低 |
窗口函数与递归决策树
需求分析
↓
需要分组计算?
↓ 是
使用窗口函数
↓
需要排名?
↓ 是
使用ROW_NUMBER/RANK/DENSE_RANK
↓
需要累计计算?
↓ 是
使用SUM/AVG OVER
↓
需要层级查询?
↓ 是
使用递归CTE
↓
需要树遍历?
↓ 是
使用RECURSIVE WITH
↓
完成
常见误区与避坑指南
❌ 误区1:窗口函数后直接使用WHERE
问题:
-- 错误:窗口函数后直接过滤
SELECT
user_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) as rn
FROM orders
WHERE rn <= 3; -- 错误!rn还未生成
正确做法:
-- 正确:使用子查询或CTE
SELECT *
FROM (
SELECT
user_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) as rn
FROM orders
) t
WHERE rn <= 3;
❌ 误区2:递归没有终止条件
问题:递归无限循环直到耗尽资源。
正确做法:
-- 始终设置终止条件
WITH RECURSIVE tree AS (
SELECT id, parent_id, 0 as level
FROM categories
WHERE id = 1
UNION ALL
SELECT c.id, c.parent_id, t.level + 1
FROM categories c
JOIN tree t ON c.parent_id = t.id
WHERE t.level < 10 -- 终止条件
)
SELECT * FROM tree;
❌ 误区3:混淆排名函数
问题:
- ROW_NUMBER():即使值相同也有不同排名
- RANK():值相同排名相同,但会跳号
- DENSE_RANK():值相同排名相同,不跳号
正确做法:根据业务需求选择合适的排名函数。
高级查询检查清单
窗口函数检查: □ 分区列是否正确? □ 排序列是否正确? □ 窗口范围是否合适? □ 是否需要在子查询中过滤? □ 相关列是否有索引? 递归CTE检查: □ 锚点成员是否正确? □ 递归成员是否正确? □ 是否有终止条件? □ 是否会无限循环? □ 递归深度是否合理?
附录
A. 窗口函数速查表
| 函数 | 用途 | 示例 |
|---|---|---|
| ROW_NUMBER() | 连续排名 | ROW_NUMBER() OVER (ORDER BY score DESC) |
| RANK() | 跳跃排名 | RANK() OVER (ORDER BY score DESC) |
| DENSE_RANK() | 密集排名 | DENSE_RANK() OVER (ORDER BY score DESC) |
| LAG() | 前N行取值 | LAG(amount, 1) OVER (ORDER BY date) |
| LEAD() | 后N行取值 | LEAD(amount, 1) OVER (ORDER BY date) |
| FIRST_VALUE() | 第一行值 | FIRST_VALUE(amount) OVER (ORDER BY date) |
| LAST_VALUE() | 最后一行值 | LAST_VALUE(amount) OVER (ORDER BY date) |
| SUM() OVER | 累计求和 | SUM(amount) OVER (ORDER BY date) |
| AVG() OVER | 移动平均 | AVG(amount) OVER (ROWS 6 PRECEDING) |
| NTILE() | 分桶 | NTILE(4) OVER (ORDER BY score) |
B. 递归CTE模板
-- 树形查询模板
WITH RECURSIVE tree AS (
-- 锚点
SELECT id, parent_id, 0 as level, name as path
FROM table
WHERE parent_id IS NULL
UNION ALL
-- 递归
SELECT t.id, t.parent_id, tr.level + 1,
CONCAT(tr.path, ' > ', t.name)
FROM table t
JOIN tree tr ON t.parent_id = tr.id
WHERE tr.level < 10
)
SELECT * FROM tree;
-- 序列生成模板
WITH RECURSIVE series AS (
SELECT 1 as n
UNION ALL
SELECT n + 1 FROM series WHERE n < 1000
)
SELECT * FROM series;
读者练习
-
思考题:分析你项目中的复杂查询,找出可以使用窗口函数优化的场景。
-
实践题:使用窗口函数实现以下查询:
-- 查询每个用户连续登录的最大天数
-- 表结构:user_log(user_id, login_date)
- 挑战题:使用递归CTE查询组织架构中每个员工的完整汇报路径。
系列上一篇:查询优化器揭秘:统计信息与执行计划
系列下一篇:事务隔离级别与并发问题深度剖析
知识点测试
读完文章了?来测试一下你对知识点的掌握程度吧!
评论区
使用 GitHub 账号登录后即可发表评论,支持 Markdown 格式。
如果评论系统无法加载,请确保:
- 您的网络可以访问 GitHub
- giscus GitHub App 已安装到仓库
- 仓库已启用 Discussions 功能