发现公司业务在时序类、流水类业务场景越来越多,对数据治理的需求越来越大,这里整理介绍写 PostgreSQL 分区表的应用方法。
一、分区表的作用
1. 将数据按指定的方法打算到子分区,提高 SQL 性能。
2. 解决时序类、流水类业务大表在进行老旧数据清理时 delete 引起的性能及磁盘空间碎片问题。
3. 利用子分区卸载、重新挂载功能,对数据进行暂时性的隐藏、维护。
4. 数据归档治理业务场景:定期增加子分区、定期删除不需要的子分区来对数据进行滑窗处理,保持业务系统瘦身。
5. 添加子分区对业务透明,业务逻辑上只需要访问父表即可。
二、业务场景举例
以大型电商平台为例,通常订单类的信息都比较庞大,假设订单表 tab_orders 的数据量是 100G,单表 10 亿数据量,业务需要统计某个区域内订单的平均额度,往往会消耗比较漫长的时间:
select avg(total_amount) from tab_orders where state_code=1;
如果我们能够把大表分拆成小表,查询数据的时候,只扫描数据所属的小表,就能大大降低扫描时间,提高查询速度。
如果采用分布式架构,比如分 10 个分片,那么单个分片依旧有 1 亿条数据,对于常规数据库来说,依然容易出现严重性能问题。
此时,我们可以在分布式架构的基础上,对业务大表再进行分区,那么单个分片的数据就会进一步被打散。
PostgreSQL 的分区表可以用来解决此类问题,适用于集中式和分布式架构。解决方式是:
创建一个表 tab_orders,作为分区表的父表,再创建 50 个子分区:
tab_orders_1, tab_orders_2, …, tab_orders_50,
这样每一个分区对应一个城市的数据,分区的数据量平均是 2G,如果是分布式架构,那么单个分片内,单个子分区就是 0.1G,200w 的数据量,如果单表是百亿数据量,如果还嫌子分区数据量太大,我们可以继续进行二级、三级、四级..... 多级分区
注:pg 分区表分区方法和分区层级不限。
在本例中,这 50 分区联合在一起,组成分区父表 tab_orders。
这里的分区父表和子分区表都是实实在在的表,和传统的分库分表不一样,分区表可以保持原普通表的查询语句保持不变,对业务透明,如下:
select avg(total_amount) from tab_orders where state_code=1;
PostgreSQL 通过对执行语句的分析处理,最终把扫描的任务定位在分区 tab_order_1 上,自动把查询语句转换成下面的语句,其他分区根本不需要扫描,这就是分区裁剪技术。
select avg(total_amount) from tab_orders_1;
三、分区表分区方法
1.pg 分区表支持 range、list、hash(pg11 版本及以上)三种主要分区方法
2.pg 分区表的分区级数不限、方法不限:即一级分区下面可以进行二级分区,二级分区下面还可以进行 3 及分区........
四、分区表使用注意事项及技巧
1. 分区表中分区键和分布式的分布键一样,不允许对分区键字段进行 update 操作。
2. 分区表中建议规范所有唯一性约束必须包含分区键。
1)分区父表的主键必须包含分区字段;
2)唯一索引必须包含分区字段
3. 分区键的选择技巧:让分区键尽可能地出现在 select、delete、update 语句的 where 条件,以发挥分区裁剪的作用来加速 SQL 性能。
4. 在 insert 语句中,需在字段列表中指定分区键,如:insert into tab_aken (id,part_col) values (1,'2021-10-16');
5. 子分区数量不宜过多,现网使用中发现子分区 1000 个和子分区 300 个两者的性能有较大差别。
五、分区表创建方法
1.range 范围分区表例子:PARTITION BY RANGE (分区键字段);
1). 创建父表:(pg-12 版本)
-- 如下使用时间字段 info_time 作为分区键,使用范围分区方法进行分区
CREATE TABLE tab_aken ( uid integer NOT NULL, info_time timestamp NOT NULL, money decimal(5,2) NOT NULL, primary key (uid,info_time) ) PARTITION BY RANGE (info_time);
2). 按月分区 (pg-12 版本)
-- 方法 1:直接添加。如下添加 3 个子分区
CREATE TABLE aken_2020_1 PARTITION of tab_aken FOR VALUES FROM ('2020-1-01') TO ('2020-1-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_2 PARTITION of tab_aken FOR VALUES FROM ('2020-2-01') TO ('2020-2-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_3 PARTITION of tab_aken FOR VALUES FROM ('2020-3-01') TO ('2020-3-01'::timestamp + interval '1 month');
-- 方法 2:使用 generate_series 函数,按月创建 12 个子表,拼接 SQL 如下:
psql -At -h 9.22.xx.xxx -p xxx -U dbmgr -d akendb -c 'SELECT 'CREATE TABLE aken_2020_' || p_month || ' PARTITION of tab_aken FOR VALUES FROM (''2020-'||p_month||'-01'') TO (''2020-'||p_month||'-01''::timestamp + interval ''1 month'');' FROM generate_series(1,12) as p_month ;' | psql -h 9.22.xx.xxx -p xxx -U dbmgr -d akendb ?column? ------------------------------------------------------------------------------------------------------------------------------- CREATE TABLE aken_2020_1 PARTITION of tab_aken FOR VALUES FROM ('2020-1-01') TO ('2020-1-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_2 PARTITION of tab_aken FOR VALUES FROM ('2020-2-01') TO ('2020-2-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_3 PARTITION of tab_aken FOR VALUES FROM ('2020-3-01') TO ('2020-3-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_4 PARTITION of tab_aken FOR VALUES FROM ('2020-4-01') TO ('2020-4-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_5 PARTITION of tab_aken FOR VALUES FROM ('2020-5-01') TO ('2020-5-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_6 PARTITION of tab_aken FOR VALUES FROM ('2020-6-01') TO ('2020-6-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_7 PARTITION of tab_aken FOR VALUES FROM ('2020-7-01') TO ('2020-7-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_8 PARTITION of tab_aken FOR VALUES FROM ('2020-8-01') TO ('2020-8-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_9 PARTITION of tab_aken FOR VALUES FROM ('2020-9-01') TO ('2020-9-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_10 PARTITION of tab_aken FOR VALUES FROM ('2020-10-01') TO ('2020-10-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_11 PARTITION of tab_aken FOR VALUES FROM ('2020-11-01') TO ('2020-11-01'::timestamp + interval '1 month'); CREATE TABLE aken_2020_12 PARTITION of tab_aken FOR VALUES FROM ('2020-12-01') TO ('2020-12-01'::timestamp + interval '1 month'); (12 rows) akendb=#
3). 查看分区表结构:可以看到父表下有 12 张分区表
akendb=# \d+ tab_aken Partitioned table 'public.aken' Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -------------+-----------------------------+-----------+----------+---------+---------+--------------+------------- sensor_id | integer | | not null | | plain | | ptime | timestamp without time zone | | not null | | plain | | temperature | numeric(5,2) | | not null | | main | | Partition key: RANGE (ptime) Indexes: 'aken_pkey' PRIMARY KEY, btree (sensor_id, ptime) Partitions: aken_2020_1 FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2020-02-01 00:00:00'), aken_2020_10 FOR VALUES FROM ('2020-10-01 00:00:00') TO ('2020-11-01 00:00:00'), aken_2020_11 FOR VALUES FROM ('2020-11-01 00:00:00') TO ('2020-12-01 00:00:00'), aken_2020_12 FOR VALUES FROM ('2020-12-01 00:00:00') TO ('2021-01-01 00:00:00'), aken_2020_2 FOR VALUES FROM ('2020-02-01 00:00:00') TO ('2020-03-01 00:00:00'), aken_2020_3 FOR VALUES FROM ('2020-03-01 00:00:00') TO ('2020-04-01 00:00:00'), aken_2020_4 FOR VALUES FROM ('2020-04-01 00:00:00') TO ('2020-05-01 00:00:00'), aken_2020_5 FOR VALUES FROM ('2020-05-01 00:00:00') TO ('2020-06-01 00:00:00'), aken_2020_6 FOR VALUES FROM ('2020-06-01 00:00:00') TO ('2020-07-01 00:00:00'), aken_2020_7 FOR VALUES FROM ('2020-07-01 00:00:00') TO ('2020-08-01 00:00:00'), aken_2020_8 FOR VALUES FROM ('2020-08-01 00:00:00') TO ('2020-09-01 00:00:00'), aken_2020_9 FOR VALUES FROM ('2020-09-01 00:00:00') TO ('2020-10-01 00:00:00')
4)查询分区表数据
akendb=# select * from tab_aken where time_col >= '2020-05-08 11:20:16'::timestamp and time_col = '2020-06-01 00:00:00'::timestamp without time zone) AND (ptime < '2020-07-01 00:00:00'::timestamp without time zone)) Indexes: 'aken_2020_6_pkey' PRIMARY KEY, btree (sensor_id, ptime) 'aken_2020_6_sensor_id_idx' btree (sensor_id) Access method: heap akendb=#
七、利用分区表进行数据维护:删除分区、添加分区、卸载分区(隐藏分区)、重新挂载分区
1. 删除子分区
-- 大表数据维护,mysql、PostgreSQL、Oracle 等关系型 DB 不建议使用 delete 操作,对性能影响较大。
-- 后期如果不需要某个时间段的数据,直接 drop 对应的子分区即可,不影响全表,对业务透明。
-- 当需要清理冷旧数据时,直接 drop 子分区即可,无需使用 delete 这种比较损耗性能的操作。
1) 首先,查看父表 tab_aken 当前有哪些子分区:
akendb=# select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric)from pg_class where relname like 'tab_aken_part_%' order by 2; relname | split_part -----------------+------------ tab_aken_part_1 | 1 tab_aken_part_2 | 2 tab_aken_part_3 | 3 tab_aken_part_4 | 4 tab_aken_part_5 | 5 tab_aken_part_6 | 6 tab_aken_part_7 | 7 tab_aken_part_8 | 8 tab_aken_part_9 | 9 (9 rows)
2). 删除目标子分区:
-- 拼接方法,可以放到定期任务里面,如每次删除前面 N 个子分区(limit N)
-- 如下拼接删除 2 个最早的子分区
akendb=# select 'drop table '||string_agg(relname, ',')||';' as drop_target_child_partitions from ( select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric) from pg_class where relname like 'tab_aken_part%' order by 2 limit 2 ) as aaa; drop_target_partitions --------------------------------------------- drop table tab_aken_part_1,tab_aken_part_2; (2 row) akendb=# drop table tab_aken_part_1,tab_aken_part_2; DROP TABLE akendb=# select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric)from pg_class where relname like 'tab_aken_part_%' order by 2; relname | split_part -----------------+------------ tab_aken_part_3 | 3 tab_aken_part_4 | 4 tab_aken_part_5 | 5 tab_aken_part_6 | 6 tab_aken_part_7 | 7 tab_aken_part_8 | 8 tab_aken_part_9 | 9 (7 rows) akendb=# select 'drop table '||string_agg(relname, ',')||';' as drop_target_child_partitions from ( select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric) from pg_class where relname like 'tab_aken_part%' order by 2 limit 2 ) as aaa; drop_target_partitions --------------------------------------------- drop table tab_aken_part_3,tab_aken_part_4; (1 row) akendb=# drop table tab_aken_part_3,tab_aken_part_4; DROP TABLE akendb=# select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric) from pg_class where relname like 'tab_aken_part_%' order by 2; relname | split_part -----------------+------------ tab_aken_part_5 | 5 tab_aken_part_6 | 6 tab_aken_part_7 | 7 tab_aken_part_8 | 8 tab_aken_part_9 | 9 (5 rows) akendb=#
2. 添加子分区
-- 增加子分区主要是为了承接超出已有子分区范围的业务新数据入库
如下当前父表已有子分区:
akendb=# select * from pg_tables where tablename like 'tab_aken%' order by tablename; schemaname | tablename | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity ------------+-----------------+------------+------------+------------+----------+-------------+------------- public | tab_aken | dbmgr | | t | f | f | f public | tab_aken_part_1 | dbmgr | | t | f | f | f public | tab_aken_part_2 | dbmgr | | t | f | f | f public | tab_aken_part_3 | dbmgr | | t | f | f | f (3rows) akendb=#
添加子分区方法 1:从最大子分区后面直接添加
-- 如下给按 range 分区的父表添加 3 个月的子分区
akendb=# alter table tab_aken add partitions 3; --默认会自动从最大的子分区后面添加3个子分区 ALTER TABLE akendb=# select * from pg_tables where tablename like 'tab_aken%' order by tablename; schemaname | tablename | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity ------------+-----------------+------------+------------+------------+----------+-------------+------------- public | tab_aken | dbmgr | | t | f | f | f public | tab_aken_part_1 | dbmgr | | t | f | f | f public | tab_aken_part_2 | dbmgr | | t | f | f | f public | tab_aken_part_3 | dbmgr | | t | f | f | f public | tab_aken_part_4 | dbmgr | | t | f | f | f public | tab_aken_part_5 | dbmgr | | t | f | f | f public | tab_aken_part_6 | dbmgr | | t | f | f | f (6rows) akendb=#
添加子分区方法 2:指定分区范围添加
CREATE TABLE aken_2020_7 PARTITION of aken FOR VALUES FROM ('2022-6-01') TO ('2020-6-01'::timestamp + interval '1 month');
3. 卸载子分区(隐藏子分区)、解绑子分区、重新绑定子分区
相对于 drop 子分区,推荐先暂时将子分区从父表中移除的方式,当后续发现还需要子分区的数据,重新将子分区挂载回来即可。
1)卸载子分区(或叫解绑子分区)
akendb=# alter table tab_aken detach partition aken_2020_6; ALTER TABLE akendb=# \d+ tab_aken Partitioned table 'public.tab_aken' Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+-------------+-----------+----------+---------+---------+--------------+------------- sensor_id | integer | | not null | | plain | | ptime | timestamp without time zone | | not null | | plain | | temperature | numeric(5,2) | | not null | | main | | Partition key: RANGE (ptime) Indexes: 'aken_pkey' PRIMARY KEY, btree (sensor_id, ptime) 'idx_sensor_id' btree (sensor_id) Partitions: aken_2020_1 FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2020-02-01 00:00:00'), aken_2020_11 FOR VALUES FROM ('2020-11-01 00:00:00') TO ('2020-12-01 00:00:00'), aken_2020_12 FOR VALUES FROM ('2020-12-01 00:00:00') TO ('2021-01-01 00:00:00'), aken_2020_2 FOR VALUES FROM ('2020-02-01 00:00:00') TO ('2020-03-01 00:00:00'), aken_2020_3 FOR VALUES FROM ('2020-03-01 00:00:00') TO ('2020-04-01 00:00:00'), aken_2020_4 FOR VALUES FROM ('2020-04-01 00:00:00') TO ('2020-05-01 00:00:00'), aken_2020_5 FOR VALUES FROM ('2020-05-01 00:00:00') TO ('2020-06-01 00:00:00'), aken_2020_7 FOR VALUES FROM ('2020-07-01 00:00:00') TO ('2020-08-01 00:00:00'), aken_2020_8 FOR VALUES FROM ('2020-08-01 00:00:00') TO ('2020-09-01 00:00:00'), aken_2020_9 FOR VALUES FROM ('2020-09-01 00:00:00') TO ('2020-10-01 00:00:00') akendb=#
和直接 DROP 相比,该方式仅仅是使子表脱离了原有的主表,而存储在子表中的数据仍然可以得到访问,因为此时该子表变成了一个普通的数据表:select * from tab_xxx (子分区表名)。
这样无论对 DBA 还是业务来说,就可以在此时对该表进行必要的维护操作,如数据清理、归档等。
在完成诸多例行性的操作之后,可以考虑是否直接删除该表 (DROP TABLE),还是先清空该表的数据 (TRUNCATE TABLE),或者让该表重新绑定主表。
2)重新挂载子分区
akendb=# ALTER TABLE tab_aken ATTACH PARTITION aken_2020_6 FOR VALUES FROM ('2020-06-01 00:00:00') TO ('2020-07-01 00:00:00');
ALTER TABLE
akendb=# \d+ tab_aken
Partitioned table 'public.tab_aken'
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
sensor_id | integer | | not null | | plain | |
ptime | timestamp without time zone | | not null | | plain | |
temperature | numeric(5,2) | | not null | | main | |
Partition key: RANGE (ptime)
Indexes:
'aken_pkey' PRIMARY KEY, btree (sensor_id, ptime)
'idx_sensor_id' btree (sensor_id)
Partitions: aken_2020_1 FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2020-02-01 00:00:00'),
aken_2020_11 FOR VALUES FROM ('2020-11-01 00:00:00') TO ('2020-12-01 00:00:00'),
aken_2020_12 FOR VALUES FROM ('2020-12-01 00:00:00') TO ('2021-01-01 00:00:00'),
aken_2020_2 FOR VALUES FROM ('2020-02-01 00:00:00') TO ('2020-03-01 00:00:00'),
aken_2020_3 FOR VALUES FROM ('2020-03-01 00:00:00') TO ('2020-04-01 00:00:00'),
aken_2020_4 FOR VALUES FROM ('2020-04-01 00:00:00') TO ('2020-05-01 00:00:00'),
aken_2020_5 FOR VALUES FROM ('2020-05-01 00:00:00') TO ('2020-06-01 00:00:00'),
aken_2020_6 FOR VALUES FROM ('2020-06-01 00:00:00') TO ('2020-07-01 00:00:00'),