PostgreSQL 分区表使用方法及技巧整理

数据运维 2023-07-11 张二河手机阅读

发现公司业务在时序类、流水类业务场景越来越多，对数据治理的需求越来越大，这里整理介绍写 PostgreSQL 分区表的应用方法。

一、分区表的作用

1. 将数据按指定的方法打算到子分区，提高 SQL 性能。

2. 解决时序类、流水类业务大表在进行老旧数据清理时 delete 引起的性能及磁盘空间碎片问题。

3. 利用子分区卸载、重新挂载功能，对数据进行暂时性的隐藏、维护。

4. 数据归档治理业务场景：定期增加子分区、定期删除不需要的子分区来对数据进行滑窗处理，保持业务系统瘦身。

5. 添加子分区对业务透明，业务逻辑上只需要访问父表即可。

二、业务场景举例

以大型电商平台为例，通常订单类的信息都比较庞大，假设订单表 tab_orders 的数据量是 100G，单表 10 亿数据量，业务需要统计某个区域内订单的平均额度，往往会消耗比较漫长的时间：

select avg(total_amount) from tab_orders where state_code=1;

如果我们能够把大表分拆成小表，查询数据的时候，只扫描数据所属的小表，就能大大降低扫描时间，提高查询速度。

如果采用分布式架构，比如分 10 个分片，那么单个分片依旧有 1 亿条数据，对于常规数据库来说，依然容易出现严重性能问题。

此时，我们可以在分布式架构的基础上，对业务大表再进行分区，那么单个分片的数据就会进一步被打散。

PostgreSQL 的分区表可以用来解决此类问题，适用于集中式和分布式架构。解决方式是：

创建一个表 tab_orders，作为分区表的父表，再创建 50 个子分区:

tab_orders_1, tab_orders_2, …, tab_orders_50，

这样每一个分区对应一个城市的数据，分区的数据量平均是 2G，如果是分布式架构，那么单个分片内，单个子分区就是 0.1G，200w 的数据量，如果单表是百亿数据量，如果还嫌子分区数据量太大，我们可以继续进行二级、三级、四级..... 多级分区

注：pg 分区表分区方法和分区层级不限。

在本例中，这 50 分区联合在一起，组成分区父表 tab_orders。

这里的分区父表和子分区表都是实实在在的表，和传统的分库分表不一样，分区表可以保持原普通表的查询语句保持不变，对业务透明，如下：

select avg(total_amount) from tab_orders where state_code=1;

PostgreSQL 通过对执行语句的分析处理，最终把扫描的任务定位在分区 tab_order_1 上，自动把查询语句转换成下面的语句，其他分区根本不需要扫描，这就是分区裁剪技术。

select avg(total_amount) from tab_orders_1;

三、分区表分区方法

1.pg 分区表支持 range、list、hash（pg11 版本及以上）三种主要分区方法

2.pg 分区表的分区级数不限、方法不限：即一级分区下面可以进行二级分区，二级分区下面还可以进行 3 及分区........

四、分区表使用注意事项及技巧

1. 分区表中分区键和分布式的分布键一样，不允许对分区键字段进行 update 操作。

2. 分区表中建议规范所有唯一性约束必须包含分区键。

1）分区父表的主键必须包含分区字段；

2）唯一索引必须包含分区字段

3. 分区键的选择技巧：让分区键尽可能地出现在 select、delete、update 语句的 where 条件，以发挥分区裁剪的作用来加速 SQL 性能。

4. 在 insert 语句中，需在字段列表中指定分区键，如：insert into tab_aken (id,part_col) values (1,'2021-10-16');

5. 子分区数量不宜过多，现网使用中发现子分区 1000 个和子分区 300 个两者的性能有较大差别。

五、分区表创建方法

1.range 范围分区表例子：PARTITION BY RANGE (分区键字段);

1）. 创建父表：（pg-12 版本）

-- 如下使用时间字段 info_time 作为分区键，使用范围分区方法进行分区

CREATE TABLE tab_aken (
  uid   integer  NOT NULL,
  info_time     timestamp NOT NULL, 
  money  decimal(5,2) NOT NULL,
  primary key (uid,info_time)
) PARTITION BY RANGE (info_time);

2）. 按月分区（pg-12 版本）

-- 方法 1：直接添加。如下添加 3 个子分区

CREATE TABLE aken_2020_1 PARTITION of tab_aken FOR VALUES FROM ('2020-1-01') TO ('2020-1-01'::timestamp + interval '1 month');
CREATE TABLE aken_2020_2 PARTITION of tab_aken FOR VALUES FROM ('2020-2-01') TO ('2020-2-01'::timestamp + interval '1 month');
CREATE TABLE aken_2020_3 PARTITION of tab_aken FOR VALUES FROM ('2020-3-01') TO ('2020-3-01'::timestamp + interval '1 month');

-- 方法 2：使用 generate_series 函数，按月创建 12 个子表，拼接 SQL 如下：

psql -At -h 9.22.xx.xxx -p xxx -U dbmgr -d akendb -c 'SELECT 'CREATE TABLE aken_2020_' || p_month || ' PARTITION of tab_aken FOR VALUES FROM (''2020-'||p_month||'-01'') TO (''2020-'||p_month||'-01''::timestamp + interval ''1 month'');' FROM generate_series(1,12) as p_month ;' | psql -h 9.22.xx.xxx -p xxx -U dbmgr -d akendb 

                              ?column?                               
-------------------------------------------------------------------------------------------------------------------------------
 CREATE TABLE aken_2020_1 PARTITION of tab_aken FOR VALUES FROM ('2020-1-01') TO ('2020-1-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_2 PARTITION of tab_aken FOR VALUES FROM ('2020-2-01') TO ('2020-2-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_3 PARTITION of tab_aken FOR VALUES FROM ('2020-3-01') TO ('2020-3-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_4 PARTITION of tab_aken FOR VALUES FROM ('2020-4-01') TO ('2020-4-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_5 PARTITION of tab_aken FOR VALUES FROM ('2020-5-01') TO ('2020-5-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_6 PARTITION of tab_aken FOR VALUES FROM ('2020-6-01') TO ('2020-6-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_7 PARTITION of tab_aken FOR VALUES FROM ('2020-7-01') TO ('2020-7-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_8 PARTITION of tab_aken FOR VALUES FROM ('2020-8-01') TO ('2020-8-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_9 PARTITION of tab_aken FOR VALUES FROM ('2020-9-01') TO ('2020-9-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_10 PARTITION of tab_aken FOR VALUES FROM ('2020-10-01') TO ('2020-10-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_11 PARTITION of tab_aken FOR VALUES FROM ('2020-11-01') TO ('2020-11-01'::timestamp + interval '1 month');
 CREATE TABLE aken_2020_12 PARTITION of tab_aken FOR VALUES FROM ('2020-12-01') TO ('2020-12-01'::timestamp + interval '1 month');
(12 rows)

akendb=#

3）. 查看分区表结构：可以看到父表下有 12 张分区表

akendb=# \d+ tab_aken
                     Partitioned table 'public.aken'
  Column  |      Type       | Collation | Nullable | Default | Storage | Stats target | Description 
-------------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
 sensor_id  | integer           |      | not null |     | plain  |       | 
 ptime    | timestamp without time zone |      | not null |     | plain  |       | 
 temperature | numeric(5,2)        |      | not null |     | main  |       | 
Partition key: RANGE (ptime)
Indexes:
  'aken_pkey' PRIMARY KEY, btree (sensor_id, ptime)
Partitions: aken_2020_1 FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2020-02-01 00:00:00'),
      aken_2020_10 FOR VALUES FROM ('2020-10-01 00:00:00') TO ('2020-11-01 00:00:00'),
      aken_2020_11 FOR VALUES FROM ('2020-11-01 00:00:00') TO ('2020-12-01 00:00:00'),
      aken_2020_12 FOR VALUES FROM ('2020-12-01 00:00:00') TO ('2021-01-01 00:00:00'),
      aken_2020_2 FOR VALUES FROM ('2020-02-01 00:00:00') TO ('2020-03-01 00:00:00'),
      aken_2020_3 FOR VALUES FROM ('2020-03-01 00:00:00') TO ('2020-04-01 00:00:00'),
      aken_2020_4 FOR VALUES FROM ('2020-04-01 00:00:00') TO ('2020-05-01 00:00:00'),
      aken_2020_5 FOR VALUES FROM ('2020-05-01 00:00:00') TO ('2020-06-01 00:00:00'),
      aken_2020_6 FOR VALUES FROM ('2020-06-01 00:00:00') TO ('2020-07-01 00:00:00'),
      aken_2020_7 FOR VALUES FROM ('2020-07-01 00:00:00') TO ('2020-08-01 00:00:00'),
      aken_2020_8 FOR VALUES FROM ('2020-08-01 00:00:00') TO ('2020-09-01 00:00:00'),
      aken_2020_9 FOR VALUES FROM ('2020-09-01 00:00:00') TO ('2020-10-01 00:00:00')

4）查询分区表数据

akendb=# select * from tab_aken where time_col >= '2020-05-08 11:20:16'::timestamp and time_col = '2020-06-01 00:00:00'::timestamp without time zone) AND (ptime < '2020-07-01 00:00:00'::timestamp without time zone))
Indexes:
  'aken_2020_6_pkey' PRIMARY KEY, btree (sensor_id, ptime)
  'aken_2020_6_sensor_id_idx' btree (sensor_id)
Access method: heap

akendb=#

七、利用分区表进行数据维护：删除分区、添加分区、卸载分区（隐藏分区）、重新挂载分区

1. 删除子分区

-- 大表数据维护，mysql、PostgreSQL、Oracle 等关系型 DB 不建议使用 delete 操作，对性能影响较大。

-- 后期如果不需要某个时间段的数据，直接 drop 对应的子分区即可，不影响全表，对业务透明。

-- 当需要清理冷旧数据时，直接 drop 子分区即可，无需使用 delete 这种比较损耗性能的操作。

1) 首先，查看父表 tab_aken 当前有哪些子分区：

akendb=# select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric)from pg_class where relname like 'tab_aken_part_%' order by 2;
   relname   | split_part 
-----------------+------------
 tab_aken_part_1 |     1
 tab_aken_part_2 |     2
 tab_aken_part_3 |     3
 tab_aken_part_4 |     4
 tab_aken_part_5 |     5
 tab_aken_part_6 |     6
 tab_aken_part_7 |     7
 tab_aken_part_8 |     8
 tab_aken_part_9 |     9
(9 rows)

2). 删除目标子分区：

-- 拼接方法，可以放到定期任务里面，如每次删除前面 N 个子分区（limit N）

-- 如下拼接删除 2 个最早的子分区

akendb=# select 'drop table '||string_agg(relname, ',')||';' as drop_target_child_partitions from ( select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric) from pg_class where relname like 'tab_aken_part%' order by 2 limit 2 ) as aaa;
     drop_target_partitions          
---------------------------------------------
 drop table tab_aken_part_1,tab_aken_part_2;
(2 row)

akendb=# drop table tab_aken_part_1,tab_aken_part_2;
DROP TABLE
akendb=# select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric)from pg_class where relname like 'tab_aken_part_%' order by 2;
   relname   | split_part 
-----------------+------------
 tab_aken_part_3 |     3
 tab_aken_part_4 |     4
 tab_aken_part_5 |     5
 tab_aken_part_6 |     6
 tab_aken_part_7 |     7
 tab_aken_part_8 |     8
 tab_aken_part_9 |     9
(7 rows)

akendb=# select 'drop table '||string_agg(relname, ',')||';' as drop_target_child_partitions from ( select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric) from pg_class where relname like 'tab_aken_part%' order by 2 limit 2 ) as aaa;
      drop_target_partitions          
---------------------------------------------
 drop table tab_aken_part_3,tab_aken_part_4;
(1 row)

akendb=# drop table tab_aken_part_3,tab_aken_part_4;
DROP TABLE
akendb=# select relname, cast(split_part(relname,'tab_aken_part_', 2) as numeric) from pg_class where relname like 'tab_aken_part_%' order by 2;
   relname   | split_part 
-----------------+------------
 tab_aken_part_5 |     5
 tab_aken_part_6 |     6
 tab_aken_part_7 |     7
 tab_aken_part_8 |     8
 tab_aken_part_9 |     9
(5 rows)

akendb=#

2. 添加子分区

-- 增加子分区主要是为了承接超出已有子分区范围的业务新数据入库

如下当前父表已有子分区：

akendb=# select * from pg_tables where tablename like 'tab_aken%' order by tablename;
 schemaname |  tablename  | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity 
------------+-----------------+------------+------------+------------+----------+-------------+-------------
 public   | tab_aken    | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_1 | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_2 | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_3 | dbmgr   |      | t     | f    | f      | f
(3rows)

akendb=#

添加子分区方法 1：从最大子分区后面直接添加

-- 如下给按 range 分区的父表添加 3 个月的子分区

akendb=# alter table tab_aken add partitions 3;  --默认会自动从最大的子分区后面添加3个子分区
ALTER TABLE
akendb=# select * from pg_tables where tablename like 'tab_aken%' order by tablename;
 schemaname |  tablename  | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity 
------------+-----------------+------------+------------+------------+----------+-------------+-------------
 public   | tab_aken    | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_1 | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_2 | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_3 | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_4 | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_5 | dbmgr   |      | t     | f    | f      | f
 public   | tab_aken_part_6 | dbmgr   |      | t     | f    | f      | f
(6rows)

akendb=#

添加子分区方法 2：指定分区范围添加

CREATE TABLE aken_2020_7 PARTITION of aken FOR VALUES FROM ('2022-6-01') TO ('2020-6-01'::timestamp + interval '1 month');

3. 卸载子分区（隐藏子分区）、解绑子分区、重新绑定子分区

相对于 drop 子分区，推荐先暂时将子分区从父表中移除的方式，当后续发现还需要子分区的数据，重新将子分区挂载回来即可。

1）卸载子分区（或叫解绑子分区）

akendb=# alter table tab_aken detach partition aken_2020_6;
ALTER TABLE
akendb=# \d+ tab_aken    
                     Partitioned table 'public.tab_aken'
  Column  |      Type       | Collation | Nullable | Default | Storage | Stats target | Description 
-----------+-------------+-----------+----------+---------+---------+--------------+-------------
 sensor_id  | integer           |      | not null |     | plain  |       | 
 ptime    | timestamp without time zone |      | not null |     | plain  |       | 
 temperature | numeric(5,2)        |      | not null |     | main  |       | 
Partition key: RANGE (ptime)
Indexes:
  'aken_pkey' PRIMARY KEY, btree (sensor_id, ptime)
  'idx_sensor_id' btree (sensor_id)
Partitions: aken_2020_1 FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2020-02-01 00:00:00'),
      aken_2020_11 FOR VALUES FROM ('2020-11-01 00:00:00') TO ('2020-12-01 00:00:00'),
      aken_2020_12 FOR VALUES FROM ('2020-12-01 00:00:00') TO ('2021-01-01 00:00:00'),
      aken_2020_2 FOR VALUES FROM ('2020-02-01 00:00:00') TO ('2020-03-01 00:00:00'),
      aken_2020_3 FOR VALUES FROM ('2020-03-01 00:00:00') TO ('2020-04-01 00:00:00'),
      aken_2020_4 FOR VALUES FROM ('2020-04-01 00:00:00') TO ('2020-05-01 00:00:00'),
      aken_2020_5 FOR VALUES FROM ('2020-05-01 00:00:00') TO ('2020-06-01 00:00:00'),
      aken_2020_7 FOR VALUES FROM ('2020-07-01 00:00:00') TO ('2020-08-01 00:00:00'),
      aken_2020_8 FOR VALUES FROM ('2020-08-01 00:00:00') TO ('2020-09-01 00:00:00'),
      aken_2020_9 FOR VALUES FROM ('2020-09-01 00:00:00') TO ('2020-10-01 00:00:00')

akendb=#

和直接 DROP 相比，该方式仅仅是使子表脱离了原有的主表，而存储在子表中的数据仍然可以得到访问，因为此时该子表变成了一个普通的数据表:select * from tab_xxx (子分区表名)。

这样无论对 DBA 还是业务来说，就可以在此时对该表进行必要的维护操作，如数据清理、归档等。

在完成诸多例行性的操作之后，可以考虑是否直接删除该表 (DROP TABLE)，还是先清空该表的数据 (TRUNCATE TABLE)，或者让该表重新绑定主表。

2）重新挂载子分区

akendb=# ALTER TABLE tab_aken ATTACH PARTITION aken_2020_6 FOR VALUES FROM ('2020-06-01 00:00:00') TO ('2020-07-01 00:00:00');
ALTER TABLE
akendb=# \d+ tab_aken
                     Partitioned table 'public.tab_aken'
  Column  |      Type       | Collation | Nullable | Default | Storage | Stats target | Description 
-------------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
 sensor_id  | integer           |      | not null |     | plain  |       | 
 ptime    | timestamp without time zone |      | not null |     | plain  |       | 
 temperature | numeric(5,2)        |      | not null |     | main  |       | 
Partition key: RANGE (ptime)
Indexes:
  'aken_pkey' PRIMARY KEY, btree (sensor_id, ptime)
  'idx_sensor_id' btree (sensor_id)
Partitions: aken_2020_1 FOR VALUES FROM ('2020-01-01 00:00:00') TO ('2020-02-01 00:00:00'),
      aken_2020_11 FOR VALUES FROM ('2020-11-01 00:00:00') TO ('2020-12-01 00:00:00'),
      aken_2020_12 FOR VALUES FROM ('2020-12-01 00:00:00') TO ('2021-01-01 00:00:00'),
      aken_2020_2 FOR VALUES FROM ('2020-02-01 00:00:00') TO ('2020-03-01 00:00:00'),
      aken_2020_3 FOR VALUES FROM ('2020-03-01 00:00:00') TO ('2020-04-01 00:00:00'),
      aken_2020_4 FOR VALUES FROM ('2020-04-01 00:00:00') TO ('2020-05-01 00:00:00'),
      aken_2020_5 FOR VALUES FROM ('2020-05-01 00:00:00') TO ('2020-06-01 00:00:00'),
      aken_2020_6 FOR VALUES FROM ('2020-06-01 00:00:00') TO ('2020-07-01 00:00:00'),