MySQL 中的压缩技术

数据运维 2024-01-29 大白菜程序猿手机阅读

为什么要有这篇文章？

MySQL 中数据压缩技术主要有三种：表压缩，页压缩，列压缩。

在互联网上，关于页压缩的源码解析文章比较多，但是关于表压缩，列压缩的源码解析的文章处于空白状态，没有相关资料，这就为一些对压缩技术比较感兴趣的同学提出了一些挑战。

本文旨在通过对表压缩，页压缩，列压缩的源码进行解析，同时做出使用上的说明，以填补这部分的空白。

1. 综述

数据压缩可以减少存储空间，降低存储成本，增加 IO 效率，是降低数据库整体使用成本的重要手段，MySQL 目前具备的压缩能力，包括 InnoDB 存储引擎层提供的表数据的压缩，以及在 Server 层实现的 binlog 日志压缩两种。

MySQL 中有一个分支 MyRocks 对压缩的支持比较多 IO 也比较优秀，被 Facebook 大规模使用，本文不做重点分析，对 MyRocks 感兴趣可以参考网易 MyRocks 使用和优化。

本文主要介绍 MySQL 中 InnoDB 存储引擎层的压缩，本文所有代码基于 MySQL 8.0.28。

1.1 MySQL 压缩要解决的问题

笔者认为，MySQL 的数据压缩要解决两个问题，第一个问题是通过压缩把数据需要的存储空间减少；第二个问题是对压缩后剩余空间的利用。

MySQL 中的表压缩解决第一个问题使用的方法是通过 zlib 压缩算法提供的接口，通过 Zlib 算法的压缩以及解压使得数据占有的存储空间减少；通过 KEY_BLOCK_SIZE 值的设置，如果成功压缩，就可以把一个页面中 KEY_BLOCK_SIZE 之外的空间通过 MySQL 的调度得以使用。

MySQL 中的页压缩解决第一个问题的方法是通过 zlib 以及 lz4 压缩算法提供的接口来实现，一个 MySQL 页面可以占用更少的操作系统页面；通过操作系统的 punching hole 功能把剩余空间得以调度以及使用。

2 MySQL 中的表压缩

本节主要包括两部分：表压缩的使用与表压缩的代码实现。

2.1 表压缩的使用

2.1.1 如何创建一个压缩表

在 file_per_table 的表空间或者 general 表空间里，可以使用表压缩，系统表不支持对表级别的压缩；用户在设置好 innodb_file_format 之后（仅支持 Barracuda ），再把 ROW_FORMAT （COMPRESSED）与 KEY_BLOCK_SIZE 都设置为对应的值，可以启用表压缩。

file_per_table space 创建压缩表：

mysql> SET GLOBAL innodb_file_per_table=1;


mysql> ## 5.7 设置,  8.0 取消此参数，5.7 默认值 Barracuda

mysql> SET GLOBAL innodb_file_format=Barracuda;

mysql> CREATE TABLE t1 (c1 INT PRIMARY KEY) mysql> ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;

general space 创建压缩表：

mysql> CREATE TABLESPACE `ts2` ADD DATAFILE 'ts2.ibd' mysql> FILE_BLOCK_SIZE = 8192 Engine=InnoDB;

mysql> CREATE TABLE t4 (c1 INT PRIMARY KEY) mysql> TABLESPACE ts2 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;

官方文档中对表压缩参数设置限制了限制，最常用的是 KEY_BLOCK_SIZE = 8K.

2.1.2 压缩表使用的限制

1.KEY_BLOCK_SIZE 原则上不超过 innodb_page_size；但也不能过小，如果 KEY_BLOCK_SIZE 如果指定的值太小，则当数据值无法压缩到足以容纳每页中的多行时，重新组织页面会产生额外的开销；因此会有硬性的规定，KEY_BLOCK_SIZE 的值如果太小，会导致 CREATE 或者 ALTER SQL 执行失败。

2.在使用压缩表的时候，可以适当调大 buffer_pool_size ，以增强性能。

3.FILE_BLOCK_SIZE 没有设置的时候，默认值是 innodb_page_size ，这时候不允许使用 COMPREESSION 功能。

2.1.3 表压缩参数

参数		参数说明
innodb_compression_failure_threshold_pct	GLOBAL	当压缩失败的次数 / 总压缩次数达到该值时，MySQL 会动态增加每个页面的额外可用空间，以此来提高压缩成功率
innodb_compression_pad_pct_max	GLOBAL	每个页面中预留空间占总空间的比例
innodb_compression_level	GLOBAL	压缩时使用的 zlib 算法的压缩级别（0-9 , default 6），更高的压缩级别意味着更高的压缩率与更高的 CPU 消耗
innodb_log_compressed_pages	GLOBAL	是否在 redolog 中记载页面的 re-compression 信息，default ON

2.1.4 表压缩监测

全部压缩表的性能监测可以在 INNODB_CMP 以及 INNODB_CMP_RESET 表中看到；单独压缩表的性能监测数据可以在 INNODB_CMP_PER_INDEX 以及 INNODB_CMP_PER_INDEX_RESET 表中看到。

INNODB_CMP 表跟 INNODB_CMP_RESET 表存储的信息都是全部压缩表的数据，这两个表里面的所有字段均一致，不同的是 INNODB_CMP_RESET 表在每次查询之后都会进行一次清零操作，所以 INNODB_CMP_RESET 表查询到的是一段时间的数据，INNODB_CMP 表查询到的是 MySQL 服务启动至今的数据。

INNODB_CMP_PER_INDEX 表提供更细粒度的表级别的压缩数据，本文会给出查询对应表的示例。

2.2 表压缩的实现

2.2.1 buffer_pool 中对表压缩的支持

从buffer_pool中获取一个压缩页的过程是：从磁盘上把压缩页取出 (buf_buddy_alloc)，取出之后，把page的状态置为 BUF_BLOCK_ZIP_PAGE。page被解压之后，状态被置为 BUF_BLOCK_FILE_PAGE，同时加入unzip_LRU中。如果内存资源紧张，解压页将会被回收，如果这个时候page无更新，状态是BUF_BLOCK_ZIP_PAGE，否则状态是BUF_BLOCK_ZIP_DIRTY。

struct buf_pool_t { /** zip_hash mutex */ BufListMutex zip_hash_mutex;


  /** 压缩页链表 */

  UT_LIST_BASE_NODE_T(buf_page_t, zip_list) zip_list;
  /** Hash table of buf_block_t blocks whose frames are allocated to the zip

  buddy system, indexed by block->frame */

  hash_table_t *zip_hash;
  /** 解压后的压缩页链表 */

  UT_LIST_BASE_NODE_T(buf_block_t, unzip_LRU) unzip_LRU;
  /** 记录没被修改过的压缩页，debug 模式下才有 */

  UT_LIST_BASE_NODE_T(buf_page_t, list) zip_clean;

/** 支持压缩表的 free lists, 不同页面大小对应数组的不同值 */ UT_LIST_BASE_NODE_T(buf_buddy_free_t, list) zip_free[BUF_BUDDY_SIZES_MAX]; }

struct buf_block_t { /* page 页的元信息 */ buf_page_t page; /* 真正存储数据的 page */ byte *frame; };

/* page 的一些元信息 */ class buf_page_t { /* 压缩页 */ page_zip_des_t zip; /* 是否在 buf_pool->zip_hash 链表中 */ bool in_zip_hash; };

/** 描述压缩页结构体 */ struct page_zip_des_t { /** 压缩页数据 */ page_zip_t *data; /** modification log 起始偏移量 */ uint16_t m_start; /** modification log 终结偏移量 */ uint16_t m_end; /** modification log 是否为空 */ bool m_nonempty; /** 表示压缩页大小 */ uint8_t ssize; };

2.2.2 B-tree 对压缩表的支持

B-tree 压缩表的数据插入的函数逻辑跟非压缩表的数据插入逻辑在 page_cur_tuple_insert函数之前都是一样的，不同的是，在 page_cur_tuple_insert函数的执行过程中，压缩表的情况会进入 page_cur_insert_rec_zip函数，这个函数会同时更新压缩页与非压缩页的数据。更新压缩页数据的时候，不会直接更新记录，而是会将更新信息记载到压缩页的modifition log（压缩页中负责记录修改信息的log，存储在压缩页的末尾）中。

具体的压缩表数据插入逻辑：

数据插入主要函数： /* 向 clust_index 中插入一条数据 */ -->row_ins_clust_index_entry_low /* 乐观插入 */ --> btr_cur_optimistic_insert /* 向页面中插入 record, 成功的话返回这条record, 失败的话返回 NULL */ --> page_cur_tuple_insert /* 同时在压缩页与非压缩页插入一条记录 */ --> page_cur_insert_rec_zip /* 向 record 中写入 offset */ --> page_zip_dir_insert /* 写入压缩页面 */ --> page_zip_write_rec

rec_t *page_cur_insert_rec_zip( page_cur_t *cursor, /*!< in/out: page cursor */ dict_index_t *index, /*!< in: record descriptor */ const rec_t *rec, /*!< in: pointer to a physical record */ ulint *offsets, /*!< in/out: rec_get_offsets(rec, index) */ mtr_t *mtr) /*!rec != insert_rec);


  {

    /* next record after current before the insertion */

    const rec_t *next_rec = page_rec_get_next_low(cursor->rec, TRUE);

    ...
    page_rec_set_next(insert_rec, next_rec);

    page_rec_set_next(cursor->rec, insert_rec);

  }
  page_header_set_field(page, page_zip, PAGE_N_RECS, 1 + page_get_n_recs(page));
  /* 5. Set the n_owned field in the inserted record to zero,

  and set the heap_no field */

  rec_set_n_owned_new(insert_rec, nullptr, 0);

  rec_set_heap_no_new(insert_rec, heap_no);
  UNIV_MEM_ASSERT_RW(rec_get_start(insert_rec, offsets),

                     rec_offs_size(offsets));
  page_zip_dir_insert(page_zip, cursor->rec, free_rec, insert_rec);
  /* 6. Update the last insertion info in page header */
  last_insert = page_header_get_ptr(page, PAGE_LAST_INSERT);

  ut_ad(!last_insert || rec_get_node_ptr_flag(last_insert) ==

                            rec_get_node_ptr_flag(insert_rec));
  if (!dict_index_is_spatial(index)) {

    if (UNIV_UNLIKELY(last_insert == nullptr)) {

      page_header_set_field(page, page_zip, PAGE_DIRECTION, PAGE_NO_DIRECTION);

      page_header_set_field(page, page_zip, PAGE_N_DIRECTION, 0);
    } else if ((last_insert == cursor->rec) &&

               (page_header_get_field(page, PAGE_DIRECTION) != PAGE_LEFT)) {

      page_header_set_field(page, page_zip, PAGE_DIRECTION, PAGE_RIGHT);

      page_header_set_field(page, page_zip, PAGE_N_DIRECTION,

                            page_header_get_field(page, PAGE_N_DIRECTION) + 1);
    } else if ((page_rec_get_next(insert_rec) == last_insert) &&

               (page_header_get_field(page, PAGE_DIRECTION) != PAGE_RIGHT)) {

      page_header_set_field(page, page_zip, PAGE_DIRECTION, PAGE_LEFT);

      page_header_set_field(page, page_zip, PAGE_N_DIRECTION,

                            page_header_get_field(page, PAGE_N_DIRECTION) + 1);

    } else {

      page_header_set_field(page, page_zip, PAGE_DIRECTION, PAGE_NO_DIRECTION);

      page_header_set_field(page, page_zip, PAGE_N_DIRECTION, 0);

    }

  }
  page_header_set_ptr(page, page_zip, PAGE_LAST_INSERT, insert_rec);
  /* 7. It remains to update the owner record. */

  {

    rec_t *owner_rec = page_rec_find_owner_rec(insert_rec);

    ulint n_owned;
    n_owned = rec_get_n_owned_new(owner_rec);

    rec_set_n_owned_new(owner_rec, page_zip, n_owned + 1);
    /* 8. Now we have incremented the n_owned field of the owner

    record. If the number exceeds PAGE_DIR_SLOT_MAX_N_OWNED,

    we have to split the corresponding directory slot in two. */
    if (UNIV_UNLIKELY(n_owned == PAGE_DIR_SLOT_MAX_N_OWNED)) {

      page_dir_split_slot(page, page_zip, page_dir_find_owner_slot(owner_rec));

    }

  }
  /* 9. write compressed page */

  page_zip_write_rec(page_zip, insert_rec, index, offsets, 1);
  /* 10. Write log record of the insert */

  if (UNIV_LIKELY(mtr != nullptr)) {

    page_cur_insert_rec_write_log(insert_rec, rec_size, cursor->rec, index,

                                  mtr);

  }

return (insert_rec); }

创建压缩页的代码逻辑：

压缩页创建主要函数： /* 创建一个压缩页 */ -->page_create_zip /* 创建一个页的底层函数，填充文件头等 */ -->page_create_low /* 写入一条 MLOG_ZIP_PAGE_COMPRESS 类型的 redolog */ --> page_zip_compress

/** Compress a page. @return true on success, false on failure; page_zip will be left intact on failure. */ ibool page_zip_compress(page_zip_des_t *page_zip, /*!< in: size; out: data, n_blobs, m_start, m_end, m_nonempty */ const page_t *page, /*!< in: uncompressed page */ dict_index_t *index, /*!< in: index tree */ ulint level, /*!ssize - 1].compressed_ok++; page_zip_stat[page_zip->ssize - 1].compress_time += time_diff; if (cmp_per_index_enabled) { mutex_enter(&page_zip_stat_per_index_mutex); page_zip_stat_per_index[ind_id].compressed_ok++; page_zip_stat_per_index[ind_id].compress_time += time_diff; mutex_exit(&page_zip_stat_per_index_mutex); }

return (TRUE); }

2.2.3 redolog 对压缩表的支持

redolog 冲对压缩页面的处理跟普通的页面有所区别，压缩页面在 redo 中的处理是：把系统信息列存入固定的位置，比如 trx_id 等信息；然后把数据列的修改写入 modified log 中，部分元数据信息也会经过 mlog_open_and_write_index 存储在 redolog 中，列数，定长列信息，变长列信息等。

redolog 类型
MLOG_ZIP_WRITE_NODE_PTR	在非叶结点上写入别的 page 的指针
MLOG_ZIP_WRITE_BLOB_PTR	写入 blob 页的指针
MLOG_ZIP_WRITE_HEADER	写页头部
MLOG_ZIP_PAGE_COMPRESS	压缩页
MLOG_ZIP_PAGE_COMPRESS_NO_DATA	压缩空的页
MLOG_ZIP_PAGE_REORGANIZE	重新组织页面

static byte *recv_parse_or_apply_log_rec_body( mlog_id_t type, byte *ptr, byte *end_ptr, space_id_t space_id, page_no_t page_no, buf_block_t *block, mtr_t *mtr, ulint parsed_bytes, lsn_t start_lsn) {


...

...
    case MLOG_ZIP_WRITE_NODE_PTR:
      ut_ad(!page || fil_page_type_is_index(page_type));
      ptr = page_zip_parse_write_node_ptr(ptr, end_ptr, page, page_zip);
      break;
    case MLOG_ZIP_WRITE_BLOB_PTR:
      ut_ad(!page || fil_page_type_is_index(page_type));
      ptr = page_zip_parse_write_blob_ptr(ptr, end_ptr, page, page_zip);
      break;
    case MLOG_ZIP_WRITE_HEADER:
      ut_ad(!page || fil_page_type_is_index(page_type));
      ptr = page_zip_parse_write_header(ptr, end_ptr, page, page_zip);
      break;
    case MLOG_ZIP_PAGE_COMPRESS:
      /* Allow anything in page_type when creating a page. */

      ptr = page_zip_parse_compress(ptr, end_ptr, page, page_zip);

      break;
    case MLOG_ZIP_PAGE_COMPRESS_NO_DATA:
      if (nullptr != (ptr = mlog_parse_index(ptr, end_ptr, true, &index))) {

        ut_a(!page || ((ibool) !!page_is_comp(page) ==

                       dict_table_is_comp(index->table)));
        ptr = page_zip_parse_compress_no_data(ptr, end_ptr, page, page_zip,

                                              index);

      }
      break;

...

}

2.2.4 压缩算法

/* zip 算法默认的压缩级别 */ #define DEFAULT_COMPRESSION_LEVEL 6

/* zip 算法的压缩级别，默认是上面的 6 ，可以取 0-9 */ uint page_zip_level = DEFAULT_COMPRESSION_LEVEL;

/* 值得关注的是，表压缩跟页压缩使用相同的参数控制 zip 算法的压缩级别 */ static MYSQL_SYSVAR_UINT( compression_level, page_zip_level, PLUGIN_VAR_RQCMDARG, "Compression level used for compressed row format. 0 is no compression" ", 1 is fastest, 9 is best compression and default is 6.", nullptr, nullptr, DEFAULT_COMPRESSION_LEVEL, 0, 9, 0);

3. MySQL 中的页压缩

3.1 页压缩的使用

在 file_per_table 的表空间里，在设置 innodb_file_per_table 参数设置为 ON 之后，InnoDB 支持表空间中的表的页级别的压缩，这个功能叫页压缩，也叫透明压缩。用户的表结构中有一个选项 COMPRESSION ，可以通过设置该值来控制该表对应的页面是否会进行压缩；目前 MySQL 对透明压缩的支持算法包括 zlib，lz4 两种压缩算法。

3.1.1 页压缩的创建与禁用

页压缩的使用有两种方式，一种方式是通过 CREATE TABLE 的方式，另一种方式是通过 ALTER TABLE 的方式；只需要把 COMPRESSION 选项设置为 NONE，就可以禁用页压缩功能。

3.1.2 页压缩的监控

/* 通过 CREATE TABLE 的方式来创建页压缩表 */ CREATE TABLE t1 (c1 INT) COMPRESSION="zlib";


/* 通过 ALTER TABLE 的方式来创建压缩表 */

ALTER TABLE t1 COMPRESSION="zlib";

OPTIMIZE TABLE t1;

/* 通过设置 COMPRESSION 的方式禁用压缩表*/ ALTER TABLE t1 COMPRESSION="None"; OPTIMIZE TABLE t1;

监控项	含义
FS_BLOCK_SIZE	打孔使用的单位大小
FILE_SIZE	表示文件的最大大小，未压缩
ALLOCATED_SIZE	磁盘上分配的空间量

# Create the employees table with Zlib page compression


CREATE TABLE employees (

    emp_no      INT             NOT NULL,

    birth_date  DATE            NOT NULL,

    first_name  VARCHAR(14)     NOT NULL,

    last_name   VARCHAR(16)     NOT NULL,

    gender      ENUM ('M','F')  NOT NULL,

    hire_date   DATE            NOT NULL,

    PRIMARY KEY (emp_no)

) COMPRESSION="zlib";
# Insert data (not shown)
# Query page compression metadata in INFORMATION_SCHEMA.INNODB_TABLESPACES

mysql> SELECT SPACE, NAME, FS_BLOCK_SIZE, FILE_SIZE, ALLOCATED_SIZE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE NAME='employees/employees'\G *************************** 1. row *************************** SPACE: 45 NAME: employees/employees FS_BLOCK_SIZE: 4096 FILE_SIZE: 23068672 ALLOCATED_SIZE: 19415040

3.2 页压缩的实现

页压缩的主要代码逻辑是在 file read 以及 file wirte 的时候完成的，对原来的代码侵入性不大。 3.2.1 页面压缩的逻辑

页面压缩的整体逻辑是先计算页面压缩后的大小，如果压缩后的大小比未压缩的大小至少小 1 个 block，执行压缩。否则，不会执行压缩。如果需要执行压缩，需要把压缩后的页面大小，以及压缩信息记录到页面头部。同时页面压缩，只压缩数据本身跟数据尾部，不压缩数据头部。

/** Compress a data page @param[in] compression Compression algorithm @param[in] block_size File system block size @param[in] src Source contents to compress @param[in] src_len Length in bytes of the source @param[out] dst Compressed page contents @param[out] dst_len Length in bytes of dst contents @return buffer data, dst_len will have the length of the data */ byte *os_file_compress_page(Compression compression, ulint block_size, byte *src, ulint src_len, byte *dst, ulint *dst_len) { ulint len = 0; ulint compression_level = page_zip_level; ulint page_type = mach_read_from_2(src + FIL_PAGE_TYPE);


  /* Must compress to = LOG_FILE_HDR_SIZE) {

      Encryption encryption(type.encryption_algorithm());
      ret = encryption.decrypt_log(type, buf, src_len, scratch, len);

    }
    return (ret);

  } else if (type.is_read()) {

    ut_ad(!type.is_row_log());

    Encryption encryption(type.encryption_algorithm());
    ret = encryption.decrypt(type, buf, src_len, scratch, len);
    if (ret == DB_SUCCESS) {

      return (os_file_decompress_page(type.is_dblwr(), buf, scratch, len));

    } else {

      return (ret);

    }

  } else if (type.punch_hole()) {

    ut_ad(len