基于ClickHouse实现Topurl功能

2023年 9月 27日 72.8k 0

背景

最近接到了mcdn团队topurl的需求,线上有一份hulk-nginx日志,需要分别按照分钟,小时,天粒度计算每个域名url的pv排序,并且给出url的省份和range分布信息。
把需求拆解,需要计算每个域名分钟,小时,天粒度,按照url,area,range的pv排序。
给的示例如下:

{
  "url_list":[ // Url列表
    {
      "pros":[ // 省份列表
        {
          "pv":32, // 省份Pv
          "pro":"shanghai" // 省份
        },
        {
          "pv":61,
          "pro":"fujian"
        },
        {
          "pv":3,
          "pro":"ningxia"
        }
      ],
      "pv":1943,  // Url Pv
      "range":"", // 分片信息
      "uri":"/vod/product/xxxxx"
    },
    {
      "pros":[
        {
          "pv":32,
          "pro":"shanghai"
        },
        {
          "pv":61,
          "pro":"fujian"
        },
        {
          "pv":3,
          "pro":"ningxia"
        }
      ],
      "pv":1943,
      "range":"",
      "uri":"/vod/product/xxx"
    }
  ],
  "total_url":314121, // Url总个数
  "total_pv":1842272, // Url总Pv
  "top_url":253,      // topUrl总个数
  "top_pv":92142,     // topUrl总Pv
  "top_percent":5,    // topUrl比例
  "domain":"ww.jd.com", // 域名
  "date":"20230530"   // 查询日期
}

方案设计

结合现有的架构,设计出基于ck实现topurl的方案:

未命名文件.png
SparkStreaming
使用SparkStreaming对日志进行分钟级别的聚合,维度为域名,url,area,range,指标为pv
ClickHouse
1. 分钟粒度表
底表为分钟级粒度的表,此表spark直接写入,引擎为ReplicatedSummingMergeTree会对数据进行汇总,保存7天。
本地表:

CREATE TABLE default.mcdn_topurl_min_local
(
    `time` DateTime,
    `domain` String,
    `uri` String,
    `range` String,
    `area` String,
    `pv` UInt64
)
ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/mcdn_topurl_min_local/{shard}', '{replica}')
PARTITION BY toStartOfHour(time)
ORDER BY (time, domain, uri, range, area)
TTL time + toIntervalDay(7), time TO VOLUME 'hot', time + toIntervalHour(12) TO VOLUME 'cold'
SETTINGS storage_policy = 'hot_cold_police', index_granularity = 8192, max_suspicious_broken_parts = 1000

分布式表

│ CREATE TABLE default.mcdn_topurl_min
(
    `time` DateTime,
    `domain` String,
    `uri` String,
    `range` String,
    `area` String,
    `pv` UInt64
)
ENGINE = Distributed('rawlog_cluster', 'default', 'mcdn_topurl_min_local', rand()) │

2. 小时粒度表
物化视图,基于分钟粒度表生成
本地表

CREATE MATERIALIZED VIEW default.mcdn_topurl_hour_local
(
    `time` DateTime('Asia/Shanghai'),
    `domain` String,
    `uri` String,
    `range` String,
    `area` String,
    `pv` UInt64
)
ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/mcdn_topurl_hour_local/{shard}', '{replica}')
PARTITION BY toStartOfHour(time)
ORDER BY (time, domain, uri, area, range)
TTL time + toIntervalDay(7), time TO VOLUME 'hot', time + toIntervalHour(12) TO VOLUME 'cold'
SETTINGS storage_policy = 'hot_cold_police', index_granularity = 8192, max_suspicious_broken_parts = 1000 AS
SELECT
    toStartOfHour(time) AS time,
    domain,
    uri,
    range,
    area,
    sum(pv) AS pv
FROM default.mcdn_topurl_min_local
GROUP BY (time, domain, uri, range, area)

分布式表

CREATE TABLE default.mcdn_topurl_hour
(
    `time` DateTime,
    `domain` String,
    `uri` String,
    `range` String,
    `area` String,
    `pv` UInt64
)
ENGINE = Distributed('rawlog_cluster', 'default', 'mcdn_topurl_hour_local', rand())

3. 天粒度表
物化视图,基于分钟粒度表生成
本地表

CREATE MATERIALIZED VIEW default.mcdn_topurl_day_local
(
    `time` DateTime('Asia/Shanghai'),
    `domain` String,
    `uri` String,
    `range` String,
    `area` String,
    `pv` UInt64
)
ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/mcdn_topurl_day_local/{shard}', '{replica}')
PARTITION BY toStartOfDay(time)
ORDER BY (time, domain, uri, range, area)
TTL time + toIntervalDay(7), time TO VOLUME 'hot', time + toIntervalHour(12) TO VOLUME 'cold'
SETTINGS storage_policy = 'hot_cold_police', index_granularity = 8192, max_suspicious_broken_parts = 1000 AS
SELECT
    toStartOfDay(time) AS time,
    domain,
    uri,
    range,
    area,
    sum(pv) AS pv
FROM default.mcdn_topurl_min_local
GROUP BY (time, domain, uri, range, area)

分布式表

CREATE TABLE default.mcdn_topurl_day
(
    `time` DateTime,
    `domain` String,
    `uri` String,
    `range` String,
    `area` String,
    `pv` UInt64
)
ENGINE = Distributed('rawlog_cluster', 'default', 'mcdn_topurl_day_local', rand())

性能测试

通过打量的方式进行性能测试,使用的测试工具为clickhouse-benchmark

报告说明(clickhouse.com/docs/zh/ope…):
QPS:服务端每秒处理的查询数量
RPS:服务器每秒读取多少行
MiB/s:服务器每秒读取多少字节的数据
result RPS:服务端每秒生成多少行的结果集数据
result MiB/s.服务端每秒生成多少字节的结果集数据
单条SQL多次测试

echo "select uri, area, range, sum(pv) as pv from mcdn_topurl_min where time >= '2023-09-11
19:00:00' and time < '2023-09-11 19:10:00' and domain='pid-onlinetest35.jcloud.com' group by
uri, area, range order by pv desc limit 850000, 10000;" | clickhouse-benchmark -i 10
Queries executed: 10.

localhost:9000, queries 10, QPS: 0.790, RPS: 1993929.073, MiB/s: 226.094, result RPS: 4044.542, result MiB/s: 0.308.

0.000%		1.097 sec.	
10.000%		1.097 sec.	
20.000%		1.101 sec.	
30.000%		1.126 sec.	
40.000%		1.155 sec.	
50.000%		1.167 sec.	
60.000%		1.167 sec.	
70.000%		1.256 sec.	
80.000%		1.265 sec.	
90.000%		1.693 sec.	
95.000%		1.698 sec.	
99.000%		1.698 sec.	
99.900%		1.698 sec.	
99.990%		1.698 sec.	

指定多条sql测试

clickhouse-benchmark -i 5 < ./muti-sqls
Queries executed: 5 (250.000%).

localhost:9000, queries 5, QPS: 1.413, RPS: 2378507.424, MiB/s: 357.696, result RPS: 14130.295, result MiB/s: 1.761.

0.000%		0.649 sec.	
10.000%		0.649 sec.	
20.000%		0.654 sec.	
30.000%		0.654 sec.	
40.000%		0.688 sec.	
50.000%		0.688 sec.	
60.000%		0.688 sec.	
70.000%		0.730 sec.	
80.000%		0.730 sec.	
90.000%		0.817 sec.	
95.000%		0.817 sec.	
99.000%		0.817 sec.	
99.900%		0.817 sec.	
99.990%		0.817 sec.	

SparkStreaming写入CK方式

SparkStreaming通过负载均衡域名,均匀的写入ck的本地表中。

image.png

需要注意的是,ck推荐将数据直接写入本地表,原因如下:
ck的分布式表写入流程:

  • 写入本地分片数据
  • 将其余分片数据存入临时目录
  • 和远端建立连接
  • 发送远端数据
  • 在这个过程中:

  • 写入part数变多,容易merge不过来,too many parts
  • 数据分发比较慢
  • 写放大 tcp连接数变多
  • 一般推荐使用直接写入本地表的方式。

    优化

    Seatunnel的ClickHouse File
    基于这种方案,可以降低ck写入的压力
    segmentfault.com/a/119000004…

    Projection
    预聚合方式降低查询压力
    相比来讲,物化视图不支持同步删除。
    zhuanlan.zhihu.com/p/404851831

    相关文章

    JavaScript2024新功能:Object.groupBy、正则表达式v标志
    PHP trim 函数对多字节字符的使用和限制
    新函数 json_validate() 、randomizer 类扩展…20 个PHP 8.3 新特性全面解析
    使用HTMX为WordPress增效:如何在不使用复杂框架的情况下增强平台功能
    为React 19做准备:WordPress 6.6用户指南
    如何删除WordPress中的所有评论

    发布评论