前文
在本文中,你可以获取以下经验值。
- 如何审请体验TiDB AI的最新AI功能
- TiDB Vector的使用DEMO
- TiDB Vector的相关介绍
- TiDB Vector对比PG Vector
申请TiDB AI功能体验
前置条件能够访问 tidb.cloud/ai
如果你通过审核,你会收到以下一封邮件,点击Getting Started Guid,在Build AI Apps with TiDB Vector Search你会发现如何创建 AI的指南
按照提示,选择指定的机房 Frankfurt(eu-central-1)
成功创建的TiDB AI集群与普通TiDB集群在WEB界面区别如下
INSERT INTO vector_table VALUES
('[5.3, 6.2, 4.7, 9.4, 3.2]'),
('[7.4, 8.3, 3.6, 9.5, 1.5]'),
('[1.6, 5.3, 3.9, 4.9, 3.4]'),
('[4.6, 6.2, 2.9, 5.5, 2.4]'),
('[8.2, 2.7, 5.9, 4.5, 1.1]');
SELECT
embedding,
VEC_Cosine_Distance(embedding, '[1,2,3,4,5]') AS d
FROM vector_table
ORDER BY d;
TiDB AI整体认识
概念认识
TiDB 当前支持向量化数据类型,能够处理向量化数据,为生成人工智能提供动力,能对文本、图像、视频、音频或任何类型的数据进行语义搜索或相似性搜索。矢量搜索允许您搜索数据的含义,而不是搜索数据本身,例如执行语义搜索。
向量化数据(几乎任何类型的数据,如文本、图像、视频、用户、音乐等)表示为空间中的点的方式,这些点在空间中的位置在语义上是有意义的。 矢量搜索使用嵌入空间中的距离来表示相似性。将矢量嵌入存储在TiDB中后,要查找相关数据,只需搜索矢量嵌入的最近邻居即可。
简而方之, 非结构化数据图像、声音、视频在TiDB 数据库存储表示如下
TiDB通过自己独有的Vector函数,目前支持4种函数如下
SELECT vec_cosine_distance('[1,1,1]', '[1,2,3]');
SELECT vec_l1_distance('[1,1,1]', '[1,2,3]');
SELECT vec_l2_distance('[1,1,1]', '[1,2,3]');
SELECT vec_negative_inner_product('[1,1,1]', '[1,2,3]');
使用示范
#建表,构建数据类型为vector的表结构
CREATE TABLE vector_table(embedding VECTOR);
#插入相关数据
INSERT INTO vector_table VALUES
('[5.3, 6.2, 4.7, 9.4, 3.2]'),
('[7.4, 8.3, 3.6, 9.5, 1.5]'),
('[1.6, 5.3, 3.9, 4.9, 3.4]'),
('[4.6, 6.2, 2.9, 5.5, 2.4]'),
('[8.2, 2.7, 5.9, 4.5, 1.1]');
#查看
SELECT
embedding,
VEC_Cosine_Distance(embedding, '[1,2,3,4,5]') AS d
FROM vector_table
ORDER BY d;
#也支持对表构建索引
CREATE TABLE vector_table_with_index (
id INT PRIMARY KEY, doc TEXT,
embedding VECTOR(3) COMMENT "hnsw(distance=cosine)"
);
# 更多功能见 https://docs.google.com/document/d/15eAO0xrvEd6_tTxW_zEko4CECwnnSwQg8GGrqK1Caiw/edit
TiDB Vector对比PG Vector
笔者之前做的基于PG Vector的测试 ,见2024 ,pgvector如何使你龙年识别真龙
笔者使用手机拍了22个图,图集里面有龙公仔、狗公仔、熊公仔、兔公仔、蜗公仔,把所有的数据存储到postgresql,再转化为vector数据类型,最后通过PG提供的Vector算法进行计算。 以ID为9的图像即 long9的基准,查看搜索引与long9相似度最接近的9个图。
下面我们会基于TiDB做同样的测试,测试两者的区别点。
TiDB Vector建表导入
# TiDB建表
CREATE TABLE imgsearch (
id integer,
info text,
emb vector
);
# PG的图像数据要导出,还需要经一番清洗处理,才能在TiDB上面使用
/usr/pgsql-14/bin/pg_dump -U postgres -h 127.0.0.1 -p 5432 -f /tmp/img.sql -d aigc -t img -W
#TiDB导数
insert into imgsearch values (11,'long11,jpg','[0.736297,1.075938,0.42861,0.892686,1.116836,0.762016,0.916824,1.259509,0.395746,0.934469,0.377053,0.394685,0.101372,1.012071,0.033544,0.513227]'),
(12,'chang1,jpg','[0.725221,0.518129,0.301916,1.072228,0.493588,0.310527,1.067409,0.32858,0.565257,0.573178,0.349785,0.38567,1.773137,0.158718,0.191309,0.551427]'),
(13,'hu1,jpg', '[0.539411,0.498615,0.222136,0.544601,0.586316,0.348909,1.068628,0.645802,0.39286,0.526826,1.071487,0.228127,0.003894,0.777899,0.489787,0.518695]'),
(14,'mao1,jpg',' [0.827598,0.762516,0.461345,0.712426,0.761116,0.663931,1.129789,0.611228,0.552839,0.465407,0.875965,0.580499,0.515468,0.977421,0.328147,0.554762]'),
(15,'she1,jpg',' [0.71958,0.581352,0.341945,0.74294,0.937697,0.52459,0.937106,1.554514,0.640871,1.022986,1.541015,0.504912,0.59253,0.62851,0.700355,0.461511]'),
(16,'tian1,jpg','[0.705562,0.575841,0.317712,1.07174,0.545053,0.465813,1.149812,0.640039,0.396544,0.402276,0.895212,0.422421,0.26145,0.998564,0.115623,0.557303]'),
(17,'wo1,jpg','[0.718501,0.78452,0.441315,0.698562,0.881231,0.380895,0.65969,1.165185,0.414465,0.667443,0.769969,0.352301,0.455007,0.801981,0.184673,0.602418]'),
(18,'xiong1,jpg','[0.460106,0.619717,0.302368,0.528348,0.705491,0.298183,0.585181,0.836597,0.381536,0.582833,0.224188,0.257927,0.637693,0.474295,0.418517,0.497981]'),
(19,'tu1,jpg','[0.752204,0.490604,0.297339,0.623052,0.777398,0.299141,0.480937,0.382444,0.300806,0.537903,0.302091,0.311521,0.22277,0.159951,0.125384,0.559185]'),
(20,'tu2,jpg','[0.773247,0.71391,0.393946,0.981045,0.92682,0.488609,0.941678,0.761343,0.408509,0.549109,1.004021,0.282817,0.325018,2.32993,0.276134,0.5435]'),
(21,'tu3,jpg','[0.375829,0.449242,0.224704,0.424777,0.403536,0.32574,0.288067,0.257558,0.301552,0.389855,0.688895,0.22314,0.051274,0.848036,0.02786,0.656272]');
select * from imgsearch;
TiDB Vector 计算相似性
#ID值是 '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]'
SELECT id, info, vec_cosine_distance(emb, '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]') AS distance
FROM imgsearch ORDER BY distance LIMIT 10;
SELECT id, info, vec_l1_distance(emb, '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]') AS distance
FROM imgsearch ORDER BY distance LIMIT 10;
SELECT id, info, vec_l2_distance(emb, '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]') AS distance
FROM imgsearch ORDER BY distance LIMIT 10;
SELECT id, info, vec_negative_inner_product(emb, '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]') AS distance
FROM imgsearch ORDER BY distance LIMIT 10;
四个计算命令结果如
总结对比
从可用性、功能性、性能性三点去对比TiDB Vector与PG Vector。
- 从安装可用性来看,TiDB Vector好过PG Vector,笔者安装PG Vector报了个错,这个错误需要改变源代码配置,最后是通过GITHUB才找到的,基于云原生的TiDB Vector不存在这方面的问题。
- 从使用可用性来看,TiDB Vector是函数编程 vec_l2_distance(值1,值 2) , PG Vector则是链路编程的风格,笔者个人偏好函数风格,较简洁。
- 从操作可用性来看,TiDB Vector已经有相应的api demo,python可以直接调用,但是从世界和先来后到来看,pg vector的API覆盖更多,使用更广
- 从生态功能性来看, postgreSQL已经支持对图像数据的处理,图片存储到库,转文本转矢量,一气呵成,笔者把PG的数据导出来,再导入到TiDB才完成这个测试的,这点TiDB需要再努力。
- 从计算性能来看,只看软件方面的,TiDB Vector引用先进的HNSW index,在计算相似性方面的算法方面具有Vec_L1_Distance 曼哈顿距离、Vec_L2_Distance 欧氏距离、Vec_Cosine_Distance 余弦距离都与pgvector有对标,TiDB Vector还支持了 Vec_Negative_Inner_Product 的优化算法。
笔者认为AI能力是数据库能力很重要的一环,AI的业务场景需要客户端/服务端协调共同完成,数据库不可能完成所有的AI处理功能,只能完成服务端 工作,客户端还需要应用根据业务定制开发。TiDB Vector正式落实了TiDB往AI领域进军该有的基础功能,希望下面越做越好。