AI功能探测,TiDB Vector对比PG Vector

2024年 5月 23日 61.2k 0

前文

在本文中,你可以获取以下经验值。

  • 如何审请体验TiDB AI的最新AI功能
  • TiDB Vector的使用DEMO
  • TiDB Vector的相关介绍
  • TiDB Vector对比PG Vector

申请TiDB AI功能体验

前置条件能够访问 tidb.cloud/ai

AI功能探测,TiDB Vector对比PG Vector-1

如果你通过审核,你会收到以下一封邮件,点击Getting Started Guid,在Build AI Apps with TiDB Vector Search你会发现如何创建 AI的指南

AI功能探测,TiDB Vector对比PG Vector-2

按照提示,选择指定的机房 Frankfurt(eu-central-1)

AI功能探测,TiDB Vector对比PG Vector-3

成功创建的TiDB AI集群与普通TiDB集群在WEB界面区别如下

AI功能探测,TiDB Vector对比PG Vector-4

INSERT INTO vector_table VALUES
   ('[5.3, 6.2, 4.7, 9.4, 3.2]'),
   ('[7.4, 8.3, 3.6, 9.5, 1.5]'),
   ('[1.6, 5.3, 3.9, 4.9, 3.4]'),
   ('[4.6, 6.2, 2.9, 5.5, 2.4]'),
   ('[8.2, 2.7, 5.9, 4.5, 1.1]');
   
   
   SELECT
    embedding,
    VEC_Cosine_Distance(embedding, '[1,2,3,4,5]') AS d
    FROM vector_table
    ORDER BY d;

TiDB AI整体认识

概念认识

TiDB 当前支持向量化数据类型,能够处理向量化数据,为生成人工智能提供动力,能对文本、图像、视频、音频或任何类型的数据进行语义搜索或相似性搜索。矢量搜索允许您搜索数据的含义,而不是搜索数据本身,例如执行语义搜索。

向量化数据(几乎任何类型的数据,如文本、图像、视频、用户、音乐等)表示为空间中的点的方式,这些点在空间中的位置在语义上是有意义的。 矢量搜索使用嵌入空间中的距离来表示相似性。将矢量嵌入存储在TiDB中后,要查找相关数据,只需搜索矢量嵌入的最近邻居即可。

AI功能探测,TiDB Vector对比PG Vector-5

简而方之, 非结构化数据图像、声音、视频在TiDB 数据库存储表示如下

AI功能探测,TiDB Vector对比PG Vector-6

TiDB通过自己独有的Vector函数,目前支持4种函数如下

SELECT vec_cosine_distance('[1,1,1]', '[1,2,3]');
SELECT vec_l1_distance('[1,1,1]', '[1,2,3]');
SELECT vec_l2_distance('[1,1,1]', '[1,2,3]');
SELECT vec_negative_inner_product('[1,1,1]', '[1,2,3]');
​

使用示范

#建表,构建数据类型为vector的表结构
CREATE TABLE vector_table(embedding VECTOR);
​
#插入相关数据
INSERT INTO vector_table VALUES
   ('[5.3, 6.2, 4.7, 9.4, 3.2]'),
   ('[7.4, 8.3, 3.6, 9.5, 1.5]'),
   ('[1.6, 5.3, 3.9, 4.9, 3.4]'),
   ('[4.6, 6.2, 2.9, 5.5, 2.4]'),
   ('[8.2, 2.7, 5.9, 4.5, 1.1]');
​
#查看
   SELECT
    embedding,
    VEC_Cosine_Distance(embedding, '[1,2,3,4,5]') AS d
    FROM vector_table
    ORDER BY d;
    
#也支持对表构建索引
​
CREATE TABLE vector_table_with_index (
    id INT PRIMARY KEY, doc TEXT,
    embedding VECTOR(3) COMMENT "hnsw(distance=cosine)"
);
​
# 更多功能见    https://docs.google.com/document/d/15eAO0xrvEd6_tTxW_zEko4CECwnnSwQg8GGrqK1Caiw/edit
​

TiDB Vector对比PG Vector

笔者之前做的基于PG Vector的测试 ,见2024 ,pgvector如何使你龙年识别真龙

笔者使用手机拍了22个图,图集里面有龙公仔、狗公仔、熊公仔、兔公仔、蜗公仔,把所有的数据存储到postgresql,再转化为vector数据类型,最后通过PG提供的Vector算法进行计算。 以ID为9的图像即 long9的基准,查看搜索引与long9相似度最接近的9个图。

下面我们会基于TiDB做同样的测试,测试两者的区别点。

AI功能探测,TiDB Vector对比PG Vector-7

TiDB Vector建表导入

#  TiDB建表
CREATE TABLE imgsearch (
    id integer,
    info text,
    emb vector
);
​
# PG的图像数据要导出,还需要经一番清洗处理,才能在TiDB上面使用
/usr/pgsql-14/bin/pg_dump  -U postgres  -h 127.0.0.1  -p 5432   -f  /tmp/img.sql  -d aigc  -t img  -W

#TiDB导数
​
insert into  imgsearch values (11,'long11,jpg','[0.736297,1.075938,0.42861,0.892686,1.116836,0.762016,0.916824,1.259509,0.395746,0.934469,0.377053,0.394685,0.101372,1.012071,0.033544,0.513227]'),
(12,'chang1,jpg','[0.725221,0.518129,0.301916,1.072228,0.493588,0.310527,1.067409,0.32858,0.565257,0.573178,0.349785,0.38567,1.773137,0.158718,0.191309,0.551427]'),
(13,'hu1,jpg', '[0.539411,0.498615,0.222136,0.544601,0.586316,0.348909,1.068628,0.645802,0.39286,0.526826,1.071487,0.228127,0.003894,0.777899,0.489787,0.518695]'),
(14,'mao1,jpg',' [0.827598,0.762516,0.461345,0.712426,0.761116,0.663931,1.129789,0.611228,0.552839,0.465407,0.875965,0.580499,0.515468,0.977421,0.328147,0.554762]'),
(15,'she1,jpg',' [0.71958,0.581352,0.341945,0.74294,0.937697,0.52459,0.937106,1.554514,0.640871,1.022986,1.541015,0.504912,0.59253,0.62851,0.700355,0.461511]'),
(16,'tian1,jpg','[0.705562,0.575841,0.317712,1.07174,0.545053,0.465813,1.149812,0.640039,0.396544,0.402276,0.895212,0.422421,0.26145,0.998564,0.115623,0.557303]'),
(17,'wo1,jpg','[0.718501,0.78452,0.441315,0.698562,0.881231,0.380895,0.65969,1.165185,0.414465,0.667443,0.769969,0.352301,0.455007,0.801981,0.184673,0.602418]'),
(18,'xiong1,jpg','[0.460106,0.619717,0.302368,0.528348,0.705491,0.298183,0.585181,0.836597,0.381536,0.582833,0.224188,0.257927,0.637693,0.474295,0.418517,0.497981]'),
(19,'tu1,jpg','[0.752204,0.490604,0.297339,0.623052,0.777398,0.299141,0.480937,0.382444,0.300806,0.537903,0.302091,0.311521,0.22277,0.159951,0.125384,0.559185]'),
(20,'tu2,jpg','[0.773247,0.71391,0.393946,0.981045,0.92682,0.488609,0.941678,0.761343,0.408509,0.549109,1.004021,0.282817,0.325018,2.32993,0.276134,0.5435]'),
(21,'tu3,jpg','[0.375829,0.449242,0.224704,0.424777,0.403536,0.32574,0.288067,0.257558,0.301552,0.389855,0.688895,0.22314,0.051274,0.848036,0.02786,0.656272]');
​
select * from imgsearch;

TiDB Vector 计算相似性

#ID值是  '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]'
​
SELECT id, info, vec_cosine_distance(emb, '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]') AS distance 
FROM imgsearch ORDER BY distance LIMIT 10;
​
​
SELECT id, info, vec_l1_distance(emb, '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]') AS distance 
FROM imgsearch ORDER BY distance LIMIT 10;
​
​
SELECT id, info, vec_l2_distance(emb, '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]') AS distance 
FROM imgsearch ORDER BY distance LIMIT 10;
​
​
SELECT id, info, vec_negative_inner_product(emb, '[0.778041,1.039809,0.496359,0.907833,1.083662,0.703667,1.082397,0.913184,0.507747,0.525088,0.622534,0.844312,0.933426,0.63143,0.43012,0.522419]') AS distance 
FROM imgsearch ORDER BY distance LIMIT 10;

四个计算命令结果如

AI功能探测,TiDB Vector对比PG Vector-8

总结对比

可用性、功能性、性能性三点去对比TiDB Vector与PG Vector。

  • 安装可用性来看,TiDB Vector好过PG Vector,笔者安装PG Vector报了个错,这个错误需要改变源代码配置,最后是通过GITHUB才找到的,基于云原生的TiDB Vector不存在这方面的问题。
  • 使用可用性来看,TiDB Vector是函数编程 vec_l2_distance(值1,值 2) , PG Vector则是链路编程的风格,笔者个人偏好函数风格,较简洁。
  • 操作可用性来看,TiDB Vector已经有相应的api demo,python可以直接调用,但是从世界和先来后到来看,pg vector的API覆盖更多,使用更广
  • 生态功能性来看, postgreSQL已经支持对图像数据的处理,图片存储到库,转文本转矢量,一气呵成,笔者把PG的数据导出来,再导入到TiDB才完成这个测试的,这点TiDB需要再努力。
  • 计算性能来看,只看软件方面的,TiDB Vector引用先进的HNSW index,在计算相似性方面的算法方面具有Vec_L1_Distance 曼哈顿距离、Vec_L2_Distance 欧氏距离、Vec_Cosine_Distance 余弦距离都与pgvector有对标,TiDB Vector还支持了 Vec_Negative_Inner_Product 的优化算法。

笔者认为AI能力是数据库能力很重要的一环,AI的业务场景需要客户端/服务端协调共同完成,数据库不可能完成所有的AI处理功能,只能完成服务端 工作,客户端还需要应用根据业务定制开发。TiDB Vector正式落实了TiDB往AI领域进军该有的基础功能,希望下面越做越好。

相关文章

Oracle如何使用授予和撤销权限的语法和示例
Awesome Project: 探索 MatrixOrigin 云原生分布式数据库
下载丨66页PDF,云和恩墨技术通讯(2024年7月刊)
社区版oceanbase安装
Oracle 导出CSV工具-sqluldr2
ETL数据集成丨快速将MySQL数据迁移至Doris数据库

发布评论