如何在 Elasticsearch 中使用 Openai Embedding 进行语义搜索

开发运维 2023-09-30 三掌柜手机阅读

随着强大的 GPT 模型的出现，文本的语义提取得到了改进。在本文中，我们将使用嵌入向量在文档中进行搜索，而不是使用关键字进行老式搜索。

什么是嵌入 - embedding？

在深度学习术语中，嵌入是文本或图像等内容的数字表示。由于每个深度学习模型的输入都应该是数字，因此要使用文本来训练模型，我们应该将其转换为一种数字格式。

有多种算法可以将文本转换为 n 维数字数组。最简单的算法称为“Bag Of Word”，该算法中 n 是语料库中唯一单词的数量。该算法只是简单地统计文本中出现的单词数量，并形成一个数组来表示它。



1.  >>> from sklearn.feature_extraction.text import CountVectorizer
2.  >>> corpus = [3.  ...     'This is the first document.',4.  ...     'This document is the second document.',5.  ...     'And this is the third one.',6.  ...     'Is this the first document?',7.  ... ]
8.  >>> vectorizer = CountVectorizer()
9.  >>> X = vectorizer.fit_transform(corpus)
10.  >>> vectorizer.get_feature_names_out()
11.  array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',12.         'this'], ...)
13.  >>> print(X.toarray())
14.  [[0 1 1 1 0 0 1 0 1]
15.   [0 2 0 1 0 1 1 0 1]
16.   [1 0 0 1 1 0 1 1 1]
17.   [0 1 1 1 0 0 1 0 1]]

这种表示形式不够丰富，无法从文本中提取语义和含义。由于变换器的强大功能，模型可以学习嵌入。 Openai 提供了嵌入 API 来计算文本的嵌入数组。该表示可以存储在矢量数据库中以供搜索。

Openai 嵌入 API

要使用 openai，我们需要在 openai 网站上生成一个 API 密钥。为此，我们需要在 “View API Keys” 页面中注册并生成一个新密钥。

Openai API key 页面

请记住：该密钥只会显示一次，因此请保存以供以后使用。

要检索文本嵌入，我们应该使用模型和文本调用 openai 嵌入 API。



1.  {
2.      "input": "The food was delicious and the waiter...",
3.      "model": "text-embedding-ada-002"
4.  }

输入是我们要计算嵌入数组的文本，模型是嵌入模型的名称。 Openai 对于此链接中提供的嵌入模型有多种选择。在本文中，我们将使用默认的 “text-embedding-ada-002”。为了调用 API，我们在 python 中使用以下脚本。



1.  import os
2.  import requests

4.  headers = {
5.      'Authorization': 'Bearer ' + os.getenv('OPENAI_API_KEY', ''),
6.      'Content-Type': 'application/json',
7.  }

9.  json_data = {
10.      'input': 'This is the test text',
11.      'model': 'text-embedding-ada-002',
12.  }

14.  response = requests.post('https://api.openai.com/v1/embeddings',
15.                           headers=headers,
16.                           json=json_data)
17.  result = response.json()

嵌入的响应将类似于：



1.  {
2.    "object": "list",
3.    "data": [
4.      {
5.        "object": "embedding",
6.        "embedding": [
7.          0.0023064255,
8.          -0.009327292,
9.          .... (1536 floats total for ada-002)
10.          -0.0028842222,
11.        ],
12.        "index": 0
13.      }
14.    ],
15.    "model": "text-embedding-ada-002",
16.    "usage": {
17.      "prompt_tokens": 8,
18.      "total_tokens": 8
19.    }
20.  }

result['data']['embedding'] 是给定文本的嵌入向量。 ada-002 模型的向量大小为 1536 个浮点数，输入的最大标记为 8191 个标记。

存储和搜索

有多种数据库选择来存储嵌入向量。在本文中，我们将探索 Elasticsearch 来存储和搜索向量。

Elasticsearch 有一个预定义的向量数据类型，称为 “密集向量”。为了存储嵌入向量，我们需要创建一个索引，其中包括一个文本字段和一个嵌入向量字段。



1.  PUT my_vector_index
2.  {
3.    "mappings": {
4.      "properties": {
5.        "embedding": {
6.          "type": "dense_vector",
7.          "dims": 1536
8.        },
9.        "text": {
10.          "type": "keyword"
11.        }
12.      }
13.    }
14.  }

对于 ada-002 模型，向量的维数应为 1536。现在要查询该索引，我们需要熟悉不同类型的向量相似度得分。余弦相似度是我们可以在 Elasticsearch 中使用的分数之一。首先，我们需要计算搜索短语的嵌入向量，然后通过索引对其进行查询并获取 top-k 结果。



1.  POST my_vector_index/_search
2.  {
3.    "query": {
4.      "script_score": {
5.        "query": {
6.          "match_all": {}
7.        },
8.        "script": {
9.          "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
10.          "params": {
11.            "query_vector": [0.230, -0.120, 0.389, ...]
12.          }
13.        }
14.      }
15.    }
16.  }

当然，对于大规模部署，我们需要使用 aNN 搜索。请详细阅读 “Elasticsearch：在 Elastic Stack 8.0 中引入近似最近邻搜索”。

这将返回语义上与文本查询相似的文本。

结论

在本文中，我们探讨了新嵌入模型在文档中查找语义的强大功能。你可以使用任何类型的文档，例如 PDF、图像、音频，并使用 Elasticsearch 作为语义相似性的搜索引擎。该功能可用于语义搜索、推荐系统。

如何在 Elasticsearch 中使用 Openai Embedding 进行语义搜索

什么是嵌入 - embedding？

Openai 嵌入 API

存储和搜索

结论

java开发工资一般多少

go函数闭包和类型引用与指针传递(33)

怎样在ThinkPHP6中使用Ajax进行异步操作？

聊聊数据处理的那些事

Python 2.x 中如何使用math模块进行数学运算