Elasticsearch:与多个 PDF 聊天 | LangChain Python 应用教程(免费 LLMs 和嵌入)

2023年 9月 25日 71.7k 0

在本博客中,你将学习创建一个 LangChain 应用程序,以使用 ChatGPT API 和 Huggingface 语言模型与多个 PDF 文件聊天。

如上所示,我们在最最左边摄入 PDF 文件,并它们连成一起,并分为不同的 chunks。我们可以通过使用 huggingface 来对 chunks 进行处理并形成 embeddings。我们把 embeddings 写入到 Elasticsearch 向量数据库中,并保存。在搜索的时候,我们通过 LangChain 来进行向量化,并使用 Elasticsearch 进行向量搜索。在最后,我们通过大模型的使用,针对提出的问题来进行提问。我们最终的界面如下:

如上所示,它可以针对我们的问题进行回答。进一步阅读 

  • 使用 LangChain 和 Elasticsearch 对私人数据进行人工智能搜索
  • 使用 LangChain 和 Elasticsearch 的隐私优先 AI 搜索

所有的源码可以在地址 GitHub - liu-xiao-guo/ask-multiple-pdfs: A Langchain app that allows you to chat with multiple PDFs 进行下载。

安装

如果你还没有安装好自己的 Elasticsearch 及 Kibana 的话,那么请参考如下的链接:

  • 如何在 Linux,MacOS 及 Windows 上进行安装 Elasticsearch

  • Kibana:如何在 Linux,MacOS 及 Windows 上安装 Elastic 栈中的 Kibana

在安装的时候,我们选择 Elastic Stack 9.x 的安装指南来进行安装。在默认的情况下,Elasticsearch 集群的访问具有 HTTPS 的安全访问。

在安装时,我们可以在 Elasticsearch 的如下地址找到相应的证书文件 http_ca.crt:



1.  $ pwd
2.  /Users/liuxg/elastic/elasticsearch-8.10.0/config/certs
3.  $ ls
4.  http.p12      http_ca.crt   transport.p12


我们需要把该证书拷贝到项目文件的根目录下:



1.  $ tree -L 3
2.  .
3.  ├── app.py
4.  ├── docs
5.  │   └── PDF-LangChain.jpg
6.  ├── htmlTemplates.py
7.  ├── http_ca.crt
8.  ├── lib_embeddings.py
9.  ├── lib_indexer.py
10.  ├── lib_llm.py
11.  ├── lib_vectordb.py
12.  ├── myapp.py
13.  ├── pdf_files
14.  │   ├── sample1.pdf
15.  │   └── sample2.pdf
16.  ├── readme.md
17.  ├── requirements.txt
18.  └── simple.cfg


如上所示,我们把 http_ca.crt 拷贝到应用的根目录下。我们在 pdf_files 里放了两个用于测试的 PDF 文件。你可以使用自己的 PDF 文件来进行测试。我们在 simple.cfg 做如下的配置:



1.  ES_SERVER: "localhost" 
2.  ES_PASSWORD: "vXDWYtL*my3vnKY9zCfL"
3.  ES_FINGERPRINT: "e2c1512f617f432ddf242075d3af5177b28f6497fecaaa0eea11429369bb7b00"


在上面,我们需要配置 ES_SERVER。这个是 Elasticsearch 集群的地址。这里的 ES_PASSWORD 是 Elasticsearch 的超级用户 elastic 的密码。我们可以在 Elasticsearch 第一次启动的画面中找到这个 ES_FINGERPRINT:

你还可以在 Kibana 的配置文件 confgi/kibana.yml 文件中获得 fingerprint 的配置:

在项目的目录中,我们还可以看到一个叫做 .env-example 的文件。我们可以使用如下的命令把它重新命名为 .env:

mv .env.example .env

在 .env 中,我们输入 huggingface.co 网站得到的 token:



1.  $ cat .env
2.  OPENAI_API_KEY=your_openai_key
3.  HUGGINGFACEHUB_API_TOKEN=your_huggingface_key


在本例中,我们将使用 huggingface 来进行测试。如果你需要使用到 OpenAI,那么你需要配置它的 key。有关 huggingface 的开发者 key,你可以在地址获得。

运行项目

在运行项目之前,你需要做一下安装的动作:



1.  python3 -m venv env
2.  source env/bin/activate
3.  python3 -m pip install --upgrade pip
4.  pip install -r requirements.txt


创建界面

本应用的界面,我们采用是 streamlit 来创建的。它的创建也是非常地简单。我们可以在 myapp.py 中看到如下的代码:

myapp.py



1.  import streamlit as st
2.  from dotenv import load_dotenv
3.  from PyPDF2 import PdfReader
4.  from htmlTemplates import css, bot_template, user_template

6.  def get_pdf_texts(pdf_docs):
7.      text = ""
8.      for pdf in pdf_docs:
9.          pdf_reader = PdfReader(pdf)
10.          for page in pdf_reader.pages:
11.              text += page.extract_text()
12.      return text

14.  def main():
15.      load_dotenv()
16.      st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
17.      st.write(css, unsafe_allow_html=True)
18.      st.header("Chat with multiple PDFs :books:")
19.      user_question = st.text_input("Ask a question about your documents")
20.      if user_question:
21.          pass

23.      st.write(user_template.replace("{{MSG}}", "Hello, human").replace("{{MSG1}}", " "), unsafe_allow_html=True)
24.      st.write(bot_template.replace("{{MSG}}", "Hello, robot").replace("{{MSG1}}", " "), unsafe_allow_html=True)

26.      # Add a side bar
27.      with st.sidebar:
28.          st.subheader("Your documents")
29.          pdf_docs = st.file_uploader(
30.              "Upload your PDFs here and press on click on Process", accept_multiple_files=True)
31.          print(pdf_docs)
32.          if st.button("Process"):
33.              with st.spinner("Processing"):
34.                  # Get pdf text from
35.                  raw_text = get_pdf_texts(pdf_docs)
36.                  st.write(raw_text)

39.  if __name__ == "__main__":
40.      main()


在上面的代码中,我创建了一个 sidebar 用来选择需要的 PDF 文件。我们可以点击 Process 按钮来显示已经提取的 PDF 文本。我们可以使用如下的命令来运行应用:

(venv) $ streamlit run myapp.py


1.  venv) $ streamlit run myapp.py

3.    You can now view your Streamlit app in your browser.

5.    Local URL: http://localhost:8502
6.    Network URL: http://198.18.1.13:8502


运行完上面的命令后,我们可以在浏览器中打开应用:

我们点击 Browse files,并选中 PDF 文件:

点击上面的 Process,我们可以看到:

在上面,我们为了显示的方便,我使用 st.write 直接把结果写到浏览器的页面里。我们接下来需要针对这个长的文字进行切分为一个一个的 chunks。我们需要按照模型的需要,不能超过模型允许的最大值。

上面我简单地叙述了 UI 的构造。最终完整的 myapp.py 的设计如下:

myapp.py



1.  import streamlit as st
2.  from dotenv import load_dotenv
3.  from PyPDF2 import PdfReader
4.  from langchain.text_splitter import CharacterTextSplitter
5.  from langchain.text_splitter import RecursiveCharacterTextSplitter
6.  from langchain.embeddings import OpenAIEmbeddings
7.  from htmlTemplates import css, bot_template, user_template

9.  import lib_indexer
10.  import lib_llm
11.  import lib_embeddings
12.  import lib_vectordb

14.  index_name = "pdf_docs"

16.  def get_pdf_text(pdf):
17.      text = ""
18.      pdf_reader = PdfReader(pdf)
19.      for page in pdf_reader.pages:
20.          text += page.extract_text()
21.      return text

24.  def get_pdf_texts(pdf_docs):
25.      text = ""
26.      for pdf in pdf_docs:
27.          pdf_reader = PdfReader(pdf)
28.          for page in pdf_reader.pages:
29.              text += page.extract_text()
30.      return text

32.  def get_text_chunks(text):
33.      text_splitter = CharacterTextSplitter(
34.          separator="n", 
35.          chunk_size=1000,
36.          chunk_overlap=200,
37.          length_function=len
38.      )
39.      chunks = text_splitter.split_text(text)
40.      # chunks = text_splitter.split_documents(text)
41.      return chunks

43.  def get_text_chunks1(text):
44.      text_splitter = RecursiveCharacterTextSplitter(chunk_size=384, chunk_overlap=0)
45.      chunks = text_splitter.split_text(text)
46.      return chunks

48.  def handle_userinput(db, llm_chain_informed, user_question):
49.      similar_docs = db.similarity_search(user_question)
50.      print(f'The most relevant passage: nt{similar_docs[0].page_content}')

52.      ## 4. Ask Local LLM context informed prompt
53.      # print(">> 4. Asking The Book ... and its response is: ")
54.      informed_context= similar_docs[0].page_content
55.      response = llm_chain_informed.run(context=informed_context,question=user_question)

57.      st.write(user_template.replace("{{MSG}}", user_question).replace("{{MSG1}}", " "), unsafe_allow_html=True)
58.      st.write(bot_template.replace("{{MSG}}", response).replace("{{MSG1}}", similar_docs[0].page_content),unsafe_allow_html=True)

60.  def main():

62.      # # Huggingface embedding setup
63.      hf = lib_embeddings.setup_embeddings()

65.      # # # ## Elasticsearch as a vector db
66.      db, url = lib_vectordb.setup_vectordb(hf, index_name)

68.      # # # ## set up the conversational LLM
69.      llm_chain_informed= lib_llm.make_the_llm()

71.      load_dotenv()
72.      st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
73.      st.write(css, unsafe_allow_html=True)
74.      st.header("Chat with multiple PDFs :books:")
75.      user_question = st.text_input("Ask a question about your documents")
76.      if user_question:
77.          handle_userinput(db, llm_chain_informed, user_question)

79.      st.write(user_template.replace("{{MSG}}", "Hello, human").replace("{{MSG1}}", " "), unsafe_allow_html=True)
80.      st.write(bot_template.replace("{{MSG}}", "Hello, robot").replace("{{MSG1}}", " "), unsafe_allow_html=True)

82.      # Add a side bar
83.      with st.sidebar:
84.          st.subheader("Your documents")
85.          pdf_docs = st.file_uploader(
86.              "Upload your PDFs here and press on click on Process", accept_multiple_files=True)
87.          print(pdf_docs)
88.          if st.button("Process"):
89.              with st.spinner("Processing"):
90.                  # Get pdf text from
91.                  # raw_text = get_pdf_text(pdf_docs[0])
92.                  raw_text = get_pdf_texts(pdf_docs)
93.                  # st.write(raw_text)
94.                  print(raw_text)

96.                  # Get the text chunks
97.                  text_chunks = get_text_chunks(raw_text)
98.                  # st.write(text_chunks)

100.                  # Create vector store
101.                  lib_indexer.loadPdfChunks(text_chunks, url, hf, db, index_name)

104.  if __name__ == "__main__":
105.      main()


创建嵌入模型

lib_embedding.py



1.  ## for embeddings
2.  from langchain.embeddings import HuggingFaceEmbeddings

4.  def setup_embeddings():
5.      # Huggingface embedding setup
6.      print(">> Prep. Huggingface embedding setup")
7.      model_name = "sentence-transformers/all-mpnet-base-v2"
8.      return HuggingFaceEmbeddings(model_name=model_name)


 创建向量存储

lib_vectordb.py



1.  import os
2.  from config import Config

4.  ## for vector store
5.  from langchain.vectorstores import ElasticVectorSearch

7.  def setup_vectordb(hf,index_name):
8.      # Elasticsearch URL setup
9.      print(">> Prep. Elasticsearch config setup")

11.      with open('simple.cfg') as f:
12.          cfg = Config(f)

14.      endpoint = cfg['ES_SERVER']
15.      username = "elastic"
16.      password = cfg['ES_PASSWORD']

18.      ssl_verify = {
19.          "verify_certs": True,
20.          "basic_auth": (username, password),
21.          "ca_certs": "./http_ca.crt",
22.      }

24.      url = f"https://{username}:{password}@{endpoint}:9200"

26.      return ElasticVectorSearch( embedding = hf, 
27.                                  elasticsearch_url = url, 
28.                                  index_name = index_name, 
29.                                  ssl_verify = ssl_verify), url


 创建使用带有上下文和问题变量的提示模板的离线 LLM

lib_llm.py



1.  ## for conversation LLM
2.  from langchain import PromptTemplate, HuggingFaceHub, LLMChain
3.  from langchain.llms import HuggingFacePipeline
4.  import torch
5.  from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

8.  def make_the_llm():
9.      # Get Offline flan-t5-large ready to go, in CPU mode
10.      print(">> Prep. Get Offline flan-t5-large ready to go, in CPU mode")
11.      model_id = 'google/flan-t5-large'# go for a smaller model if you dont have the VRAM
12.      tokenizer = AutoTokenizer.from_pretrained(model_id) 
13.      model = AutoModelForSeq2SeqLM.from_pretrained(model_id) #load_in_8bit=True, device_map='auto'
14.      pipe = pipeline(
15.          "text2text-generation",
16.          model=model, 
17.          tokenizer=tokenizer, 
18.          max_length=100
19.      )
20.      local_llm = HuggingFacePipeline(pipeline=pipe)
21.      # template_informed = """
22.      # I know the following: {context}
23.      # Question: {question}
24.      # Answer: """

26.      template_informed = """
27.      I know: {context}
28.      when asked: {question}
29.      my response is: """

31.      prompt_informed = PromptTemplate(template=template_informed, input_variables=["context", "question"])

33.      return LLMChain(prompt=prompt_informed, llm=local_llm)


写入以向量表示的 PDF 文件

以下是我的分块和向量存储代码。 它需要在 Elasticsearch 中准备好组成的 Elasticsearch url、huggingface 嵌入模型、向量数据库和目标索引名称

lib_indexer.py

 2.  from langchain.text_splitter import RecursiveCharacterTextSplitter
3.  from langchain.document_loaders import TextLoader

5.  ## for vector store
6.  from langchain.vectorstores import ElasticVectorSearch
7.  from elasticsearch import Elasticsearch
8.  from config import Config

10.  with open('simple.cfg') as f:
11.      cfg = Config(f)

13.  fingerprint = cfg['ES_FINGERPRINT']
14.  endpoint = cfg['ES_SERVER']
15.  username = "elastic"
16.  password = cfg['ES_PASSWORD']
17.  ssl_verify = {
18.      "verify_certs": True,
19.      "basic_auth": (username, password),
20.      "ca_certs": "./http_ca.crt"
21.  }

23.  url = f"https://{username}:{password}@{endpoint}:9200"

25.  def parse_book(filepath):
26.      loader = TextLoader(filepath)
27.      documents = loader.load()
28.      text_splitter = RecursiveCharacterTextSplitter(chunk_size=384, chunk_overlap=0)
29.      docs = text_splitter.split_documents(documents)
30.      return docs

32.  def parse_triplets(filepath):
33.      docs = parse_book(filepath)
34.      result = []
35.      for i in range(len(docs) - 2):
36.          concat_str = docs[i].page_content + " " + docs[i+1].page_content + " " + docs[i+2].page_content
37.          result.append(concat_str)
38.      return result
39.      #db.from_texts(docs, embedding=hf, elasticsearch_url=url, index_name=index_name)

41.  ## load book utility
42.  ## params
43.  ##  filepath: where to get the book txt ... should be utf-8
44.  ##  url: the full Elasticsearch url with username password and port embedded
45.  ##  hf: hugging face transformer for sentences
46.  ##  db: the VectorStore Langcahin object ready to go with embedding thing already set up
47.  ##  index_name: name of index to use in ES
48.  ##
49.  ##  will check if the index_name exists already in ES url before attempting split and load
50.  def loadBookTriplets(filepath, url, hf, db, index_name):
51.      with open('simple.cfg') as f:
52.          cfg = Config(f)

54.      fingerprint = cfg['ES_FINGERPRINT']
55.      es = Elasticsearch( [ url ], 
56.                      basic_auth = ("elastic", cfg['ES_PASSWORD']), 
57.                      ssl_assert_fingerprint = fingerprint, 
58.                      http_compress = True  )

60.      ## Parse the book if necessary
61.      if not es.indices.exists(index=index_name):
62.          print(f'tThe index: {index_name} does not exist')
63.          print(">> 1. Chunk up the Source document")

65.          results = parse_triplets(filepath)

67.          print(">> 2. Index the chunks into Elasticsearch")

69.          elastic_vector_search= ElasticVectorSearch.from_documents( docs,
70.                                  embedding = hf, 
71.                                  elasticsearch_url = url, 
72.                                  index_name = index_name, 
73.                                  ssl_verify = ssl_verify)
74.      else:
75.          print("tLooks like the pdfs are already loaded, let's move on")

77.  def loadBookBig(filepath, url, hf, db, index_name):
78.      es = Elasticsearch( [ url ], 
79.                         basic_auth = ("elastic", cfg['ES_PASSWORD']), 
80.                         ssl_assert_fingerprint = fingerprint, 
81.                         http_compress = True  )

83.      ## Parse the book if necessary
84.      if not es.indices.exists(index=index_name):
85.          print(f'tThe index: {index_name} does not exist')
86.          print(">> 1. Chunk up the Source document")

88.          docs = parse_book(filepath)

90.          # print(docs)

92.          print(">> 2. Index the chunks into Elasticsearch")

94.          elastic_vector_search= ElasticVectorSearch.from_documents( docs,
95.                                  embedding = hf, 
96.                                  elasticsearch_url = url, 
97.                                  index_name = index_name, 
98.                                  ssl_verify = ssl_verify)   
99.      else:
100.          print("tLooks like the pdfs are already loaded, let's move on")

102.  def loadPdfChunks(chunks, url, hf, db, index_name):    
103.      es = Elasticsearch( [ url ], 
104.                         basic_auth = ("elastic", cfg['ES_PASSWORD']), 
105.                         ssl_assert_fingerprint = fingerprint, 
106.                         http_compress = True  )

108.      ## Parse the book if necessary
109.      if not es.indices.exists(index=index_name):
110.          print(f'tThe index: {index_name} does not exist')        
111.          print(">> 2. Index the chunks into Elasticsearch")

113.          print("url: ", url)
114.          print("index_name", index_name)

116.          elastic_vector_search = db.from_texts( chunks,
117.                                  embedding = hf, 
118.                                  elasticsearch_url = url, 
119.                                  index_name = index_name, 
120.                                  ssl_verify = ssl_verify)   
121.      else:
122.          print("tLooks like the pdfs are already loaded, let's move on")

提问

我们使用 streamlit 的 input 来进行提问:

 1.      user_question = st.text_input("Ask a question about your documents")
2.      if user_question:
3.          handle_userinput(db, llm_chain_informed, user_question)

当我们打入 ENTER 键后,上面的代码调用 handle_userinput(db, llm_chain_informed, user_question):



1.  def handle_userinput(db, llm_chain_informed, user_question):
2.      similar_docs = db.similarity_search(user_question)
3.      print(f'The most relevant passage: nt{similar_docs[0].page_content}')

5.      ## 4. Ask Local LLM context informed prompt
6.      # print(">> 4. Asking The Book ... and its response is: ")
7.      informed_context= similar_docs[0].page_content
8.      response = llm_chain_informed.run(context=informed_context,question=user_question)

10.      st.write(user_template.replace("{{MSG}}", user_question).replace("{{MSG1}}", " "), unsafe_allow_html=True)
11.      st.write(bot_template.replace("{{MSG}}", response).replace("{{MSG1}}", similar_docs[0].page_content),unsafe_allow_html=True)


首先它使用 db 进行相似性搜索,然后我们再使用大模型来得到我们想要的答案。

运行结果

我们使用命令来运行代码:

streamlit run myapp.py

我们在浏览器中选择在 pdf_files 中的两个 PDF 文件:

在上面,我们输入想要的问题:

上面的问题是:

what do I make all the same and put a cup next to him on the desk?

再进行提问:

上面的问题是:

when should you come? I will send a car to meet you from the half past four arrival at Harrogate Station.

上面的问题是:

what will I send to meet you from the half past four arrival at Harrogate Station?

你进行多次尝试其它的问题。Happy journery 🙂

有关 ChatGPT 的使用也是基本相同的。你需要使用 ChatGPT 的模型及其相应的 key 即可。在这里就不赘述了。

相关文章

JavaScript2024新功能:Object.groupBy、正则表达式v标志
PHP trim 函数对多字节字符的使用和限制
新函数 json_validate() 、randomizer 类扩展…20 个PHP 8.3 新特性全面解析
使用HTMX为WordPress增效:如何在不使用复杂框架的情况下增强平台功能
为React 19做准备:WordPress 6.6用户指南
如何删除WordPress中的所有评论

发布评论