在本博客中,你将学习创建一个 LangChain 应用程序,以使用 ChatGPT API 和 Huggingface 语言模型与多个 PDF 文件聊天。
如上所示,我们在最最左边摄入 PDF 文件,并它们连成一起,并分为不同的 chunks。我们可以通过使用 huggingface 来对 chunks 进行处理并形成 embeddings。我们把 embeddings 写入到 Elasticsearch 向量数据库中,并保存。在搜索的时候,我们通过 LangChain 来进行向量化,并使用 Elasticsearch 进行向量搜索。在最后,我们通过大模型的使用,针对提出的问题来进行提问。我们最终的界面如下:
如上所示,它可以针对我们的问题进行回答。进一步阅读
- 使用 LangChain 和 Elasticsearch 对私人数据进行人工智能搜索
- 使用 LangChain 和 Elasticsearch 的隐私优先 AI 搜索
所有的源码可以在地址 GitHub - liu-xiao-guo/ask-multiple-pdfs: A Langchain app that allows you to chat with multiple PDFs 进行下载。
安装
如果你还没有安装好自己的 Elasticsearch 及 Kibana 的话,那么请参考如下的链接:
-
如何在 Linux,MacOS 及 Windows 上进行安装 Elasticsearch
-
Kibana:如何在 Linux,MacOS 及 Windows 上安装 Elastic 栈中的 Kibana
在安装的时候,我们选择 Elastic Stack 9.x 的安装指南来进行安装。在默认的情况下,Elasticsearch 集群的访问具有 HTTPS 的安全访问。
在安装时,我们可以在 Elasticsearch 的如下地址找到相应的证书文件 http_ca.crt:
1. $ pwd
2. /Users/liuxg/elastic/elasticsearch-8.10.0/config/certs
3. $ ls
4. http.p12 http_ca.crt transport.p12
我们需要把该证书拷贝到项目文件的根目录下:
1. $ tree -L 3
2. .
3. ├── app.py
4. ├── docs
5. │ └── PDF-LangChain.jpg
6. ├── htmlTemplates.py
7. ├── http_ca.crt
8. ├── lib_embeddings.py
9. ├── lib_indexer.py
10. ├── lib_llm.py
11. ├── lib_vectordb.py
12. ├── myapp.py
13. ├── pdf_files
14. │ ├── sample1.pdf
15. │ └── sample2.pdf
16. ├── readme.md
17. ├── requirements.txt
18. └── simple.cfg
如上所示,我们把 http_ca.crt 拷贝到应用的根目录下。我们在 pdf_files 里放了两个用于测试的 PDF 文件。你可以使用自己的 PDF 文件来进行测试。我们在 simple.cfg 做如下的配置:
1. ES_SERVER: "localhost"
2. ES_PASSWORD: "vXDWYtL*my3vnKY9zCfL"
3. ES_FINGERPRINT: "e2c1512f617f432ddf242075d3af5177b28f6497fecaaa0eea11429369bb7b00"
在上面,我们需要配置 ES_SERVER。这个是 Elasticsearch 集群的地址。这里的 ES_PASSWORD 是 Elasticsearch 的超级用户 elastic 的密码。我们可以在 Elasticsearch 第一次启动的画面中找到这个 ES_FINGERPRINT:
你还可以在 Kibana 的配置文件 confgi/kibana.yml 文件中获得 fingerprint 的配置:
在项目的目录中,我们还可以看到一个叫做 .env-example 的文件。我们可以使用如下的命令把它重新命名为 .env:
mv .env.example .env
在 .env 中,我们输入 huggingface.co 网站得到的 token:
1. $ cat .env
2. OPENAI_API_KEY=your_openai_key
3. HUGGINGFACEHUB_API_TOKEN=your_huggingface_key
在本例中,我们将使用 huggingface 来进行测试。如果你需要使用到 OpenAI,那么你需要配置它的 key。有关 huggingface 的开发者 key,你可以在地址获得。
运行项目
在运行项目之前,你需要做一下安装的动作:
1. python3 -m venv env
2. source env/bin/activate
3. python3 -m pip install --upgrade pip
4. pip install -r requirements.txt
创建界面
本应用的界面,我们采用是 streamlit 来创建的。它的创建也是非常地简单。我们可以在 myapp.py 中看到如下的代码:
myapp.py
1. import streamlit as st
2. from dotenv import load_dotenv
3. from PyPDF2 import PdfReader
4. from htmlTemplates import css, bot_template, user_template
6. def get_pdf_texts(pdf_docs):
7. text = ""
8. for pdf in pdf_docs:
9. pdf_reader = PdfReader(pdf)
10. for page in pdf_reader.pages:
11. text += page.extract_text()
12. return text
14. def main():
15. load_dotenv()
16. st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
17. st.write(css, unsafe_allow_html=True)
18. st.header("Chat with multiple PDFs :books:")
19. user_question = st.text_input("Ask a question about your documents")
20. if user_question:
21. pass
23. st.write(user_template.replace("{{MSG}}", "Hello, human").replace("{{MSG1}}", " "), unsafe_allow_html=True)
24. st.write(bot_template.replace("{{MSG}}", "Hello, robot").replace("{{MSG1}}", " "), unsafe_allow_html=True)
26. # Add a side bar
27. with st.sidebar:
28. st.subheader("Your documents")
29. pdf_docs = st.file_uploader(
30. "Upload your PDFs here and press on click on Process", accept_multiple_files=True)
31. print(pdf_docs)
32. if st.button("Process"):
33. with st.spinner("Processing"):
34. # Get pdf text from
35. raw_text = get_pdf_texts(pdf_docs)
36. st.write(raw_text)
39. if __name__ == "__main__":
40. main()
在上面的代码中,我创建了一个 sidebar 用来选择需要的 PDF 文件。我们可以点击 Process 按钮来显示已经提取的 PDF 文本。我们可以使用如下的命令来运行应用:
(venv) $ streamlit run myapp.py
1. venv) $ streamlit run myapp.py
3. You can now view your Streamlit app in your browser.
5. Local URL: http://localhost:8502
6. Network URL: http://198.18.1.13:8502
运行完上面的命令后,我们可以在浏览器中打开应用:
我们点击 Browse files,并选中 PDF 文件:
点击上面的 Process,我们可以看到:
在上面,我们为了显示的方便,我使用 st.write 直接把结果写到浏览器的页面里。我们接下来需要针对这个长的文字进行切分为一个一个的 chunks。我们需要按照模型的需要,不能超过模型允许的最大值。
上面我简单地叙述了 UI 的构造。最终完整的 myapp.py 的设计如下:
myapp.py
1. import streamlit as st
2. from dotenv import load_dotenv
3. from PyPDF2 import PdfReader
4. from langchain.text_splitter import CharacterTextSplitter
5. from langchain.text_splitter import RecursiveCharacterTextSplitter
6. from langchain.embeddings import OpenAIEmbeddings
7. from htmlTemplates import css, bot_template, user_template
9. import lib_indexer
10. import lib_llm
11. import lib_embeddings
12. import lib_vectordb
14. index_name = "pdf_docs"
16. def get_pdf_text(pdf):
17. text = ""
18. pdf_reader = PdfReader(pdf)
19. for page in pdf_reader.pages:
20. text += page.extract_text()
21. return text
24. def get_pdf_texts(pdf_docs):
25. text = ""
26. for pdf in pdf_docs:
27. pdf_reader = PdfReader(pdf)
28. for page in pdf_reader.pages:
29. text += page.extract_text()
30. return text
32. def get_text_chunks(text):
33. text_splitter = CharacterTextSplitter(
34. separator="n",
35. chunk_size=1000,
36. chunk_overlap=200,
37. length_function=len
38. )
39. chunks = text_splitter.split_text(text)
40. # chunks = text_splitter.split_documents(text)
41. return chunks
43. def get_text_chunks1(text):
44. text_splitter = RecursiveCharacterTextSplitter(chunk_size=384, chunk_overlap=0)
45. chunks = text_splitter.split_text(text)
46. return chunks
48. def handle_userinput(db, llm_chain_informed, user_question):
49. similar_docs = db.similarity_search(user_question)
50. print(f'The most relevant passage: nt{similar_docs[0].page_content}')
52. ## 4. Ask Local LLM context informed prompt
53. # print(">> 4. Asking The Book ... and its response is: ")
54. informed_context= similar_docs[0].page_content
55. response = llm_chain_informed.run(context=informed_context,question=user_question)
57. st.write(user_template.replace("{{MSG}}", user_question).replace("{{MSG1}}", " "), unsafe_allow_html=True)
58. st.write(bot_template.replace("{{MSG}}", response).replace("{{MSG1}}", similar_docs[0].page_content),unsafe_allow_html=True)
60. def main():
62. # # Huggingface embedding setup
63. hf = lib_embeddings.setup_embeddings()
65. # # # ## Elasticsearch as a vector db
66. db, url = lib_vectordb.setup_vectordb(hf, index_name)
68. # # # ## set up the conversational LLM
69. llm_chain_informed= lib_llm.make_the_llm()
71. load_dotenv()
72. st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
73. st.write(css, unsafe_allow_html=True)
74. st.header("Chat with multiple PDFs :books:")
75. user_question = st.text_input("Ask a question about your documents")
76. if user_question:
77. handle_userinput(db, llm_chain_informed, user_question)
79. st.write(user_template.replace("{{MSG}}", "Hello, human").replace("{{MSG1}}", " "), unsafe_allow_html=True)
80. st.write(bot_template.replace("{{MSG}}", "Hello, robot").replace("{{MSG1}}", " "), unsafe_allow_html=True)
82. # Add a side bar
83. with st.sidebar:
84. st.subheader("Your documents")
85. pdf_docs = st.file_uploader(
86. "Upload your PDFs here and press on click on Process", accept_multiple_files=True)
87. print(pdf_docs)
88. if st.button("Process"):
89. with st.spinner("Processing"):
90. # Get pdf text from
91. # raw_text = get_pdf_text(pdf_docs[0])
92. raw_text = get_pdf_texts(pdf_docs)
93. # st.write(raw_text)
94. print(raw_text)
96. # Get the text chunks
97. text_chunks = get_text_chunks(raw_text)
98. # st.write(text_chunks)
100. # Create vector store
101. lib_indexer.loadPdfChunks(text_chunks, url, hf, db, index_name)
104. if __name__ == "__main__":
105. main()
创建嵌入模型
lib_embedding.py
1. ## for embeddings
2. from langchain.embeddings import HuggingFaceEmbeddings
4. def setup_embeddings():
5. # Huggingface embedding setup
6. print(">> Prep. Huggingface embedding setup")
7. model_name = "sentence-transformers/all-mpnet-base-v2"
8. return HuggingFaceEmbeddings(model_name=model_name)
创建向量存储
lib_vectordb.py
1. import os
2. from config import Config
4. ## for vector store
5. from langchain.vectorstores import ElasticVectorSearch
7. def setup_vectordb(hf,index_name):
8. # Elasticsearch URL setup
9. print(">> Prep. Elasticsearch config setup")
11. with open('simple.cfg') as f:
12. cfg = Config(f)
14. endpoint = cfg['ES_SERVER']
15. username = "elastic"
16. password = cfg['ES_PASSWORD']
18. ssl_verify = {
19. "verify_certs": True,
20. "basic_auth": (username, password),
21. "ca_certs": "./http_ca.crt",
22. }
24. url = f"https://{username}:{password}@{endpoint}:9200"
26. return ElasticVectorSearch( embedding = hf,
27. elasticsearch_url = url,
28. index_name = index_name,
29. ssl_verify = ssl_verify), url
创建使用带有上下文和问题变量的提示模板的离线 LLM
lib_llm.py
1. ## for conversation LLM
2. from langchain import PromptTemplate, HuggingFaceHub, LLMChain
3. from langchain.llms import HuggingFacePipeline
4. import torch
5. from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
8. def make_the_llm():
9. # Get Offline flan-t5-large ready to go, in CPU mode
10. print(">> Prep. Get Offline flan-t5-large ready to go, in CPU mode")
11. model_id = 'google/flan-t5-large'# go for a smaller model if you dont have the VRAM
12. tokenizer = AutoTokenizer.from_pretrained(model_id)
13. model = AutoModelForSeq2SeqLM.from_pretrained(model_id) #load_in_8bit=True, device_map='auto'
14. pipe = pipeline(
15. "text2text-generation",
16. model=model,
17. tokenizer=tokenizer,
18. max_length=100
19. )
20. local_llm = HuggingFacePipeline(pipeline=pipe)
21. # template_informed = """
22. # I know the following: {context}
23. # Question: {question}
24. # Answer: """
26. template_informed = """
27. I know: {context}
28. when asked: {question}
29. my response is: """
31. prompt_informed = PromptTemplate(template=template_informed, input_variables=["context", "question"])
33. return LLMChain(prompt=prompt_informed, llm=local_llm)
写入以向量表示的 PDF 文件
以下是我的分块和向量存储代码。 它需要在 Elasticsearch 中准备好组成的 Elasticsearch url、huggingface 嵌入模型、向量数据库和目标索引名称
lib_indexer.py
2. from langchain.text_splitter import RecursiveCharacterTextSplitter
3. from langchain.document_loaders import TextLoader
5. ## for vector store
6. from langchain.vectorstores import ElasticVectorSearch
7. from elasticsearch import Elasticsearch
8. from config import Config
10. with open('simple.cfg') as f:
11. cfg = Config(f)
13. fingerprint = cfg['ES_FINGERPRINT']
14. endpoint = cfg['ES_SERVER']
15. username = "elastic"
16. password = cfg['ES_PASSWORD']
17. ssl_verify = {
18. "verify_certs": True,
19. "basic_auth": (username, password),
20. "ca_certs": "./http_ca.crt"
21. }
23. url = f"https://{username}:{password}@{endpoint}:9200"
25. def parse_book(filepath):
26. loader = TextLoader(filepath)
27. documents = loader.load()
28. text_splitter = RecursiveCharacterTextSplitter(chunk_size=384, chunk_overlap=0)
29. docs = text_splitter.split_documents(documents)
30. return docs
32. def parse_triplets(filepath):
33. docs = parse_book(filepath)
34. result = []
35. for i in range(len(docs) - 2):
36. concat_str = docs[i].page_content + " " + docs[i+1].page_content + " " + docs[i+2].page_content
37. result.append(concat_str)
38. return result
39. #db.from_texts(docs, embedding=hf, elasticsearch_url=url, index_name=index_name)
41. ## load book utility
42. ## params
43. ## filepath: where to get the book txt ... should be utf-8
44. ## url: the full Elasticsearch url with username password and port embedded
45. ## hf: hugging face transformer for sentences
46. ## db: the VectorStore Langcahin object ready to go with embedding thing already set up
47. ## index_name: name of index to use in ES
48. ##
49. ## will check if the index_name exists already in ES url before attempting split and load
50. def loadBookTriplets(filepath, url, hf, db, index_name):
51. with open('simple.cfg') as f:
52. cfg = Config(f)
54. fingerprint = cfg['ES_FINGERPRINT']
55. es = Elasticsearch( [ url ],
56. basic_auth = ("elastic", cfg['ES_PASSWORD']),
57. ssl_assert_fingerprint = fingerprint,
58. http_compress = True )
60. ## Parse the book if necessary
61. if not es.indices.exists(index=index_name):
62. print(f'tThe index: {index_name} does not exist')
63. print(">> 1. Chunk up the Source document")
65. results = parse_triplets(filepath)
67. print(">> 2. Index the chunks into Elasticsearch")
69. elastic_vector_search= ElasticVectorSearch.from_documents( docs,
70. embedding = hf,
71. elasticsearch_url = url,
72. index_name = index_name,
73. ssl_verify = ssl_verify)
74. else:
75. print("tLooks like the pdfs are already loaded, let's move on")
77. def loadBookBig(filepath, url, hf, db, index_name):
78. es = Elasticsearch( [ url ],
79. basic_auth = ("elastic", cfg['ES_PASSWORD']),
80. ssl_assert_fingerprint = fingerprint,
81. http_compress = True )
83. ## Parse the book if necessary
84. if not es.indices.exists(index=index_name):
85. print(f'tThe index: {index_name} does not exist')
86. print(">> 1. Chunk up the Source document")
88. docs = parse_book(filepath)
90. # print(docs)
92. print(">> 2. Index the chunks into Elasticsearch")
94. elastic_vector_search= ElasticVectorSearch.from_documents( docs,
95. embedding = hf,
96. elasticsearch_url = url,
97. index_name = index_name,
98. ssl_verify = ssl_verify)
99. else:
100. print("tLooks like the pdfs are already loaded, let's move on")
102. def loadPdfChunks(chunks, url, hf, db, index_name):
103. es = Elasticsearch( [ url ],
104. basic_auth = ("elastic", cfg['ES_PASSWORD']),
105. ssl_assert_fingerprint = fingerprint,
106. http_compress = True )
108. ## Parse the book if necessary
109. if not es.indices.exists(index=index_name):
110. print(f'tThe index: {index_name} does not exist')
111. print(">> 2. Index the chunks into Elasticsearch")
113. print("url: ", url)
114. print("index_name", index_name)
116. elastic_vector_search = db.from_texts( chunks,
117. embedding = hf,
118. elasticsearch_url = url,
119. index_name = index_name,
120. ssl_verify = ssl_verify)
121. else:
122. print("tLooks like the pdfs are already loaded, let's move on")
提问
我们使用 streamlit 的 input 来进行提问:
1. user_question = st.text_input("Ask a question about your documents")
2. if user_question:
3. handle_userinput(db, llm_chain_informed, user_question)
当我们打入 ENTER 键后,上面的代码调用 handle_userinput(db, llm_chain_informed, user_question):
1. def handle_userinput(db, llm_chain_informed, user_question):
2. similar_docs = db.similarity_search(user_question)
3. print(f'The most relevant passage: nt{similar_docs[0].page_content}')
5. ## 4. Ask Local LLM context informed prompt
6. # print(">> 4. Asking The Book ... and its response is: ")
7. informed_context= similar_docs[0].page_content
8. response = llm_chain_informed.run(context=informed_context,question=user_question)
10. st.write(user_template.replace("{{MSG}}", user_question).replace("{{MSG1}}", " "), unsafe_allow_html=True)
11. st.write(bot_template.replace("{{MSG}}", response).replace("{{MSG1}}", similar_docs[0].page_content),unsafe_allow_html=True)
首先它使用 db 进行相似性搜索,然后我们再使用大模型来得到我们想要的答案。
运行结果
我们使用命令来运行代码:
streamlit run myapp.py
我们在浏览器中选择在 pdf_files 中的两个 PDF 文件:
在上面,我们输入想要的问题:
上面的问题是:
what do I make all the same and put a cup next to him on the desk?
再进行提问:
上面的问题是:
when should you come? I will send a car to meet you from the half past four arrival at Harrogate Station.
上面的问题是:
what will I send to meet you from the half past four arrival at Harrogate Station?
你进行多次尝试其它的问题。Happy journery 🙂
有关 ChatGPT 的使用也是基本相同的。你需要使用 ChatGPT 的模型及其相应的 key 即可。在这里就不赘述了。