RAG — เพิ่มพลัง Similarity Search ด้วย Document Expansion by Query Prediction: และลองวัดผลลัพธ์ด้วย Metrics แบบ MRR (Mean Reciprocal Rank) ด้วย LLM-as-a-Judge

ในปัจจุบัน เทคโนโลยีด้านการค้นหาข้อมูล (Information Retrieval) และการประมวลผลภาษาธรรมชาติ (Natural Language Processing หรือ NLP) ได้พัฒนาไปอย่างรวดเร็ว โดยเฉพาะการมาของ AI อย่าง ChatGPT, Google Gemini, รวมถึงแอปพลิเคชันเฉพาะทางอื่น ๆ ที่ใช้ LLM (Large Language Models) ทำให้เราสามารถค้นหาข้อมูลในรูปแบบต่าง ๆ — ไม่ว่าจะเป็นข้อความ ภาพ หรือเสียง — ได้อย่างสะดวกและแม่นยำมากขึ้น

หลายองค์กรจึงเริ่มนำ LLM หรือ AI มาประยุกต์ใช้งานภายใน เช่น สร้าง Chatbot สำหรับตอบคำถามภายในองค์กร, ช่วยสรุปเอกสาร, หรือตอบคำถามเชิงลึกจากข้อมูลเฉพาะทางที่มีอยู่ในองค์กรเอง

แต่ปัญหาคือ — ข้อมูลส่วนใหญ่ของแต่ละองค์กร ไม่ได้ถูกนำไปฝึก (pre-train) กับ LLM เหล่านี้ตั้งแต่แรก เพราะเป็นข้อมูลภายใน หรือไม่เปิดเผยสู่สาธารณะ

แล้วจะทำอย่างไรให้ AI เหล่านี้ “ตอบได้” ด้วยข้อมูลของเรา?

หนึ่งในแนวทางที่ได้รับความนิยมมากคือสิ่งที่เรียกว่า RAG — Retrieval-Augmented Generation

RAG คือแนวทางที่นำการ “ค้นหา” และ “การสร้างคำตอบ” มาผสมกัน

โดยระบบจะไปดึงข้อมูลจากฐานความรู้หรือเอกสารขององค์กรก่อน แล้วจึงป้อนข้อมูลที่ค้นได้นั้นเข้าไปใน LLM เพื่อให้มันใช้ “บริบทล่าสุด” ในการสร้างคำตอบ

พูดง่าย ๆ คือ แทนที่ LLM จะพยายามเดา “จากที่เคยรู้มา” อย่างเดียว เราให้มัน “เปิดหนังสือดูก่อนตอบ” ได้

แนวคิดนี้เป็นพื้นฐานสำคัญของหลาย ๆ ระบบ Chatbot สำหรับองค์กร และยังเป็นจุดตั้งต้นที่เชื่อมโยงมาสู่หัวข้อของบทความนี้:

จะทำอย่างไรให้ระบบค้นหาในขั้นตอน Retrieval ดึงข้อมูลที่ “ตรง” และ “เกี่ยวข้อง” ที่สุด?

หลังจากได้ลองฝึกทำระบบ RAG (Retrieval-Augmented Generation) และสร้างระบบ Similarity Search สำหรับค้นหาข้อมูลในชุดข้อความภายในองค์กร ผมก็ได้เรียนรู้ว่าวิธีเพิ่มประสิทธิภาพของการค้นหานั้นมีอยู่หลายแนวทางมาก

หนึ่งในปัญหาที่พบเจอบ่อยคือ — คำค้น (query) ที่ผู้ใช้ป้อน กับคำที่ปรากฏในเอกสารจริงมักจะ “ไม่ตรงกัน” โดยเฉพาะเมื่อใช้เทคนิคการค้นหาแบบ embedding-based search อย่างเดียว เช่น การแปลง query และ document ให้เป็น vector แล้วใช้ cosine similarity หาเอกสารที่ใกล้เคียงที่สุด

ปัญหาคือ…แม้ความหมายจะใกล้เคียงกัน แต่ถ้า vector space ไม่ได้แมปไว้ดี หรือไม่สามารถ capture คำเฉพาะทางได้ดีพอ มันก็ทำให้ระบบค้นหา พลาดเอกสารที่เกี่ยวข้องจริง ๆ ไป

วันนี้เลยอยากมาแชร์หนึ่งในเทคนิคที่ช่วย “เพิ่มพลัง” ให้กับ Similarity Search ได้อย่างน่าสนใจ นั่นคือ Document Expansion by Query Prediction และลองวัดผลลัพธ์ด้วย Metrics แบบ MRR (Mean Reciprocal Rank) ด้วย LLM-as-a-Judge:

🤖📝 Document Expansion by Query Prediction (หรือ Doc2Query)

วิธีนี้เรียบง่ายแต่ได้ผลดี โดยแทนที่จะขยายคำค้นของผู้ใช้ (query expansion) เราหันมาขยายฝั่งเอกสารแทน

เราสามารถใช้ LLM หรือ ใช้โมเดล (เช่น T5) (ในบทความนี้เราจะใช้ LLM) สร้าง query จำลองที่ “น่าจะถูกใช้ค้นหา” เอกสารหนึ่ง ๆ แล้วนำ query เหล่านั้นกลับมาเติมเข้าไปในตัวเอกสาร

สิ่งนี้ช่วยให้ vector ที่แทนเอกสารสามารถ “ครอบคลุม” ความหมายได้หลากหลายมากขึ้น เมื่อมีการเปรียบเทียบกับ vector ของ query จริง มันจึง match ได้ง่ายขึ้น และผลการค้นหาก็ดีขึ้นตามไปด้วย

เกี่ยวกับ Doc2Query สามารถอ่านแบบเต็มๆ ได้ที่ : Document Expansion by Query Prediction

🔍 Similarity Search คืออะไร?

Similarity Search เป็นเทคนิคพื้นฐานในการค้นหาเอกสารที่ “ใกล้เคียง” กับคำถามของผู้ใช้ ไม่ใช่แค่ตรงคำแบบ keyword matching แต่เป็นการ “เข้าใจความหมาย” ด้วยการเปรียบเทียบในรูปแบบเวกเตอร์ (vector space)

เรานำข้อความหรือเอกสารทั้งหมดมาสร้างเป็น vector ด้วยโมเดล embedding เช่น sentence-transformers แล้วเวลาใช้จริงก็แปลง query ให้เป็น vector และหาว่าเอกสารไหนมี vector ที่ใกล้เคียงที่สุด

แต่ความแม่นของการ matching ก็ขึ้นอยู่กับคุณภาพของ embedding และเนื้อหาของเอกสารที่นำไปสร้าง vector ซึ่ง Doc2Query เข้ามาช่วยในจุดนี้

แล้วเราจะวัดว่า “ดีขึ้นจริงไหม?” ด้วยอะไร? หนึ่งใน metric ยอดนิยมคือ

🥇📈 MRR (Mean Reciprocal Rank)

MRR คือการดูว่า เอกสารที่ “เกี่ยวข้องจริง” ถูกจัดอยู่อันดับที่เท่าไรในผลลัพธ์ของเรา หากเอกสารที่ใช่โผล่มาในอันดับ 1 จะได้ 1.0, ถ้าอยู่อันดับ 2 จะได้ 0.5 และลดหลั่นลงไป ค่าเฉลี่ยของผลลัพธ์ทั้งหมดจะให้ภาพรวมว่าระบบเราดึงข้อมูลได้แม่นแค่ไหน

https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/rag-evaluation/

แล้วเราจะรู้ได้อย่างไรละ ว่าผลลัพธ์จากการค้นหา เกี่ยวข้องกับคำถามหรือไม่ หรือมากน้อยแค่ไหน ให้มนุษย์ไปนั่งเช็คผลลัพธ์ทุกครั้ง? (แบบนี้คงจะเหนื่อยไม่น้อย)

วิธีที่เราจะใช้สำหรับบทความนี้คือใช้ LLM-as-a-Judge หรือ ใช้ AI ทำแทน

แนวคิดคือ เรานำผลลัพธ์จากการค้นหา (เช่น Top-5 documents) พร้อมกับ query เดิม ไปป้อนให้ LLM แล้วให้มันช่วย “ประเมิน” ว่าเอกสารที่ได้มาตรงกับคำถามมากน้อยแค่ไหน โดยให้มันให้คะแนนหรือจัดอันดับตามความเกี่ยวข้อง

ข้อดีคือเราไม่ต้องมีมนุษย์มาอ่านและให้คะแนนเองทุกเคส

ข้อเสียคือก็ต้องระวังว่า LLM เองอาจมีอคติหรือ bias บางอย่าง

แต่โดยรวมแล้ว เป็นแนวทางที่น่าสนใจและประหยัดแรงเมื่อเราต้องประเมินคุณภาพของระบบ search แบบอัตโนมัติ

🎯 สรุป: บทความนี้จะพาไปดูวิธีทดลองใช้ Document Expansion by Query Prediction (Doc2Query) เพื่อเพิ่มประสิทธิภาพให้กับระบบ Similarity Search โดยจะลองวัดผลทั้งแบบ MRR และใช้ LLM-as-a-Judge เพื่อดูว่าเทคนิคนี้ช่วยได้จริงไหม และช่วยแค่ไหน

🧰 เครื่องมือและเทคโนโลยีที่ใช้

ก่อนเริ่มการทดลอง เรามาทำความรู้จักกับเครื่องมือและไลบรารีหลักที่ใช้ในการเสริมพลังให้กับ Similarity Search ด้วย Document Expansion by Query Prediction (Doc2Query)

รวมถึงการวัดผลด้วย MRR (Mean Reciprocal Rank) โดยใช้ LLM-as-a-Judge

💻 VS Code

ใช้เป็น IDE หลัก สำหรับเขียนและจัดการโค้ด Python ตลอดทั้งโปรเจกต์

🐍 Python

ภาษาโปรแกรมหลักในการ

• ประมวลผลข้อความ

• เรียกใช้โมเดล

• ติดต่อกับ VectorDB

• ควบคุมลอจิกในแต่ละขั้นตอน

⚡️ FastEmbed

ไลบรารีฝั่ง client สำหรับสร้าง text embedding อย่างรวดเร็ว

รองรับโมเดลหลากหลาย เช่น BAAI/bge-small-en-v1.5

🧲 Qdrant

Vector database สำหรับจัดเก็บและค้นหา embedding ของเอกสาร

รองรับทั้ง dense vector, sparse vector, และ hybrid search

🤖 Google Gemini (2.0 Flash / 1.5 Flash)

ใช้เป็น LLM หลัก ทั้งในการ:

สร้าง query ขยาย (Doc2Query)
ทำหน้าที่เป็น judge ประเมินผลการค้นหา

จุดเด่น: ใช้งานฟรี มีโควต้า ตอบสนองเร็ว เหมาะกับงานทดลอง

โครงสร้างการอธิบายในบทความนี้

🧩 พาร์ทที่ 1 — ทำความเข้าใจโค้ดและแนวคิดพื้นฐาน

ในส่วนนี้ เราจะไล่ดูโค้ดและอธิบายว่าแต่ละส่วนทำงานอย่างไรแบบเบื้องต้น

เพื่อให้เห็น กลไกเบื้องหลัง ของกระบวนการทั้งหมดที่เราจะใช้ในภายหลัง

เนื้อหาจะประกอบด้วย:

การสร้าง Text Embedding
การใช้งาน Vector Database
ขั้นตอน Similarity Search
การเรียกใช้ LLM เพื่อสร้างหรือประเมินผล

🚀 พาร์ทที่ 2 — ทดลองใช้ Doc2Query เพิ่มพลังการค้นหา และประเมินผลด้วย LLM-as-a-Judge

ในส่วนนี้ เราจะนำความเข้าใจจากพาร์ทแรก มาประยุกต์ใช้กับหัวข้อหลักของบทความ

คือการ เพิ่มประสิทธิภาพของ Similarity Search ด้วย Document Expansion (Doc2Query)

และทดสอบผลลัพธ์ด้วยเมตริก:

🥇 MRR (Mean Reciprocal Rank)
👨‍⚖️ ประเมินผลด้วย LLM-as-a-Judge

🧩 พาร์ทที่ 1 — ทำความเข้าใจโค้ดและแนวคิดพื้นฐาน

ก่อนเริ่มลงมือทำจริง เรามาดูภาพรวมโครงสร้างโปรเจคกันก่อน

เพื่อให้เข้าใจว่าไฟล์แต่ละตัวควรอยู่ที่ไหนในโฟลเดอร์ และช่วยลดความสับสนเวลาเขียนโค้ด

💡 หมายเหตุ: สามารถปรับโครงสร้างตามสไตล์หรือความถนัดของตัวเองได้เลย ไม่จำเป็นต้องตามเป๊ะ ๆ

🔢 Text Embedding การแปลงข้อความเป็นชุดตัวเลขเพื่อให้คอมพิวเตอร์เข้าใจความหมาย)

ก่อนเริ่มลงมือ เรามาทำความรู้จักกับ Text Embedding แบบ Dense และ Sparse แบบสั้น ๆ กันก่อน

🧠 Dense Vector คืออะไร?

ลองนึกว่าเราต้องการให้คอมพิวเตอร์ “เข้าใจความหมาย” ของข้อความ

สิ่งที่เราทำคือส่งข้อความเข้าโมเดล AI (เช่น BERT หรือ BGE) เพื่อแปลงเป็นชุดตัวเลข เช่น [0.12, -0.55, 0.87, …] — นี่คือ Dense Vector

💡 จุดสำคัญคือ:

ตัวเลขไม่ได้แทนคำ แต่แทน “ความหมายโดยรวม” ของข้อความ
ประโยคที่ใช้คำต่างกัน แต่ความหมายใกล้เคียงกัน — Dense Vector ของมันจะคล้ายกัน

ข้อดี: Dense Vector ช่วยให้ระบบ “เข้าใจความหมาย” ได้ลึกกว่าแค่เช็กว่ามีคำตรงกันหรือไม่

ตัวอย่าง

ประโยค A: "ขอลางานพรุ่งนี้ได้ไหม"  
ประโยค B: "ฉันจะหยุดงานวันพรุ่งนี้"

→ ใช้ dense vector จะพบว่า 2 ประโยคนี้มีเวกเตอร์ที่ใกล้เคียงกัน  
→ ระบบจึงรู้ว่า ทั้งคู่พูดถึงเรื่อง “การลางาน” แม้ใช้คำต่างกัน

🧾 Sparse Vector คืออะไร?

นี่คือวิธีแทนข้อความแบบดั้งเดิม

มันไม่ได้เข้าใจ “ความหมาย” โดยตรง แต่เน้นว่า มีคำไหนบ้าง และ เจอบ่อยแค่ไหน

นึกภาพว่า Sparse Vector คือ “ตารางเช็กชื่อคำ”

ถ้าคำปรากฏในข้อความ → ใส่ค่า 1
ถ้าไม่ปรากฏ → ใส่ 0
หรือบางทีก็ใช้ค่าน้ำหนัก เช่น TF-IDF หรือ BM25

💡 จุดสำคัญคือ:

มองที่ “คำ” ไม่ใช่ “ความหมาย”
แม้จะเร็วและเบา แต่ไม่เข้าใจความหมายลึกซึ้งแบบ Dense Vector

ข้อดีของ Sparse Vector:

เร็ว ใช้งานง่าย
ดีมากเมื่อใช้ค้นหาข้อความที่ใช้คำเดียวกันเป๊ะ ๆ

ข้อจำกัด:

ถ้าผู้ใช้พิมพ์คำไม่เหมือนกับในเอกสาร (เช่น ใช้คำพ้อง ความหมายใกล้เคียง) — มันจะหาไม่เจอ

ตัวอย่าง

ประโยค: "ขอลางานพรุ่งนี้"

ระบบจะมีรายการคำทั้งหมดที่มันรู้จัก เช่น ["ลา", "ขอ", "งาน", "ประชุม", "เงินเดือน", ...]  
แล้วดูว่าในประโยคเรามีคำไหนบ้าง

→ เวกเตอร์จะหน้าตาประมาณนี้:
["ลา": 1, "ขอ": 1, "งาน": 1, "ประชุม": 0, "เงินเดือน": 0, ...]

🧠 การสร้าง Dense Embedding จากข้อความ ด้วย FastEmbed

เราจะเริ่มจากการแปลงข้อความ เช่น เอกสาร, บทความ หรือ FAQ ที่ผ่านการ Chunking แล้ว

ให้กลายเป็น Dense Embedding ด้วยไลบรารีชื่อ FastEmbed

จุดเด่นของ FastEmbed คือ

⚡️ ใช้งานง่าย ไม่ต้องเขียนโค้ดเยอะ

🤗 รองรับโมเดลหลากหลายบน Hugging Face

📦 ติดตั้งไลบรารี (ถ้ายังไม่ได้ติดตั้ง)

pip install fastembed

⚙️ ตัวอย่างฟังก์ชันการสร้าง Dense Embedding

ไฟล์ src/embeddings/dense_text_embedding.py
เพิ่มฟังก์ชั่นสำหรับการ embedding ข้อความ แบบ dense vector

from fastembed.embedding import TextEmbedding

# Create encoder (you can specify the model name you want to use)
embedding_model: TextEmbedding = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Sample text list
documents = [
    "How to request vacation leave?",
    "Company leave policy explained.",
    "Guide to submitting a leave request form.",
]

# Convert text to dense embedding
embeddings = list(embedding_model.embed(documents))

# Display results
for i, doc in enumerate(documents):
    print(f"Document {i}: {doc}")
    print(f"Embedding: {embeddings[i][:5]}")
    print()

# Example output (first 5 values of embedding)
# Document 0: How to request vacation leave?
# Embedding: [-0.03820893 -0.02452314  0.00697538 -0.05776384  0.0553261, ...]

# Document 1: Company leave policy explained.
# Embedding: [-0.02869109  0.01727294 -0.01712723 -0.05012331  0.05791203, ...]

# Document 2: Guide to submitting a leave request form.
# Embedding: [-0.07710513 -0.0187871   0.04183139 -0.02038685  0.04148764, ...]

ตอนนี้เราจะได้ฟังก์ชันหรือตัวอย่างโค้ดสำหรับสร้าง Dense Vector Embedding แล้ว

💡 หมายเหตุ:

ในโปรเจกต์นี้เราเลือกใช้โมเดล BAAI/bge-small-en-v1.5

สำหรับ ข้อความที่เป็นภาษาไทย แนะนำให้ใช้โมเดลแบบ Multilingual ซึ่งขนาดโมเดลอาจจะใหญ่ขึ้นมาหน่อย ดูชื่อโมเดลที่ FastEmbed ใช้งานได้ที่นี่ : FastEmbed Supported Models

เพราะโมเดลนี้มีขนาดเล็ก, เร็ว และคุณภาพดีพอสำหรับงาน retrieval ซึ่งได้รับความนิยมในระบบ RAG ด้วย

🧾 สร้าง Sparse Embedding ด้วย BM25 (ผ่าน FastEmbed)

แม้ว่า FastEmbed จะขึ้นชื่อเรื่อง Dense Embedding แต่จริง ๆ แล้วมันยังรองรับการสร้าง Sparse Embedding ด้วย BM25 ได้ด้วย ซึ่งสะดวกมาก เพราะใช้ API คล้ายกันแทบทุกอย่าง

⚙️ ตัวอย่างฟังก์ชันการสร้าง Sparse Embedding

ไฟล์ src/embeddings/sparse_text_embedding.py
เพิ่มฟังก์ชั่นสำหรับการ embedding ข้อความ แบบ sparse vector

from fastembed import SparseTextEmbedding

# Create encoder (you can specify the model name you want to use)
embedding_model: SparseTextEmbedding = SparseTextEmbedding(model_name="Qdrant/bm25")

# Sample text list
documents = [
    "How to request vacation leave?",
    "Company leave policy explained.",
    "Guide to submitting a leave request form.",
]


# Convert text to sparse embedding
embeddings = list(embedding_model.embed(documents))


# Display results
for i, doc in enumerate(documents):
    print(f"Document {i}: {doc}")
    print(f"Embedding: {embeddings[i]}")
    print()

# Example output
# Document 0: How to request vacation leave?
# Embedding: SparseEmbedding(values=array([1.67868852, 1.67868852, 1.67868852]), indices=array([2064885619, 1253263062, 1943443462]))

# Document 1: Company leave policy explained.
# Embedding: SparseEmbedding(values=array([1.67419738, 1.67419738, 1.67419738, 1.67419738]), indices=array([1442710396, 1943443462,  203139330, 1971389377]))

# Document 2: Guide to submitting a leave request form.
# Embedding: SparseEmbedding(values=array([1.66973021, 1.66973021, 1.66973021, 1.66973021, 1.66973021]), indices=array([1209733406, 1546460417, 1943443462, 2064885619,   14784032]))

🔍 อธิบายเพิ่มเติม

values: คือน้ำหนักของแต่ละคำ ตามสูตร BM25
indices: คือตำแหน่งของคำใน vocabulary ทั้งหมด (มักแสดงเป็น hash ID)

แม้เราจะ ไม่เห็นคำตรง ๆ เป็น string แต่ Sparse Embedding นี้ก็สามารถนำไปใช้กับ

📦 Vector DB อย่าง Qdrant หรือใช้ในกระบวนการค้นหาเอกสาร (retrieval) ได้เลย

⚔️ เปรียบเทียบ Dense vs Sparse Vector

🧠 Dense Embedding (FastEmbed)

✅ เวกเตอร์เป็น list ของตัวเลขทศนิยม (floats)
✅ เข้าใจความหมายโดยรวมของข้อความ
✅ เหมาะกับ Query ที่คลุมความหมายกว้าง ๆ
✅ ทำงานดีร่วมกับ LLM หรือ fuzzy match
❌ ไม่อิงคำแบบตรงตัว

🧾 Sparse Embedding (BM25 via FastEmbed)

✅ เวกเตอร์ประกอบด้วยคู่ (values + indices)
✅ อิงคำตรงตัวเท่านั้น
✅ เหมาะกับ Query ที่มี keyword ตรงกับเอกสาร
✅ ทำงานเร็ว และแม่นยำในการค้นหาคำที่ตรงเป๊ะ
❌ ไม่เข้าใจความหมายโดยรวม

🚀 พร้อมลุย! สร้างฟังก์ชัน Embedding เพื่อนำไปใช้งานจริง

หลังจากที่เราได้ทำความเข้าใจทั้ง

🔹 Dense Vector — ที่เข้าใจ “ความหมาย” ของข้อความ

🔹 Sparse Vector — ที่เน้นการจับ “คำแบบตรงตัว”

ตอนนี้ก็ถึงเวลาเริ่มลงมือจริงแล้ว!

เราจะมาสร้างฟังก์ชันสำหรับแปลงข้อความให้กลายเป็นเวกเตอร์ embedding

เพื่อใช้งานในขั้นตอน Similarity Search หรือส่งเข้า Vector Database อย่าง Qdrant ต่อไป

⚙️ ตัวอย่างฟังก์ชันการสร้าง Text Embedding

เราจะเริ่มจากการสร้างไฟล์แยกสำหรับเก็บค่าคงที่ (constants) และฟังก์ชัน Embedding เพื่อความเป็นระเบียบและเรียกใช้ซ้ำได้ง่ายในโปรเจกต์

ไฟล์ src/constants/constants.py

from typing import Final


FASTEMBED_DENSE_MODEL_NAME: Final[str] = "BAAI/bge-small-en-v1.5"
FASTEMBED_BM25_MODEL_NAME: Final[str] = "Qdrant/bm25"

ไฟล์ src/embeddings/text_embedding.py

from fastembed import TextEmbedding, SparseTextEmbedding
import src.constants.constants as constants


# Create encoder (you can specify the model name you want to use)
dense_text_embedding = TextEmbedding(model_name=constants.FASTEMBED_DENSE_MODEL_NAME)
sparse_text_embedding = SparseTextEmbedding(
    model_name=constants.FASTEMBED_BM25_MODEL_NAME
)


# Dense embedding for single text
def dense_embedding(document: str) -> list:
    # Return [0] because fastembed returns a list of lists
    # But we only want a single list
    # So we return the first list of the first list
    return list(dense_text_embedding.embed([document]))[0]


# Dense embedding for multiple texts
def dense_embedding_list(documents: list) -> list:
    return list(dense_text_embedding.embed(documents))


# Sparse embedding for single text
def sparse_embedding(document: str) -> list:
    # Return [0] because fastembed returns a list of lists
    # But we only want a single list
    # So we return the first list of the first list
    return list(sparse_text_embedding.embed([document]))[0]


# Sparse embedding for multiple texts
def sparse_embedding_list(documents: list) -> list:
    return list(sparse_text_embedding.embed(documents))

🧪 ทดสอบฟังก์ชัน Text Embedding ด้วย main_example.py

เราสามารถสร้างไฟล์ main_example.py เพื่อเรียกใช้งานหรือทดสอบฟังก์ชันต่าง ๆ ที่สร้างไว้ได้แบบง่าย ๆ

⚙️ ตัวอย่างฟังก์ชันการสร้าง main_example.py

ไฟล์ main_example.py

from src.embeddings.text_embeddings import (
    dense_embedding,
    dense_embedding_list,
    sparse_embedding,
    sparse_embedding_list,
)


example_documents = [
    "How to request vacation leave?",
    "Company leave policy explained.",
    "Guide to submitting a leave request form.",
]


def test_dense_embedding():
    print("Testing dense embedding...")
    for doc in example_documents:
        print(f"Document: {doc}")
        print(f"Dense embedding: {dense_embedding(doc)}")


def test_dense_embedding_list():
    print("Testing dense embedding list...")
    embeddings = dense_embedding_list(example_documents)
    for doc, embedding in zip(example_documents, embeddings):
        print(f"Document: {doc}")
        print(f"Dense embedding: {embedding}")


def test_sparse_embedding():
    print("Testing sparse embedding...")
    for doc in example_documents:
        print(f"Document: {doc}")
        print(f"Sparse embedding: {sparse_embedding(doc)}")


def test_sparse_embedding_list():
    print("Testing sparse embedding list...")
    embeddings = sparse_embedding_list(example_documents)
    for doc, embedding in zip(example_documents, embeddings):
        print(f"Document: {doc}")
        print(f"Sparse embedding: {embedding}")


def main():
    print("Running main.py...")
    # Uncomment the tests you want to run
    # test_dense_embedding()
    # test_dense_embedding_list()
    # test_sparse_embedding()
    # test_sparse_embedding_list()


if __name__ == "__main__":
    main()

✅ ผลลัพธ์ที่ควรจะได้

ถ้าการติดตั้งไลบรารี และโมเดลต่าง ๆ ถูกต้อง

เมื่อเรียกใช้ฟังก์ชันใด ๆ ข้างต้นจะได้ output ที่เป็น embedding ของข้อความ (Dense: list[float], Sparse: dict)

🗂️ ขั้นตอนถัดไป

หลังจากที่เราได้ฟังก์ชัน Embedding สำหรับทั้ง Dense Vector และ Sparse Vector แล้ว

👉 ต่อไปเราจะนำเวกเตอร์เหล่านี้ไปจัดเก็บใน Vector Database (Qdrant)

เพื่อเตรียมใช้งานจริงในการทำ Similarity Search 🔍

🗃️ จัดเก็บ Embedding ลง Vector Database ด้วย Qdrant (Local)

หลังจากที่เราได้แปลงข้อความเป็น embedding (ไม่ว่าจะเป็น Dense หรือ Sparse)

📌 ขั้นตอนถัดไปคือ นำ embedding เหล่านี้ไปจัดเก็บใน Vector Database เพื่อให้สามารถค้นหาข้อมูลภายหลังได้อย่างมีประสิทธิภาพ

❓ Qdrant คืออะไร?

Qdrant (อ่านว่า ควอดแรนต์) คือระบบฐานข้อมูลสำหรับเก็บเวกเตอร์

เหมาะสำหรับงานที่ต้องการค้นหาเอกสารด้วยความใกล้เคียงของความหมาย (vector similarity)

✨ จุดเด่นของ Qdrant:

✅ รองรับทั้ง Dense และ Sparse embedding

✅ ใช้งานได้ทั้งแบบ Local และ Cloud

✅ มี API ที่ใช้งานง่าย (รองรับทั้ง Python SDK, REST API และ gRPC)

✅ เหมาะกับงาน RAG, Semantic Search, Recommendation, AI Agent

🐳 ติดตั้ง Qdrant แบบ Local ด้วย Docker Compose

เราจะใช้ docker-compose.yml ในการรัน Qdrant บนเครื่องของเรา

⚙️ ตัวอย่างโค้ดสำหรับ Docker Compose

ไฟล์ docker-compose.yml

version: '3.7'

services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"  # HTTP API
      - "6334:6334"  # gRPC API
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      - QDRANT_ALLOW_ORIGIN=*

volumes:
  qdrant_data:

▶️ วิธีใช้งาน

เปิดเทอร์มินัลที่อยู่ในโฟลเดอร์เดียวกับไฟล์ docker-compose.yml แล้วรันคำสั่งนี้:

docker-compose up -d

หากทุกอย่างถูกต้อง Qdrant จะเริ่มทำงานในพื้นหลัง

พร้อมให้ใช้งานผ่าน API ที่พอร์ต 6333 (HTTP) และ 6334 (gRPC)

🌐 ทดสอบการเชื่อมต่อ Qdrant

หลังจากที่รัน Qdrant ด้วย docker-compose แล้ว

เราสามารถทดสอบว่าเชื่อมต่อได้หรือไม่ผ่าน Web UI

🔗 เปิดเบราว์เซอร์แล้วเข้าไปที่:

http://localhost:6333/dashboard#/welcome

📌 ถ้าทุกอย่างทำงานได้ถูกต้อง

คุณจะเห็นหน้า Qdrant Dashboard UI ที่แสดงข้อมูลเบื้องต้นของระบบ

🖼️ ตัวอย่างหน้าจอ:

📍 หากเข้าไม่ได้:

ตรวจสอบว่า Docker container ของ Qdrant ทำงานอยู่ (docker ps)
พอร์ต 6333 อาจถูกใช้งานหรือถูก block
ลอง docker-compose down แล้ว up -d ใหม่อีกครั้ง

🔌 ติดตั้งและเชื่อมต่อ Qdrant ด้วย Python

หลังจากที่เรารัน Qdrant server ผ่าน Docker ได้แล้ว

ขั้นตอนถัดไปคือการเขียน Python code เพื่อ เชื่อมต่อ กับ Qdrant

📦 ติดตั้งไลบรารี Qdrant Client

pip install qdrant-client

⚙️ ตัวอย่างโค้ดสำหรับ เชื่อมต่อ Qdrant

QDRANT_HOST และ QDRANT_PORT แนะนำให้เก็บไว้ใน constants.py

ไฟล์ src/constants/constants.py

# ... existing code ...

QDRANT_HTTP_HOST: Final[str] = "http://localhost"
QDRANT_HOST: Final[str] = "localhost"

ไฟล์: src/retrieval/qdrant_store.py

from qdrant_client import QdrantClient
from qdrant_client.http import models

import src.constants.constants as constants

# Create client to connect to Qdrant server running through Docker
qdrant_client = QdrantClient(host=constants.QDRANT_HOST, port=constants.QDRANT_PORT)

🧪 ทดสอบการเชื่อมต่อ (ใน main_example.py)

# ... existing code ...
from src.retrieval.qdrant_store import  qdrant_client

# Test connection to Qdrant
def test_qdrant_connection():
    try:
        # Check connection
        qdrant_client.get_collections()
        print("Qdrant connection is successful.")
    except Exception as e:
        print(f"Error connecting to Qdrant: {e}")

def main():
    print("Running main.py...")
    test_qdrant_connection()

if __name__ == "__main__":
    main()

🖥️ ตัวอย่างผลลัพธ์ที่คาดหวัง

Running main.py...
Qdrant connection is successful.

หลังจากเรายืนยันว่าเชื่อมต่อ Qdrant ได้สำเร็จแล้ว ✅

ขั้นตอนถัดไปคือ 👉 สร้าง collection และเพิ่มข้อมูลเวกเตอร์ลงฐานข้อมูล

📦 สร้าง Collection และเพิ่มข้อมูลลง Vector Database (Qdrant)

หลังจากที่เราแปลงข้อความให้เป็น Embedding แล้ว

ขั้นตอนถัดไปคือการ

1. 🧱 สร้าง Collection ใน Qdrant สำหรับจัดเก็บเวกเตอร์

2. 📥 เพิ่มข้อมูล Embedding พร้อม Metadata เข้าไปใน Collection

⚙️ ตัวอย่างฟังก์ชันสร้าง Collection ใหม่

ไฟล์ src/retrieval/qdrant_store.py

from qdrant_client import QdrantClient
from qdrant_client.http import models
import src.constants.constants as constants

# Create client to connect to Qdrant server running through Docker
qdrant_client = QdrantClient(host=constants.QDRANT_HOST, port=constants.QDRANT_PORT)

# Function to create collection in Qdrant
def create_collection(collection_name: str):
    try:
        # Check if collection already exists
        qdrant_client.get_collection(collection_name=collection_name)
    except Exception:
        # If collection doesn't exist, create new one
        qdrant_client.create_collection(
            collection_name=collection_name,
            vectors_config=models.VectorParams(
                distance=models.Distance.COSINE,
                size=constants.VECTOR_SIZE,
            ),
        )

# Function to delete collection in Qdrant
def delete_collection(collection_name: str):
    try:
        # Check if collection already exists
        qdrant_client.get_collection(collection_name=collection_name)
        # If exists, delete collection
        qdrant_client.delete_collection(collection_name=collection_name)
    except Exception:
        # If collection doesn't exist, do nothing
        pass

📌 ค่า VECTOR_SIZE ต้องตรงกับขนาดของ embedding vector ที่ใช้งาน

เช่น 384, 768, 1024 แล้วแต่โมเดล (เช่น BAAI/bge-small-en-v1.5 = 384)

🆔 ติดตั้ง uuid สำหรับสร้างไอดีแบบไม่ซ้ำกัน

pip install uuid

⚙️ ตัวอย่างฟังก์ชั่นเพิ่ม Embedding ลง Qdrant

ไฟล์ src/retrieval/qdrant_store.py

# ... existing code ...
from uuid import uuid4

def generate_id():
    # Generate ID for Qdrant
    return str(uuid4())

# Function to upload embedded text to Qdrant
# We will store original text, metadata, and embedding vector together
# So that the data can be used in the future
def add_embedding(
    collection_name: str,
    vector: list,
    original_text: str,
    expanded_text: str,
    metadata: dict,
):
    # Create collection if it doesn't exist yet
    create_collection(collection_name)

    # Add data to Qdrant
    qdrant_client.upsert(
        collection_name=collection_name,
        points=[
            models.PointStruct(
                id=generate_id(),
                vector=vector,
                payload={
                    "original_text": original_text,
                    "expanded_text": expanded_text,
                    "metadata": metadata,
                },
            )
        ],
    )

📝 อธิบายเพิ่มเติม

vector: คือเวกเตอร์ที่ได้จากการแปลงข้อความ (Dense หรือ Sparse)
original_text: ข้อความต้นฉบับก่อน embedding
metadata: ข้อมูลเพิ่มเติม เช่น source, topic, chunk_index ฯลฯ
payload: สามารถเก็บข้อมูลอะไรก็ได้ (จะถูกเรียกคืนตอน query ได้เลย)

ตัวอย่าง

{
  "original_text": "Company leave policy explained.",
  "metadata": {
    "source": "HR_FAQ.md",
    "topic": "leave_policy"
  }
}

🧪 ทดสอบการเพิ่มข้อมูล (Embedding + Metadata) ลงใน Qdrant (ใน main_example.py)

เราจะทำการ:

แปลงข้อความตัวอย่างเป็น embedding
ส่ง embedding + metadata เข้าไปเก็บใน Vector Database (Qdrant)
ตรวจสอบผลลัพธ์ผ่าน UI ของ Qdrant

# ... existing code ...
from src.retrieval.qdrant_store import add_embedding, qdrant_client
import src.constants.constants as constants

example_documents = [
    "How to request vacation leave?",
    "Company leave policy explained.",
    "Guide to submitting a leave request form.",
]

# Test adding data to Qdrant
def test_insert_documents():
    embedding = dense_embedding(example_documents[2])
    add_embedding(
        constants.TEST_COLLECTION_NAME,
        embedding,
        example_documents[2],
        {"doc_info": "Example document info"},
    )

def main():
    print("Running main.py...")
    # ...
    test_insert_documents()

if __name__ == "__main__":
    main()

🔗 เปิดเบราว์เซอร์แล้วเข้าไปที่:

http://localhost:6333/dashboard#/collections

จะเห็นชื่อ Collection ที่เพิ่งสร้าง เช่น:

test_collection

กดเข้าไป จะเห็น:

ค่า Vector ที่ถูกเก็บไว้ (Embedding)
ข้อความต้นฉบับ (original_text)
Metadata (เช่น doc_info)

หรือเราจะเขียนโค้ดเพื่อดึงข้อมูลก็ได้เช่นกัน โดยเพิ่มโค้ดใน qdrant_store.py

⚙️ ตัวอย่างฟังก์ชั่นการดึงข้อมูลด้วยโค้ด python

ไฟล์ src/retrieval/qdrant_store.py

# ... existing code ...

def get_all_embedding(collection_name: str, limit: int = 10):
    try:
        # Check if collection already exists
        qdrant_client.get_collection(collection_name=collection_name)
        # If exists, retrieve all data
        response = qdrant_client.scroll(
            collection_name=collection_name,
            limit=limit,
        )
        return response
    except Exception as e:
        # If collection doesn't exist
        print(f"Error getting data from Qdrant: {e}")
        return []

🧪 ทดสอบการดึงข้อมูลใน Qdrant (ใน main_example.py)

⚙️ ตัวอย่างการใช้งานฟังก์ชั่น

ไฟล์ main_example.py

# ... existing code ...
from src.retrieval.qdrant_store import get_all_embedding,add_embedding, qdrant_client
import src.constants.constants as constants

# Test retrieving all data from Qdrant
def test_get_all_embedding():
    print(f"All embeddings in {constants.TEST_COLLECTION_NAME}:")
    data = get_all_embedding(constants.TEST_COLLECTION_NAME)
    print(data)

def main():
    print("Running main.py...")
    # ...
    test_get_all_embedding()

if __name__ == "__main__":
    main()

# Example output
# Running main.py...
# All embeddings in test_collection:
# ([Record(id='902d56ad-69f4-4217-ad85-f3e6e7d970aa', payload={'original_text': 'How to request vacation leave?', 'metadata': {'doc_info': 'Example document info'}}, vector=None, shard_key=None, order_value=None)], None)

✅ สรุปสิ่งที่คุณทำได้จนถึงตอนนี้

Text Embedding — แปลงข้อความเป็น Dense หรือ Sparse Vector ด้วย FastEmbed
Vector Database — สร้าง Collection และเก็บ Vector + Metadata ใน Qdrant
Query/Read API — ดึงข้อมูลกลับจาก Qdrant เพื่อตรวจสอบ หรือใช้งานต่อ

การค้นหาข้อมูลใน Vector Database ด้วย Similarity Search

📌 หลักการ: Similarity Search ด้วย Qdrant

1.รับข้อความจากผู้ใช้ เช่น: “How to apply for vacation?”

2. แปลงข้อความนี้เป็น embedding (เวกเตอร์)

3. ส่งเวกเตอร์นี้ไปค้นหาใน Qdrant

4. Qdrant จะค้นหาเวกเตอร์ในฐานข้อมูลที่มีค่า similarity ใกล้ที่สุด (ตามค่า cosine similarity หรือ dot product ที่เราตั้งไว้ตอนสร้าง collection)

5. คืนผลลัพธ์ที่เกี่ยวข้องที่สุดกลับมา

ก่อนอื่นเราไปเพิ่มฟังก์ชั่น Similarity Search ใน qdrant_store.py ก่อน

⚙️ ตัวอย่างฟังก์ชั่นการค้นหาข้อมูล Similarity Search

ไฟล์ src/retrieval/qdrant_store.py

# ... existing code ...

def similarity_search(collection_name: str, query_vector: list, limit: int = 10):
    try:
        # Check if collection already exists
        qdrant_client.get_collection(collection_name=collection_name)

        # If exists, search for similar data
        response = qdrant_client.search(
            collection_name=collection_name,
            query_vector=query_vector,
            limit=limit,
        )
        return response
    except Exception as e:
        # If collection doesn't exist,
        print(f"Error searching data in Qdrant: {e}")
        return []

🧪 ทดสอบการค้นหา Similarity Search ใน main_exmaple.py

⚙️ ตัวอย่างการใช้งานฟังก์ชั่น

ไฟล์ main_example.py

# ... existing code ...
from src.retrieval.qdrant_store import (
    get_all_embedding,
    add_embedding,
    qdrant_client,
    similarity_search,
)

# Test similarity search
def test_similarity_search():
    query = "How to request vacation leave?"
    query_vector = dense_embedding(query)
    search_result = similarity_search(constants.TEST_COLLECTION_NAME, query_vector)
    print(f"Similarity search result for query : {query}")
    print("Search results:")
    print(search_result)


def main():
    print("Running main.py...")
    # ...
    test_similarity_search()

if __name__ == "__main__":
    main()

📈 ตัวอย่างผลลัพธ์

Search results for query: 'How do I take time off?'
Rank 1:
- ID: 902d56ad...
- Score: 0.91
- Text: How to request vacation leave?
- Metadata: {'doc_info': 'Example document info'}
----------------------------------------
...

📌 Note:

ยิ่ง score ใกล้ 1 (ในกรณี cosine similarity) → ยิ่งคล้าย
สามารถขยาย logic เพื่อ filter ตาม metadata ได้ เช่น “doc_type”: “faq” โดยใช้ Filter() ของ Qdrant

✨ การใช้งาน LLM (Large Language Model) ด้วย Google Gemini

ในการทดสอบระบบ LLM สำหรับการสร้างข้อความ (Text Generation) เราจะใช้โมเดล Google Gemini โดยเฉพาะรุ่น gemini-1.5-flash หรือ gemini-2.0-flash ซึ่งมี โควต้าฟรี และเหมาะสำหรับการทดลองใช้งานเบื้องต้น

🔑 เตรียม API Key

ก่อนใช้งาน Gemini API จำเป็นต้องมี API_KEY หากยังไม่มีสามารถสร้างได้ที่:

👉 GOOGLE GEMINI API KEY

📦 ติดตั้งและตั้งค่า

ติดตั้งไลบรารี Google Gemini AI สำหรับ python

pip install google-generativeai

⚙️ ตัวอย่างฟังก์ชั่นการใช้งาน Google Gemini

อัปเดตไฟล์ constants

ไฟล์ src/constants/constants.py

from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv(override=False)

# ML, AI Model names, Configs
GOOGLE_GEMINI_API_KEY: Final[str] = os.getenv("GOOGLE_GEMINI_API_KEY", "")
GOOGLE_GEMINI_MODEL_NAME: Final[str] = "gemini-2.0-flash"

ฟังก์ชันเรียกใช้งาน Gemini

ไฟล์ src/generation/google_gemini_ai.py

import google.generativeai as genai
import src.constants.constants as constants


genai.configure(api_key=constants.GOOGLE_GEMINI_API_KEY)
gemini_client = genai.GenerativeModel(constants.GOOGLE_GEMINI_MODEL_NAME)


def gemini_generate_content(
    prompt: str,
    max_output_tokens: int = 8192,
    response_schema=None,
):
    if not constants.GOOGLE_GEMINI_API_KEY:
        raise ValueError(
            "Google Gemini API key is not set. Please set it in constants.py."
        )
    if not prompt:
        raise ValueError("Prompt cannot be empty.")

    generation_config = {
        "max_output_tokens": max_output_tokens,
        "response_mime_type": "application/json",
        "response_schema": response_schema,
    }
    response = gemini_client.generate_text(
        prompt=prompt,
        max_output_tokens=max_output_tokens,
        generation_config=generation_config,
    )
    return response.text

🧪 ทดสอบเรียกใช้งานผ่าน main_example.py

ไฟล์ main_example.py

# ... existing code ...
from src.generation.google_gemini_ai import gemini_generate_content

# Test content generation from LLM
# Using Google Gemini
def test_llm():
    query = "Who is Albert Einstein?"
    response = gemini_generate_content(query)
    print(f"LLM response for query '{query}': {response}")


def main():
    print("Running main.py...")
    # ... existing code ...
    test_llm()

if __name__ == "__main__":
    main()

📈 ตัวอย่างผลลัพธ์

{
  "name": "Albert Einstein",
  "description": "A German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics..."
}

🧩 ใช้ Gemini ร่วมกับ Pydantic เพื่อควบคุมรูปแบบโครงสร้างข้อมูล

เราสามารถให้ Gemini ตอบกลับในรูปแบบที่เรากำหนดไว้ล่วงหน้าได้ โดยใช้ไลบรารี Pydantic ซึ่งช่วยในการจัดการข้อมูลแบบมี schema

📦 ติดตั้ง pydantic

pip install pydantic

จากนั้นสร้างโมเดลตัวอย่างใน example.py

⚙️ ตัวอย่างโมเดล pydantic

ไฟล์ src/models/example.py

from pydantic import BaseModel


class ExampleModel(BaseModel):
    name: str
    answer: str
    characteristics: str

🚀 ทดสอบการเรียกใช้งาน Gemini พร้อม Schema

⚙️ ตัวอย่างฟังก์ชั่นการเรียกใช้งาน

ไฟล์ main_exmaple.py

# ... existing code ...
from src.generation.google_gemini_ai import gemini_generate_content
from src.models.example import ExampleModel

def test_llm_with_model():
    query = "Who is Albert Einstein and His/Her characteristics?"
    response = gemini_generate_content(query, response_schema=ExampleModel)

    example_parser = ExampleModel.model_validate_json(response)

    print(f"Name: {example_parser.name}")
    print(f"Answer: {example_parser.answer}")
    print(f"Characteristics: {example_parser.characteristics}")


def main():
    print("Running main.py...")
    # ... existing code ...
    test_llm_with_model()

if __name__ == "__main__":
    main()

📈 ตัวอย่างผลลัพธ์ที่โครงสร้างตรงตามโมเดล

Name: Albert Einstein  
Answer: Albert Einstein was a German-born theoretical physicist...  
Characteristics: Intellectual curiosity, perseverance, independent thinking, ...

ไฟล์ main_exmaple.py ในตอนสุดท้าย

from src.embeddings.text_embeddings import (
    dense_embedding,
    dense_embedding_list,
    sparse_embedding,
    sparse_embedding_list,
)

from src.retrieval.qdrant_store import (
    get_all_embedding,
    add_embedding,
    qdrant_client,
    similarity_search,
)
from src.generation.google_gemini_ai import gemini_generate_content
from src.models.example import ExampleModel
import src.constants.constants as constants

example_documents = [
    "How to request vacation leave?",
    "Company leave policy explained.",
    "Guide to submitting a leave request form.",
]


def test_dense_embedding():
    print("Testing dense embedding...")
    for doc in example_documents:
        print(f"Document: {doc}")
        print(f"Dense embedding: {dense_embedding(doc)}")


def test_dense_embedding_list():
    print("Testing dense embedding list...")
    embeddings = dense_embedding_list(example_documents)
    for doc, embedding in zip(example_documents, embeddings):
        print(f"Document: {doc}")
        print(f"Dense embedding: {embedding}")


def test_sparse_embedding():
    print("Testing sparse embedding...")
    for doc in example_documents:
        print(f"Document: {doc}")
        print(f"Sparse embedding: {sparse_embedding(doc)}")


def test_sparse_embedding_list():
    print("Testing sparse embedding list...")
    embeddings = sparse_embedding_list(example_documents)
    for doc, embedding in zip(example_documents, embeddings):
        print(f"Document: {doc}")
        print(f"Sparse embedding: {embedding}")


# Test connection to Qdrant
def test_qdrant_connection():
    try:
        # Check connection
        qdrant_client.get_collections()
        print("Qdrant connection is successful.")
    except Exception as e:
        print(f"Error connecting to Qdrant: {e}")


# Test adding data to Qdrant
def test_insert_documents():
    embedding = dense_embedding(example_documents[2])
    add_embedding(
        constants.TEST_COLLECTION_NAME,
        embedding,
        example_documents[2],
        {"doc_info": "Example document info"},
    )


# Test retrieving all data from Qdrant
def test_get_all_embedding():
    print(f"All embeddings in {constants.TEST_COLLECTION_NAME}:")
    data = get_all_embedding(constants.TEST_COLLECTION_NAME)
    print(data)


# Test similarity search
def test_similarity_search():
    query = "How to request vacation leave?"
    query_vector = dense_embedding(query)
    search_result = similarity_search(constants.TEST_COLLECTION_NAME, query_vector)
    print(f"Similarity search result for query : {query}")
    print("Search results:")
    print(search_result)


# Test content generation from LLM
# Using Google Gemini
def test_llm():
    query = "Who is Albert Einstein"
    response = gemini_generate_content(query)
    print(f"LLM response for query '{query}': {response}")


def test_llm_with_model():
    query = "Who is Albert Einstein and His/Her characteristics?"
    response = gemini_generate_content(query, response_schema=ExampleModel)

    example_parser = ExampleModel.model_validate_json(response)
    print(f"Name: {example_parser.name}")
    print(f"Answer: {example_parser.answer}")
    print(f"Characteristics: {example_parser.characteristics}")


def main():
    print("Running main.py...")
    # Uncomment the tests you want to run
    # test_dense_embedding()
    # test_dense_embedding_list()
    # test_sparse_embedding()
    # test_sparse_embedding_list()

    # test_qdrant_connection()
    # test_insert_documents()
    # test_get_all_embedding()
    # test_similarity_search()
    test_llm_with_model()


if __name__ == "__main__":
    main()

ตอนนี้เราสามารถ:

เรียกใช้ Google Gemini ได้จาก Python
ส่งคำถามพร้อมรับคำตอบที่มีโครงสร้างตามที่กำหนด
ใช้ร่วมกับ Pydantic เพื่อควบคุม schema ของผลลัพธ์ได้อย่างยืดหยุ่น

📌 ภาพรวมจนถึงตอนนี้

✅ ระบบสร้าง embedding + บันทึกลง Qdrant

✅ ระบบค้นหาด้วย Similarity Search

✅ ระบบ LLM ที่ต่อกับ Google Gemini

✅ และความสามารถในการแปลงผลลัพธ์ LLM ให้อยู่ในรูปแบบโครงสร้างผ่าน Pydantic

🚀 พาร์ทที่ 2 — ทดลองใช้ Doc2Query เพิ่มพลังการค้นหา และประเมินผลด้วย LLM-as-a-Judge สร้างข้อมูลทดสอบเพื่อเก็บลง Vector Database

เราจะแบ่งข้อมูลออกเป็น 3 collections สำหรับการทดสอบ:

1. Original text embeddings

2. Original + Expanded text embeddings

3. Only Expanded text embeddings

🔹 Step 1:การเตรียมข้อมูล

เราจะให้ LLM (Google Gemini) สร้างข้อมูลทั้งหมด โดยตั้งไว้ที่ 100 ชุด (Chunks)

แต่ละชุดจะมีหัวข้อ (subtopic) ที่แตกต่างกัน เช่น Leave Policy, Benefits, ฯลฯ

💡 สำหรับการใช้งานจริง สามารถดึงข้อความจากไฟล์ .pdf, .ppt, .csv ฯลฯ ของบริษัทได้ หากสามารถแปลงข้อความออกมาได้

เพื่อหลีกเลี่ยง rate limit ของ Gemini Free Tier เราจะ generate ทีละ 10 documents ต่อหัวข้อ (ใครใช้แบบจ่ายเงินสามารถรวมทั้งหมดในครั้งเดียวได้)

ก่อนอื่นจะทำการเพิ่ม file_utils สำหรับใช้ เขียน อ่านไฟล์ ก่อน เพื่อสามารถนำไปใช้ซ้ำในขั้นตอนถัดๆไปได้

⚙️ ตัวอย่างฟังก์ชั่น File Utility

ไฟล์ src/utils/file_utils.py

import os, json, time

def write_file(dir_path, file_name, content) -> str:
    os.makedirs(dir_path, exist_ok=True)
    file_path = os.path.join(dir_path, file_name)
    if os.path.exists(file_path):
        os.remove(file_path)
    with open(file_path, "w") as file:
        file.write(content)
    print(f"Content written to {file_path}")
    return file_path

def read_jsonl_file(file_path: str) -> list:
    with open(file_path, "r") as file:
        return [json.loads(line) for line in file]

def read_json_file(file_path: str) -> dict:
    with open(file_path, "r") as file:
        return json.load(file)

def get_unique_name_for_file_name(file_name: str) -> str:
    return f"{int(time.time())}_{file_name}"

Data Model: โครงสร้างข้อมูลที่ให้ LLM generate

ไฟล์ src/models/generated_document.py

from pydantic import BaseModel


class GeneratedDocument(BaseModel):
    title: str
    original_text: str

🤖🚀 Generate ข้อมูลชุดทดสอบ

เราจะใช้ Google Gemini เพื่อ generate documents เป็นไฟล์ .jsonl (รายละเอียดเพิ่มเติม: JSONL)

ไฟล์ src/executions/generate_data_set.py

import json
import time
from pathlib import Path

from src.generation.google_gemini_ai import gemini_generate_content
from src.models.generated_document import GeneratedDocument
from src.utils.file_utils import write_file


OUTPUT_DIR = Path("data/data_set")
BATCH_SIZE = 10
RATE_LIMIT_DELAY = 10


SUBTOPICS = [
    "Leave Policy",
    "Salary & Compensation",
    "Benefits",
    "Performance Review",
    "Health and Safety",
]

FILE_NAME_MAPPING = {
    "Leave Policy": "leave_policy",
    "Salary & Compensation": "salary_compensation",
    "Benefits": "benefits",
    "Performance Review": "performance_review",
    "Health and Safety": "health_safety",
}


def generate_data_set():
    for subtopic in SUBTOPICS:
        prompt = f"""
You are a helpful assistant generating realistic internal knowledge base documents for testing an AI document retrieval system.

Topic: HR Knowledge Base 

Objective:
Generate {BATCH_SIZE} unique, documents that resemble internal HR or company knowledge base entries.

Instructions:
Each document must include:
 • A short and clear title
 • A original text field (10-15 sentences), written in professional, informative tone
 • Content should appear as if it were published on a company intranet or internal wiki

Coverage (Subtopics to include):
 • {subtopic}

Constraints:
 • Do not repeat or reuse exact sentences between documents
 • Vary wording, phrasing, and examples across documents
 • Tone should be clear, helpful, and consistent with corporate internal communication
 • Do not generate queries — only documents
"""

        response = gemini_generate_content(
            prompt,
            response_schema=list[GeneratedDocument],
        )

        documents: list[GeneratedDocument] = []
        json_data = json.loads(response)
        documents = [GeneratedDocument(**item) for item in json_data]

        jsonl_content = "\n".join(doc.model_dump_json() for doc in documents)

        # Write the JSONL content to a file
        write_file(OUTPUT_DIR, f"{FILE_NAME_MAPPING[subtopic]}.jsonl", jsonl_content)
        time.sleep(
            RATE_LIMIT_DELAY
        )  # Sleep for {RATE_LIMIT_DELAY} second to avoid hitting the rate limit


if __name__ == "__main__":
    # Generate the data set
    generate_data_set()

ซึ่งเราจะได้หน้าตาไฟล์ทั้งหมดประมาณนี้

ในแต่ไฟล์ก็จะประกอบไปด้วย ข้อมูล 10 Chunks ที่เราให้ LLM สร้างให้ หน้าตาประมาณนี้

ตัวอย่างจากไฟล์ benefits.jsonl

{"title":"Company Benefits Package Overview","original_text":"This document outlines the company's comprehensive benefits package. We are committed to providing our employees with competitive benefits to support their health, wellness, and financial security. Our health insurance plan includes options for medical, dental, and vision coverage. Employees can choose from a variety of plans to best fit their needs and budgets. We also offer a generous paid time off policy, including vacation, sick leave, and holidays. Our retirement savings plan includes a company match to help employees save for their future. Life insurance is also provided as a valuable employee benefit. Short-term and long-term disability coverage is available to help protect employees' income in the event of illness or injury. Employees also have access to employee assistance programs (EAP) to support their overall well-being. We regularly review and update our benefits package to ensure it remains competitive and relevant. To access detailed information and plan documents, please visit the benefits portal on the company intranet."}
{"title":"Understanding Your Health Insurance Options","original_text":"Understanding your health insurance options is crucial. We offer a range of plans to cater to different needs and budgets. The plans vary in premium costs and co-pays. It's essential to carefully review each plan's details before making your selection during open enrollment. Consider factors like your current health status, anticipated healthcare needs, and your budget. You can access detailed plan information, including provider networks, formularies, and cost estimates, through the online benefits portal. Health savings accounts (HSAs) are also available for employees enrolled in high-deductible plans. If you require assistance in understanding the plans, our HR department is available to help. Remember to review your coverage regularly for changes and updates. Taking advantage of preventative care under your health insurance can help you save money in the long run. We encourage you to make informed decisions regarding your healthcare benefits."}
{"title":"Paid Time Off (PTO) Policy","original_text":"The company provides a generous paid time off (PTO) policy. This policy allows employees to take time off for vacation, personal needs, and sick leave. The amount of PTO accrues based on years of service with the company. You can access your PTO balance and submit leave requests through our online HR system. Ensure you submit requests in advance to allow for appropriate scheduling and coverage. For extended periods of leave, such as parental leave, we have specific policies and forms available. If you require assistance in managing your PTO, please contact the HR team. Proper planning and management of your PTO is important for work-life balance. Remember to check the company calendar for holidays that are automatically included as paid time off. Our PTO policy provides flexibility while maintaining business operations."}
{"title":"Retirement Savings Plan","original_text":"Our retirement savings plan offers a significant company match. This match contributes a percentage of your contributions to help grow your retirement savings. The match rate varies based on several factors. Consult the plan document for the most up-to-date details. Regular contributions are key to maximizing your savings and the company match. You can choose from various investment options within the plan to align with your risk tolerance and goals. Our financial advisor is available to provide guidance and answer your questions. Remember to review your investment allocation periodically to ensure it aligns with your long-term goals. We encourage you to start saving early and regularly for a secure retirement. You can access your account statements and make adjustments to your contributions online."}
{"title":"Life Insurance Benefits","original_text":"Life insurance is a valuable benefit provided to all eligible employees. This coverage offers financial protection for your dependents in the event of your passing. The amount of coverage is typically based on your salary and tenure. Details about the life insurance policy, including the beneficiary designation process, are available on the internal benefits portal. You can update your beneficiary information online through the self-service portal. It's important to review and update this information periodically to ensure it reflects your current wishes. Contact HR if you require assistance or have any questions. Your family's financial well-being is important, and this benefit is designed to provide peace of mind."}
{"title":"Disability Benefits","original_text":"Short-term and long-term disability benefits help protect your income. Short-term disability covers temporary illnesses or injuries that prevent you from working. Long-term disability provides income replacement for longer-term conditions. Eligibility criteria and waiting periods apply. Information about the policies and claim procedures is available online and through HR. It is crucial to understand the differences between these two types of disability coverage. These benefits are designed to provide financial security during periods of unexpected illness or injury. We encourage you to familiarize yourself with the policy details and know how to file a claim. Our HR department is here to support you through the claims process."}
{"title":"Employee Assistance Program (EAP)","original_text":"The company provides access to Employee Assistance Programs (EAPs). EAPs offer confidential support for employees and their families. These services are available to address various challenges, including stress, mental health concerns, and work-life balance issues. The EAP provides counseling, resources, and referrals to help employees cope with personal and professional challenges. Access to EAP services is available 24/7, providing convenient support whenever needed. Contact information for the EAP is available on the company intranet and HR materials. Utilizing these services is a sign of strength and proactive self-care. Confidential services are available at no cost to you. Prioritize your well-being and take advantage of these helpful resources."}
{"title":"Flexible Work Arrangements","original_text":"Our flexible work arrangements policy offers various options. Employees can discuss flexible work arrangements with their managers. These options might include flexible work schedules, remote work, or compressed workweeks. Eligibility and approval will depend on the specific role and business needs. The goal is to provide a work environment that supports work-life balance and employee well-being. Open communication with your manager is key to exploring these possibilities. Maintaining productivity and meeting deadlines are essential components of any flexible work arrangement. These arrangements are reviewed and assessed regularly. We encourage employees to discuss these options during performance reviews or as needed."}
{"title":"Parental Leave Policy","original_text":"This document summarizes our parental leave policy. Eligible employees can take paid time off for parental leave. The duration of the leave is outlined in the policy document. This policy covers both mothers and eligible fathers or partners. The policy is designed to support employees as they welcome a new child into their family. We believe that providing paid parental leave is essential for employee well-being. Contact HR for detailed information, eligibility requirements, and application procedures. Providing necessary documentation is essential for a timely processing of your request. We encourage you to plan ahead and submit your request well in advance. Our goal is to make this process as smooth as possible for our employees."}
{"title":"Bereavement Leave Policy","original_text":"This document explains our bereavement leave policy. Employees are eligible for paid time off for bereavement. The amount of paid time off is determined by the specific circumstances. Our goal is to provide support during times of grief and loss. Contact HR to discuss the details and procedures for bereavement leave. Providing necessary documentation may be required for processing. We offer our deepest sympathy to those who have experienced a loss. We are committed to supporting our employees through challenging times. We aim to be sensitive and understanding in these situations."}

📌 สรุป

เราได้เตรียมระบบสำหรับ generate ข้อมูลทดสอบโดยใช้ LLM และเก็บข้อมูลเป็น .jsonl แยกตามหัวข้อ ซึ่งพร้อมนำไปใช้ในขั้นตอนการสร้าง embeddings และเพิ่มลงใน Vector Database ต่อไป

🔹 Step 2: Embedding ข้อมูล และเก็บลง Vector Database

เพิ่มข้อมูลแบบที่ 1 : Original Text Embedding

ขั้นตอนนี้เราจะอ่านข้อมูลชุดที่เตรียมไว้ แล้วสร้าง embedding จาก original text ก่อน จากนั้นนำข้อมูลเข้าไปเก็บใน Vector Database ของ Qdrant ได้เลย

⚙️ ตัวอย่างฟังก์ชั่นเพิ่มข้อมูลแบบ original text embedding

ไฟล์ src/executions/insert_document.py

import os
import time
import json
from pathlib import Path

from src.embeddings.text_embeddings import dense_embedding, sparse_embedding
from src.retrieval.qdrant_store import add_embedding
from src.models.generated_document import (
    GeneratedDocument,
    ExpandedDocument,
    QueryDocument,
)
from src.generation.google_gemini_ai import gemini_generate_content
from src.utils.file_utils import write_file, read_jsonl_file

import src.constants.constants as constants


DATASET_DIR = Path("data/data_set")
EXPANDED_DATA_DIR = Path("data/expanded")
EXPANDED_FILE_NAME = "expanded_documents.jsonl"
MAX_QUERY_PER_DOCUMENT = 5
RATE_LIMIT_DELAY = 5


def load_and_combine_documents():
    files = [
        f
        for f in os.listdir(DATASET_DIR)
        if os.path.isfile(os.path.join(DATASET_DIR, f)) and f.endswith(".jsonl")
    ]

    documents = []
    for file in files:
        with open(os.path.join(DATASET_DIR, file), "r") as f:
            for line in f:
                documents.append(line.strip())
    return documents

def insert_original_documents():
    collection_name = constants.ORIGINAL_TEXT_COLLECTION_NAME

    documents = load_and_combine_documents()

    for idx, document in enumerate(documents):
        doc = GeneratedDocument.model_validate_json(document)

        embedding_vector = dense_embedding(doc.original_text)
        metadata = {
            "title": doc.title,
        }

        add_embedding(
            collection_name=collection_name,
            vector=embedding_vector,
            original_text=doc.original_text,
            expanded_text="",
            metadata=metadata,
        )

        print(
            f"Inserted document into collection: {collection_name} : Progress {idx+1}/{len(documents)}"
        )

def insert_original_and_expanded_documents():
    print("TODO")


def insert_expanded_documents():
    print("TODO")

🧪 รวมฟังก์ชันเรียกใช้งานใน main.py

ไฟล์ main.py

from src.executions.insert_documents import (
    insert_original_documents,
    insert_original_and_expanded_documents,
    insert_expanded_documents,
)


def main():
    options = {
        "0": "Check your options",
        "1": "Insert original text into Qdrant",
        "2": "Insert original + expanded text into Qdrant",
        "3": "Insert only expanded text into Qdrant",
        "" "99": "Exit",
    }

    for key, value in options.items():
        print(f"{key}: {value}")

    while True:
        choice = input("Enter your choice (0: Check your options): ")

        if choice == "0":
            for key, value in options.items():
                print(f"{key}: {value}")
        elif choice == "1":
            print("Inserting original text into Qdrant...")
            # Call the function to insert original text
            insert_original_documents()
        elif choice == "2":
            print("Inserting original + expanded text into Qdrant...")
            # Call the function to insert original + expanded text
            insert_original_and_expanded_documents()
        elif choice == "3":
            print("Inserting only expanded text into Qdrant...")
            # Call the function to insert only expanded text
            insert_expanded_documents()
        elif choice == "99":
            print("Exiting...")
            break
        else:
            print("Invalid choice. Please try again.")


if __name__ == "__main__":
    main()

🧪 ทดสอบเพิ่มข้อมูลแบบ original text

▶️ รันคำสั่ง

python main.py

เลือกเมนู 1 เพื่อเพิ่มข้อมูลแบบ Original text

รอสักครู่ให้โปรแกรมประมวลผลจนเสร็จ…

🔗 เปิดเบราว์เซอร์แล้วเข้าไปที่:

http://localhost:6333/dashboard#/collections

📌 สรุป:

เราได้เพิ่มข้อมูลแบบ Original Text ลงใน Qdrant ผ่านการสร้าง embedding และเก็บข้อมูลใน collection ที่กำหนดไว้ พร้อมสามารถต่อยอดขั้นตอนการทำ Document Expansion และ embedding แบบอื่นๆ ได้ในขั้นตอนถัดไป

เพิ่มข้อมูลแบบที่ 2: Original + Expanded Text Embedding

ในขั้นตอนนี้ เราจะเพิ่มข้อมูลลงใน Qdrant โดยใช้ทั้ง original text และ expanded text ซึ่งได้จากเทคนิค Document Expansion by Query Prediction (หรือเรียกสั้น ๆ ว่า Doc2Query)

ใครที่ยังไม่คุ้นกับแนวคิดนี้ สามารถย้อนกลับไปดูพาร์ทก่อนหน้าได้

โดยในส่วนนี้เราจะให้ LLM ช่วยสร้าง Query Prediction สำหรับแต่ละ Document และจัดเก็บผลลัพธ์ไว้ในรูปแบบใหม่ที่ประกอบด้วย:

title
original_text
expanded_text (รวม Query ที่ LLM สร้างให้)

เราจะใช้ไฟล์นี้ทั้งในการเพิ่มข้อมูลแบบที่ 2 และแบบที่ 3 ต่อไป เพื่อให้ได้ Query ชุดเดียวกัน

⚙️ ตัวอย่างโค้ดการจัดเก็บข้อมูลแบบ Original + Expanded Text

เตรียม Model สำหรับ Document Expansion

ไฟล์ src/models/generated_document

from pydantic import BaseModel


class GeneratedDocument(BaseModel):
    title: str
    original_text: str


class QueryDocument(BaseModel):
    query: str


class ExpandedDocument(GeneratedDocument):
    expanded_text: str

2. เพิ่มฟังก์ชัน generate_expanded_documents()

ไฟล์ src/executions/insert_documents.py
เพิ่มฟังก์ชั่น generate_expanded_documents โดยในฟังก์ชั่น เราจะทำการเช็คดูว่าเราเคยสร้างไฟล์นี้ไว้แล้วหรือไม่ ถ้ายังไม่มีเราจะเรียก LLM เพื่อสร้าง Query Prediction ให้เราแล้วสร้างไฟล์จากผลลัพธ์ไว้ใช้ในภายหลัง

import os
import time
import json
from pathlib import Path

from src.embeddings.text_embeddings import dense_embedding, sparse_embedding
from src.retrieval.qdrant_store import add_embedding
from src.models.generated_document import (
    GeneratedDocument,
    ExpandedDocument,
    QueryDocument,
)
from src.generation.google_gemini_ai import gemini_generate_content
from src.utils.file_utils import write_file, read_jsonl_file

import src.constants.constants as constants


DATASET_DIR = Path("data/data_set")
EXPANDED_DATA_DIR = Path("data/expanded")
EXPANDED_FILE_NAME = "expanded_documents.jsonl"
MAX_QUERY_PER_DOCUMENT = 5
RATE_LIMIT_DELAY = 5

# ...existing code ....

def generate_expanded_documents(documents):
    # If the expanded file already exists, load it
    expanded_file_path = f"{EXPANDED_DATA_DIR}/{EXPANDED_FILE_NAME}"
    if os.path.exists(expanded_file_path):
        data = read_jsonl_file(expanded_file_path)
        expanded_documents = [ExpandedDocument(**item) for item in data]
        return expanded_documents

    # If the expanded file does not exist, generate it via Gemini
    expanded_documents = []
    for document in documents:
        doc = GeneratedDocument.model_validate_json(document)

        prompt = f"""Generate query prediction for {MAX_QUERY_PER_DOCUMENT} queries (Doc2Query technique) for the document: {document}

        When answering
        - Do not include any other text
"""
        response = gemini_generate_content(prompt, response_schema=list[QueryDocument])

        queries: list[QueryDocument] = []
        json_data = json.loads(response)
        queries = [QueryDocument(**item) for item in json_data]

        query_texts = [query.query for query in queries]
        expanded_documents.append(
            ExpandedDocument(
                original_text=doc.original_text,
                title=doc.title,
                expanded_text=", ".join(query_texts),
            )
        )

        print(f"Progressing: {len(expanded_documents)}/{len(documents)}")
        time.sleep(RATE_LIMIT_DELAY)

    jsonl_content = "\n".join(doc.model_dump_json() for doc in expanded_documents)

    # Save the expanded documents to a file
    write_file(EXPANDED_DATA_DIR, EXPANDED_FILE_NAME, jsonl_content)

    return expanded_documents

3. แก้ฟังก์ชัน insert_original_and_expanded_documents()

ไฟล์ src/executions/insert_documents.py

# ... existing code ...

def insert_original_and_expanded_documents():
    collection_name = constants.ORIGINAL_TEXT_AND_EXPANDED_COLLECTION_NAME

    documents = load_and_combine_documents()

    expanded_documents = generate_expanded_documents(documents)

    for idx, document in enumerate(expanded_documents):
        text = f"{document.original_text} {document.expanded_text}"
        embedding_vector = dense_embedding(text)
        metadata = {
            "title": document.title,
        }

        add_embedding(
            collection_name=collection_name,
            vector=embedding_vector,
            original_text=document.original_text,
            expanded_text=document.expanded_text,
            metadata=metadata,
        )

        print(
            f"Inserted document into collection: {collection_name} : Progress {idx+1}/{len(expanded_documents)}"
        )

🧪 ทดสอบเรียกใช้งานผ่าน main.py

▶️ รันคำสั่ง

python main.py

เลือกเมนู 2: Insert original + expanded text into Qdrant

รอระบบสร้างและเพิ่มข้อมูลลง Qdrant สักครู่…

🔗 เปิดเบราว์เซอร์แล้วเข้าไปที่:

http://localhost:6333/dashboard#/collections

จะเห็น collection ใหม่ที่มีทั้ง original และ expanded text ถูกจัดเก็บไว้เรียบร้อย

📌 สรุป:

เราได้ทำ Document Expansion โดยใช้ LLM สร้าง Query Prediction สำหรับแต่ละเอกสาร และรวมกับ original text เพื่อสร้าง embedding ก่อนนำไปจัดเก็บใน Qdrant

ขั้นตอนนี้เป็นหัวใจสำคัญของการเปรียบเทียบว่า Document Expansion จะช่วยเพิ่มประสิทธิภาพในการค้นหาข้อมูลได้มากน้อยแค่ไหนใน RAG Pipeline

เพิ่มข้อมูลแบบที่ 3: แบบ only expanded text embedding

หลังจากที่เราได้ทำ Document Expansion และได้ไฟล์ที่ประกอบด้วย original_text, title, และ expanded_text เรียบร้อยแล้ว (จากขั้นตอนก่อนหน้า)

คราวนี้เราจะทดลองเพิ่มข้อมูลเข้า Qdrant โดยใช้ เฉพาะ Expanded Text ในการทำ Embedding

แนวทางนี้จะช่วยให้เราทดสอบได้ว่า Query Prediction ล้วน ๆ โดยไม่มี original context เลยนั้น มีผลต่อประสิทธิภาพของระบบมากแค่ไหน

⚙️ ตัวอย่างโค้ดการจัดเก็บข้อมูลแบบ Expanded Text

แก้ไขฟังก์ชัน insert_expanded_documents()

ไฟล์ src/executions/insert_documents.py

# ... existing code ...

def insert_expanded_documents():
    collection_name = constants.ONLY_EXPANDED_COLLECTION_NAME

    documents = load_and_combine_documents()

    expanded_documents = generate_expanded_documents(documents)

    for idx, document in enumerate(expanded_documents):
        embedding_vector = dense_embedding(document.expanded_text)
        metadata = {
            "title": document.title,
        }

        add_embedding(
            collection_name=collection_name,
            vector=embedding_vector,
            original_text=document.original_text,
            expanded_text=document.expanded_text,
            metadata=metadata,
        )

        print(
            f"Inserted document into collection: {collection_name} : Progress {idx+1}/{len(expanded_documents)}"
        )

🧪 ทดสอบเรียกใช้งานผ่าน main.py

▶️ รันคำสั่ง

python main.py

เลือกเมนู 3: Insert only expanded text into Qdrant

รอระบบสร้างและเพิ่มข้อมูลลง Qdrant สักครู่…

🔗 เปิดเบราว์เซอร์แล้วเข้าไปที่:

http://localhost:6333/dashboard#/collections

จะเห็น collection ใหม่ชื่อว่า only_expanded_text

เมื่อคลิกเข้าไปดูจะพบว่าทุก document ที่เก็บไว้จะมี:

Vector ที่ได้จาก expanded_text
Payload ที่ยังคงเก็บ original_text และ expanded_text เอาไว้

🔍 จุดสังเกตสำคัญ

แม้ว่า embedding จะทำจาก expanded text เพียงอย่างเดียว แต่เรา ยังคงเก็บ original text ไว้ด้วย

เพราะหลังจากระบบค้นหาเอกสารด้วย vector แล้ว เรายังจำเป็นต้องใช้ original text สำหรับแสดงผลหรือส่งให้ LLM เพื่อสรุปผลภายหลัง

📌 สรุปสถานะของข้อมูล

ตอนนี้เราได้เตรียมข้อมูลสำหรับการทดสอบ RAG ครบทั้ง 3 รูปแบบหลัก และเก็บลงใน Qdrant แล้ว ได้แก่:

Original Text

ใช้เฉพาะ original_text ในการสร้าง embedding
เหมาะกับ baseline ที่ไม่ใช้การขยายเอกสาร

2. Original + Expanded Text

รวม original_text กับ expanded_text แล้วนำไปสร้าง embedding
ใช้แนวทาง Doc2Query เพื่อช่วยเพิ่ม recall และ semantic coverage

3. Only Expanded Text

ใช้เฉพาะ expanded_text (Query Prediction) ในการสร้าง embedding
เน้นทดลองวัดผลลัพธ์ของการใช้ Query-only แทน document

💡 ทั้ง 3 แบบจะยังคงเก็บ original_text และ expanded_text ไว้ใน metadata เสมอ เพื่อใช้ในการแสดงผลภายหลัง หรือนำไปสรุปผลด้วย LLM ในขั้นตอน Evaluation ต่อไป

ทดสอบ Similarity Search สำหรับ Text embedding ทั้ง 3 แบบ

🔍 เตรียมคำถามเพื่อทดสอบ Similarity Search

ก่อนที่เราจะไปทดสอบประสิทธิภาพของระบบค้นหา (Similarity Search) เราจำเป็นต้องมี ชุดคำถาม ที่ใช้สำหรับวัดผลว่า vector database ของเราสามารถค้นหาเนื้อหาที่ตรงกับคำถามได้ดีแค่ไหน

แนวคิดของเรา คือให้ LLM ช่วย generate คำถามจากเอกสารที่เรามี โดยในแต่ละ document จะได้คำถามประมาณ 2 ข้อ

⚙️ ตัวอย่างโค้ดการสร้างชุดคำถามด้วย LLM

ไฟล์ src/executions/generate_question.py

import json
import time
import os
from pathlib import Path

from src.generation.google_gemini_ai import gemini_generate_content
from src.models.generated_document import GeneratedDocument, QuestionDocument
from src.utils.file_utils import read_jsonl_file, write_file
from src.utils.document_utils import load_and_combine_jsonl_documents


DATASET_DIR = Path("data/data_set")
OUTPUT_DIR = Path("data/question")
QUESTION_FILE_NAME = "questions.jsonl"
TOTAL_QUESTIONS = 2
RATE_LIMIT_DELAY = 10


def generate_questions():
    # If file is present, return the file
    questions_file_path = f"{OUTPUT_DIR}/{QUESTION_FILE_NAME}"
    if os.path.exists(questions_file_path):
        data = read_jsonl_file(questions_file_path)
        question_documents = [QuestionDocument(**item) for item in data]
        return question_documents

    # Load and combine documents from the dataset directory
    original_documents = load_and_combine_jsonl_documents(DATASET_DIR)

    question_documents = []
    # Generate questions for each document
    for idx, document in enumerate(original_documents):
        doc = GeneratedDocument.model_validate_json(document)
        prompt = f"""Given the following document, generate {TOTAL_QUESTIONS} high-quality search queries that users might use to find this information. The document content is: {doc.original_text}

        Generate diverse questions following these requirements:
        1. Semantic Variation: Create questions that use different words/synonyms than those in the original text, but mean the same thing
        2. Question Types:
           - Direct questions about specific facts
           - Questions about relationships between concepts
           - Questions that require understanding the context
           - Questions using alternative terminology
        3. Complexity Levels:
           - Include both simple and complex queries
           - Some questions should combine multiple aspects from the document
           - Some questions should test semantic understanding rather than keyword matching
        4. Natural Language Patterns:
           - Use natural language as real users would ask
           - Vary question formats (what, how, why, when, etc.)
           - Include both formal and conversational writing styles
        5. Search Relevance:
           - Questions should have clear answers in the document
           - Avoid questions that are too generic or could apply to any document
           - Include edge cases that test the search system's semantic understanding

        Format each question as a QuestionDocument object with appropriate metadata.
        """

        response = gemini_generate_content(
            prompt,
            response_schema=list[QuestionDocument],
        )

        documents: list[QuestionDocument] = []
        json_data = json.loads(response)
        documents = [QuestionDocument(**item) for item in json_data]

        question_documents.extend(documents)

        print(f"Progressing: {idx + 1}/{len(original_documents)}")
        time.sleep(
            RATE_LIMIT_DELAY
        )  #  Sleep for {RATE_LIMIT_DELAY} second to avoid hitting the rate limit

    jsonl_content = "\n".join([doc.model_dump_json() for doc in question_documents])

    write_file(OUTPUT_DIR, QUESTION_FILE_NAME, jsonl_content)

    return question_documents

อัปเดต main.py ให้เรียกใช้การ generate คำถาม

ไฟล์ main.py
เพิ่ม option 4 สำหรับเรียกฟังก์ชั่น generate qeustions

from src.executions.insert_documents import (
    insert_original_documents,
    insert_original_and_expanded_documents,
    insert_expanded_documents,
)

from src.executions.generate_question import generate_questions


def main():
    options = {
        "0": "Check your options",
        "1": "Insert original text into Qdrant",
        "2": "Insert original + expanded text into Qdrant",
        "3": "Insert only expanded text into Qdrant",
        "4": "Generate questions",
        "" "99": "Exit",
    }

    for key, value in options.items():
        print(f"{key}: {value}")

    while True:
        choice = input("Enter your choice (0: Check your options): ")

        if choice == "0":
            for key, value in options.items():
                print(f"{key}: {value}")
        elif choice == "1":
            print("Inserting original text into Qdrant...")
            # Call the function to insert original text
            insert_original_documents()
        elif choice == "2":
            print("Inserting original + expanded text into Qdrant...")
            # Call the function to insert original + expanded text
            insert_original_and_expanded_documents()
        elif choice == "3":
            print("Inserting only expanded text into Qdrant...")
            # Call the function to insert only expanded text
            insert_expanded_documents()
        elif choice == "4":
            print("Generating questions...")
            # Call the function to generate questions
            generate_questions()
        elif choice == "99":
            print("Exiting...")
            break
        else:
            print("Invalid choice. Please try again.")


if __name__ == "__main__":
    main()

🧪 ทดสอบเรียกใช้งานผ่าน main.py

▶️ รันคำสั่ง

python main.py

จากนั้นเลือก 4 เพื่อเริ่มการสร้างคำถาม แล้วรอจนกระบวนการทำงานเสร็จ

📈 ตัวอย่างผลลัพธ์

เมื่อฟังก์ชันทำงานเสร็จ จะมีไฟล์ชื่อ questions.jsonl อยู่ในโฟลเดอร์ data/question/

ตัวอย่างเนื้อหาบางส่วนของไฟล์:

{"question":"What is the typical frequency of performance reviews, and are mid-year check-ins allowed?"}
{"question":"How should employees prepare for their performance reviews to effectively showcase their achievements?"}

..100...

🎯 เป้าหมายถัดไป

หลังจากเราได้คำถามทั้งหมดแล้ว เป้าหมายต่อไปคือ:

ทดสอบฟังก์ชัน 🔎 Similarity Search ด้วยคำถามเหล่านี้

และใช้ LLM ช่วยประเมิน ว่าเอกสารที่ค้นเจอ ตรงกับคำถามหรือไม่ และอยู่ใน ลำดับใด

สิ่งนี้จะช่วยให้เราวัดประสิทธิภาพของ RAG pipeline ได้ดีวิธีหนึ่ง โดยไม่ต้องใช้ human evaluator

🚀 ขั้นตอนการประเมิน Similarity Search ด้วย Embedding ทั้ง 3 รูปแบบ

ในการทดลองนี้ เราต้องการวัดประสิทธิภาพของการค้นหาเอกสารด้วย Similarity Search บน Embedding สามประเภท ได้แก่:

Original Text

2. Original + Expanded Text

3. Expanded Only Text

📊 กระบวนการประเมินประกอบด้วยขั้นตอนดังนี้

1.โหลดคำถามที่เตรียมไว้

เรานำคำถามแต่ละข้อจากชุดข้อมูลที่สร้างไว้ล่วงหน้าเข้ามาทีละรายการ

2. ค้นหาด้วย Similarity Search

สำหรับแต่ละคำถาม เราจะทำการสร้างเวกเตอร์ด้วยฟังก์ชัน dense_embedding แล้วนำไปค้นหาข้อมูลใน Qdrant ด้วยค่าพารามิเตอร์ limit = 10 เพื่อดึงเอกสารที่มีความใกล้เคียงสูงสุด 10 รายการ

3. ประเมินผลด้วย LLM

เอกสารที่ได้จากการค้นหาจะถูกจัดรูปแบบให้ LLM เข้าใจง่าย โดยระบุลำดับของเอกสารแต่ละชิ้นอย่างชัดเจน จากนั้นส่งให้ LLM วิเคราะห์ว่าเอกสารใดบ้างที่มีเนื้อหาเกี่ยวข้องกับคำถามนั้นๆ

4. คำนวณค่า MRR (Mean Reciprocal Rank)

โดยดูว่าเอกสารแรกที่เกี่ยวข้องปรากฏอยู่ในลำดับที่เท่าไรในผลลัพธ์ เพื่อประเมินความแม่นยำของการค้นหาสำหรับแต่ละรูปแบบของ Embedding

5. ทำซ้ำกับทุก Collection

ขั้นตอนนี้จะดำเนินซ้ำสำหรับทั้งสามประเภทของเอกสาร และบันทึกผลลัพธ์เป็นไฟล์ .jsonl สำหรับใช้ในการวิเคราะห์เชิงเปรียบเทียบในขั้นถัดไป

🤖 การเตรียมโค้ดสำหรับประเมินผล

⚙️ ตัวอย่างโค้ดสำหรับประเมินผล

เพิ่มโมเดลสำหรับโครงสร้างข้อมูลเพื่อใช้ในการประเมินผล

ไฟล์ src/models/mrr_document.py

from pydantic import BaseModel
from enum import Enum


class Relevance(Enum):
    RELEVANT = "relevant"
    NOT_RELEVANT = "not_relevant"
    AMBIGUOUS = "ambiguous"


class MRRResult(BaseModel):
    document_number: int
    relevance: Relevance


class MRRDataset(BaseModel):
    query_id: str
    retrieved: list[str]
    relevant: set[str]


class MRRDatasetInfo(BaseModel):
    dataset_name: str
    note: str
    embedding_type: str

2. เพิ่มฟังก์ชันจัดรูปแบบเอกสารใน ให้อยู่ในรูปแบบที่มี ชื่อ หรือ ลำดับของ document ระบุไว้ด้วย เพื่อให้ LLM สามารถรู้ได้ว่า context ส่วนไหน เป็นของ document ลำดับที่เท่าไหร่

ไฟล์ src/utils/document_utils.py

# ... existing code ...

def format_documents_as_context(documents: list):
    formatted = []
    for i, doc in enumerate(documents):

        payload = doc.payload or dict()
        original_text = payload.get("original_text", "")

        formatted.append(
            f"--- Document {i+1} ---\n"
            f"{original_text}\n"
            f"--- End of Document {i+1} ---"
        )

    return "\n\n".join(formatted)

3. เพิ่มฟังก์ชันหลักในสำหรับการประเมิน Similarity Search

ตัวอย่างโค้ดด้านล่างเป็นแกนหลักของการประเมินผลลัพธ์จาก Similarity Search ด้วยการส่ง context ไปให้ LLM ประเมินความเกี่ยวข้องแบบเอกสารต่อเอกสาร

ไฟล์ src/executions/evaluate_similarity_search.py

import time
import json

from src.utils.document_utils import (
    load_and_combine_jsonl_documents,
    format_documents_as_context,
)
from src.generation.google_gemini_ai import gemini_generate_content
from src.retrieval.qdrant_store import similarity_search
from src.embeddings.text_embeddings import dense_embedding
from src.models.generated_document import QuestionDocument
from src.models.mrr_document import MRRResult, MRRDataset, Relevance
from src.utils.file_utils import write_file

QUESTIONS_PATH = "data/question"
MRR_RESULTS_PATH = "data/mrr_results/preprocessing"
RATE_LIMIT = 5


def evaluate_similarity_search(
    collection_name: str,
):

    questions = load_and_combine_jsonl_documents(QUESTIONS_PATH)

    mrr_results: list[MRRDataset] = []

    for idx, question in enumerate(questions):

        question_text = QuestionDocument.model_validate_json(question).question

        query_vector = dense_embedding(question_text)
        similarity_results = similarity_search(
            collection_name=collection_name, query_vector=query_vector
        )

        formatted_documents = format_documents_as_context(similarity_results)

        prompt = f"""
Based on the following query and context, please analyze and evaluate the relevance of each document. 

Evaluation instructions:
1. Examine each document in the context separately and document number is provided in the context
2. For each document, determine if it contains information relevant to answering the query
3. Provide a result for each document indicating its relevance by using the following scale:
    relevant - It is relevant to the query
    not_relevant - It is not relevant to the query
    ambiguous - It is unclear if it is relevant to the query

query: {question_text}
context: {formatted_documents}
"""
        response_list: list[MRRResult] = []
        retrieved: list[str] = []
        relevant: set[str] = set()

        response = gemini_generate_content(
            prompt,
            response_schema=list[MRRResult],
        )

        data = json.loads(response)

        parsed_response = [MRRResult(**item) for item in data]
        response_list.extend(parsed_response)

        sorted_response_list = sorted(response_list, key=lambda x: x.document_number)

        for _, response in enumerate(sorted_response_list):
            doc_name = f"doc_{response.document_number}"
            retrieved.append(doc_name)
            if response.relevance == Relevance.RELEVANT:
                relevant.add(doc_name)

        mrr_results.append(
            MRRDataset(
                query_id=str(idx + 1),
                retrieved=retrieved,
                relevant=relevant,
            )
        )

        print(f"Progressing {idx + 1}/{len(questions)}")
        time.sleep(RATE_LIMIT)

    jsonl_content = ""
    for item in mrr_results:
        json_str = item.model_dump_json()
        jsonl_content += json_str + "\n"

    write_file(
        MRR_RESULTS_PATH,
        f"{collection_name}_mrr_results.jsonl",
        jsonl_content,
    )

📈 ตัวอย่างผลลัพธ์

{"query_id":"13","retrieved":["doc_1","doc_2",...],"relevant":["doc_1","doc_2"]}
{"query_id":"24","retrieved":["doc_1","doc_2",...],"relevant":["doc_6","doc_5","doc_9"]}

แสดงว่า LLM พบว่าเอกสารที่ตอบคำถามได้ดีที่สุดอยู่ในลำดับที่ 1 และ 2 (หรือ 5, 6, 9 ตามลำดับ)

อัปเดต main.py ให้เรียกใช้งานฟังก์ชั่น

ไฟล์ main.py

from src.executions.insert_documents import (
    insert_original_documents,
    insert_original_and_expanded_documents,
    insert_expanded_documents,
)

from src.executions.generate_question import generate_questions
from src.executions.evaluate_similarity_search import evaluate_similarity_search

import src.constants.constants as constants


def main():
    options = {
        "0": "Check your options",
        "1": "Insert original text into Qdrant",
        "2": "Insert original + expanded text into Qdrant",
        "3": "Insert only expanded text into Qdrant",
        "4": "Generate questions",
        "5": "Evaluate similarity search",
        "99": "Exit",
    }

    collections = [
        constants.ORIGINAL_TEXT_COLLECTION_NAME,
        constants.ORIGINAL_TEXT_AND_EXPANDED_COLLECTION_NAME,
        constants.ONLY_EXPANDED_COLLECTION_NAME,
    ]

    for key, value in options.items():
        print(f"{key}: {value}")

    while True:
        choice = input("Enter your choice (0: Check your options): ")

        if choice == "0":
            for key, value in options.items():
                print(f"{key}: {value}")
        elif choice == "1":
            print("Inserting original text into Qdrant...")
            # Call the function to insert original text
            insert_original_documents()
        elif choice == "2":
            print("Inserting original + expanded text into Qdrant...")
            # Call the function to insert original + expanded text
            insert_original_and_expanded_documents()
        elif choice == "3":
            print("Inserting only expanded text into Qdrant...")
            # Call the function to insert only expanded text
            insert_expanded_documents()
        elif choice == "4":
            print("Generating questions...")
            # Call the function to generate questions
            generate_questions()
        elif choice == "5":
            print("Evaluating similarity search...")
            # Call the function to evaluate similarity search

            evaluate_similarity_search(collection_name=collections[0])
            # evaluate_similarity_search(collection_name=collections[1])
            # evaluate_similarity_search(collection_name=collections[2])
        elif choice == "99":
            print("Exiting...")
            break
        else:
            print("Invalid choice. Please try again.")


if __name__ == "__main__":
    main()

🧪 ทดสอบเรียกใช้งานผ่าน main.py

▶️ รันคำสั่ง

python main.py

จากนั้นเลือกเมนูหมายเลข 5 จะทำการรัน evaluate_similarity_search กับทั้ง 3 collection ให้ครบ

📈 ผลลัพธ์ที่ได้

หลังจากรันครบทั้ง 3 ประเภท Embedding แล้ว เราจะได้ไฟล์ .jsonl ในโฟลเดอร์ data/mrr_results/preprocessing ซึ่งสามารถนำไปใช้คำนวณค่า MRR และเปรียบเทียบประสิทธิภาพของ Embedding แต่ละแบบได้

🚀 ขั้นตอนสุดท้าย: คำนวณประสิทธิภาพด้วย MRR

หลังจากที่เราได้ผลลัพธ์จากการให้ LLM ประเมินความเกี่ยวข้องของเอกสารแต่ละชุดเรียบร้อยแล้ว ขั้นตอนถัดไปก็คือ การวัดผลลัพธ์ของ Similarity Search ในแต่ละรูปแบบการฝัง (embedding) ด้วย metric ที่ชื่อว่า MRR (Mean Reciprocal Rank) เพื่อหาว่าวิธีการใดให้ผลลัพธ์ดีที่สุด

🔁 MRR คืออะไร?

MRR (Mean Reciprocal Rank) เป็น metric ที่ใช้วัดประสิทธิภาพของระบบค้นหา โดยพิจารณาว่า ผลลัพธ์ที่ “เกี่ยวข้อง” ตัวแรก ปรากฏอยู่ในลำดับที่เท่าไหร่ของรายการที่ค้นคืนมา ยิ่งเอกสารที่เกี่ยวข้องปรากฏเร็วเท่าไหร่ (เช่น ลำดับที่ 1, 2, 3…) ค่าคะแนนก็ยิ่งสูง (มีตัวอย่างในช่วงต้นของบทความ สามารถย้อนกลับไปดูได้)

เพื่อความละเอียดในการวัด เราจะประเมินด้วยค่า MRR@K ได้แก่:

MRR@1 → พิจารณาเฉพาะอันดับ 1
MRR@3 → พิจารณาอันดับ 1 ถึง 3
MRR@5 → พิจารณาอันดับ 1 ถึง 5
MRR@10 → พิจารณาอันดับ 1 ถึง 10

🧠 การคำนวณ MRR จากชุดข้อมูล

จากผลลัพธ์การประเมินที่เก็บไว้ใน data/mrr_results/preprocessing/ สำหรับแต่ละรูปแบบของ embedding เราจะนำข้อมูลเหล่านี้ไปเข้าสู่ฟังก์ชันเพื่อคำนวณค่า MRR โดยรวม

⚙️ ตัวอย่างโค้ด: สร้าง Evaluator สำหรับ MRR

ไฟล์ src/evaluation/mrr.py
เราสร้าง class MRREvaluation เพื่อรองรับการคำนวณดังกล่าว โดยประกอบด้วยฟังก์ชันหลัก 2 ส่วน:

from typing import List, Dict, Set
import pandas as pd

from src.models.mrr_document import MRRDataset, MRRDatasetInfo

# ---------- CONFIGURABLE PARAMETERS ----------
TOP_Ks = [1, 3, 5, 10]


class MRREvaluation:
    def __init__(self):
        pass

    # ---------- FUNCTION TO CALCULATE MRR@K ----------
    def reciprocal_rank(
        self, retrieved: List[str], relevant: Set[str], k: int
    ) -> float:
        for rank, doc_id in enumerate(retrieved[:k], start=1):
            if doc_id in relevant:
                return 1.0 / rank
        return 0.0

    # ---------- MAIN FUNCTION FOR DATASET ----------
    def evaluate_dataset(
        self,
        dataset_information: MRRDatasetInfo,
        queries: List[MRRDataset],
        total_queries: int,
    ) -> tuple[pd.DataFrame, Dict]:
        results = []

        for q in queries:
            row = {"query_id": q.query_id}

            for k in TOP_Ks:
                row[f"MRR@{k}"] = self.reciprocal_rank(q.retrieved, q.relevant, k)

            results.append(row)

        df = pd.DataFrame(results)
        summary = {
            "Dataset": dataset_information.dataset_name,
            "Embedding": dataset_information.embedding_type,
            "Note": dataset_information.note,
            "Total Queries": total_queries,
        }
        for k in TOP_Ks:
            summary[f"Avg MRR@{k}"] = df[f"MRR@{k}"].mean()
            summary[f"Max MRR@{k}"] = df[f"MRR@{k}"].max()
            summary[f"Min MRR@{k}"] = df[f"MRR@{k}"].min()

        return df, summary

🧪 การเรียกใช้งานและสรุปผลลัพธ์

ไฟล์ src/executions/evaluate_mrr.py
เพื่อโหลดข้อมูลที่ได้จากการประเมิน LLM และคำนวณ MRR รวมสำหรับแต่ละ collection โดยบันทึกผลลัพธ์ไว้ที่ data/mrr_results/processed/ ในรูปแบบ JSON ดังนี้:

import os
import json

from src.utils.file_utils import read_jsonl_file, write_file
from src.models.mrr_document import MRRDatasetInfo, MRRDataset
from src.evaluation.mrr import MRREvaluation


DATA_DIR = "data/mrr_results/preprocessing"
SUMMARY_DIR = "data/mrr_results/processed"


def evaluate_mrr(
    collection_name: str,
):

    dataset_info = MRRDatasetInfo(
        dataset_name=collection_name,
        note="-",
        embedding_type=collection_name.replace("_", " "),
    )

    dataset: list[MRRDataset] = []

    files = [
        f
        for f in os.listdir(DATA_DIR)
        if os.path.isfile(os.path.join(DATA_DIR, f)) and f.endswith(".jsonl")
    ]
    # filter only file names that start with collection_name
    files = [f for f in files if f.startswith(collection_name)]

    if not files:
        print(f"No files found for collection name: {collection_name}")
        return
    else:
        file_name = files[0]
        data = read_jsonl_file(os.path.join(DATA_DIR, file_name))

        if len(data) == 0:
            print(f"No data found in file: {file_name}")
            return
        else:

            for item in data:
                dataset.append(
                    MRRDataset(
                        query_id=item["query_id"],
                        retrieved=item["retrieved"],
                        relevant=set(item["relevant"]),
                    )
                )

    mrr = MRREvaluation()

    _, summary = mrr.evaluate_dataset(
        dataset_information=dataset_info,
        queries=dataset,
        total_queries=len(dataset),
    )

    write_file(
        SUMMARY_DIR, f"{collection_name}_summary.json", json.dumps(summary, indent=4)
    )

อัพเดท main.py เพื่อเรียกใช้งานฟังก์ชั่น

from src.executions.insert_documents import (
    insert_original_documents,
    insert_original_and_expanded_documents,
    insert_expanded_documents,
)

from src.executions.generate_question import generate_questions
from src.executions.evaluate_similarity_search import evaluate_similarity_search
from src.executions.evaluate_mrr import evaluate_mrr

import src.constants.constants as constants


def main():
    options = {
        "0": "Check your options",
        "1": "Insert original text into Qdrant",
        "2": "Insert original + expanded text into Qdrant",
        "3": "Insert only expanded text into Qdrant",
        "4": "Generate questions",
        "5": "Evaluate similarity search",
        "6": "Evaluate MRR",
        "99": "Exit",
    }

    collections = [
        constants.ORIGINAL_TEXT_COLLECTION_NAME,
        constants.ORIGINAL_TEXT_AND_EXPANDED_COLLECTION_NAME,
        constants.ONLY_EXPANDED_COLLECTION_NAME,
    ]

    for key, value in options.items():
        print(f"{key}: {value}")

    while True:
        choice = input("Enter your choice (0: Check your options): ")

        if choice == "0":
            for key, value in options.items():
                print(f"{key}: {value}")
        elif choice == "1":
            print("Inserting original text into Qdrant...")
            # Call the function to insert original text
            insert_original_documents()
        elif choice == "2":
            print("Inserting original + expanded text into Qdrant...")
            # Call the function to insert original + expanded text
            insert_original_and_expanded_documents()
        elif choice == "3":
            print("Inserting only expanded text into Qdrant...")
            # Call the function to insert only expanded text
            insert_expanded_documents()
        elif choice == "4":
            print("Generating questions...")
            # Call the function to generate questions
            generate_questions()
        elif choice == "5":
            print("Evaluating similarity search...")
            # Call the function to evaluate similarity search

            evaluate_similarity_search(collection_name=collections[0])
            # evaluate_similarity_search(collection_name=collections[1])
            # evaluate_similarity_search(collection_name=collections[2])
        elif choice == "6":
            print("Evaluating MRR...")
            # Call the function to evaluate MRR
            evaluate_mrr(collection_name=collections[0])
            # evaluate_mrr(collection_name=collections[1])
            # evaluate_mrr(collection_name=collections[2])
        elif choice == "99":
            print("Exiting...")
            break
        else:
            print("Invalid choice. Please try again.")


if __name__ == "__main__":
    main()

🧪 ทดสอบเรียกใช้งานผ่าน main.py

▶️ รันคำสั่ง

python main.py

จากนั้นเลือกเมนูหมายเลข 6 จะทำการรัน evaluate_mrr กับทั้ง 3 collection ให้ครบ

ไฟล์จะถูกสร้างไว้ที่ data/mrr_results/processed

📈 ตัวอย่างผลลัพธ์

{
    "Dataset": "only_expanded_text",
    "Embedding": "only expanded text",
    "Note": "-",
    "Total Queries": 100,
    "Avg MRR@1": 0.9,
    "Max MRR@1": 1.0,
    "Min MRR@1": 0.0,
    "Avg MRR@3": 0.9333333333333332,
    "Max MRR@3": 1.0,
    "Min MRR@3": 0.0,
    "Avg MRR@5": 0.9333333333333332,
    "Max MRR@5": 1.0,
    "Min MRR@5": 0.0,
    "Avg MRR@10": 0.9333333333333332,
    "Max MRR@10": 1.0,
    "Min MRR@10": 0.0
}

✅ สิ่งที่เราจะได้จากขั้นตอนนี้

หลังจากทำการประเมินครบทั้ง 3 รูปแบบของ embedding (original, original + expanded, และ expanded only) เราจะได้ไฟล์ผลลัพธ์สรุป เช่น:

original_text_summary.json
original_and_expanded_text_summary.json
only_expanded_text_summary.json

ในไฟล์เหล่านี้จะมีค่าเฉลี่ย (Avg), ค่าสูงสุด (Max), และค่าต่ำสุด (Min) ของ MRR@1, MRR@3, MRR@5 และ MRR@10 สำหรับแต่ละรูปแบบ พร้อมนำไปใช้วิเคราะห์ต่อได้ทันที

📊 เปรียบเทียบผลลัพธ์: สร้าง MRR Report ในรูปแบบ Markdown

ตัวอย่างโค้ด: รวมข้อมูลจากหลายไฟล์แล้วเขียนเป็น Markdown

ไฟล์ src/utils/report_utils.py เราสร้างฟังก์ชัน generate_mrr_comparison_report() เพื่อทำหน้าที่ดังนี้:
โหลดข้อมูล MRR ที่สรุปไว้ก่อนหน้านี้ (*.json)
รวมเป็นตารางเดียวด้วย pandas
แปลงเป็น Markdown Table
เขียนลงไฟล์ mrr_comparison_report.md
สรุปผลว่าแบบใดมีค่า MRR@1 สูงที่สุด

import pandas as pd
import json
from pathlib import Path

import pandas as pd
import json
from pathlib import Path


def create_merged_summary_to_dataframes(summaries_list, dataset_names):
    dataframes = []

    for i, summary in enumerate(summaries_list):
        # Convert the dictionary to DataFrame if it's not already one
        if isinstance(summary, dict):
            df = pd.DataFrame([summary])  # Convert dict to single-row DataFrame
        else:
            df = summary

        # Add dataset name
        df["Dataset"] = dataset_names[i]
        dataframes.append(df)

    # Now concatenate the list of dataframes
    combined_df = pd.concat(dataframes, ignore_index=True)

    # Reorder columns to put dataset name first
    cols = combined_df.columns.tolist()
    cols = ["Dataset"] + [col for col in cols if col != "Dataset"]
    return combined_df[cols]


def generate_mrr_comparison_report(
    output_path: str = "data/report/mrr_comparison_report.md",
):
    """
    Generate a markdown report comparing MRR results from different approaches.

    Args:
        output_path (str): Path where the markdown report will be saved
    """
    # Read the summary files
    summary_files = [
        "data/mrr_results/processed/original_text_summary.json",
        "data/mrr_results/processed/original_and_expanded_text_summary.json",
        "data/mrr_results/processed/only_expanded_text_summary.json",
    ]

    summaries = []
    for file_path in summary_files:
        with open(file_path, "r") as f:
            summaries.append(json.load(f))

    # Create DataFrame
    df = pd.DataFrame(summaries)

    # Select relevant columns
    mrr_columns = ["Dataset", "Avg MRR@1", "Avg MRR@3", "Avg MRR@5", "Avg MRR@10"]
    df_mrr = df[mrr_columns].copy()  # Create an explicit copy of the DataFrame

    # Format MRR values to 4 decimal places
    for col in mrr_columns[1:]:
        df_mrr[col] = df_mrr[col].apply(lambda x: f"{x:.4f}").astype(str)

    # Generate markdown content
    markdown_content = "# MRR Comparison Report\n\n"
    markdown_content += "## Overview\n"
    markdown_content += "This report compares the Mean Reciprocal Rank (MRR) results for different text representation approaches:\n\n"
    markdown_content += "1. Original text only\n"
    markdown_content += "2. Original + Expanded text\n"
    markdown_content += "3. Expanded text only\n\n"

    markdown_content += "## Results\n\n"
    markdown_content += df_mrr.to_markdown(index=False)

    markdown_content += "\n\n## Analysis\n\n"
    markdown_content += "### Key Findings:\n\n"

    # Get best performing approach
    best_mrr1 = df["Avg MRR@1"].max()
    best_approach = df.loc[df["Avg MRR@1"] == best_mrr1, "Dataset"].iloc[0]

    markdown_content += f"- Best performing approach: **{best_approach}** with MRR@1 of {best_mrr1:.4f}\n"
    markdown_content += "- Performance comparison:\n"

    # Create directory if it doesn't exist
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)

    # Write the report
    with open(output_path, "w") as f:
        f.write(markdown_content)

    return output_path

🧪 เรียกใช้ผ่านเมนูหลัก: อัปเดต main.py

from src.executions.insert_documents import (
    insert_original_documents,
    insert_original_and_expanded_documents,
    insert_expanded_documents,
)

from src.executions.generate_question import generate_questions
from src.executions.evaluate_similarity_search import evaluate_similarity_search
from src.executions.evaluate_mrr import evaluate_mrr
from src.utils.report_utils import generate_mrr_comparison_report

import src.constants.constants as constants

def main():
    options = {
        "0": "Check your options",
        "1": "Insert original text into Qdrant",
        "2": "Insert original + expanded text into Qdrant",
        "3": "Insert only expanded text into Qdrant",
        "4": "Generate questions",
        "5": "Evaluate similarity search",
        "6": "Evaluate MRR",
        "7": "Generate MRR comparison report",
        "99": "Exit",
    }

    collections = [
        constants.ORIGINAL_TEXT_COLLECTION_NAME,
        constants.ORIGINAL_TEXT_AND_EXPANDED_COLLECTION_NAME,
        constants.ONLY_EXPANDED_COLLECTION_NAME,
    ]

    for key, value in options.items():
        print(f"{key}: {value}")

    while True:
        choice = input("Enter your choice (0: Check your options): ")

        if choice == "0":
            for key, value in options.items():
                print(f"{key}: {value}")
        elif choice == "1":
            print("Inserting original text into Qdrant...")
            # Call the function to insert original text
            insert_original_documents()
        elif choice == "2":
            print("Inserting original + expanded text into Qdrant...")
            # Call the function to insert original + expanded text
            insert_original_and_expanded_documents()
        elif choice == "3":
            print("Inserting only expanded text into Qdrant...")
            # Call the function to insert only expanded text
            insert_expanded_documents()
        elif choice == "4":
            print("Generating questions...")
            # Call the function to generate questions
            generate_questions()
        elif choice == "5":
            print("Evaluating similarity search...")
            # Call the function to evaluate similarity search

            evaluate_similarity_search(collection_name=collections[0])
            # evaluate_similarity_search(collection_name=collections[1])
            # evaluate_similarity_search(collection_name=collections[2])
        elif choice == "6":
            print("Evaluating MRR...")
            # Call the function to evaluate MRR
            evaluate_mrr(collection_name=collections[0])
            # evaluate_mrr(collection_name=collections[1])
            # evaluate_mrr(collection_name=collections[2])
        elif choice == "7":
            print("Generating MRR comparison report...")
            report_path = generate_mrr_comparison_report()
            print(f"Report generated successfully at: {report_path}")
        elif choice == "99":
            print("Exiting...")
            break
        else:
            print("Invalid choice. Please try again.")

if __name__ == "__main__":
    main()

🧪 ทดสอบเรียกใช้งานผ่าน main.py

▶️ รันคำสั่ง

python main.py

จากนั้นเลือกเมนูหมายเลข 7 เพื่อทำการสร้าง Markdown ไฟล์

📈 ตัวอย่าง ตารางแสดงผลการประเมิน Similarity Search โดยใช้ Text Embedding ที่แตกต่างกัน 3 แบบ

จากการทดลองประเมินผลประสิทธิภาพของระบบค้นหาเอกสาร (Similarity Search) โดยวัดจาก Mean Reciprocal Rank (MRR) ที่ระดับค่า K ต่าง ๆ (MRR@1, @3, @5, @10) สำหรับเอกสารที่ผ่านการ Embedding ด้วยวิธีต่าง ๆ พบผลลัพธ์ดังนี้:

💡 วิเคราะห์ผล

1. Original + Expanded Text (ผลลัพธ์ดีที่สุด) ✅

คะแนน MRR สูงที่สุดในทุกช่วง K โดยเฉพาะ MRR@1 ที่ 0.96
การนำข้อความต้นฉบับมารวมกับคำที่ระบบคาดว่าผู้ใช้จะใช้ค้นหา (Predicted Queries) ทำให้ระบบเข้าใจได้ทั้งความหมายโดยรวมและบริบทของคำค้นที่หลากหลายขึ้น
เทคนิคนี้ช่วยเพิ่มโอกาสที่เวกเตอร์ของเอกสารจะเข้าใกล้เวกเตอร์ของคำค้นมากขึ้น จึงสามารถดึงข้อมูลที่เกี่ยวข้องได้ตรงจุด

2. Original Text Only

มีผลลัพธ์รองลงมา โดยเฉพาะในกรณี MRR@1 ที่ได้ 0.94
แม้จะไม่มีการขยายเอกสาร แต่การใช้ข้อความดั้งเดิมช่วยให้รักษาความหมายที่แท้จริงของเอกสารไว้ได้ดี
อย่างไรก็ตาม หากผู้ใช้ใช้คำค้นที่มีลักษณะแตกต่างจากคำในเอกสาร ระบบอาจค้นหาไม่เจอ

3. Expanded Text Only

ผลลัพธ์ต่ำที่สุด โดย MRR@1 เพียง 0.90
แม้การขยายด้วย query prediction จะช่วยให้ครอบคลุมคำค้นที่หลากหลาย แต่เมื่อไม่มีเนื้อหาต้นฉบับ อาจทำให้เอกสารขาดความแม่นยำหรือบริบทที่ชัดเจน

🧠 ข้อสรุป

การขยายเอกสารด้วย Query Prediction แล้วนำไปรวมกับต้นฉบับ เป็นวิธีที่มีประสิทธิภาพที่สุดในการปรับปรุงระบบค้นหาเอกสารด้วยเวกเตอร์ (สำหรับการทดลองนี้)

การรักษาข้อมูลต้นฉบับไว้ควบคู่กับข้อมูลขยาย ช่วยให้ระบบสามารถตีความได้ทั้งบริบทจริง และความหลากหลายของภาษาที่ผู้ใช้จะใช้ในการค้นหา

สำหรับใครที่สนใจตัวอย่างโค้ดแบบเต็มๆ เช็คได้ที่ลิ้งค์ด้านล่าง

GitHub Repository : rag-document-expansion-mrr

สำหรับใครที่กำลังมองหาวิธีสร้าง RAG Application หรือ Chatbot เพื่อใช้งานในองค์กร ที่ PALO IT เรามีทีมผู้เชี่ยวชาญพร้อมช่วยตั้งแต่เริ่มต้นจนระบบใช้งานได้จริง! ไม่ว่าจะเป็น

Data Cleaning — เตรียมข้อมูลให้สะอาด พร้อมใช้งาน เพื่อผลลัพธ์ที่แม่นยำ
RAG Optimisation — ปรับแต่งระบบให้ตอบไว ตรงประเด็น และพร้อมรองรับการใช้งานระดับองค์กร
Evaluation — ทดสอบและวัดผลลัพธ์ของโมเดล เพื่อให้มั่นใจว่า RAG ของคุณตอบได้ดีจริง

ไม่ว่าคุณจะเพิ่งเริ่มต้น หรือมีระบบอยู่แล้วและอยากต่อยอด เราพร้อมเป็น partner ที่จะช่วยให้คุณไปได้ไกลกว่าเดิม

ทักไปที่เพจ Facebook: PALO IT Thailand ได้เลยครับ 🎉

RAG — เพิ่มพลัง Similarity Search ด้วย Document Expansion by Query Prediction: และลองวัดผลลัพธ์ด้วย Metrics แบบ MRR (Mean Reciprocal Rank) ด้วย LLM-as-a-Judge

โครงสร้างการอธิบายในบทความนี้

🧩 พาร์ทที่ 1 — ทำความเข้าใจโค้ดและแนวคิดพื้นฐาน

🔢 Text Embedding การแปลงข้อความเป็นชุดตัวเลขเพื่อให้คอมพิวเตอร์เข้าใจความหมาย)

🧠 การสร้าง Dense Embedding จากข้อความ ด้วย FastEmbed

🧾 สร้าง Sparse Embedding ด้วย BM25 (ผ่าน FastEmbed)

🗃️ จัดเก็บ Embedding ลง Vector Database ด้วย Qdrant (Local)

🔌 ติดตั้งและเชื่อมต่อ Qdrant ด้วย Python

การค้นหาข้อมูลใน Vector Database ด้วย Similarity Search

✨ การใช้งาน LLM (Large Language Model) ด้วย Google Gemini

🚀 พาร์ทที่ 2 — ทดลองใช้ Doc2Query เพิ่มพลังการค้นหา และประเมินผลด้วย LLM-as-a-Judge สร้างข้อมูลทดสอบเพื่อเก็บลง Vector Database

ทดสอบ Similarity Search สำหรับ Text embedding ทั้ง 3 แบบ

🚀 ขั้นตอนการประเมิน Similarity Search ด้วย Embedding ทั้ง 3 รูปแบบ

🚀 ขั้นตอนสุดท้าย: คำนวณประสิทธิภาพด้วย MRR

📊 เปรียบเทียบผลลัพธ์: สร้าง MRR Report ในรูปแบบ Markdown

🧠 ข้อสรุป

ซิดนีย์, ออสเตรเลีย (บริสเบน, เมลเบิร์น)

ฮ่องกง

สิงคโปร์

กรุงเทพมหานคร, ประเทศไทย

เซาเปาโล, บราซิล

โบโกตา, โคลอมเบีย

เมเดยิน, โคลอมเบีย

เม็กซิโกซิตี้, เม็กซิโก

นิวยอร์กซิตี้, สหรัฐอเมริกา

ปารีส, ฝรั่งเศส