Building a RAG Chatbot over DataFrames.jl Documentation - Easy Mode


Elevating our RAG chatbot development, this post explores the integration of PromptingTools.jl and RAGTools, showcasing a more efficient approach to building a chatbot using the DataFrames.jl documentation. We'll also delve into system evaluations.

Last time, we crafted a RAG chatbot from scratch. Today, we're taking a leap forward with PromptingTools.jl and its experimental sub-module, RAGTools, for a more streamlined experience in Julia. Ready to dive deeper?

Effortless RAG System Building with RAGTools

Remember, "RAG" is a cornerstone in Generative AI right now. If you're new to "RAG", this article is a great starting point.

Even if you are familiar with RAG, I would strongly recommend watching Jerry Liu's talk on Building Production-Ready RAG Applications. It's well spent 20 minutes and will help you understand the further sections in this article.

using LinearAlgebra, SparseArrays
using PromptingTools
using PromptingTools.Experimental.RAGTools
## Note: RAGTools module is still experimental and will change in the future. Ideally, they will be cleaned up and moved to a dedicated package
using JSON3, Serialization, DataFramesMeta
using Statistics: mean
const PT = PromptingTools
const RT = PromptingTools.Experimental.RAGTools

RAG in Two Lines

Start by grabbing a few text pages from the DataFrames.jl documentation, saving them as text files in examples/data. Aim for clean content, free of headers and footers. Remember, garbage in, garbage out!

files = [
    joinpath("examples", "data", "database_style_joins.txt"),
    joinpath("examples", "data", "what_is_dataframes.txt"),
# Build an index of chunks, embed them, and create a lookup index of metadata/tags for each chunk
index = build_index(files; extract_metadata = false);

Let's ask a question

# Embeds the question, finds the closest chunks in the index, and generates an answer from the closest chunks
answer = airag(index; question = "I like dplyr, what is the equivalent in Julia?")
AIMessage("The equivalent package in Julia to dplyr in R is DataFramesMeta.jl. It provides convenience functions for data manipulation with syntax similar to dplyr.")

And there you have it, a RAG system in just two lines!

What does it do?

You should save the index for later to avoid re-embedding / re-extracting the document chunks!

serialize("examples/index.jls", index)
index = deserialize("examples/index.jls");

Evaluations: Assessing Quality

To gauge the effectiveness of our system, we need a golden set of quality Q&A pairs. Creating these manually is ideal but can be labor-intensive. Instead, let's generate them from our index:

Generate Q&A Pairs

Here's how to create evaluation sets from your index (we need the text chunks and corresponding file paths/sources).

evals = build_qa_evals(RT.chunks(index),
    instructions = "None.",
    verbose = true);
[ Info: Q&A Sets built! (cost: $0.102)

[!TIP] In practice, you would review each item in this golden evaluation set (and delete any generic/poor questions). It will determine the future success of your app, so you need to make sure it's good!

Save your evaluation sets for later use (and ideally review them manually).

JSON3.write("examples/evals.json", evals)
evals ="examples/evals.json", Vector{RT.QAEvalItem});

Explore a Q&A pair

Here's a sample Q&A pair to illustrate the process (it's not the best quality but gives you the idea):

 source: examples/data/database_style_joins.txt
 context: Database-Style Joins
Introduction to joins
We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:

julia> using DataFrames
 question: What is the purpose of joining two or more data sets together?
 answer: The purpose of joining two or more data sets together is to provide a complete picture of the topic being studied.

Judging a Q&A Pair

So let's say we use this Q&A pair to evaluate our system, we plug this Question into airag and get a new answer back. But how do we know it's good?

We use a "judge model" (like GPT-4) for evaluation (we extract a final_rating):

# Note: that we used the same question, but generated a different context and answer via `airag`
msg, ctx = airag(index; evals[1].question, return_context = true);
# ctx is a RAGContext object that keeps all intermediate states of the RAG pipeline for easy evaluation
judged = aiextract(:RAGJudgeAnswerFromContext;
    return_type = RT.JudgeAllScores)
Dict{Symbol, Any} with 6 entries:
  :final_rating => 4.8
  :clarity => 5
  :completeness => 4
  :relevance => 5
  :consistency => 5
  :helpfulness => 5

But final_rating for the generated answer is not the only metric we should watch. We should also judge the quality of the provided context ("retrieval_score") and a few others.

This generation+evaluation loop and a few common metrics are available in run_qa_evals:

x = run_qa_evals(evals[10], ctx;
    parameters_dict = Dict(:top_k => 3), verbose = true, model_judge = "gpt4t")
 source: examples/data/database_style_joins.txt
 context: outerjoin: the output contains rows for values of the key that exist in any of the passed data frames.
semijoin: Like an inner join, but output is restricted to columns from the first (left) argument.
 question: What is the difference between outer join and semi join?
 answer: The purpose of joining two or more data sets together is to combine them in order to provide a complete picture or analysis of a specific topic or dataset. By joining data sets, we can combine information from multiple sources to gain more insights and make more informed decisions.
 retrieval_score: 0.0
 retrieval_rank: nothing
 answer_score: 5
 parameters: Dict(:top_k => 3)

QAEvalResult is a simple struct that holds the evaluation results for a single Q&A pair. It becomes useful when we evaluate a whole set of Q&A pairs (see below).

Evaluate the Whole Set

Fortunately, we don't have to do the evaluations manually one by one.

Let's run each question & answer through our eval loop in async (we do it only for the first 10 to save time). See the ?airag for which parameters you can tweak, eg, top_k

results = asyncmap(evals[1:10]) do qa_item
    # Generate an answer -- often you want the model_judge to be the highest quality possible, eg, "GPT-4 Turbo" (alias "gpt4t)
    msg, ctx = airag(index; qa_item.question, return_context = true,
        top_k = 3, verbose = false, model_judge = "gpt4t")
    # Evaluate the response
    # Note: you can log key parameters for easier analysis later
    run_qa_evals(qa_item, ctx; parameters_dict = Dict(:top_k => 3), verbose = false)
## Note that the "failed" evals can show as "nothing" (failed as in there was some API error or parsing error), so make sure to handle them.
results = filter(x->!isnothing(x.answer_score), results);

Note: You could also use the vectorized version results = run_qa_evals(evals) to evaluate all items at once and skip the above code block.

# Let's take a simple average to calculate our score
@info "RAG Evals: $(length(results)) results, Avg. score: $(round(mean(x->x.answer_score, results);digits=1)), Retrieval score: $(100*round(Int,mean(x->x.retrieval_score,results)))%"
[ Info: RAG Evals: 10 results, Avg. score: 4.6, Retrieval score: 100%

Note: The retrieval score is 100% only because we have two small documents and running on 10 items only. In practice, you would have a much larger document set and a much larger eval set, which would result in a more representative retrieval score.

If you prefer, you can also analyze the results in a DataFrame:

df = DataFrame(results)

We're done for today!

Where to Go From Here?

... and much more! See some ideas in Anyscale RAG tutorial

If you want to learn more about helpful patterns and advanced evaluation techniques (eg, how to measure "hallucinations"), check out this talk by Eugene Yan: Building Blocks for LLM Systems & Products.

CC BY-SA 4.0 Jan Siml. Last modified: April 30, 2024. Website built with Franklin.jl and the Julia programming language. See the Privacy Policy