Reference
InteractiveUtils.edit
JuliaLLMLeaderboard.evaluate_1shot
JuliaLLMLeaderboard.find_definitions
JuliaLLMLeaderboard.load_conversation_from_eval
JuliaLLMLeaderboard.load_definition
JuliaLLMLeaderboard.load_evals
JuliaLLMLeaderboard.preview
JuliaLLMLeaderboard.preview
JuliaLLMLeaderboard.run_benchmark
JuliaLLMLeaderboard.run_code_blocks_additive
JuliaLLMLeaderboard.run_code_main
JuliaLLMLeaderboard.save_definition
JuliaLLMLeaderboard.score_eval
JuliaLLMLeaderboard.score_eval
JuliaLLMLeaderboard.timestamp_now
JuliaLLMLeaderboard.tmapreduce
JuliaLLMLeaderboard.validate_definition
InteractiveUtils.edit
— FunctionInteractiveUtils.edit(conversation::AbstractVector{<:PT.AbstractMessage}, bookmark::Int=-1)
Opens the conversation in a preview window formatted as markdown (In VSCode, right click on the tab and select "Open Preview" to format it nicely).
See also: preview
(for rendering as markdown in REPL)
JuliaLLMLeaderboard.evaluate_1shot
— Methodevaluate_1shot(; conversation, fn_definition, definition, model, prompt_label, schema, parameters::NamedTuple=NamedTuple(), device="UNKNOWN", timestamp=timestamp_now(), version_pt=string(pkgversion(PromptingTools)), prompt_strategy="1SHOT", verbose::Bool=false,
auto_save::Bool=true, save_dir::AbstractString=dirname(fn_definition), experiment::AbstractString="",
execution_timeout::Int=60, capture_stdout::Bool=true)
Runs evaluation for a single test case (parse, execute, run examples, run unit tests), including saving the files.
If auto_save=true
, it saves the following files
<model-name>/evaluation__PROMPTABC__1SHOT__TIMESTAMP.json
<model-name>/conversation__PROMPTABC__1SHOT__TIMESTAMP.json
into a sub-folder of where the definition file was stored.
Keyword Arguments
conversation
: the conversation to evaluate (vector of messages), eg, fromaigenerate
whenreturn_all=true
fn_definition
: path to the definition file (eg,joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
)definition
: the test case definition dict loaded from the definition file. It's subset to only the relevant keys for code generation, eg,definition=load_definition(fn_definition)["code_generation"]
model
: the model name, eg,model="gpt4t"
prompt_label
: the prompt label, eg,prompt_label="JuliaExpertAsk"
schema
: the schema used for the prompt, eg,schema="-"
orschema="OllamaManagedSchema()"
parameters
: the parameters used for the generation liketemperature
ortop_p
, eg,parameters=(; top_p=0.9)
device
: the device used for the generation, eg,device="Apple-MacBook-Pro-M1"
timestamp
: the timestamp used for the generation. Defaults totimestamp=timestamp_now()
which is like "20231201_120000"version_pt
: the version of PromptingTools used for the generation, eg,version_pt="0.1.0"
prompt_strategy
: the prompt strategy used for the generation, eg,prompt_strategy="1SHOT"
. Fixed for now!verbose
: ifverbose=true
, it will print out more information about the evaluation process, eg, the evaluation errorsauto_save
: ifauto_save=true
, it will save the evaluation and conversation files into a sub-folder of where the definition file was stored.save_dir
: the directory where the evaluation and conversation files are saved. Defaults todirname(fn_definition)
.experiment
: the experiment name, eg,experiment="my_experiment"
(eg, when you're doing a parameter search). Defaults to""
for standard benchmark run.execution_timeout
: the timeout for the AICode code execution in seconds. Defaults to 60s.capture_stdout
: ifcapture_stdout=true
, AICode will capture the stdout of the code execution. Set tofalse
if you're evaluating with multithreading (stdout capture is not thread-safe). Defaults totrue
to avoid poluting the benchmark.remove_tests
: ifremove_tests=true
, AICode will remove any @testset blocks and unit tests from the main code definition (shields against model defining wrong unit tests inadvertedly).
Examples
using JuliaLLMLeaderboard
using PromptingTools
fn_definition = joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
d = load_definition(fn_definition)
msg = aigenerate(:JuliaExpertAsk; ask=d["code_generation"]["prompt"], model="gpt4t", return_all=true)
# Try evaluating it -- auto_save=false not to polute our benchmark
evals = evaluate_1shot(; conversation=msg, fn_definition, definition=d["code_generation"], model="gpt4t", prompt_label="JuliaExpertAsk", timestamp=timestamp_now(), device="Apple-MacBook-Pro-M1", schema="-", prompt_strategy="1SHOT", verbose=true, auto_save=false)
JuliaLLMLeaderboard.find_definitions
— FunctionFinds all definition.toml
filenames in the given path. Returns a list of filenames to load.
JuliaLLMLeaderboard.load_conversation_from_eval
— MethodLoads the conversation from the corresponding evaluation file.
JuliaLLMLeaderboard.load_definition
— MethodLoads the test case definition from a TOML file under filename
.
JuliaLLMLeaderboard.load_evals
— Methodload_evals(base_dir::AbstractString; score::Bool=true, max_history::Int=5, new_columns::Vector{Symbol}=Symbol[], kwargs...)
Loads all evaluation JSONs from a given director loaded in a DataFrame as rows. The directory is searched recursively, and all files starting with the prefix evaluation__
are loaded.
Keyword Arguments
score::Bool=true
: Ifscore=true
, the function will also callscore_eval
on the resulting DataFrame.max_history::Int=5
: Onlymax_history
most recent evaluations are loaded. Ifmax_history=0
, all evaluations are loaded.
Returns: DataFrame
Note: It loads a fixed set of columns (set in a local variable eval_cols
), so if you added some new columns, you'll need to pass them to new_columns::Vector{Symbol}
argument.
JuliaLLMLeaderboard.preview
— Methodpreview(conversation::AbstractVector{<:PT.AbstractMessage})
Render a conversation, which is a vector of AbstractMessage
objects, as a single markdown-formatted string. Each message is rendered individually and concatenated with separators for clear readability.
This function is particularly useful for displaying the flow of a conversation in a structured and readable format. It leverages the PT.preview
method for individual messages to create a cohesive view of the entire conversation.
Arguments
conversation::AbstractVector{<:PT.AbstractMessage}
: A vector of messages representing the conversation.
Returns
String
: A markdown-formatted string representing the entire conversation.
Example
conversation = [
PT.SystemMessage("Welcome"),
PT.UserMessage("Hello"),
PT.AIMessage("Hi, how can I help you?")
]
println(PT.preview(conversation))
This will output:
# System Message
Welcome
---
# User Message
Hello
---
# AI Message
Hi, how can I help you?
---
JuliaLLMLeaderboard.preview
— Methodpreview(msg::PT.AbstractMessage)
Render a single AbstractMessage
as a markdown-formatted string, highlighting the role of the message sender and the content of the message.
This function identifies the type of the message (User, Data, System, AI, or Unknown) and formats it with a header indicating the sender's role, followed by the content of the message. The output is suitable for nicer rendering, especially in REPL or markdown environments.
Arguments
msg::PT.AbstractMessage
: The message to be rendered.
Returns
String
: A markdown-formatted string representing the message.
Example
msg = PT.UserMessage("Hello, world!")
println(PT.preview(msg))
This will output:
# User Message
Hello, world!
JuliaLLMLeaderboard.run_benchmark
— Methodrun_benchmark(; fn_definitions::Vector{<:AbstractString}=find_definitons(joinpath(@__DIR__, "..", "code_generation")),
models::Vector{String}=["gpt-3.5-turbo-1106"], model_suffix::String="", prompt_labels::Vector{<:AbstractString}=["JuliaExpertCoTTask", "JuliaExpertAsk", "InJulia", "AsIs", "JuliaRecapTask", "JuliaRecapCoTTask"],
api_kwargs::NamedTuple=NamedTuple(), http_kwargs::NamedTuple=(; readtimeout=300),
experiment::AbstractString="", save_dir::AbstractString="", auto_save::Bool=true, verbose::Union{Int,Bool}=true, device::AbstractString="-",
num_samples::Int=1, schema_lookup::AbstractDict{String,<:Any}=Dict{String,<:Any}())
Runs the code generation benchmark with specified models and prompts for the specified number of samples.
It will generate the response, evaluate it and, optionally, also save it.
Note: This benchmark is not parallelized, because locally hosted models would be overwhelmed. However, you can take this code and apply Threads.@spawn
to it. The evaluation itself should be thread-safe.
Keyword Arguments
fn_definitions
: a vector of paths to definitions of test cases to run (definition.toml
). If not specified, it will run all definition files in thecode_generation
folder.models
: a vector of models to run. If not specified, it will run onlygpt-3.5-turbo-1106
.model_suffix
: a string to append to the model name in the evaluation data, eg, "–optim" if you provide some tuned API kwargs.prompt_labels
: a vector of prompt labels to run. If not specified, it will run all available.num_samples
: an integer to specify the number of samples to generate for each model/prompt combination. If not specified, it will generate 1 sample.api_kwargs
: a named tuple of API kwargs to pass to theaigenerate
function. If not specified, it will use the default values. Example:(; temperature=0.5, top_p=0.5)
http_kwargs
: a named tuple of HTTP.jl kwargs to pass to theaigenerate
function. Defaults to 300s timeout. Example:http_kwargs = (; readtimeout=300)
schema_lookup
: a dictionary of schemas to use for each model. If not specified, it will use the default schemas in the registry. Example:Dict("mistral-tiny" => PT.MistralOpenAISchema())
codefixing_num_rounds
: an integer to specify the number of rounds of codefixing to run (with AICodeFixer). If not specified, it will NOT be used.codefixing_prompt_labels
: a vector of prompt labels to use for codefixing. If not specified, it will default to only "CodeFixerTiny".experiment
: a string to save in the evaluation data. Useful for future analysis.device
: specify the device the benchmark is running on, eg, "Apple-MacBook-Pro-M1" or "NVIDIA-GTX-1080Ti", broadly "manufacturer-model".save_dir
: a string of path where to save the evaluation data. If not specified, it will save in the same folder as the definition file. Useful for small experiments not to pollute the main benchmark.auto_save
: a boolean whether to automatically save the evaluation data. If not specified, it will save the data. Otherwise, it only returns the vector ofevals
verbose
: a boolean or integer to print progress. If not specified, it will print highlevel progress (verbose = 1 or true), for more details, setverbose=2 or =3
.execution_timeout
: an integer to specify the timeout in seconds for the execution of the generated code. If not specified, it will use 10s (per run/test item).
Return
- Vector of
evals
, ie, a dictionary of evaluation data for each model/prompt combination and each sample.
Notes
- In general, use HTTP timeouts because both local models and APIs can get stuck, eg,
http_kwargs=(; readtimeout=150)
- On Mac M1 with Ollama, you want to set apikwargs=(; options=(; numgpu=99)) for Ollama to have normal performance (ie, offload all model layers to the GPU)
- For commercial providers (MistralAI, OpenAI), we automatically inject a different random seed for each run to avoid caching
Example
fn_definitions = find_definitions("code_generation/") # all test cases
evals = run_benchmark(; fn_definitions, models=["gpt-3.5-turbo-1106"], prompt_labels=["JuliaExpertAsk"],
experiment="my-first-run", save_dir="temp", auto_save=true, verbose=true, device="Apple-MacBook-Pro-M1",
num_samples=1);
# not using `schema_lookup` as it's not needed for OpenAI models
Or if you want only one test case use: fn_definitions = [joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")]
JuliaLLMLeaderboard.run_code_blocks_additive
— Methodrun_code_blocks_additive(cb::AICode, code_blocks::AbstractVector{<:AbstractString};
verbose::Bool = false,
setup_code::AbstractString = "", teardown_code::AbstractString = "",
capture_stdout::Bool = true, execution_timeout::Int = 60)
Runner for the additional code_blocks
(can be either unit tests or examples), returns count of examples executed without an error.
code_blocks
should be a vector of strings, each of which is a valid Julia expression that can be evaluated without an error thrown. Each successful run (no error thrown) is counted as a successful example.
Keyword Arguments
verbose=true
will provide more information about the test failures.setup_code
is a string that will be prepended to each code block before it's evaluated. Useful for setting up the environment/test objects.teardown_code
is a string that will be appended to each code block before it's evaluated. Useful for cleaning up the environment/test objects.capture_stdout
is a boolean whether to capture the stdout of the code execution. Set tofalse
if you're evaluating with multithreading (stdout capture is not thread-safe).execution_timeout
is the timeout for the AICode code execution in seconds. Defaults to 60s.
Returns
count_successful
the number of examples that were executed without an error thrown.
Example
using JuliaLLMLeaderboard: run_code_blocks
using PromptingTools: AICode
cb = AICode("mysum(a,b)=a+b")
code = "mysum(1,2)"
run_code_blocks(cb, [code])
# Output: 1 (= 1 example executed without an error thrown)
JuliaLLMLeaderboard.run_code_main
— Methodrun_code_main(msg::PT.AIMessage; verbose::Bool = true, function_name::AbstractString = "",
prefix::String = "",
execution_timeout::Int = 60,
capture_stdout::Bool = true,
expression_transform::Symbol = :remove_all_tests)
Runs the code block in the message msg
and returns the result as an AICode
object.
Logic:
- Always execute with a timeout
- Always execute in a "safe mode" (inside a custom module,
safe_eval=true
) - Skip any package imports or environment changes (
skip_unsafe=true
) - Skip invalid/broken lines (
skip_invalid=true
) - Remove any unit tests (
expression_transform=:remove_all_tests
), because model might have added some without being asked for it explicitly - First, evaluate the code block as a whole, and if it fails, try to extract the function definition and evaluate it separately (fallback)
JuliaLLMLeaderboard.save_definition
— MethodSaves the test case definition
to a TOML file under filename
.
JuliaLLMLeaderboard.score_eval
— Methodscore_eval(parsed, executed, unit_tests_success_ratio, examples_success_ratio; max_points::Int=100)
Score the evaluation result by distributing max_points
equally across the available criteria.
Example
df=@rtransform df :score = score_eval(:parsed, :executed, :unit_tests_passed / :unit_tests_count, :examples_executed / :examples_count)
JuliaLLMLeaderboard.score_eval
— Methodscore_eval(eval::AbstractDict; max_points::Int=100)
score_eval(parsed, executed, unit_tests_success_ratio, examples_success_ratio; max_points::Int=100)
Scores the evaluation result eval
by distributing max_points
equally across the available criteria. Alternatively, you can provide the individual scores as arguments (see above) with values in the 0-1 range.
Eg, if all 4 criteria are available, each will be worth 25% of points:
parsed
(25% if true)executed
(25% if true)unit_tests
(25% if all unit tests passed)examples
(25% if all examples executed without an error thrown)
JuliaLLMLeaderboard.timestamp_now
— MethodProvide a current timestamp in the format yyyymmddHHMMSS. If `addrandom` is true, a random number between 100 and 999 is appended to avoid overrides.
JuliaLLMLeaderboard.tmapreduce
— Methodtmapreduce(f, op, itr; tasks_per_thread::Int = 2, kwargs...)
A parallelized version of the mapreduce
function leveraging multi-threading.
The function f
is applied to each element of itr
, and then the results are reduced using an associative two-argument function op
.
Arguments
f
: A function to apply to each element ofitr
.op
: An associative two-argument reduction function.itr
: An iterable collection of data.
Keyword Arguments
tasks_per_thread::Int = 2
: The number of tasks spawned per thread. Determines the granularity of parallelism.kwargs...
: Additional keyword arguments to pass to the innermapreduce
calls.
Implementation Details
The function divides itr
into chunks, spawning tasks for processing each chunk in parallel. The size of each chunk is determined by tasks_per_thread
and the number of available threads (nthreads
). The results from each task are then aggregated using the op
function.
Notes
This implementation serves as a general replacement for older patterns. The goal is to introduce this function or a version of it to base Julia in the future.
Example
using Base.Threads: nthreads, @spawn
result = tmapreduce(x -> x^2, +, 1:10)
The above example squares each number in the range 1 through 10 and then sums them up in parallel.
Source: Julia Blog post
JuliaLLMLeaderboard.validate_definition
— Methodvalidate_definition(definition::AbstractDict; evaluate::Bool=true, verbose::Bool=true)
Validates the definition.toml
file for the code generation benchmark.
Returns true
if the definition is valid.
Keyword Arguments
evaluate
: a boolean whether to evaluate the definition. If not specified, it will evaluate the definition.verbose
: a boolean whether to print progress during the evaluation. If not specified, it will print progress.kwargs
: keyword arguments to pass to code parsing function (PT.AICode
).
Example
fn_definition = joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
definition = load_definition(fn_definition)
validate_definition(definition)
# output: true