Reference
InteractiveUtils.editJuliaLLMLeaderboard.evaluate_1shotJuliaLLMLeaderboard.find_definitionsJuliaLLMLeaderboard.load_conversation_from_evalJuliaLLMLeaderboard.load_definitionJuliaLLMLeaderboard.load_evalsJuliaLLMLeaderboard.previewJuliaLLMLeaderboard.previewJuliaLLMLeaderboard.run_benchmarkJuliaLLMLeaderboard.run_code_blocks_additiveJuliaLLMLeaderboard.run_code_mainJuliaLLMLeaderboard.save_definitionJuliaLLMLeaderboard.score_evalJuliaLLMLeaderboard.score_evalJuliaLLMLeaderboard.timestamp_nowJuliaLLMLeaderboard.tmapreduceJuliaLLMLeaderboard.validate_definition
InteractiveUtils.edit — FunctionInteractiveUtils.edit(conversation::AbstractVector{<:PT.AbstractMessage}, bookmark::Int=-1)Opens the conversation in a preview window formatted as markdown (In VSCode, right click on the tab and select "Open Preview" to format it nicely).
See also: preview (for rendering as markdown in REPL)
JuliaLLMLeaderboard.evaluate_1shot — Methodevaluate_1shot(; conversation, fn_definition, definition, model, prompt_label, schema, parameters::NamedTuple=NamedTuple(), device="UNKNOWN", timestamp=timestamp_now(), version_pt=string(pkgversion(PromptingTools)), prompt_strategy="1SHOT", verbose::Bool=false,
auto_save::Bool=true, save_dir::AbstractString=dirname(fn_definition), experiment::AbstractString="",
execution_timeout::Int=60, capture_stdout::Bool=true)Runs evaluation for a single test case (parse, execute, run examples, run unit tests), including saving the files.
If auto_save=true, it saves the following files
<model-name>/evaluation__PROMPTABC__1SHOT__TIMESTAMP.json<model-name>/conversation__PROMPTABC__1SHOT__TIMESTAMP.json
into a sub-folder of where the definition file was stored.
Keyword Arguments
conversation: the conversation to evaluate (vector of messages), eg, fromaigeneratewhenreturn_all=truefn_definition: path to the definition file (eg,joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml"))definition: the test case definition dict loaded from the definition file. It's subset to only the relevant keys for code generation, eg,definition=load_definition(fn_definition)["code_generation"]model: the model name, eg,model="gpt4t"prompt_label: the prompt label, eg,prompt_label="JuliaExpertAsk"schema: the schema used for the prompt, eg,schema="-"orschema="OllamaManagedSchema()"parameters: the parameters used for the generation liketemperatureortop_p, eg,parameters=(; top_p=0.9)device: the device used for the generation, eg,device="Apple-MacBook-Pro-M1"timestamp: the timestamp used for the generation. Defaults totimestamp=timestamp_now()which is like "20231201_120000"version_pt: the version of PromptingTools used for the generation, eg,version_pt="0.1.0"prompt_strategy: the prompt strategy used for the generation, eg,prompt_strategy="1SHOT". Fixed for now!verbose: ifverbose=true, it will print out more information about the evaluation process, eg, the evaluation errorsauto_save: ifauto_save=true, it will save the evaluation and conversation files into a sub-folder of where the definition file was stored.save_dir: the directory where the evaluation and conversation files are saved. Defaults todirname(fn_definition).experiment: the experiment name, eg,experiment="my_experiment"(eg, when you're doing a parameter search). Defaults to""for standard benchmark run.execution_timeout: the timeout for the AICode code execution in seconds. Defaults to 60s.capture_stdout: ifcapture_stdout=true, AICode will capture the stdout of the code execution. Set tofalseif you're evaluating with multithreading (stdout capture is not thread-safe). Defaults totrueto avoid poluting the benchmark.remove_tests: ifremove_tests=true, AICode will remove any @testset blocks and unit tests from the main code definition (shields against model defining wrong unit tests inadvertedly).
Examples
using JuliaLLMLeaderboard
using PromptingTools
fn_definition = joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
d = load_definition(fn_definition)
msg = aigenerate(:JuliaExpertAsk; ask=d["code_generation"]["prompt"], model="gpt4t", return_all=true)
# Try evaluating it -- auto_save=false not to polute our benchmark
evals = evaluate_1shot(; conversation=msg, fn_definition, definition=d["code_generation"], model="gpt4t", prompt_label="JuliaExpertAsk", timestamp=timestamp_now(), device="Apple-MacBook-Pro-M1", schema="-", prompt_strategy="1SHOT", verbose=true, auto_save=false)JuliaLLMLeaderboard.find_definitions — FunctionFinds all definition.toml filenames in the given path. Returns a list of filenames to load.
JuliaLLMLeaderboard.load_conversation_from_eval — MethodLoads the conversation from the corresponding evaluation file.
JuliaLLMLeaderboard.load_definition — MethodLoads the test case definition from a TOML file under filename.
JuliaLLMLeaderboard.load_evals — Methodload_evals(base_dir::AbstractString; score::Bool=true, max_history::Int=5, new_columns::Vector{Symbol}=Symbol[], kwargs...)Loads all evaluation JSONs from a given director loaded in a DataFrame as rows. The directory is searched recursively, and all files starting with the prefix evaluation__ are loaded.
Keyword Arguments
score::Bool=true: Ifscore=true, the function will also callscore_evalon the resulting DataFrame.max_history::Int=5: Onlymax_historymost recent evaluations are loaded. Ifmax_history=0, all evaluations are loaded.
Returns: DataFrame
Note: It loads a fixed set of columns (set in a local variable eval_cols), so if you added some new columns, you'll need to pass them to new_columns::Vector{Symbol} argument.
JuliaLLMLeaderboard.preview — Methodpreview(conversation::AbstractVector{<:PT.AbstractMessage})Render a conversation, which is a vector of AbstractMessage objects, as a single markdown-formatted string. Each message is rendered individually and concatenated with separators for clear readability.
This function is particularly useful for displaying the flow of a conversation in a structured and readable format. It leverages the PT.preview method for individual messages to create a cohesive view of the entire conversation.
Arguments
conversation::AbstractVector{<:PT.AbstractMessage}: A vector of messages representing the conversation.
Returns
String: A markdown-formatted string representing the entire conversation.
Example
conversation = [
PT.SystemMessage("Welcome"),
PT.UserMessage("Hello"),
PT.AIMessage("Hi, how can I help you?")
]
println(PT.preview(conversation))This will output:
# System Message
Welcome
---
# User Message
Hello
---
# AI Message
Hi, how can I help you?
---JuliaLLMLeaderboard.preview — Methodpreview(msg::PT.AbstractMessage)Render a single AbstractMessage as a markdown-formatted string, highlighting the role of the message sender and the content of the message.
This function identifies the type of the message (User, Data, System, AI, or Unknown) and formats it with a header indicating the sender's role, followed by the content of the message. The output is suitable for nicer rendering, especially in REPL or markdown environments.
Arguments
msg::PT.AbstractMessage: The message to be rendered.
Returns
String: A markdown-formatted string representing the message.
Example
msg = PT.UserMessage("Hello, world!")
println(PT.preview(msg))This will output:
# User Message
Hello, world!JuliaLLMLeaderboard.run_benchmark — Methodrun_benchmark(; fn_definitions::Vector{<:AbstractString}=find_definitons(joinpath(@__DIR__, "..", "code_generation")),
models::Vector{String}=["gpt-3.5-turbo-1106"], model_suffix::String="", prompt_labels::Vector{<:AbstractString}=["JuliaExpertCoTTask", "JuliaExpertAsk", "InJulia", "AsIs", "JuliaRecapTask", "JuliaRecapCoTTask"],
api_kwargs::NamedTuple=NamedTuple(), http_kwargs::NamedTuple=(; readtimeout=300),
experiment::AbstractString="", save_dir::AbstractString="", auto_save::Bool=true, verbose::Union{Int,Bool}=true, device::AbstractString="-",
num_samples::Int=1, schema_lookup::AbstractDict{String,<:Any}=Dict{String,<:Any}())Runs the code generation benchmark with specified models and prompts for the specified number of samples.
It will generate the response, evaluate it and, optionally, also save it.
Note: This benchmark is not parallelized, because locally hosted models would be overwhelmed. However, you can take this code and apply Threads.@spawn to it. The evaluation itself should be thread-safe.
Keyword Arguments
fn_definitions: a vector of paths to definitions of test cases to run (definition.toml). If not specified, it will run all definition files in thecode_generationfolder.models: a vector of models to run. If not specified, it will run onlygpt-3.5-turbo-1106.model_suffix: a string to append to the model name in the evaluation data, eg, "–optim" if you provide some tuned API kwargs.prompt_labels: a vector of prompt labels to run. If not specified, it will run all available.num_samples: an integer to specify the number of samples to generate for each model/prompt combination. If not specified, it will generate 1 sample.api_kwargs: a named tuple of API kwargs to pass to theaigeneratefunction. If not specified, it will use the default values. Example:(; temperature=0.5, top_p=0.5)http_kwargs: a named tuple of HTTP.jl kwargs to pass to theaigeneratefunction. Defaults to 300s timeout. Example:http_kwargs = (; readtimeout=300)schema_lookup: a dictionary of schemas to use for each model. If not specified, it will use the default schemas in the registry. Example:Dict("mistral-tiny" => PT.MistralOpenAISchema())codefixing_num_rounds: an integer to specify the number of rounds of codefixing to run (with AICodeFixer). If not specified, it will NOT be used.codefixing_prompt_labels: a vector of prompt labels to use for codefixing. If not specified, it will default to only "CodeFixerTiny".experiment: a string to save in the evaluation data. Useful for future analysis.device: specify the device the benchmark is running on, eg, "Apple-MacBook-Pro-M1" or "NVIDIA-GTX-1080Ti", broadly "manufacturer-model".save_dir: a string of path where to save the evaluation data. If not specified, it will save in the same folder as the definition file. Useful for small experiments not to pollute the main benchmark.auto_save: a boolean whether to automatically save the evaluation data. If not specified, it will save the data. Otherwise, it only returns the vector ofevalsverbose: a boolean or integer to print progress. If not specified, it will print highlevel progress (verbose = 1 or true), for more details, setverbose=2 or =3.execution_timeout: an integer to specify the timeout in seconds for the execution of the generated code. If not specified, it will use 10s (per run/test item).
Return
- Vector of
evals, ie, a dictionary of evaluation data for each model/prompt combination and each sample.
Notes
- In general, use HTTP timeouts because both local models and APIs can get stuck, eg,
http_kwargs=(; readtimeout=150) - On Mac M1 with Ollama, you want to set apikwargs=(; options=(; numgpu=99)) for Ollama to have normal performance (ie, offload all model layers to the GPU)
- For commercial providers (MistralAI, OpenAI), we automatically inject a different random seed for each run to avoid caching
Example
fn_definitions = find_definitions("code_generation/") # all test cases
evals = run_benchmark(; fn_definitions, models=["gpt-3.5-turbo-1106"], prompt_labels=["JuliaExpertAsk"],
experiment="my-first-run", save_dir="temp", auto_save=true, verbose=true, device="Apple-MacBook-Pro-M1",
num_samples=1);
# not using `schema_lookup` as it's not needed for OpenAI modelsOr if you want only one test case use: fn_definitions = [joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")]
JuliaLLMLeaderboard.run_code_blocks_additive — Methodrun_code_blocks_additive(cb::AICode, code_blocks::AbstractVector{<:AbstractString};
verbose::Bool = false,
setup_code::AbstractString = "", teardown_code::AbstractString = "",
capture_stdout::Bool = true, execution_timeout::Int = 60)Runner for the additional code_blocks (can be either unit tests or examples), returns count of examples executed without an error.
code_blocks should be a vector of strings, each of which is a valid Julia expression that can be evaluated without an error thrown. Each successful run (no error thrown) is counted as a successful example.
Keyword Arguments
verbose=truewill provide more information about the test failures.setup_codeis a string that will be prepended to each code block before it's evaluated. Useful for setting up the environment/test objects.teardown_codeis a string that will be appended to each code block before it's evaluated. Useful for cleaning up the environment/test objects.capture_stdoutis a boolean whether to capture the stdout of the code execution. Set tofalseif you're evaluating with multithreading (stdout capture is not thread-safe).execution_timeoutis the timeout for the AICode code execution in seconds. Defaults to 60s.
Returns
count_successfulthe number of examples that were executed without an error thrown.
Example
using JuliaLLMLeaderboard: run_code_blocks
using PromptingTools: AICode
cb = AICode("mysum(a,b)=a+b")
code = "mysum(1,2)"
run_code_blocks(cb, [code])
# Output: 1 (= 1 example executed without an error thrown)JuliaLLMLeaderboard.run_code_main — Methodrun_code_main(msg::PT.AIMessage; verbose::Bool = true, function_name::AbstractString = "",
prefix::String = "",
execution_timeout::Int = 60,
capture_stdout::Bool = true,
expression_transform::Symbol = :remove_all_tests)Runs the code block in the message msg and returns the result as an AICode object.
Logic:
- Always execute with a timeout
- Always execute in a "safe mode" (inside a custom module,
safe_eval=true) - Skip any package imports or environment changes (
skip_unsafe=true) - Skip invalid/broken lines (
skip_invalid=true) - Remove any unit tests (
expression_transform=:remove_all_tests), because model might have added some without being asked for it explicitly - First, evaluate the code block as a whole, and if it fails, try to extract the function definition and evaluate it separately (fallback)
JuliaLLMLeaderboard.save_definition — MethodSaves the test case definition to a TOML file under filename.
JuliaLLMLeaderboard.score_eval — Methodscore_eval(parsed, executed, unit_tests_success_ratio, examples_success_ratio; max_points::Int=100)Score the evaluation result by distributing max_points equally across the available criteria.
Example
df=@rtransform df :score = score_eval(:parsed, :executed, :unit_tests_passed / :unit_tests_count, :examples_executed / :examples_count)JuliaLLMLeaderboard.score_eval — Methodscore_eval(eval::AbstractDict; max_points::Int=100)
score_eval(parsed, executed, unit_tests_success_ratio, examples_success_ratio; max_points::Int=100)Scores the evaluation result eval by distributing max_points equally across the available criteria. Alternatively, you can provide the individual scores as arguments (see above) with values in the 0-1 range.
Eg, if all 4 criteria are available, each will be worth 25% of points:
parsed(25% if true)executed(25% if true)unit_tests(25% if all unit tests passed)examples(25% if all examples executed without an error thrown)
JuliaLLMLeaderboard.timestamp_now — MethodProvide a current timestamp in the format yyyymmddHHMMSS. If `addrandom` is true, a random number between 100 and 999 is appended to avoid overrides.
JuliaLLMLeaderboard.tmapreduce — Methodtmapreduce(f, op, itr; tasks_per_thread::Int = 2, kwargs...)A parallelized version of the mapreduce function leveraging multi-threading.
The function f is applied to each element of itr, and then the results are reduced using an associative two-argument function op.
Arguments
f: A function to apply to each element ofitr.op: An associative two-argument reduction function.itr: An iterable collection of data.
Keyword Arguments
tasks_per_thread::Int = 2: The number of tasks spawned per thread. Determines the granularity of parallelism.kwargs...: Additional keyword arguments to pass to the innermapreducecalls.
Implementation Details
The function divides itr into chunks, spawning tasks for processing each chunk in parallel. The size of each chunk is determined by tasks_per_thread and the number of available threads (nthreads). The results from each task are then aggregated using the op function.
Notes
This implementation serves as a general replacement for older patterns. The goal is to introduce this function or a version of it to base Julia in the future.
Example
using Base.Threads: nthreads, @spawn
result = tmapreduce(x -> x^2, +, 1:10)The above example squares each number in the range 1 through 10 and then sums them up in parallel.
Source: Julia Blog post
JuliaLLMLeaderboard.validate_definition — Methodvalidate_definition(definition::AbstractDict; evaluate::Bool=true, verbose::Bool=true)Validates the definition.toml file for the code generation benchmark.
Returns true if the definition is valid.
Keyword Arguments
evaluate: a boolean whether to evaluate the definition. If not specified, it will evaluate the definition.verbose: a boolean whether to print progress during the evaluation. If not specified, it will print progress.kwargs: keyword arguments to pass to code parsing function (PT.AICode).
Example
fn_definition = joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
definition = load_definition(fn_definition)
validate_definition(definition)
# output: true