Reference

InteractiveUtils.edit
JuliaLLMLeaderboard.evaluate_1shot
JuliaLLMLeaderboard.find_definitions
JuliaLLMLeaderboard.load_conversation_from_eval
JuliaLLMLeaderboard.load_definition
JuliaLLMLeaderboard.load_evals
JuliaLLMLeaderboard.preview
JuliaLLMLeaderboard.preview
JuliaLLMLeaderboard.run_benchmark
JuliaLLMLeaderboard.run_code_blocks_additive
JuliaLLMLeaderboard.run_code_main
JuliaLLMLeaderboard.save_definition
JuliaLLMLeaderboard.score_eval
JuliaLLMLeaderboard.score_eval
JuliaLLMLeaderboard.timestamp_now
JuliaLLMLeaderboard.tmapreduce
JuliaLLMLeaderboard.validate_definition

InteractiveUtils.edit — Function

InteractiveUtils.edit(conversation::AbstractVector{<:PT.AbstractMessage}, bookmark::Int=-1)

Opens the conversation in a preview window formatted as markdown (In VSCode, right click on the tab and select "Open Preview" to format it nicely).

See also: preview (for rendering as markdown in REPL)

source

JuliaLLMLeaderboard.evaluate_1shot — Method

evaluate_1shot(; conversation, fn_definition, definition, model, prompt_label, schema, parameters::NamedTuple=NamedTuple(), device="UNKNOWN", timestamp=timestamp_now(), version_pt=string(pkgversion(PromptingTools)), prompt_strategy="1SHOT", verbose::Bool=false,
auto_save::Bool=true, save_dir::AbstractString=dirname(fn_definition), experiment::AbstractString="",
execution_timeout::Int=60, capture_stdout::Bool=true)

Runs evaluation for a single test case (parse, execute, run examples, run unit tests), including saving the files.

If auto_save=true, it saves the following files

<model-name>/evaluation__PROMPTABC__1SHOT__TIMESTAMP.json
<model-name>/conversation__PROMPTABC__1SHOT__TIMESTAMP.json

into a sub-folder of where the definition file was stored.

Keyword Arguments

conversation: the conversation to evaluate (vector of messages), eg, from aigenerate when return_all=true
fn_definition: path to the definition file (eg, joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml"))
definition: the test case definition dict loaded from the definition file. It's subset to only the relevant keys for code generation, eg, definition=load_definition(fn_definition)["code_generation"]
model: the model name, eg, model="gpt4t"
prompt_label: the prompt label, eg, prompt_label="JuliaExpertAsk"
schema: the schema used for the prompt, eg, schema="-" or schema="OllamaManagedSchema()"
parameters: the parameters used for the generation like temperature or top_p, eg, parameters=(; top_p=0.9)
device: the device used for the generation, eg, device="Apple-MacBook-Pro-M1"
timestamp: the timestamp used for the generation. Defaults to timestamp=timestamp_now() which is like "20231201_120000"
version_pt: the version of PromptingTools used for the generation, eg, version_pt="0.1.0"
prompt_strategy: the prompt strategy used for the generation, eg, prompt_strategy="1SHOT". Fixed for now!
verbose: if verbose=true, it will print out more information about the evaluation process, eg, the evaluation errors
auto_save: if auto_save=true, it will save the evaluation and conversation files into a sub-folder of where the definition file was stored.
save_dir: the directory where the evaluation and conversation files are saved. Defaults to dirname(fn_definition).
experiment: the experiment name, eg, experiment="my_experiment" (eg, when you're doing a parameter search). Defaults to "" for standard benchmark run.
execution_timeout: the timeout for the AICode code execution in seconds. Defaults to 60s.
capture_stdout: if capture_stdout=true, AICode will capture the stdout of the code execution. Set to false if you're evaluating with multithreading (stdout capture is not thread-safe). Defaults to true to avoid poluting the benchmark.
remove_tests: if remove_tests=true, AICode will remove any @testset blocks and unit tests from the main code definition (shields against model defining wrong unit tests inadvertedly).

Examples

using JuliaLLMLeaderboard
using PromptingTools

fn_definition = joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
d = load_definition(fn_definition)

msg = aigenerate(:JuliaExpertAsk; ask=d["code_generation"]["prompt"], model="gpt4t", return_all=true)

# Try evaluating it -- auto_save=false not to polute our benchmark
evals = evaluate_1shot(; conversation=msg, fn_definition, definition=d["code_generation"], model="gpt4t", prompt_label="JuliaExpertAsk", timestamp=timestamp_now(), device="Apple-MacBook-Pro-M1", schema="-", prompt_strategy="1SHOT", verbose=true, auto_save=false)

source

JuliaLLMLeaderboard.find_definitions — Function

Finds all definition.toml filenames in the given path. Returns a list of filenames to load.

source

JuliaLLMLeaderboard.load_conversation_from_eval — Method

Loads the conversation from the corresponding evaluation file.

source

JuliaLLMLeaderboard.load_definition — Method

Loads the test case definition from a TOML file under filename.

source

JuliaLLMLeaderboard.load_evals — Method

load_evals(base_dir::AbstractString; score::Bool=true, max_history::Int=5, new_columns::Vector{Symbol}=Symbol[], kwargs...)

Loads all evaluation JSONs from a given director loaded in a DataFrame as rows. The directory is searched recursively, and all files starting with the prefix evaluation__ are loaded.

Keyword Arguments

score::Bool=true: If score=true, the function will also call score_eval on the resulting DataFrame.
max_history::Int=5: Only max_history most recent evaluations are loaded. If max_history=0, all evaluations are loaded.

Returns: DataFrame

Note: It loads a fixed set of columns (set in a local variable eval_cols), so if you added some new columns, you'll need to pass them to new_columns::Vector{Symbol} argument.

source

JuliaLLMLeaderboard.preview — Method

preview(conversation::AbstractVector{<:PT.AbstractMessage})

Render a conversation, which is a vector of AbstractMessage objects, as a single markdown-formatted string. Each message is rendered individually and concatenated with separators for clear readability.

This function is particularly useful for displaying the flow of a conversation in a structured and readable format. It leverages the PT.preview method for individual messages to create a cohesive view of the entire conversation.

Arguments

conversation::AbstractVector{<:PT.AbstractMessage}: A vector of messages representing the conversation.

Returns

String: A markdown-formatted string representing the entire conversation.

Example

conversation = [
    PT.SystemMessage("Welcome"),
    PT.UserMessage("Hello"),
    PT.AIMessage("Hi, how can I help you?")
]
println(PT.preview(conversation))

This will output:

# System Message
Welcome
---
# User Message
Hello
---
# AI Message
Hi, how can I help you?
---

source

JuliaLLMLeaderboard.preview — Method

preview(msg::PT.AbstractMessage)

Render a single AbstractMessage as a markdown-formatted string, highlighting the role of the message sender and the content of the message.

This function identifies the type of the message (User, Data, System, AI, or Unknown) and formats it with a header indicating the sender's role, followed by the content of the message. The output is suitable for nicer rendering, especially in REPL or markdown environments.

Arguments

msg::PT.AbstractMessage: The message to be rendered.

Returns

String: A markdown-formatted string representing the message.

Example

msg = PT.UserMessage("Hello, world!")
println(PT.preview(msg))

This will output:

# User Message
Hello, world!

source

JuliaLLMLeaderboard.run_benchmark — Method

run_benchmark(; fn_definitions::Vector{<:AbstractString}=find_definitons(joinpath(@__DIR__, "..", "code_generation")),
models::Vector{String}=["gpt-3.5-turbo-1106"], model_suffix::String="", prompt_labels::Vector{<:AbstractString}=["JuliaExpertCoTTask", "JuliaExpertAsk", "InJulia", "AsIs", "JuliaRecapTask", "JuliaRecapCoTTask"],
api_kwargs::NamedTuple=NamedTuple(), http_kwargs::NamedTuple=(; readtimeout=300),
experiment::AbstractString="", save_dir::AbstractString="", auto_save::Bool=true, verbose::Union{Int,Bool}=true, device::AbstractString="-",
num_samples::Int=1, schema_lookup::AbstractDict{String,<:Any}=Dict{String,<:Any}())

Runs the code generation benchmark with specified models and prompts for the specified number of samples.

It will generate the response, evaluate it and, optionally, also save it.

Note: This benchmark is not parallelized, because locally hosted models would be overwhelmed. However, you can take this code and apply Threads.@spawn to it. The evaluation itself should be thread-safe.

Keyword Arguments

fn_definitions: a vector of paths to definitions of test cases to run (definition.toml). If not specified, it will run all definition files in the code_generation folder.
models: a vector of models to run. If not specified, it will run only gpt-3.5-turbo-1106.
model_suffix: a string to append to the model name in the evaluation data, eg, "–optim" if you provide some tuned API kwargs.
prompt_labels: a vector of prompt labels to run. If not specified, it will run all available.
num_samples: an integer to specify the number of samples to generate for each model/prompt combination. If not specified, it will generate 1 sample.
api_kwargs: a named tuple of API kwargs to pass to the aigenerate function. If not specified, it will use the default values. Example: (; temperature=0.5, top_p=0.5)
http_kwargs: a named tuple of HTTP.jl kwargs to pass to the aigenerate function. Defaults to 300s timeout. Example: http_kwargs = (; readtimeout=300)
schema_lookup: a dictionary of schemas to use for each model. If not specified, it will use the default schemas in the registry. Example: Dict("mistral-tiny" => PT.MistralOpenAISchema())
codefixing_num_rounds: an integer to specify the number of rounds of codefixing to run (with AICodeFixer). If not specified, it will NOT be used.
codefixing_prompt_labels: a vector of prompt labels to use for codefixing. If not specified, it will default to only "CodeFixerTiny".
experiment: a string to save in the evaluation data. Useful for future analysis.
device: specify the device the benchmark is running on, eg, "Apple-MacBook-Pro-M1" or "NVIDIA-GTX-1080Ti", broadly "manufacturer-model".
save_dir: a string of path where to save the evaluation data. If not specified, it will save in the same folder as the definition file. Useful for small experiments not to pollute the main benchmark.
auto_save: a boolean whether to automatically save the evaluation data. If not specified, it will save the data. Otherwise, it only returns the vector of evals
verbose: a boolean or integer to print progress. If not specified, it will print highlevel progress (verbose = 1 or true), for more details, set verbose=2 or =3.
execution_timeout: an integer to specify the timeout in seconds for the execution of the generated code. If not specified, it will use 10s (per run/test item).

Return

Vector of evals, ie, a dictionary of evaluation data for each model/prompt combination and each sample.

Notes

In general, use HTTP timeouts because both local models and APIs can get stuck, eg, http_kwargs=(; readtimeout=150)
On Mac M1 with Ollama, you want to set apikwargs=(; options=(; numgpu=99)) for Ollama to have normal performance (ie, offload all model layers to the GPU)
For commercial providers (MistralAI, OpenAI), we automatically inject a different random seed for each run to avoid caching

Example

fn_definitions = find_definitions("code_generation/") # all test cases

evals = run_benchmark(; fn_definitions, models=["gpt-3.5-turbo-1106"], prompt_labels=["JuliaExpertAsk"],
    experiment="my-first-run", save_dir="temp", auto_save=true, verbose=true, device="Apple-MacBook-Pro-M1",
    num_samples=1);

# not using `schema_lookup` as it's not needed for OpenAI models

Or if you want only one test case use: fn_definitions = [joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")]

source

JuliaLLMLeaderboard.run_code_blocks_additive — Method

run_code_blocks_additive(cb::AICode, code_blocks::AbstractVector{<:AbstractString};
    verbose::Bool = false,
    setup_code::AbstractString = "", teardown_code::AbstractString = "",
    capture_stdout::Bool = true, execution_timeout::Int = 60)

Runner for the additional code_blocks (can be either unit tests or examples), returns count of examples executed without an error.

code_blocks should be a vector of strings, each of which is a valid Julia expression that can be evaluated without an error thrown. Each successful run (no error thrown) is counted as a successful example.

Keyword Arguments

verbose=true will provide more information about the test failures.
setup_code is a string that will be prepended to each code block before it's evaluated. Useful for setting up the environment/test objects.
teardown_code is a string that will be appended to each code block before it's evaluated. Useful for cleaning up the environment/test objects.
capture_stdout is a boolean whether to capture the stdout of the code execution. Set to false if you're evaluating with multithreading (stdout capture is not thread-safe).
execution_timeout is the timeout for the AICode code execution in seconds. Defaults to 60s.

Returns

count_successful the number of examples that were executed without an error thrown.

Example

using JuliaLLMLeaderboard: run_code_blocks
using PromptingTools: AICode

cb = AICode("mysum(a,b)=a+b")
code = "mysum(1,2)"
run_code_blocks(cb, [code])
# Output: 1 (= 1 example executed without an error thrown)

source

JuliaLLMLeaderboard.run_code_main — Method

run_code_main(msg::PT.AIMessage; verbose::Bool = true, function_name::AbstractString = "",
    prefix::String = "",
    execution_timeout::Int = 60,
    capture_stdout::Bool = true,
    expression_transform::Symbol = :remove_all_tests)

Runs the code block in the message msg and returns the result as an AICode object.

Logic:

Always execute with a timeout
Always execute in a "safe mode" (inside a custom module, safe_eval=true)
Skip any package imports or environment changes (skip_unsafe=true)
Skip invalid/broken lines (skip_invalid=true)
Remove any unit tests (expression_transform=:remove_all_tests), because model might have added some without being asked for it explicitly
First, evaluate the code block as a whole, and if it fails, try to extract the function definition and evaluate it separately (fallback)

source

JuliaLLMLeaderboard.save_definition — Method

Saves the test case definition to a TOML file under filename.

source

JuliaLLMLeaderboard.score_eval — Method

score_eval(parsed, executed, unit_tests_success_ratio, examples_success_ratio; max_points::Int=100)

Score the evaluation result by distributing max_points equally across the available criteria.

Example

df=@rtransform df :score = score_eval(:parsed, :executed, :unit_tests_passed / :unit_tests_count, :examples_executed / :examples_count)

source

JuliaLLMLeaderboard.score_eval — Method

score_eval(eval::AbstractDict; max_points::Int=100)

score_eval(parsed, executed, unit_tests_success_ratio, examples_success_ratio; max_points::Int=100)

Scores the evaluation result eval by distributing max_points equally across the available criteria. Alternatively, you can provide the individual scores as arguments (see above) with values in the 0-1 range.

Eg, if all 4 criteria are available, each will be worth 25% of points:

parsed (25% if true)
executed (25% if true)
unit_tests (25% if all unit tests passed)
examples (25% if all examples executed without an error thrown)

source

JuliaLLMLeaderboard.timestamp_now — Method

Provide a current timestamp in the format yyyymmddHHMMSS. If `addrandom` is true, a random number between 100 and 999 is appended to avoid overrides.

source

JuliaLLMLeaderboard.tmapreduce — Method

tmapreduce(f, op, itr; tasks_per_thread::Int = 2, kwargs...)

A parallelized version of the mapreduce function leveraging multi-threading.

The function f is applied to each element of itr, and then the results are reduced using an associative two-argument function op.

Arguments

f: A function to apply to each element of itr.
op: An associative two-argument reduction function.
itr: An iterable collection of data.

Keyword Arguments

tasks_per_thread::Int = 2: The number of tasks spawned per thread. Determines the granularity of parallelism.
kwargs...: Additional keyword arguments to pass to the inner mapreduce calls.

Implementation Details

The function divides itr into chunks, spawning tasks for processing each chunk in parallel. The size of each chunk is determined by tasks_per_thread and the number of available threads (nthreads). The results from each task are then aggregated using the op function.

Notes

This implementation serves as a general replacement for older patterns. The goal is to introduce this function or a version of it to base Julia in the future.

Example

using Base.Threads: nthreads, @spawn
result = tmapreduce(x -> x^2, +, 1:10)

The above example squares each number in the range 1 through 10 and then sums them up in parallel.

Source: Julia Blog post

source

JuliaLLMLeaderboard.validate_definition — Method

validate_definition(definition::AbstractDict; evaluate::Bool=true, verbose::Bool=true)

Validates the definition.toml file for the code generation benchmark.

Returns true if the definition is valid.

Keyword Arguments

evaluate: a boolean whether to evaluate the definition. If not specified, it will evaluate the definition.
verbose: a boolean whether to print progress during the evaluation. If not specified, it will print progress.
kwargs: keyword arguments to pass to code parsing function (PT.AICode).

Example

fn_definition = joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
definition = load_definition(fn_definition)
validate_definition(definition)
# output: true

source