Reference

InteractiveUtils.editFunction
InteractiveUtils.edit(conversation::AbstractVector{<:PT.AbstractMessage}, bookmark::Int=-1)

Opens the conversation in a preview window formatted as markdown (In VSCode, right click on the tab and select "Open Preview" to format it nicely).

See also: preview (for rendering as markdown in REPL)

source
JuliaLLMLeaderboard.evaluate_1shotMethod
evaluate_1shot(; conversation, fn_definition, definition, model, prompt_label, schema, parameters::NamedTuple=NamedTuple(), device="UNKNOWN", timestamp=timestamp_now(), version_pt=string(pkgversion(PromptingTools)), prompt_strategy="1SHOT", verbose::Bool=false,
auto_save::Bool=true, save_dir::AbstractString=dirname(fn_definition), experiment::AbstractString="",
execution_timeout::Int=60, capture_stdout::Bool=true)

Runs evaluation for a single test case (parse, execute, run examples, run unit tests), including saving the files.

If auto_save=true, it saves the following files

  • <model-name>/evaluation__PROMPTABC__1SHOT__TIMESTAMP.json
  • <model-name>/conversation__PROMPTABC__1SHOT__TIMESTAMP.json

into a sub-folder of where the definition file was stored.

Keyword Arguments

  • conversation: the conversation to evaluate (vector of messages), eg, from aigenerate when return_all=true
  • fn_definition: path to the definition file (eg, joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml"))
  • definition: the test case definition dict loaded from the definition file. It's subset to only the relevant keys for code generation, eg, definition=load_definition(fn_definition)["code_generation"]
  • model: the model name, eg, model="gpt4t"
  • prompt_label: the prompt label, eg, prompt_label="JuliaExpertAsk"
  • schema: the schema used for the prompt, eg, schema="-" or schema="OllamaManagedSchema()"
  • parameters: the parameters used for the generation like temperature or top_p, eg, parameters=(; top_p=0.9)
  • device: the device used for the generation, eg, device="Apple-MacBook-Pro-M1"
  • timestamp: the timestamp used for the generation. Defaults to timestamp=timestamp_now() which is like "20231201_120000"
  • version_pt: the version of PromptingTools used for the generation, eg, version_pt="0.1.0"
  • prompt_strategy: the prompt strategy used for the generation, eg, prompt_strategy="1SHOT". Fixed for now!
  • verbose: if verbose=true, it will print out more information about the evaluation process, eg, the evaluation errors
  • auto_save: if auto_save=true, it will save the evaluation and conversation files into a sub-folder of where the definition file was stored.
  • save_dir: the directory where the evaluation and conversation files are saved. Defaults to dirname(fn_definition).
  • experiment: the experiment name, eg, experiment="my_experiment" (eg, when you're doing a parameter search). Defaults to "" for standard benchmark run.
  • execution_timeout: the timeout for the AICode code execution in seconds. Defaults to 60s.
  • capture_stdout: if capture_stdout=true, AICode will capture the stdout of the code execution. Set to false if you're evaluating with multithreading (stdout capture is not thread-safe). Defaults to true to avoid poluting the benchmark.
  • remove_tests: if remove_tests=true, AICode will remove any @testset blocks and unit tests from the main code definition (shields against model defining wrong unit tests inadvertedly).

Examples

using JuliaLLMLeaderboard
using PromptingTools

fn_definition = joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
d = load_definition(fn_definition)

msg = aigenerate(:JuliaExpertAsk; ask=d["code_generation"]["prompt"], model="gpt4t", return_all=true)

# Try evaluating it -- auto_save=false not to polute our benchmark
evals = evaluate_1shot(; conversation=msg, fn_definition, definition=d["code_generation"], model="gpt4t", prompt_label="JuliaExpertAsk", timestamp=timestamp_now(), device="Apple-MacBook-Pro-M1", schema="-", prompt_strategy="1SHOT", verbose=true, auto_save=false)
source
JuliaLLMLeaderboard.load_evalsMethod
load_evals(base_dir::AbstractString; score::Bool=true, max_history::Int=5, new_columns::Vector{Symbol}=Symbol[], kwargs...)

Loads all evaluation JSONs from a given director loaded in a DataFrame as rows. The directory is searched recursively, and all files starting with the prefix evaluation__ are loaded.

Keyword Arguments

  • score::Bool=true: If score=true, the function will also call score_eval on the resulting DataFrame.
  • max_history::Int=5: Only max_history most recent evaluations are loaded. If max_history=0, all evaluations are loaded.

Returns: DataFrame

Note: It loads a fixed set of columns (set in a local variable eval_cols), so if you added some new columns, you'll need to pass them to new_columns::Vector{Symbol} argument.

source
JuliaLLMLeaderboard.previewMethod
preview(conversation::AbstractVector{<:PT.AbstractMessage})

Render a conversation, which is a vector of AbstractMessage objects, as a single markdown-formatted string. Each message is rendered individually and concatenated with separators for clear readability.

This function is particularly useful for displaying the flow of a conversation in a structured and readable format. It leverages the PT.preview method for individual messages to create a cohesive view of the entire conversation.

Arguments

  • conversation::AbstractVector{<:PT.AbstractMessage}: A vector of messages representing the conversation.

Returns

  • String: A markdown-formatted string representing the entire conversation.

Example

conversation = [
    PT.SystemMessage("Welcome"),
    PT.UserMessage("Hello"),
    PT.AIMessage("Hi, how can I help you?")
]
println(PT.preview(conversation))

This will output:

# System Message
Welcome
---
# User Message
Hello
---
# AI Message
Hi, how can I help you?
---
source
JuliaLLMLeaderboard.previewMethod
preview(msg::PT.AbstractMessage)

Render a single AbstractMessage as a markdown-formatted string, highlighting the role of the message sender and the content of the message.

This function identifies the type of the message (User, Data, System, AI, or Unknown) and formats it with a header indicating the sender's role, followed by the content of the message. The output is suitable for nicer rendering, especially in REPL or markdown environments.

Arguments

  • msg::PT.AbstractMessage: The message to be rendered.

Returns

  • String: A markdown-formatted string representing the message.

Example

msg = PT.UserMessage("Hello, world!")
println(PT.preview(msg))

This will output:

# User Message
Hello, world!
source
JuliaLLMLeaderboard.run_benchmarkMethod
run_benchmark(; fn_definitions::Vector{<:AbstractString}=find_definitons(joinpath(@__DIR__, "..", "code_generation")),
models::Vector{String}=["gpt-3.5-turbo-1106"], model_suffix::String="", prompt_labels::Vector{<:AbstractString}=["JuliaExpertCoTTask", "JuliaExpertAsk", "InJulia", "AsIs", "JuliaRecapTask", "JuliaRecapCoTTask"],
api_kwargs::NamedTuple=NamedTuple(), http_kwargs::NamedTuple=(; readtimeout=300),
experiment::AbstractString="", save_dir::AbstractString="", auto_save::Bool=true, verbose::Union{Int,Bool}=true, device::AbstractString="-",
num_samples::Int=1, schema_lookup::AbstractDict{String,<:Any}=Dict{String,<:Any}())

Runs the code generation benchmark with specified models and prompts for the specified number of samples.

It will generate the response, evaluate it and, optionally, also save it.

Note: This benchmark is not parallelized, because locally hosted models would be overwhelmed. However, you can take this code and apply Threads.@spawn to it. The evaluation itself should be thread-safe.

Keyword Arguments

  • fn_definitions: a vector of paths to definitions of test cases to run (definition.toml). If not specified, it will run all definition files in the code_generation folder.

  • models: a vector of models to run. If not specified, it will run only gpt-3.5-turbo-1106.

  • model_suffix: a string to append to the model name in the evaluation data, eg, "–optim" if you provide some tuned API kwargs.

  • prompt_labels: a vector of prompt labels to run. If not specified, it will run all available.

  • num_samples: an integer to specify the number of samples to generate for each model/prompt combination. If not specified, it will generate 1 sample.

  • api_kwargs: a named tuple of API kwargs to pass to the aigenerate function. If not specified, it will use the default values. Example: (; temperature=0.5, top_p=0.5)

  • http_kwargs: a named tuple of HTTP.jl kwargs to pass to the aigenerate function. Defaults to 300s timeout. Example: http_kwargs = (; readtimeout=300)

  • schema_lookup: a dictionary of schemas to use for each model. If not specified, it will use the default schemas in the registry. Example: Dict("mistral-tiny" => PT.MistralOpenAISchema())

  • codefixing_num_rounds: an integer to specify the number of rounds of codefixing to run (with AICodeFixer). If not specified, it will NOT be used.

  • codefixing_prompt_labels: a vector of prompt labels to use for codefixing. If not specified, it will default to only "CodeFixerTiny".

  • experiment: a string to save in the evaluation data. Useful for future analysis.

  • device: specify the device the benchmark is running on, eg, "Apple-MacBook-Pro-M1" or "NVIDIA-GTX-1080Ti", broadly "manufacturer-model".

  • save_dir: a string of path where to save the evaluation data. If not specified, it will save in the same folder as the definition file. Useful for small experiments not to pollute the main benchmark.

  • auto_save: a boolean whether to automatically save the evaluation data. If not specified, it will save the data. Otherwise, it only returns the vector of evals

  • verbose: a boolean or integer to print progress. If not specified, it will print highlevel progress (verbose = 1 or true), for more details, set verbose=2 or =3.

  • execution_timeout: an integer to specify the timeout in seconds for the execution of the generated code. If not specified, it will use 10s (per run/test item).

Return

  • Vector of evals, ie, a dictionary of evaluation data for each model/prompt combination and each sample.

Notes

  • In general, use HTTP timeouts because both local models and APIs can get stuck, eg, http_kwargs=(; readtimeout=150)
  • On Mac M1 with Ollama, you want to set apikwargs=(; options=(; numgpu=99)) for Ollama to have normal performance (ie, offload all model layers to the GPU)
  • For commercial providers (MistralAI, OpenAI), we automatically inject a different random seed for each run to avoid caching

Example

fn_definitions = find_definitions("code_generation/") # all test cases

evals = run_benchmark(; fn_definitions, models=["gpt-3.5-turbo-1106"], prompt_labels=["JuliaExpertAsk"],
    experiment="my-first-run", save_dir="temp", auto_save=true, verbose=true, device="Apple-MacBook-Pro-M1",
    num_samples=1);

# not using `schema_lookup` as it's not needed for OpenAI models

Or if you want only one test case use: fn_definitions = [joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")]

source
JuliaLLMLeaderboard.run_code_blocks_additiveMethod
run_code_blocks_additive(cb::AICode, code_blocks::AbstractVector{<:AbstractString};
    verbose::Bool = false,
    setup_code::AbstractString = "", teardown_code::AbstractString = "",
    capture_stdout::Bool = true, execution_timeout::Int = 60)

Runner for the additional code_blocks (can be either unit tests or examples), returns count of examples executed without an error.

code_blocks should be a vector of strings, each of which is a valid Julia expression that can be evaluated without an error thrown. Each successful run (no error thrown) is counted as a successful example.

Keyword Arguments

  • verbose=true will provide more information about the test failures.
  • setup_code is a string that will be prepended to each code block before it's evaluated. Useful for setting up the environment/test objects.
  • teardown_code is a string that will be appended to each code block before it's evaluated. Useful for cleaning up the environment/test objects.
  • capture_stdout is a boolean whether to capture the stdout of the code execution. Set to false if you're evaluating with multithreading (stdout capture is not thread-safe).
  • execution_timeout is the timeout for the AICode code execution in seconds. Defaults to 60s.

Returns

  • count_successful the number of examples that were executed without an error thrown.

Example

using JuliaLLMLeaderboard: run_code_blocks
using PromptingTools: AICode

cb = AICode("mysum(a,b)=a+b")
code = "mysum(1,2)"
run_code_blocks(cb, [code])
# Output: 1 (= 1 example executed without an error thrown)
source
JuliaLLMLeaderboard.run_code_mainMethod
run_code_main(msg::PT.AIMessage; verbose::Bool = true, function_name::AbstractString = "",
    prefix::String = "",
    execution_timeout::Int = 60,
    capture_stdout::Bool = true,
    expression_transform::Symbol = :remove_all_tests)

Runs the code block in the message msg and returns the result as an AICode object.

Logic:

  • Always execute with a timeout
  • Always execute in a "safe mode" (inside a custom module, safe_eval=true)
  • Skip any package imports or environment changes (skip_unsafe=true)
  • Skip invalid/broken lines (skip_invalid=true)
  • Remove any unit tests (expression_transform=:remove_all_tests), because model might have added some without being asked for it explicitly
  • First, evaluate the code block as a whole, and if it fails, try to extract the function definition and evaluate it separately (fallback)
source
JuliaLLMLeaderboard.score_evalMethod
score_eval(parsed, executed, unit_tests_success_ratio, examples_success_ratio; max_points::Int=100)

Score the evaluation result by distributing max_points equally across the available criteria.

Example

df=@rtransform df :score = score_eval(:parsed, :executed, :unit_tests_passed / :unit_tests_count, :examples_executed / :examples_count)
source
JuliaLLMLeaderboard.score_evalMethod
score_eval(eval::AbstractDict; max_points::Int=100)

score_eval(parsed, executed, unit_tests_success_ratio, examples_success_ratio; max_points::Int=100)

Scores the evaluation result eval by distributing max_points equally across the available criteria. Alternatively, you can provide the individual scores as arguments (see above) with values in the 0-1 range.

Eg, if all 4 criteria are available, each will be worth 25% of points:

  • parsed (25% if true)
  • executed (25% if true)
  • unit_tests (25% if all unit tests passed)
  • examples (25% if all examples executed without an error thrown)
source
JuliaLLMLeaderboard.timestamp_nowMethod

Provide a current timestamp in the format yyyymmddHHMMSS. If `addrandom` is true, a random number between 100 and 999 is appended to avoid overrides.

source
JuliaLLMLeaderboard.tmapreduceMethod
tmapreduce(f, op, itr; tasks_per_thread::Int = 2, kwargs...)

A parallelized version of the mapreduce function leveraging multi-threading.

The function f is applied to each element of itr, and then the results are reduced using an associative two-argument function op.

Arguments

  • f: A function to apply to each element of itr.
  • op: An associative two-argument reduction function.
  • itr: An iterable collection of data.

Keyword Arguments

  • tasks_per_thread::Int = 2: The number of tasks spawned per thread. Determines the granularity of parallelism.
  • kwargs...: Additional keyword arguments to pass to the inner mapreduce calls.

Implementation Details

The function divides itr into chunks, spawning tasks for processing each chunk in parallel. The size of each chunk is determined by tasks_per_thread and the number of available threads (nthreads). The results from each task are then aggregated using the op function.

Notes

This implementation serves as a general replacement for older patterns. The goal is to introduce this function or a version of it to base Julia in the future.

Example

using Base.Threads: nthreads, @spawn
result = tmapreduce(x -> x^2, +, 1:10)

The above example squares each number in the range 1 through 10 and then sums them up in parallel.

Source: Julia Blog post

source
JuliaLLMLeaderboard.validate_definitionMethod
validate_definition(definition::AbstractDict; evaluate::Bool=true, verbose::Bool=true)

Validates the definition.toml file for the code generation benchmark.

Returns true if the definition is valid.

Keyword Arguments

  • evaluate: a boolean whether to evaluate the definition. If not specified, it will evaluate the definition.
  • verbose: a boolean whether to print progress during the evaluation. If not specified, it will print progress.
  • kwargs: keyword arguments to pass to code parsing function (PT.AICode).

Example

fn_definition = joinpath("code_generation", "utility_functions", "event_scheduler", "definition.toml")
definition = load_definition(fn_definition)
validate_definition(definition)
# output: true
source