Getting Started
A few different options to get you started:
Run the Benchmark
See examples/code_gen_benchmark.jl
for an example of how to run the benchmark.
You can start by launching the benchmark with the default settings:
using JuliaLLMLeaderboard
# set your device for reference
device = "Apple-MacBook-Pro-M1" # "Apple-MacBook-Pro-M1" or "NVIDIA-GTX-1080Ti", broadly "manufacturer-model"
# which models
model_options = [
"gpt-3.5-turbo",
"gpt-3.5-turbo-1106",
"gpt-4-1106-preview",
"mistral-tiny",
"mistral-small",
"mistral-medium",
]
# which prompts
prompt_options = [
"JuliaExpertCoTTask",
"JuliaExpertAsk",
"InJulia",
"JuliaRecapTask",
"JuliaRecapCoTTask",
]
# Which test cases
fn_definitions = find_definitions("code_generation/")
# What experiment are you running (for future reference)
experiment = "my-first-benchmark"
# Run the benchmark -- set `auto_save = false` to disable saving results into files
evals = run_benchmark(; fn_definitions, models = model_options,
prompt_labels = prompt_options,
experiment, auto_save = true, verbose = true, device,
num_samples = 1, http_kwargs = (; readtimeout = 150));
# You can then easily score each of these evaluation runs
scores = score_evals.(evals)
Create Your Analysis
To inspect individual model answers and their associated scores, see examples/inspect_results.jl
or examples/debugging_results.jl
.
To compare different models, see examples/summarize_results_paid.jl
Run an Experiment
Want to run some experiments and save the results? Check out examples/experiment_hyperparameter_scan.jl
for finding the optimal temperature
and top_p
!
Contributing Results
- Run Your Evaluation: Choose your model and prompt, and run the test.
- Save Results: Store both the conversation and the evaluation.
- Open a PR: Include the part of the code snippet you changed in the PR comments. We generally require 1-2 independent verifications of your result or at least 3 samples for each combination (for validity).