Results for Paid LLM APIs
The below captures the performance of 3 models from two commercial LLM APIs: OpenAI (GPT-3.5 Turbo, GPT-4, ...) and MistralAI (tiny, small, medium).
There are many other providers, but OpenAI is the most commonly used. MistralAI commercial API has launched recently and has a very good relationship with the Open-Source community, so we've added it as a challenger to compare OpenAI's cost effectiveness ("cost per point", ie, how many cents would you pay for 1pt in this benchmark)
Reminder: The below scores are on a scale 0-100, where 100 is the best possible score and 0 means the generated code was not even parseable.
# Imports
using JuliaLLMLeaderboard
using CairoMakie, AlgebraOfGraphics
using MarkdownTables, DataFramesMeta
using Statistics: mean, median, quantile, std;
# ! Configuration
SAVE_PLOTS = false
DIR_RESULTS = joinpath(pkgdir(JuliaLLMLeaderboard), "code_generation")
PAID_MODELS_DEFAULT = [
"gpt-3.5-turbo",
"gpt-3.5-turbo-1106",
"gpt-3.5-turbo-0125",
"gpt-4-1106-preview",
"gpt-4-0125-preview",
"gpt-4-turbo-2024-04-09",
"gpt-4o-2024-05-13",
"gpt-4o-mini-2024-07-18",
"gpt-4o-2024-08-06",
"mistral-tiny",
"mistral-small",
"mistral-medium",
"mistral-large",
"mistral-small-2402",
"mistral-medium-2312",
"mistral-large-2402",
"claude-3-opus-20240229",
"claude-3-sonnet-20240229",
"claude-3-haiku-20240307",
"claude-3-5-sonnet-20240620",
"claude-2.1",
"gemini-1.0-pro-latest",
"deepseek-chat",
"deepseek-coder",
"codestral-2405",
"mistral-large-2407"
];
PROMPTS = [
"JuliaExpertCoTTask",
"JuliaExpertAsk",
"InJulia",
"JuliaRecapTask",
"JuliaRecapCoTTask"
];
Load Latest Results
Use only the 10 most recent evaluations available for each definition/model/prompt
df = @chain begin
load_evals(DIR_RESULTS; max_history = 10)
@rsubset :model in PAID_MODELS_DEFAULT && :prompt_label in PROMPTS
end;
Model Comparison
Highest average score by model:
fig = @chain df begin
@by [:model] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
end
transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
@orderby -:score
@aside local order_ = _.model
data(_) *
mapping(:model => sorter(order_) => "Model",
:score => "Avg. Score (Max 100 pts)") *
visual(BarPlot; bar_labels = :y,
label_offset = 0, label_rotation = 1)
draw(;
axis = (limits = (nothing, nothing, 0, 100),
xticklabelrotation = 45,
title = "Paid APIs Performance"))
end
SAVE_PLOTS && save("assets/model-comparison-paid.png", fig)
fig
Table:
output = @chain df begin
@by [:model] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
:score_std_deviation = std(:score)
:count_zero_score = count(iszero, :score)
:count_full_score = count(==(100), :score)
end
transform(_,
[:elapsed, :score, :score_std_deviation] .=> ByRow(x -> round(x, digits = 1)),
renamecols = false)
@rtransform :cost_cents = round(:cost * 100; digits = 2)
select(Not(:cost))
@orderby -:score
rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
Model | Elapsed | Score | Score Std Deviation | Count Zero Score | Count Full Score | Cost Cents |
---|---|---|---|---|---|---|
claude-3-5-sonnet-20240620 | 6.3 | 85.8 | 21.1 | 13 | 355 | 0.73 |
claude-3-opus-20240229 | 20.3 | 83.2 | 19.6 | 2 | 329 | 3.9 |
claude-3-sonnet-20240229 | 8.7 | 78.8 | 26.2 | 22 | 308 | 0.73 |
gpt-4o-2024-08-06 | 4.6 | 76.6 | 27.9 | 26 | 310 | 0.0 |
codestral-2405 | 1.9 | 76.3 | 29.3 | 33 | 276 | 0.0 |
gpt-4-turbo-2024-04-09 | 10.8 | 75.3 | 29.6 | 38 | 290 | 1.38 |
claude-3-haiku-20240307 | 4.0 | 74.9 | 27.2 | 9 | 261 | 0.05 |
gpt-4-0125-preview | 30.3 | 74.4 | 30.3 | 39 | 284 | 1.29 |
gpt-4-1106-preview | 22.4 | 74.4 | 29.9 | 19 | 142 | 1.21 |
gpt-4o-mini-2024-07-18 | 5.1 | 74.0 | 29.4 | 32 | 276 | 0.03 |
mistral-large-2407 | 11.3 | 73.6 | 29.5 | 15 | 137 | 0.49 |
gpt-4o-2024-05-13 | 4.3 | 72.9 | 29.1 | 29 | 257 | 0.0 |
deepseek-coder | 13.0 | 71.6 | 32.6 | 39 | 115 | 0.01 |
mistral-large-2402 | 8.5 | 71.6 | 27.2 | 13 | 223 | 0.0 |
deepseek-chat | 17.9 | 71.3 | 32.9 | 30 | 140 | 0.01 |
claude-2.1 | 10.1 | 67.9 | 30.8 | 47 | 229 | 0.8 |
gpt-3.5-turbo-0125 | 1.2 | 61.7 | 36.6 | 125 | 192 | 0.03 |
mistral-medium | 18.1 | 60.8 | 33.2 | 22 | 90 | 0.41 |
mistral-small | 5.9 | 60.1 | 30.2 | 27 | 76 | 0.09 |
mistral-small-2402 | 5.3 | 59.9 | 29.4 | 31 | 169 | 0.0 |
gpt-3.5-turbo-1106 | 2.1 | 58.4 | 39.2 | 82 | 97 | 0.04 |
mistral-tiny | 4.6 | 46.9 | 32.0 | 75 | 42 | 0.02 |
gpt-3.5-turbo | 3.6 | 42.3 | 38.2 | 132 | 54 | 0.04 |
gemini-1.0-pro-latest | 4.2 | 34.8 | 27.4 | 181 | 25 | 0.0 |
While the victory of GPT-4 is not surprising, note that the our sample size is small and the standard deviation is quite high.
Overview by Prompt Template
Bar chart with all paid models and various prompt templates
fig = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
@aside local average_ = @by(_, :model, :avg=mean(:score)) |>
x -> @orderby(x, -:avg).model
data(_) *
mapping(:model => sorter(average_) => "Model",
:score => "Avg. Score (Max 100 pts)",
color = :prompt_label => "Prompts",
dodge = :prompt_label) * visual(BarPlot)
draw(;
figure = (; size = (900, 600)),
axis = (xticklabelrotation = 45, title = "Comparison for Paid APIs"))
end
SAVE_PLOTS && save("assets/model-prompt-comparison-paid.png", fig)
fig
Table:
- Surprised by the low performance of some models (eg, GPT 3.5 Turbo) on the CoT prompts? It's because the model accidentally sends a "stop" token before it writes the code.
output = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
end
@aside average_ = @by _ :model :AverageScore=mean(:score) |> x -> round(x, digits = 1)
unstack(:model, :prompt_label, :score; fill = 0.0)
transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
leftjoin(average_, on = :model)
@orderby -:AverageScore
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
model | InJulia | JuliaExpertAsk | JuliaExpertCoTTask | JuliaRecapCoTTask | JuliaRecapTask | AverageScore |
---|---|---|---|---|---|---|
claude-3-5-sonnet-20240620 | 87.7 | 84.2 | 86.1 | 84.5 | 86.5 | 85.8 |
claude-3-opus-20240229 | 84.1 | 84.0 | 85.1 | 81.6 | 81.2 | 83.2 |
claude-3-sonnet-20240229 | 80.9 | 79.0 | 80.3 | 75.6 | 78.2 | 78.8 |
gpt-4o-2024-08-06 | 74.8 | 79.6 | 82.3 | 74.0 | 72.6 | 76.6 |
codestral-2405 | 78.5 | 78.0 | 74.2 | 77.8 | 72.7 | 76.3 |
gpt-4-turbo-2024-04-09 | 76.5 | 78.5 | 75.6 | 73.4 | 72.3 | 75.3 |
claude-3-haiku-20240307 | 75.1 | 75.0 | 64.1 | 79.1 | 81.3 | 74.9 |
gpt-4-0125-preview | 72.7 | 77.5 | 72.4 | 75.0 | 74.6 | 74.4 |
gpt-4-1106-preview | 74.9 | 79.1 | 71.8 | 72.4 | 73.6 | 74.4 |
gpt-4o-mini-2024-07-18 | 73.0 | 76.6 | 73.9 | 74.1 | 72.5 | 74.0 |
mistral-large-2407 | 69.9 | 78.7 | 77.3 | 72.1 | 70.0 | 73.6 |
gpt-4o-2024-05-13 | 71.2 | 75.7 | 80.6 | 67.9 | 69.2 | 72.9 |
deepseek-coder | 81.1 | 69.9 | 56.8 | 71.9 | 78.1 | 71.6 |
mistral-large-2402 | 67.9 | 71.1 | 71.0 | 74.2 | 73.6 | 71.6 |
deepseek-chat | 76.4 | 56.4 | 75.3 | 72.5 | 75.5 | 71.2 |
claude-2.1 | 64.3 | 65.4 | 72.2 | 69.2 | 68.4 | 67.9 |
gpt-3.5-turbo-0125 | 73.0 | 74.7 | 64.7 | 29.3 | 66.9 | 61.7 |
mistral-medium | 63.1 | 60.5 | 63.4 | 55.9 | 61.2 | 60.8 |
mistral-small | 67.3 | 61.4 | 59.9 | 56.1 | 55.9 | 60.1 |
mistral-small-2402 | 61.7 | 63.0 | 62.1 | 56.6 | 55.9 | 59.9 |
gpt-3.5-turbo-1106 | 74.6 | 73.6 | 73.4 | 15.4 | 55.0 | 58.4 |
mistral-tiny | 51.7 | 44.3 | 41.1 | 50.5 | 47.2 | 47.0 |
gpt-3.5-turbo | 73.1 | 60.9 | 32.8 | 26.2 | 18.4 | 42.3 |
gemini-1.0-pro-latest | 36.0 | 38.6 | 35.2 | 30.8 | 33.3 | 34.8 |
Other Considerations
Comparison of Cost vs Average Score
fig = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
data(_) * mapping(:cost => (x -> x * 100) => "Avg. Cost (US Cents/query)",
:score => "Avg. Score (Max 100 pts)",
color = :model => "Model")
draw(;
axis = (xticklabelrotation = 45,
title = "Cost vs Score for Paid APIs"))
end
SAVE_PLOTS && save("assets/cost-vs-score-scatter-paid.png", fig)
fig
Table:
- Point per cent is the average score divided by the average cost in US cents
output = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score_avg = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
@rtransform :point_per_cent = :score_avg / :cost / 100
@orderby -:point_per_cent
#
transform(_,
names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
renamecols = false)
@rtransform :cost_cents = round(:cost * 100; digits = 2)
select(Not(:cost))
rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
Model | Prompt Label | Elapsed | Score Avg | Score Median | Cnt | Point Per Cent | Cost Cents |
---|---|---|---|---|---|---|---|
codestral-2405 | InJulia | 2.0 | 78.5 | 92.5 | 140.0 | Inf | 0.0 |
codestral-2405 | JuliaExpertAsk | 1.6 | 78.0 | 95.0 | 140.0 | Inf | 0.0 |
codestral-2405 | JuliaExpertCoTTask | 1.8 | 74.2 | 95.0 | 140.0 | Inf | 0.0 |
codestral-2405 | JuliaRecapCoTTask | 2.2 | 77.8 | 90.8 | 140.0 | Inf | 0.0 |
codestral-2405 | JuliaRecapTask | 2.2 | 72.7 | 90.0 | 139.0 | Inf | 0.0 |
gemini-1.0-pro-latest | InJulia | 4.1 | 36.0 | 25.0 | 140.0 | Inf | 0.0 |
gemini-1.0-pro-latest | JuliaExpertAsk | 3.9 | 38.6 | 50.0 | 140.0 | Inf | 0.0 |
gemini-1.0-pro-latest | JuliaExpertCoTTask | 4.0 | 35.2 | 25.0 | 140.0 | Inf | 0.0 |
gemini-1.0-pro-latest | JuliaRecapCoTTask | 4.8 | 30.8 | 25.0 | 140.0 | Inf | 0.0 |
gemini-1.0-pro-latest | JuliaRecapTask | 4.3 | 33.3 | 25.0 | 140.0 | Inf | 0.0 |
gpt-4o-2024-05-13 | InJulia | 4.2 | 71.2 | 85.0 | 140.0 | Inf | 0.0 |
gpt-4o-2024-05-13 | JuliaExpertAsk | 1.6 | 75.7 | 86.7 | 140.0 | Inf | 0.0 |
gpt-4o-2024-05-13 | JuliaExpertCoTTask | 4.3 | 80.6 | 90.0 | 140.0 | Inf | 0.0 |
gpt-4o-2024-05-13 | JuliaRecapCoTTask | 5.5 | 67.9 | 64.6 | 140.0 | Inf | 0.0 |
gpt-4o-2024-05-13 | JuliaRecapTask | 5.8 | 69.2 | 73.8 | 140.0 | Inf | 0.0 |
gpt-4o-2024-08-06 | InJulia | 4.9 | 74.8 | 86.7 | 140.0 | Inf | 0.0 |
gpt-4o-2024-08-06 | JuliaExpertAsk | 2.1 | 79.6 | 90.0 | 140.0 | Inf | 0.0 |
gpt-4o-2024-08-06 | JuliaExpertCoTTask | 4.2 | 82.3 | 100.0 | 140.0 | Inf | 0.0 |
gpt-4o-2024-08-06 | JuliaRecapCoTTask | 5.4 | 74.0 | 75.0 | 140.0 | Inf | 0.0 |
gpt-4o-2024-08-06 | JuliaRecapTask | 6.3 | 72.6 | 75.0 | 139.0 | Inf | 0.0 |
mistral-large-2402 | InJulia | 7.5 | 67.9 | 62.5 | 140.0 | Inf | 0.0 |
mistral-large-2402 | JuliaExpertAsk | 5.3 | 71.1 | 80.0 | 140.0 | Inf | 0.0 |
mistral-large-2402 | JuliaExpertCoTTask | 8.6 | 71.0 | 80.0 | 140.0 | Inf | 0.0 |
mistral-large-2402 | JuliaRecapCoTTask | 10.8 | 74.2 | 83.3 | 140.0 | Inf | 0.0 |
mistral-large-2402 | JuliaRecapTask | 10.5 | 73.6 | 90.0 | 140.0 | Inf | 0.0 |
mistral-small-2402 | InJulia | 4.4 | 61.7 | 50.0 | 140.0 | Inf | 0.0 |
mistral-small-2402 | JuliaExpertAsk | 3.6 | 63.0 | 61.2 | 140.0 | Inf | 0.0 |
mistral-small-2402 | JuliaExpertCoTTask | 4.6 | 62.1 | 61.9 | 140.0 | Inf | 0.0 |
mistral-small-2402 | JuliaRecapCoTTask | 8.2 | 56.6 | 50.0 | 140.0 | Inf | 0.0 |
mistral-small-2402 | JuliaRecapTask | 5.8 | 55.9 | 50.0 | 140.0 | Inf | 0.0 |
deepseek-chat | InJulia | 18.3 | 76.4 | 87.1 | 70.0 | 8337.9 | 0.01 |
deepseek-coder | InJulia | 14.5 | 81.1 | 86.7 | 70.0 | 8030.0 | 0.01 |
deepseek-coder | JuliaExpertAsk | 13.1 | 69.9 | 83.3 | 70.0 | 7075.8 | 0.01 |
deepseek-chat | JuliaExpertCoTTask | 18.3 | 75.3 | 90.0 | 75.0 | 6902.3 | 0.01 |
deepseek-chat | JuliaExpertAsk | 17.1 | 56.4 | 75.0 | 70.0 | 6179.2 | 0.01 |
gpt-4o-mini-2024-07-18 | JuliaExpertAsk | 2.6 | 76.6 | 85.8 | 140.0 | 6150.9 | 0.01 |
deepseek-chat | JuliaRecapTask | 16.9 | 75.5 | 88.8 | 70.0 | 5938.2 | 0.01 |
deepseek-coder | JuliaRecapTask | 12.6 | 78.1 | 83.3 | 70.0 | 5708.3 | 0.01 |
deepseek-coder | JuliaRecapCoTTask | 12.7 | 71.9 | 90.0 | 70.0 | 5282.5 | 0.01 |
deepseek-chat | JuliaRecapCoTTask | 18.9 | 72.5 | 75.0 | 70.0 | 5271.4 | 0.01 |
deepseek-coder | JuliaExpertCoTTask | 12.0 | 56.8 | 67.5 | 70.0 | 5182.2 | 0.01 |
mistral-tiny | JuliaExpertAsk | 2.4 | 44.3 | 50.0 | 70.0 | 4333.3 | 0.01 |
gpt-3.5-turbo-0125 | JuliaExpertAsk | 0.9 | 74.7 | 80.0 | 140.0 | 4119.0 | 0.02 |
mistral-tiny | InJulia | 3.8 | 51.7 | 50.0 | 68.0 | 2869.4 | 0.02 |
gpt-4o-mini-2024-07-18 | InJulia | 5.3 | 73.0 | 86.7 | 140.0 | 2804.8 | 0.03 |
gpt-3.5-turbo-1106 | JuliaExpertAsk | 1.6 | 73.6 | 80.0 | 70.0 | 2747.9 | 0.03 |
gpt-4o-mini-2024-07-18 | JuliaExpertCoTTask | 5.1 | 73.9 | 90.0 | 140.0 | 2732.4 | 0.03 |
gpt-3.5-turbo-0125 | InJulia | 1.6 | 73.0 | 80.0 | 140.0 | 2276.8 | 0.03 |
gpt-4o-mini-2024-07-18 | JuliaRecapCoTTask | 6.0 | 74.1 | 80.0 | 140.0 | 2214.9 | 0.03 |
gpt-3.5-turbo | JuliaExpertAsk | 3.1 | 60.9 | 60.0 | 70.0 | 2177.5 | 0.03 |
gpt-3.5-turbo-0125 | JuliaExpertCoTTask | 1.2 | 64.7 | 82.3 | 140.0 | 2168.8 | 0.03 |
gpt-4o-mini-2024-07-18 | JuliaRecapTask | 6.3 | 72.5 | 85.0 | 140.0 | 2168.2 | 0.03 |
claude-3-haiku-20240307 | JuliaExpertAsk | 2.8 | 75.0 | 80.0 | 140.0 | 2084.8 | 0.04 |
mistral-tiny | JuliaExpertCoTTask | 6.6 | 41.1 | 50.0 | 70.0 | 2040.5 | 0.02 |
mistral-tiny | JuliaRecapCoTTask | 4.9 | 50.5 | 50.0 | 70.0 | 1957.1 | 0.03 |
gpt-3.5-turbo-0125 | JuliaRecapTask | 1.2 | 66.9 | 75.0 | 140.0 | 1916.1 | 0.03 |
gpt-3.5-turbo-1106 | JuliaExpertCoTTask | 1.9 | 73.4 | 95.0 | 69.0 | 1873.1 | 0.04 |
mistral-tiny | JuliaRecapTask | 5.1 | 47.2 | 50.0 | 70.0 | 1783.8 | 0.03 |
gpt-3.5-turbo-1106 | InJulia | 2.9 | 74.6 | 83.3 | 70.0 | 1672.1 | 0.04 |
gpt-3.5-turbo | InJulia | 5.0 | 73.1 | 67.5 | 70.0 | 1633.3 | 0.04 |
claude-3-haiku-20240307 | JuliaRecapCoTTask | 4.2 | 79.1 | 90.0 | 140.0 | 1349.7 | 0.06 |
claude-3-haiku-20240307 | InJulia | 4.2 | 75.1 | 85.8 | 140.0 | 1338.2 | 0.06 |
claude-3-haiku-20240307 | JuliaRecapTask | 4.4 | 81.3 | 95.0 | 140.0 | 1296.3 | 0.06 |
claude-3-haiku-20240307 | JuliaExpertCoTTask | 4.2 | 64.1 | 62.5 | 140.0 | 1110.5 | 0.06 |
mistral-small | JuliaExpertAsk | 3.7 | 61.4 | 52.5 | 70.0 | 1078.7 | 0.06 |
gpt-3.5-turbo-1106 | JuliaRecapTask | 1.9 | 55.0 | 62.5 | 69.0 | 1028.1 | 0.05 |
gpt-3.5-turbo | JuliaExpertCoTTask | 3.1 | 32.8 | 0.0 | 70.0 | 1010.4 | 0.03 |
mistral-small | InJulia | 5.3 | 67.3 | 60.0 | 70.0 | 890.8 | 0.08 |
gpt-3.5-turbo-0125 | JuliaRecapCoTTask | 1.2 | 29.3 | 0.0 | 140.0 | 850.8 | 0.03 |
mistral-small | JuliaExpertCoTTask | 5.3 | 59.9 | 55.0 | 70.0 | 706.0 | 0.08 |
gpt-3.5-turbo | JuliaRecapCoTTask | 3.6 | 26.2 | 0.0 | 70.0 | 585.4 | 0.04 |
mistral-large-2407 | JuliaExpertAsk | 3.7 | 78.7 | 81.2 | 70.0 | 469.3 | 0.17 |
mistral-small | JuliaRecapCoTTask | 7.6 | 56.1 | 57.5 | 70.0 | 460.0 | 0.12 |
mistral-small | JuliaRecapTask | 7.7 | 55.9 | 55.0 | 70.0 | 436.2 | 0.13 |
gpt-3.5-turbo | JuliaRecapTask | 3.4 | 18.4 | 0.0 | 70.0 | 423.5 | 0.04 |
gpt-3.5-turbo-1106 | JuliaRecapCoTTask | 2.0 | 15.4 | 0.0 | 70.0 | 274.4 | 0.06 |
claude-3-5-sonnet-20240620 | JuliaExpertAsk | 3.1 | 84.2 | 100.0 | 140.0 | 262.2 | 0.32 |
mistral-medium | JuliaExpertAsk | 12.3 | 60.5 | 55.0 | 70.0 | 230.3 | 0.26 |
mistral-medium | InJulia | 14.8 | 63.1 | 60.0 | 70.0 | 187.6 | 0.34 |
mistral-large-2407 | JuliaExpertCoTTask | 9.2 | 77.3 | 86.7 | 70.0 | 183.0 | 0.42 |
mistral-large-2407 | InJulia | 10.8 | 69.9 | 78.1 | 70.0 | 165.3 | 0.42 |
gpt-4-0125-preview | JuliaExpertAsk | 10.8 | 77.5 | 86.7 | 140.0 | 157.7 | 0.49 |
claude-3-sonnet-20240229 | JuliaExpertAsk | 6.3 | 79.0 | 90.0 | 140.0 | 149.7 | 0.53 |
mistral-medium | JuliaExpertCoTTask | 20.0 | 63.4 | 62.5 | 70.0 | 146.8 | 0.43 |
claude-3-sonnet-20240229 | JuliaExpertCoTTask | 7.2 | 80.3 | 95.0 | 140.0 | 129.2 | 0.62 |
claude-3-5-sonnet-20240620 | JuliaExpertCoTTask | 5.8 | 86.1 | 100.0 | 139.0 | 128.4 | 0.67 |
gpt-4-1106-preview | JuliaExpertAsk | 10.9 | 79.1 | 90.8 | 70.0 | 125.2 | 0.63 |
mistral-medium | JuliaRecapTask | 20.2 | 61.2 | 65.0 | 70.0 | 116.0 | 0.53 |
claude-3-5-sonnet-20240620 | InJulia | 6.9 | 87.7 | 100.0 | 140.0 | 111.0 | 0.79 |
mistral-medium | JuliaRecapCoTTask | 23.3 | 55.9 | 50.0 | 70.0 | 110.9 | 0.5 |
claude-2.1 | InJulia | 9.3 | 64.3 | 60.0 | 140.0 | 98.6 | 0.65 |
claude-3-sonnet-20240229 | JuliaRecapCoTTask | 9.4 | 75.6 | 87.5 | 140.0 | 98.5 | 0.77 |
mistral-large-2407 | JuliaRecapCoTTask | 16.5 | 72.1 | 90.0 | 70.0 | 97.9 | 0.74 |
mistral-large-2407 | JuliaRecapTask | 16.2 | 70.0 | 87.5 | 70.0 | 97.7 | 0.72 |
claude-3-sonnet-20240229 | InJulia | 10.0 | 80.9 | 95.0 | 140.0 | 95.8 | 0.84 |
claude-3-5-sonnet-20240620 | JuliaRecapTask | 7.9 | 86.5 | 100.0 | 140.0 | 94.0 | 0.92 |
claude-2.1 | JuliaExpertAsk | 9.6 | 65.4 | 71.2 | 140.0 | 93.8 | 0.7 |
gpt-4-turbo-2024-04-09 | JuliaExpertAsk | 7.0 | 78.5 | 86.7 | 140.0 | 93.5 | 0.84 |
claude-3-5-sonnet-20240620 | JuliaRecapCoTTask | 8.0 | 84.5 | 92.5 | 140.0 | 91.1 | 0.93 |
claude-3-sonnet-20240229 | JuliaRecapTask | 10.6 | 78.2 | 90.0 | 140.0 | 86.1 | 0.91 |
claude-2.1 | JuliaExpertCoTTask | 10.6 | 72.2 | 75.0 | 140.0 | 82.8 | 0.87 |
claude-2.1 | JuliaRecapCoTTask | 10.6 | 69.2 | 75.0 | 140.0 | 78.3 | 0.88 |
claude-2.1 | JuliaRecapTask | 10.6 | 68.4 | 75.0 | 140.0 | 76.3 | 0.9 |
gpt-4-1106-preview | JuliaExpertCoTTask | 21.7 | 71.8 | 92.5 | 70.0 | 63.9 | 1.12 |
gpt-4-0125-preview | JuliaExpertCoTTask | 28.5 | 72.4 | 95.0 | 140.0 | 60.2 | 1.2 |
gpt-4-1106-preview | InJulia | 27.4 | 74.9 | 86.7 | 70.0 | 57.9 | 1.29 |
gpt-4-turbo-2024-04-09 | JuliaExpertCoTTask | 10.5 | 75.6 | 95.0 | 140.0 | 56.5 | 1.34 |
gpt-4-0125-preview | InJulia | 34.4 | 72.7 | 86.7 | 140.0 | 52.2 | 1.39 |
gpt-4-turbo-2024-04-09 | InJulia | 13.0 | 76.5 | 86.7 | 140.0 | 51.4 | 1.49 |
gpt-4-1106-preview | JuliaRecapCoTTask | 25.0 | 72.4 | 85.6 | 70.0 | 48.9 | 1.48 |
gpt-4-1106-preview | JuliaRecapTask | 26.9 | 73.6 | 77.5 | 70.0 | 47.9 | 1.54 |
gpt-4-turbo-2024-04-09 | JuliaRecapCoTTask | 11.8 | 73.4 | 88.8 | 140.0 | 45.6 | 1.61 |
gpt-4-0125-preview | JuliaRecapCoTTask | 37.2 | 75.0 | 90.0 | 140.0 | 44.6 | 1.68 |
gpt-4-turbo-2024-04-09 | JuliaRecapTask | 11.5 | 72.3 | 90.0 | 140.0 | 44.5 | 1.63 |
gpt-4-0125-preview | JuliaRecapTask | 40.8 | 74.6 | 90.0 | 140.0 | 43.7 | 1.71 |
claude-3-opus-20240229 | JuliaExpertAsk | 17.4 | 84.0 | 90.0 | 140.0 | 24.6 | 3.41 |
claude-3-opus-20240229 | JuliaExpertCoTTask | 17.6 | 85.1 | 100.0 | 140.0 | 24.3 | 3.5 |
claude-3-opus-20240229 | JuliaRecapCoTTask | 21.7 | 81.6 | 88.8 | 140.0 | 20.9 | 3.9 |
claude-3-opus-20240229 | JuliaRecapTask | 22.8 | 81.2 | 90.0 | 140.0 | 19.3 | 4.21 |
claude-3-opus-20240229 | InJulia | 22.1 | 84.1 | 100.0 | 140.0 | 18.9 | 4.46 |
Comparison of Time-to-generate vs Average Score
fig = @chain df begin
@aside local xlims = quantile(df.elapsed_seconds, [0.01, 0.99])
@by [:model, :prompt_label] begin
:elapsed = mean(:elapsed_seconds)
:elapsed_median = median(:elapsed_seconds)
:score = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
data(_) * mapping(:elapsed => "Avg. Elapsed Time (s)",
:score => "Avg. Score (Max 100 pts)",
color = :model => "Model")
draw(; figure = (size = (600, 600),),
axis = (xticklabelrotation = 45,
title = "Elapsed Time vs Score for Paid APIs",
limits = (xlims..., nothing, nothing)),
palettes = (; color = Makie.ColorSchemes.tab20.colors))
end
SAVE_PLOTS && save("assets/elapsed-vs-score-scatter-paid.png", fig)
fig
Table:
- Point per second is the average score divided by the average elapsed time
output = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score_avg = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
@rtransform :point_per_second = :score_avg / :elapsed
@orderby -:point_per_second
#
transform(_,
names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
renamecols = false)
@rtransform :cost_cents = round(:cost * 100; digits = 2)
select(Not(:cost))
rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
Model | Prompt Label | Elapsed | Score Avg | Score Median | Cnt | Point Per Second | Cost Cents |
---|---|---|---|---|---|---|---|
gpt-3.5-turbo-0125 | JuliaExpertAsk | 0.9 | 74.7 | 80.0 | 140.0 | 80.0 | 0.02 |
gpt-3.5-turbo-0125 | JuliaRecapTask | 1.2 | 66.9 | 75.0 | 140.0 | 57.1 | 0.03 |
gpt-3.5-turbo-0125 | JuliaExpertCoTTask | 1.2 | 64.7 | 82.3 | 140.0 | 52.3 | 0.03 |
codestral-2405 | JuliaExpertAsk | 1.6 | 78.0 | 95.0 | 140.0 | 49.8 | 0.0 |
gpt-4o-2024-05-13 | JuliaExpertAsk | 1.6 | 75.7 | 86.7 | 140.0 | 47.8 | 0.0 |
gpt-3.5-turbo-0125 | InJulia | 1.6 | 73.0 | 80.0 | 140.0 | 46.9 | 0.03 |
gpt-3.5-turbo-1106 | JuliaExpertAsk | 1.6 | 73.6 | 80.0 | 70.0 | 45.5 | 0.03 |
codestral-2405 | JuliaExpertCoTTask | 1.8 | 74.2 | 95.0 | 140.0 | 41.8 | 0.0 |
codestral-2405 | InJulia | 2.0 | 78.5 | 92.5 | 140.0 | 39.2 | 0.0 |
gpt-3.5-turbo-1106 | JuliaExpertCoTTask | 1.9 | 73.4 | 95.0 | 69.0 | 38.9 | 0.04 |
gpt-4o-2024-08-06 | JuliaExpertAsk | 2.1 | 79.6 | 90.0 | 140.0 | 37.7 | 0.0 |
codestral-2405 | JuliaRecapCoTTask | 2.2 | 77.8 | 90.8 | 140.0 | 35.8 | 0.0 |
codestral-2405 | JuliaRecapTask | 2.2 | 72.7 | 90.0 | 139.0 | 33.3 | 0.0 |
gpt-4o-mini-2024-07-18 | JuliaExpertAsk | 2.6 | 76.6 | 85.8 | 140.0 | 29.5 | 0.01 |
gpt-3.5-turbo-1106 | JuliaRecapTask | 1.9 | 55.0 | 62.5 | 69.0 | 29.2 | 0.05 |
claude-3-5-sonnet-20240620 | JuliaExpertAsk | 3.1 | 84.2 | 100.0 | 140.0 | 27.2 | 0.32 |
claude-3-haiku-20240307 | JuliaExpertAsk | 2.8 | 75.0 | 80.0 | 140.0 | 26.4 | 0.04 |
gpt-3.5-turbo-1106 | InJulia | 2.9 | 74.6 | 83.3 | 70.0 | 25.8 | 0.04 |
gpt-3.5-turbo-0125 | JuliaRecapCoTTask | 1.2 | 29.3 | 0.0 | 140.0 | 25.4 | 0.03 |
mistral-large-2407 | JuliaExpertAsk | 3.7 | 78.7 | 81.2 | 70.0 | 21.0 | 0.17 |
gpt-3.5-turbo | JuliaExpertAsk | 3.1 | 60.9 | 60.0 | 70.0 | 19.6 | 0.03 |
gpt-4o-2024-08-06 | JuliaExpertCoTTask | 4.2 | 82.3 | 100.0 | 140.0 | 19.5 | 0.0 |
claude-3-haiku-20240307 | JuliaRecapCoTTask | 4.2 | 79.1 | 90.0 | 140.0 | 19.0 | 0.06 |
gpt-4o-2024-05-13 | JuliaExpertCoTTask | 4.3 | 80.6 | 90.0 | 140.0 | 18.7 | 0.0 |
mistral-tiny | JuliaExpertAsk | 2.4 | 44.3 | 50.0 | 70.0 | 18.7 | 0.01 |
claude-3-haiku-20240307 | JuliaRecapTask | 4.4 | 81.3 | 95.0 | 140.0 | 18.4 | 0.06 |
claude-3-haiku-20240307 | InJulia | 4.2 | 75.1 | 85.8 | 140.0 | 17.8 | 0.06 |
mistral-small-2402 | JuliaExpertAsk | 3.6 | 63.0 | 61.2 | 140.0 | 17.6 | 0.0 |
gpt-4o-2024-05-13 | InJulia | 4.2 | 71.2 | 85.0 | 140.0 | 16.9 | 0.0 |
mistral-small | JuliaExpertAsk | 3.7 | 61.4 | 52.5 | 70.0 | 16.5 | 0.06 |
gpt-4o-2024-08-06 | InJulia | 4.9 | 74.8 | 86.7 | 140.0 | 15.4 | 0.0 |
claude-3-haiku-20240307 | JuliaExpertCoTTask | 4.2 | 64.1 | 62.5 | 140.0 | 15.4 | 0.06 |
claude-3-5-sonnet-20240620 | JuliaExpertCoTTask | 5.8 | 86.1 | 100.0 | 139.0 | 14.7 | 0.67 |
gpt-4o-mini-2024-07-18 | JuliaExpertCoTTask | 5.1 | 73.9 | 90.0 | 140.0 | 14.5 | 0.03 |
gpt-3.5-turbo | InJulia | 5.0 | 73.1 | 67.5 | 70.0 | 14.5 | 0.04 |
mistral-small-2402 | InJulia | 4.4 | 61.7 | 50.0 | 140.0 | 14.0 | 0.0 |
gpt-4o-mini-2024-07-18 | InJulia | 5.3 | 73.0 | 86.7 | 140.0 | 13.9 | 0.03 |
gpt-4o-2024-08-06 | JuliaRecapCoTTask | 5.4 | 74.0 | 75.0 | 140.0 | 13.6 | 0.0 |
mistral-tiny | InJulia | 3.8 | 51.7 | 50.0 | 68.0 | 13.6 | 0.02 |
mistral-large-2402 | JuliaExpertAsk | 5.3 | 71.1 | 80.0 | 140.0 | 13.5 | 0.0 |
mistral-small-2402 | JuliaExpertCoTTask | 4.6 | 62.1 | 61.9 | 140.0 | 13.4 | 0.0 |
mistral-small | InJulia | 5.3 | 67.3 | 60.0 | 70.0 | 12.7 | 0.08 |
claude-3-5-sonnet-20240620 | InJulia | 6.9 | 87.7 | 100.0 | 140.0 | 12.7 | 0.79 |
claude-3-sonnet-20240229 | JuliaExpertAsk | 6.3 | 79.0 | 90.0 | 140.0 | 12.5 | 0.53 |
gpt-4o-2024-05-13 | JuliaRecapCoTTask | 5.5 | 67.9 | 64.6 | 140.0 | 12.4 | 0.0 |
gpt-4o-mini-2024-07-18 | JuliaRecapCoTTask | 6.0 | 74.1 | 80.0 | 140.0 | 12.3 | 0.03 |
gpt-4o-2024-05-13 | JuliaRecapTask | 5.8 | 69.2 | 73.8 | 140.0 | 11.9 | 0.0 |
gpt-4o-2024-08-06 | JuliaRecapTask | 6.3 | 72.6 | 75.0 | 139.0 | 11.6 | 0.0 |
gpt-4o-mini-2024-07-18 | JuliaRecapTask | 6.3 | 72.5 | 85.0 | 140.0 | 11.5 | 0.03 |
mistral-small | JuliaExpertCoTTask | 5.3 | 59.9 | 55.0 | 70.0 | 11.4 | 0.08 |
gpt-4-turbo-2024-04-09 | JuliaExpertAsk | 7.0 | 78.5 | 86.7 | 140.0 | 11.2 | 0.84 |
claude-3-sonnet-20240229 | JuliaExpertCoTTask | 7.2 | 80.3 | 95.0 | 140.0 | 11.2 | 0.62 |
claude-3-5-sonnet-20240620 | JuliaRecapTask | 7.9 | 86.5 | 100.0 | 140.0 | 11.0 | 0.92 |
claude-3-5-sonnet-20240620 | JuliaRecapCoTTask | 8.0 | 84.5 | 92.5 | 140.0 | 10.5 | 0.93 |
gpt-3.5-turbo | JuliaExpertCoTTask | 3.1 | 32.8 | 0.0 | 70.0 | 10.5 | 0.03 |
mistral-tiny | JuliaRecapCoTTask | 4.9 | 50.5 | 50.0 | 70.0 | 10.3 | 0.03 |
gemini-1.0-pro-latest | JuliaExpertAsk | 3.9 | 38.6 | 50.0 | 140.0 | 10.0 | 0.0 |
mistral-small-2402 | JuliaRecapTask | 5.8 | 55.9 | 50.0 | 140.0 | 9.6 | 0.0 |
mistral-tiny | JuliaRecapTask | 5.1 | 47.2 | 50.0 | 70.0 | 9.3 | 0.03 |
mistral-large-2402 | InJulia | 7.5 | 67.9 | 62.5 | 140.0 | 9.1 | 0.0 |
gemini-1.0-pro-latest | JuliaExpertCoTTask | 4.0 | 35.2 | 25.0 | 140.0 | 8.8 | 0.0 |
gemini-1.0-pro-latest | InJulia | 4.1 | 36.0 | 25.0 | 140.0 | 8.7 | 0.0 |
mistral-large-2407 | JuliaExpertCoTTask | 9.2 | 77.3 | 86.7 | 70.0 | 8.4 | 0.42 |
mistral-large-2402 | JuliaExpertCoTTask | 8.6 | 71.0 | 80.0 | 140.0 | 8.2 | 0.0 |
claude-3-sonnet-20240229 | InJulia | 10.0 | 80.9 | 95.0 | 140.0 | 8.1 | 0.84 |
claude-3-sonnet-20240229 | JuliaRecapCoTTask | 9.4 | 75.6 | 87.5 | 140.0 | 8.0 | 0.77 |
gemini-1.0-pro-latest | JuliaRecapTask | 4.3 | 33.3 | 25.0 | 140.0 | 7.6 | 0.0 |
gpt-3.5-turbo-1106 | JuliaRecapCoTTask | 2.0 | 15.4 | 0.0 | 70.0 | 7.6 | 0.06 |
claude-3-sonnet-20240229 | JuliaRecapTask | 10.6 | 78.2 | 90.0 | 140.0 | 7.4 | 0.91 |
mistral-small | JuliaRecapCoTTask | 7.6 | 56.1 | 57.5 | 70.0 | 7.4 | 0.12 |
gpt-3.5-turbo | JuliaRecapCoTTask | 3.6 | 26.2 | 0.0 | 70.0 | 7.4 | 0.04 |
gpt-4-1106-preview | JuliaExpertAsk | 10.9 | 79.1 | 90.8 | 70.0 | 7.2 | 0.63 |
mistral-small | JuliaRecapTask | 7.7 | 55.9 | 55.0 | 70.0 | 7.2 | 0.13 |
gpt-4-0125-preview | JuliaExpertAsk | 10.8 | 77.5 | 86.7 | 140.0 | 7.2 | 0.49 |
gpt-4-turbo-2024-04-09 | JuliaExpertCoTTask | 10.5 | 75.6 | 95.0 | 140.0 | 7.2 | 1.34 |
mistral-large-2402 | JuliaRecapTask | 10.5 | 73.6 | 90.0 | 140.0 | 7.0 | 0.0 |
claude-2.1 | InJulia | 9.3 | 64.3 | 60.0 | 140.0 | 6.9 | 0.65 |
mistral-small-2402 | JuliaRecapCoTTask | 8.2 | 56.6 | 50.0 | 140.0 | 6.9 | 0.0 |
mistral-large-2402 | JuliaRecapCoTTask | 10.8 | 74.2 | 83.3 | 140.0 | 6.9 | 0.0 |
claude-2.1 | JuliaExpertAsk | 9.6 | 65.4 | 71.2 | 140.0 | 6.8 | 0.7 |
claude-2.1 | JuliaExpertCoTTask | 10.6 | 72.2 | 75.0 | 140.0 | 6.8 | 0.87 |
claude-2.1 | JuliaRecapCoTTask | 10.6 | 69.2 | 75.0 | 140.0 | 6.6 | 0.88 |
mistral-large-2407 | InJulia | 10.8 | 69.9 | 78.1 | 70.0 | 6.5 | 0.42 |
claude-2.1 | JuliaRecapTask | 10.6 | 68.4 | 75.0 | 140.0 | 6.4 | 0.9 |
gemini-1.0-pro-latest | JuliaRecapCoTTask | 4.8 | 30.8 | 25.0 | 140.0 | 6.4 | 0.0 |
gpt-4-turbo-2024-04-09 | JuliaRecapTask | 11.5 | 72.3 | 90.0 | 140.0 | 6.3 | 1.63 |
gpt-4-turbo-2024-04-09 | JuliaRecapCoTTask | 11.8 | 73.4 | 88.8 | 140.0 | 6.2 | 1.61 |
mistral-tiny | JuliaExpertCoTTask | 6.6 | 41.1 | 50.0 | 70.0 | 6.2 | 0.02 |
deepseek-coder | JuliaRecapTask | 12.6 | 78.1 | 83.3 | 70.0 | 6.2 | 0.01 |
gpt-4-turbo-2024-04-09 | InJulia | 13.0 | 76.5 | 86.7 | 140.0 | 5.9 | 1.49 |
deepseek-coder | JuliaRecapCoTTask | 12.7 | 71.9 | 90.0 | 70.0 | 5.6 | 0.01 |
deepseek-coder | InJulia | 14.5 | 81.1 | 86.7 | 70.0 | 5.6 | 0.01 |
gpt-3.5-turbo | JuliaRecapTask | 3.4 | 18.4 | 0.0 | 70.0 | 5.4 | 0.04 |
deepseek-coder | JuliaExpertAsk | 13.1 | 69.9 | 83.3 | 70.0 | 5.3 | 0.01 |
mistral-medium | JuliaExpertAsk | 12.3 | 60.5 | 55.0 | 70.0 | 4.9 | 0.26 |
claude-3-opus-20240229 | JuliaExpertCoTTask | 17.6 | 85.1 | 100.0 | 140.0 | 4.8 | 3.5 |
claude-3-opus-20240229 | JuliaExpertAsk | 17.4 | 84.0 | 90.0 | 140.0 | 4.8 | 3.41 |
deepseek-coder | JuliaExpertCoTTask | 12.0 | 56.8 | 67.5 | 70.0 | 4.7 | 0.01 |
deepseek-chat | JuliaRecapTask | 16.9 | 75.5 | 88.8 | 70.0 | 4.5 | 0.01 |
mistral-large-2407 | JuliaRecapCoTTask | 16.5 | 72.1 | 90.0 | 70.0 | 4.4 | 0.74 |
mistral-large-2407 | JuliaRecapTask | 16.2 | 70.0 | 87.5 | 70.0 | 4.3 | 0.72 |
mistral-medium | InJulia | 14.8 | 63.1 | 60.0 | 70.0 | 4.3 | 0.34 |
deepseek-chat | InJulia | 18.3 | 76.4 | 87.1 | 70.0 | 4.2 | 0.01 |
deepseek-chat | JuliaExpertCoTTask | 18.3 | 75.3 | 90.0 | 75.0 | 4.1 | 0.01 |
deepseek-chat | JuliaRecapCoTTask | 18.9 | 72.5 | 75.0 | 70.0 | 3.8 | 0.01 |
claude-3-opus-20240229 | InJulia | 22.1 | 84.1 | 100.0 | 140.0 | 3.8 | 4.46 |
claude-3-opus-20240229 | JuliaRecapCoTTask | 21.7 | 81.6 | 88.8 | 140.0 | 3.8 | 3.9 |
claude-3-opus-20240229 | JuliaRecapTask | 22.8 | 81.2 | 90.0 | 140.0 | 3.6 | 4.21 |
gpt-4-1106-preview | JuliaExpertCoTTask | 21.7 | 71.8 | 92.5 | 70.0 | 3.3 | 1.12 |
deepseek-chat | JuliaExpertAsk | 17.1 | 56.4 | 75.0 | 70.0 | 3.3 | 0.01 |
mistral-medium | JuliaExpertCoTTask | 20.0 | 63.4 | 62.5 | 70.0 | 3.2 | 0.43 |
mistral-medium | JuliaRecapTask | 20.2 | 61.2 | 65.0 | 70.0 | 3.0 | 0.53 |
gpt-4-1106-preview | JuliaRecapCoTTask | 25.0 | 72.4 | 85.6 | 70.0 | 2.9 | 1.48 |
gpt-4-1106-preview | JuliaRecapTask | 26.9 | 73.6 | 77.5 | 70.0 | 2.7 | 1.54 |
gpt-4-1106-preview | InJulia | 27.4 | 74.9 | 86.7 | 70.0 | 2.7 | 1.29 |
gpt-4-0125-preview | JuliaExpertCoTTask | 28.5 | 72.4 | 95.0 | 140.0 | 2.5 | 1.2 |
mistral-medium | JuliaRecapCoTTask | 23.3 | 55.9 | 50.0 | 70.0 | 2.4 | 0.5 |
gpt-4-0125-preview | InJulia | 34.4 | 72.7 | 86.7 | 140.0 | 2.1 | 1.39 |
gpt-4-0125-preview | JuliaRecapCoTTask | 37.2 | 75.0 | 90.0 | 140.0 | 2.0 | 1.68 |
gpt-4-0125-preview | JuliaRecapTask | 40.8 | 74.6 | 90.0 | 140.0 | 1.8 | 1.71 |
Test Case Performance
Performance of different models across each test case
output = @chain df begin
@by [:model, :name] begin
:score = mean(:score)
end
#
@aside average_ = @by _ :name :AverageScore=mean(:score) |> x -> round(x, digits = 1)
unstack(:name, :model, :score; fill = 0.0)
transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
leftjoin(average_, on = :name)
@orderby -:AverageScore
end
markdown_table(output)
name | claude-2.1 | claude-3-5-sonnet-20240620 | claude-3-haiku-20240307 | claude-3-opus-20240229 | claude-3-sonnet-20240229 | codestral-2405 | deepseek-chat | deepseek-coder | gemini-1.0-pro-latest | gpt-3.5-turbo | gpt-3.5-turbo-0125 | gpt-3.5-turbo-1106 | gpt-4-0125-preview | gpt-4-1106-preview | gpt-4-turbo-2024-04-09 | gpt-4o-2024-05-13 | gpt-4o-2024-08-06 | gpt-4o-mini-2024-07-18 | mistral-large-2402 | mistral-large-2407 | mistral-medium | mistral-small | mistral-small-2402 | mistral-tiny | AverageScore |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FloatWithUnits | 62.0 | 97.5 | 98.0 | 100.0 | 100.0 | 98.0 | 100.0 | 100.0 | 57.0 | 76.0 | 91.5 | 80.0 | 60.5 | 72.0 | 78.5 | 93.5 | 99.5 | 96.5 | 99.5 | 100.0 | 98.0 | 70.0 | 100.0 | 80.2 | 87.8 |
timezone_bumper | 82.1 | 100.0 | 98.1 | 99.7 | 95.5 | 89.5 | 100.0 | 100.0 | 39.9 | 48.0 | 77.4 | 79.2 | 90.0 | 90.0 | 94.8 | 95.0 | 98.5 | 99.1 | 96.4 | 100.0 | 97.0 | 76.6 | 78.1 | 62.0 | 87.0 |
clean_column | 100.0 | 97.3 | 89.8 | 100.0 | 96.4 | 92.3 | 78.4 | 71.2 | 41.5 | 35.5 | 66.7 | 69.8 | 88.8 | 90.5 | 90.0 | 89.3 | 87.4 | 88.0 | 91.6 | 92.0 | 81.0 | 84.6 | 99.7 | 80.8 | 83.4 |
keeponlynames | 90.1 | 91.6 | 65.0 | 85.3 | 94.9 | 95.4 | 88.4 | 74.4 | 54.0 | 50.8 | 80.6 | 74.2 | 90.9 | 91.0 | 86.2 | 77.5 | 78.7 | 80.9 | 98.7 | 89.4 | 66.2 | 76.6 | 67.9 | 51.0 | 79.2 |
wrap_string | 93.8 | 94.8 | 77.2 | 64.5 | 70.2 | 88.0 | 81.7 | 82.5 | 32.6 | 64.0 | 50.1 | 55.3 | 94.9 | 97.8 | 94.6 | 97.0 | 94.6 | 94.3 | 71.9 | 94.5 | 84.7 | 68.0 | 68.6 | 48.3 | 77.7 |
countmodelrows | 58.0 | 100.0 | 82.6 | 98.8 | 94.8 | 84.4 | 67.2 | 60.7 | 36.6 | 52.8 | 75.7 | 56.2 | 97.4 | 98.4 | 89.3 | 89.0 | 95.4 | 75.5 | 78.6 | 90.2 | 79.0 | 67.2 | 61.7 | 53.2 | 76.8 |
weatherdataanalyzer | 74.1 | 85.0 | 93.3 | 86.8 | 86.8 | 89.3 | 93.0 | 83.8 | 26.5 | 35.2 | 64.2 | 59.0 | 85.4 | 85.0 | 81.0 | 67.4 | 73.5 | 76.5 | 86.0 | 54.6 | 85.4 | 55.4 | 52.6 | 56.8 | 72.4 |
add_yearmonth | 53.8 | 88.5 | 86.2 | 92.0 | 81.0 | 62.5 | 71.2 | 62.5 | 35.8 | 33.0 | 67.6 | 65.2 | 78.6 | 72.8 | 75.9 | 68.0 | 74.9 | 67.2 | 72.2 | 71.2 | 48.0 | 62.2 | 40.2 | 33.2 | 65.2 |
event_scheduler | 86.5 | 84.4 | 76.6 | 90.2 | 77.2 | 56.8 | 76.0 | 82.4 | 37.8 | 29.0 | 44.4 | 42.8 | 87.9 | 66.6 | 82.5 | 73.8 | 67.7 | 37.5 | 57.3 | 32.8 | 36.0 | 59.0 | 38.7 | 37.2 | 60.9 |
ispersonal | 52.0 | 62.0 | 69.0 | 54.0 | 72.0 | 90.0 | 61.0 | 84.0 | 16.0 | 43.0 | 72.0 | 68.6 | 54.3 | 56.0 | 66.5 | 62.0 | 66.3 | 94.0 | 67.2 | 57.0 | 35.0 | 48.0 | 48.0 | 29.5 | 59.5 |
audi_filter | 38.0 | 93.0 | 56.0 | 93.0 | 63.8 | 59.5 | 47.0 | 57.8 | 28.1 | 27.0 | 55.0 | 58.0 | 47.5 | 58.0 | 49.0 | 56.2 | 81.0 | 78.8 | 58.0 | 92.0 | 43.0 | 48.5 | 44.8 | 27.0 | 56.7 |
extractjuliacode | 56.4 | 63.3 | 60.4 | 65.4 | 48.2 | 47.9 | 41.3 | 48.6 | 36.4 | 41.0 | 43.6 | 48.4 | 54.5 | 48.7 | 56.1 | 52.5 | 50.4 | 45.3 | 44.1 | 63.8 | 31.8 | 52.2 | 50.4 | 30.1 | 49.2 |
qanda_extractor | 73.5 | 63.7 | 62.3 | 68.0 | 65.5 | 57.0 | 43.3 | 26.7 | 26.2 | 31.7 | 35.5 | 36.7 | 56.7 | 53.3 | 49.3 | 45.3 | 50.2 | 54.7 | 46.8 | 31.0 | 38.7 | 44.7 | 55.8 | 36.0 | 48.0 |
pig_latinify | 30.6 | 79.8 | 34.6 | 67.1 | 57.0 | 56.5 | 49.0 | 67.1 | 18.7 | 24.7 | 39.8 | 23.1 | 54.7 | 61.4 | 60.1 | 54.2 | 54.8 | 48.0 | 33.6 | 61.7 | 27.8 | 28.8 | 31.6 | 33.1 | 45.7 |
This page was generated using Literate.jl.