Results for Paid LLM APIs
The below captures the performance of 3 models from two commercial LLM APIs: OpenAI (GPT-3.5 Turbo, GPT-4, ...) and MistralAI (tiny, small, medium).
There are many other providers, but OpenAI is the most commonly used. MistralAI commercial API has launched recently and has a very good relationship with the Open-Source community, so we've added it as a challenger to compare OpenAI's cost effectiveness ("cost per point", ie, how many cents would you pay for 1pt in this benchmark)
Reminder: The below scores are on a scale 0-100, where 100 is the best possible score and 0 means the generated code was not even parseable.
# Imports
using JuliaLLMLeaderboard
using CairoMakie, AlgebraOfGraphics
using MarkdownTables, DataFramesMeta
using Statistics: mean, median, quantile, std;
# ! Configuration
SAVE_PLOTS = false
DIR_RESULTS = joinpath(pkgdir(JuliaLLMLeaderboard), "code_generation")
PAID_MODELS_DEFAULT = [
"gpt-3.5-turbo",
"gpt-3.5-turbo-1106",
"gpt-3.5-turbo-0125",
"gpt-4-1106-preview",
"gpt-4-0125-preview",
"mistral-tiny",
"mistral-small",
"mistral-medium",
];
PROMPTS = [
"JuliaExpertCoTTask",
"JuliaExpertAsk",
"InJulia",
"JuliaRecapTask",
"JuliaRecapCoTTask",
];
Load Latest Results
Use only the 5 most recent evaluations available for each definition/model/prompt
df = @chain begin
load_evals(DIR_RESULTS; max_history = 5)
@rsubset :model in PAID_MODELS_DEFAULT && :prompt_label in PROMPTS
end;
Model Comparison
Highest average score by model:
fig = @chain df begin
@by [:model] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
end
transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
@orderby -:score
@aside local order_ = _.model
data(_) *
mapping(:model => sorter(order_) => "Model",
:score => "Avg. Score (Max 100 pts)") * visual(BarPlot; bar_labels = :y,
label_offset = 0)
draw(;
axis = (limits = (nothing, nothing, 0, 100),
xticklabelrotation = 45,
title = "Paid APIs Performance"))
end
Table:
output = @chain df begin
@by [:model] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
:score_std_deviation = std(:score)
:count_zero_score = count(iszero, :score)
:count_full_score = count(==(100), :score)
end
transform(_,
[:elapsed, :score, :score_std_deviation] .=> ByRow(x -> round(x, digits = 1)),
renamecols = false)
@rtransform :cost_cents = round(:cost * 100; digits = 2)
select(Not(:cost))
@orderby -:score
rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
Model | Elapsed | Score | Score Std Deviation | Count Zero Score | Count Full Score | Cost Cents |
---|---|---|---|---|---|---|
gpt-4-1106-preview | 22.4 | 74.4 | 29.9 | 19 | 142 | 1.21 |
gpt-4-0125-preview | 30.2 | 73.1 | 31.7 | 26 | 140 | 1.3 |
gpt-3.5-turbo-0125 | 1.2 | 62.1 | 36.5 | 62 | 95 | 0.03 |
mistral-medium | 18.1 | 60.8 | 33.2 | 22 | 90 | 0.41 |
mistral-small | 5.9 | 60.1 | 30.2 | 27 | 76 | 0.09 |
gpt-3.5-turbo-1106 | 2.1 | 58.4 | 39.2 | 82 | 97 | 0.04 |
mistral-tiny | 4.6 | 46.9 | 32.0 | 75 | 42 | 0.02 |
gpt-3.5-turbo | 3.6 | 42.3 | 38.2 | 132 | 54 | 0.04 |
While the victory of GPT-4 is not surprising, note that the our sample size is small and the standard deviation is quite high.
Overview by Prompt Template
Bar chart with all paid models and various prompt templates
fig = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
@aside local average_ = @by(_, :model, :avg=mean(:score)) |>
x -> @orderby(x, -:avg).model
data(_) *
mapping(:model => sorter(average_) => "Model",
:score => "Avg. Score (Max 100 pts)",
color = :prompt_label => "Prompts",
dodge = :prompt_label) * visual(BarPlot)
draw(;
axis = (xticklabelrotation = 45, title = "Comparison for Paid APIs"))
end
SAVE_PLOTS && save("assets/model-prompt-comparison-paid.png", fig)
fig
Table:
- Surprised by the low performance of some models (eg, GPT 3.5 Turbo) on the CoT prompts? It's because the model accidentally sends a "stop" token before it writes the code.
output = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
end
@aside average_ = @by _ :model :AverageScore=mean(:score) |> x -> round(x, digits = 1)
unstack(:model, :prompt_label, :score; fill = 0.0)
transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
leftjoin(average_, on = :model)
@orderby -:AverageScore
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
model | InJulia | JuliaExpertAsk | JuliaExpertCoTTask | JuliaRecapCoTTask | JuliaRecapTask | AverageScore |
---|---|---|---|---|---|---|
gpt-4-1106-preview | 74.9 | 79.1 | 71.8 | 72.4 | 73.6 | 74.4 |
gpt-4-0125-preview | 70.6 | 79.2 | 70.8 | 72.0 | 73.0 | 73.1 |
gpt-3.5-turbo-0125 | 73.9 | 72.9 | 67.6 | 28.9 | 67.1 | 62.1 |
mistral-medium | 63.1 | 60.5 | 63.4 | 55.9 | 61.2 | 60.8 |
mistral-small | 67.3 | 61.4 | 59.9 | 56.1 | 55.9 | 60.1 |
gpt-3.5-turbo-1106 | 74.6 | 73.6 | 73.4 | 15.4 | 55.0 | 58.4 |
mistral-tiny | 51.7 | 44.3 | 41.1 | 50.5 | 47.2 | 47.0 |
gpt-3.5-turbo | 73.1 | 60.9 | 32.8 | 26.2 | 18.4 | 42.3 |
Other Considerations
Comparison of Cost vs Average Score
fig = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
data(_) * mapping(:cost => (x -> x * 100) => "Avg. Cost (US Cents/query)",
:score => "Avg. Score (Max 100 pts)",
color = :model => "Model")
draw(;
axis = (xticklabelrotation = 45,
title = "Cost vs Score for Paid APIs"))
end
SAVE_PLOTS && save("assets/cost-vs-score-scatter-paid.png", fig)
fig
Table:
- Point per cent is the average score divided by the average cost in US cents
output = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score_avg = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
@rtransform :point_per_cent = :score_avg / :cost / 100
@orderby -:point_per_cent
#
transform(_,
names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
renamecols = false)
@rtransform :cost_cents = round(:cost * 100; digits = 2)
select(Not(:cost))
rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
Model | Prompt Label | Elapsed | Score Avg | Score Median | Cnt | Point Per Cent | Cost Cents |
---|---|---|---|---|---|---|---|
mistral-tiny | JuliaExpertAsk | 2.4 | 44.3 | 50.0 | 70.0 | 4333.3 | 0.01 |
gpt-3.5-turbo-0125 | JuliaExpertAsk | 0.9 | 72.9 | 80.0 | 70.0 | 4074.3 | 0.02 |
mistral-tiny | InJulia | 3.8 | 51.7 | 50.0 | 68.0 | 2869.4 | 0.02 |
gpt-3.5-turbo-1106 | JuliaExpertAsk | 1.6 | 73.6 | 80.0 | 70.0 | 2747.9 | 0.03 |
gpt-3.5-turbo-0125 | InJulia | 1.5 | 73.9 | 81.7 | 70.0 | 2347.7 | 0.03 |
gpt-3.5-turbo-0125 | JuliaExpertCoTTask | 1.2 | 67.6 | 86.7 | 70.0 | 2292.0 | 0.03 |
gpt-3.5-turbo | JuliaExpertAsk | 3.1 | 60.9 | 60.0 | 70.0 | 2177.5 | 0.03 |
mistral-tiny | JuliaExpertCoTTask | 6.6 | 41.1 | 50.0 | 70.0 | 2040.5 | 0.02 |
mistral-tiny | JuliaRecapCoTTask | 4.9 | 50.5 | 50.0 | 70.0 | 1957.1 | 0.03 |
gpt-3.5-turbo-0125 | JuliaRecapTask | 1.2 | 67.1 | 67.5 | 70.0 | 1889.6 | 0.04 |
gpt-3.5-turbo-1106 | JuliaExpertCoTTask | 1.9 | 73.4 | 95.0 | 69.0 | 1873.1 | 0.04 |
mistral-tiny | JuliaRecapTask | 5.1 | 47.2 | 50.0 | 70.0 | 1783.8 | 0.03 |
gpt-3.5-turbo-1106 | InJulia | 2.9 | 74.6 | 83.3 | 70.0 | 1672.1 | 0.04 |
gpt-3.5-turbo | InJulia | 5.0 | 73.1 | 67.5 | 70.0 | 1633.3 | 0.04 |
mistral-small | JuliaExpertAsk | 3.7 | 61.4 | 52.5 | 70.0 | 1078.7 | 0.06 |
gpt-3.5-turbo-1106 | JuliaRecapTask | 1.9 | 55.0 | 62.5 | 69.0 | 1028.1 | 0.05 |
gpt-3.5-turbo | JuliaExpertCoTTask | 3.1 | 32.8 | 0.0 | 70.0 | 1010.4 | 0.03 |
mistral-small | InJulia | 5.3 | 67.3 | 60.0 | 70.0 | 890.8 | 0.08 |
gpt-3.5-turbo-0125 | JuliaRecapCoTTask | 1.1 | 28.9 | 0.0 | 70.0 | 875.4 | 0.03 |
mistral-small | JuliaExpertCoTTask | 5.3 | 59.9 | 55.0 | 70.0 | 706.0 | 0.08 |
gpt-3.5-turbo | JuliaRecapCoTTask | 3.6 | 26.2 | 0.0 | 70.0 | 585.4 | 0.04 |
mistral-small | JuliaRecapCoTTask | 7.6 | 56.1 | 57.5 | 70.0 | 460.0 | 0.12 |
mistral-small | JuliaRecapTask | 7.7 | 55.9 | 55.0 | 70.0 | 436.2 | 0.13 |
gpt-3.5-turbo | JuliaRecapTask | 3.4 | 18.4 | 0.0 | 70.0 | 423.5 | 0.04 |
gpt-3.5-turbo-1106 | JuliaRecapCoTTask | 2.0 | 15.4 | 0.0 | 70.0 | 274.4 | 0.06 |
mistral-medium | JuliaExpertAsk | 12.3 | 60.5 | 55.0 | 70.0 | 230.3 | 0.26 |
mistral-medium | InJulia | 14.8 | 63.1 | 60.0 | 70.0 | 187.6 | 0.34 |
gpt-4-0125-preview | JuliaExpertAsk | 10.8 | 79.2 | 90.0 | 70.0 | 158.4 | 0.5 |
mistral-medium | JuliaExpertCoTTask | 20.0 | 63.4 | 62.5 | 70.0 | 146.8 | 0.43 |
gpt-4-1106-preview | JuliaExpertAsk | 10.9 | 79.1 | 90.8 | 70.0 | 125.2 | 0.63 |
mistral-medium | JuliaRecapTask | 20.2 | 61.2 | 65.0 | 70.0 | 116.0 | 0.53 |
mistral-medium | JuliaRecapCoTTask | 23.3 | 55.9 | 50.0 | 70.0 | 110.9 | 0.5 |
gpt-4-1106-preview | JuliaExpertCoTTask | 21.7 | 71.8 | 92.5 | 70.0 | 63.9 | 1.12 |
gpt-4-0125-preview | JuliaExpertCoTTask | 27.8 | 70.8 | 95.4 | 70.0 | 59.5 | 1.19 |
gpt-4-1106-preview | InJulia | 27.4 | 74.9 | 86.7 | 70.0 | 57.9 | 1.29 |
gpt-4-0125-preview | InJulia | 34.3 | 70.6 | 84.0 | 70.0 | 49.9 | 1.42 |
gpt-4-1106-preview | JuliaRecapCoTTask | 25.0 | 72.4 | 85.6 | 70.0 | 48.9 | 1.48 |
gpt-4-1106-preview | JuliaRecapTask | 26.9 | 73.6 | 77.5 | 70.0 | 47.9 | 1.54 |
gpt-4-0125-preview | JuliaRecapCoTTask | 38.0 | 72.0 | 84.4 | 70.0 | 43.2 | 1.67 |
gpt-4-0125-preview | JuliaRecapTask | 39.9 | 73.0 | 90.0 | 70.0 | 42.9 | 1.7 |
Comparison of Time-to-generate vs Average Score
fig = @chain df begin
@aside local xlims = quantile(df.elapsed_seconds, [0.01, 0.99])
@by [:model, :prompt_label] begin
:elapsed = mean(:elapsed_seconds)
:elapsed_median = median(:elapsed_seconds)
:score = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
data(_) * mapping(:elapsed => "Avg. Elapsed Time (s)",
:score => "Avg. Score (Max 100 pts)",
color = :model => "Model")
draw(; figure = (size = (600, 600),),
axis = (xticklabelrotation = 45,
title = "Elapsed Time vs Score for Paid APIs",
limits = (xlims..., nothing, nothing)),
palettes = (; color = Makie.ColorSchemes.tab20.colors))
end
SAVE_PLOTS && save("assets/elapsed-vs-score-scatter-paid.png", fig)
fig
Table:
- Point per second is the average score divided by the average elapsed time
output = @chain df begin
@by [:model, :prompt_label] begin
:cost = mean(:cost)
:elapsed = mean(:elapsed_seconds)
:score_avg = mean(:score)
:score_median = median(:score)
:cnt = $nrow
end
@rtransform :point_per_second = :score_avg / :elapsed
@orderby -:point_per_second
#
transform(_,
names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
renamecols = false)
@rtransform :cost_cents = round(:cost * 100; digits = 2)
select(Not(:cost))
rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
Model | Prompt Label | Elapsed | Score Avg | Score Median | Cnt | Point Per Second | Cost Cents |
---|---|---|---|---|---|---|---|
gpt-3.5-turbo-0125 | JuliaExpertAsk | 0.9 | 72.9 | 80.0 | 70.0 | 84.1 | 0.02 |
gpt-3.5-turbo-0125 | JuliaExpertCoTTask | 1.2 | 67.6 | 86.7 | 70.0 | 57.1 | 0.03 |
gpt-3.5-turbo-0125 | JuliaRecapTask | 1.2 | 67.1 | 67.5 | 70.0 | 56.7 | 0.04 |
gpt-3.5-turbo-0125 | InJulia | 1.5 | 73.9 | 81.7 | 70.0 | 49.4 | 0.03 |
gpt-3.5-turbo-1106 | JuliaExpertAsk | 1.6 | 73.6 | 80.0 | 70.0 | 45.5 | 0.03 |
gpt-3.5-turbo-1106 | JuliaExpertCoTTask | 1.9 | 73.4 | 95.0 | 69.0 | 38.9 | 0.04 |
gpt-3.5-turbo-1106 | JuliaRecapTask | 1.9 | 55.0 | 62.5 | 69.0 | 29.2 | 0.05 |
gpt-3.5-turbo-0125 | JuliaRecapCoTTask | 1.1 | 28.9 | 0.0 | 70.0 | 27.4 | 0.03 |
gpt-3.5-turbo-1106 | InJulia | 2.9 | 74.6 | 83.3 | 70.0 | 25.8 | 0.04 |
gpt-3.5-turbo | JuliaExpertAsk | 3.1 | 60.9 | 60.0 | 70.0 | 19.6 | 0.03 |
mistral-tiny | JuliaExpertAsk | 2.4 | 44.3 | 50.0 | 70.0 | 18.7 | 0.01 |
mistral-small | JuliaExpertAsk | 3.7 | 61.4 | 52.5 | 70.0 | 16.5 | 0.06 |
gpt-3.5-turbo | InJulia | 5.0 | 73.1 | 67.5 | 70.0 | 14.5 | 0.04 |
mistral-tiny | InJulia | 3.8 | 51.7 | 50.0 | 68.0 | 13.6 | 0.02 |
mistral-small | InJulia | 5.3 | 67.3 | 60.0 | 70.0 | 12.7 | 0.08 |
mistral-small | JuliaExpertCoTTask | 5.3 | 59.9 | 55.0 | 70.0 | 11.4 | 0.08 |
gpt-3.5-turbo | JuliaExpertCoTTask | 3.1 | 32.8 | 0.0 | 70.0 | 10.5 | 0.03 |
mistral-tiny | JuliaRecapCoTTask | 4.9 | 50.5 | 50.0 | 70.0 | 10.3 | 0.03 |
mistral-tiny | JuliaRecapTask | 5.1 | 47.2 | 50.0 | 70.0 | 9.3 | 0.03 |
gpt-3.5-turbo-1106 | JuliaRecapCoTTask | 2.0 | 15.4 | 0.0 | 70.0 | 7.6 | 0.06 |
mistral-small | JuliaRecapCoTTask | 7.6 | 56.1 | 57.5 | 70.0 | 7.4 | 0.12 |
gpt-3.5-turbo | JuliaRecapCoTTask | 3.6 | 26.2 | 0.0 | 70.0 | 7.4 | 0.04 |
gpt-4-0125-preview | JuliaExpertAsk | 10.8 | 79.2 | 90.0 | 70.0 | 7.3 | 0.5 |
gpt-4-1106-preview | JuliaExpertAsk | 10.9 | 79.1 | 90.8 | 70.0 | 7.2 | 0.63 |
mistral-small | JuliaRecapTask | 7.7 | 55.9 | 55.0 | 70.0 | 7.2 | 0.13 |
mistral-tiny | JuliaExpertCoTTask | 6.6 | 41.1 | 50.0 | 70.0 | 6.2 | 0.02 |
gpt-3.5-turbo | JuliaRecapTask | 3.4 | 18.4 | 0.0 | 70.0 | 5.4 | 0.04 |
mistral-medium | JuliaExpertAsk | 12.3 | 60.5 | 55.0 | 70.0 | 4.9 | 0.26 |
mistral-medium | InJulia | 14.8 | 63.1 | 60.0 | 70.0 | 4.3 | 0.34 |
gpt-4-1106-preview | JuliaExpertCoTTask | 21.7 | 71.8 | 92.5 | 70.0 | 3.3 | 1.12 |
mistral-medium | JuliaExpertCoTTask | 20.0 | 63.4 | 62.5 | 70.0 | 3.2 | 0.43 |
mistral-medium | JuliaRecapTask | 20.2 | 61.2 | 65.0 | 70.0 | 3.0 | 0.53 |
gpt-4-1106-preview | JuliaRecapCoTTask | 25.0 | 72.4 | 85.6 | 70.0 | 2.9 | 1.48 |
gpt-4-1106-preview | JuliaRecapTask | 26.9 | 73.6 | 77.5 | 70.0 | 2.7 | 1.54 |
gpt-4-1106-preview | InJulia | 27.4 | 74.9 | 86.7 | 70.0 | 2.7 | 1.29 |
gpt-4-0125-preview | JuliaExpertCoTTask | 27.8 | 70.8 | 95.4 | 70.0 | 2.5 | 1.19 |
mistral-medium | JuliaRecapCoTTask | 23.3 | 55.9 | 50.0 | 70.0 | 2.4 | 0.5 |
gpt-4-0125-preview | InJulia | 34.3 | 70.6 | 84.0 | 70.0 | 2.1 | 1.42 |
gpt-4-0125-preview | JuliaRecapCoTTask | 38.0 | 72.0 | 84.4 | 70.0 | 1.9 | 1.67 |
gpt-4-0125-preview | JuliaRecapTask | 39.9 | 73.0 | 90.0 | 70.0 | 1.8 | 1.7 |
Test Case Performance
Performance of different models across each test case
output = @chain df begin
@by [:model, :name] begin
:score = mean(:score)
end
#
@aside average_ = @by _ :name :AverageScore=mean(:score) |> x -> round(x, digits = 1)
unstack(:name, :model, :score; fill = 0.0)
transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
leftjoin(average_, on = :name)
@orderby -:AverageScore
end
Row | name | gpt-3.5-turbo | gpt-3.5-turbo-0125 | gpt-3.5-turbo-1106 | gpt-4-0125-preview | gpt-4-1106-preview | mistral-medium | mistral-small | mistral-tiny | AverageScore |
---|---|---|---|---|---|---|---|---|---|---|
String | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64? | |
1 | FloatWithUnits | 76.0 | 94.0 | 80.0 | 56.0 | 72.0 | 98.0 | 70.0 | 80.2 | 78.3 |
2 | timezone_bumper | 48.0 | 74.8 | 79.2 | 93.0 | 90.0 | 97.0 | 76.6 | 62.0 | 77.6 |
3 | clean_column | 35.5 | 62.7 | 69.8 | 84.5 | 90.5 | 81.0 | 84.6 | 80.8 | 73.7 |
4 | count_model_rows | 52.8 | 73.8 | 56.2 | 98.4 | 98.4 | 79.0 | 67.2 | 53.2 | 72.4 |
5 | keep_only_names | 50.8 | 80.2 | 74.2 | 85.2 | 91.0 | 66.2 | 76.6 | 51.0 | 71.9 |
6 | wrap_string | 64.0 | 57.5 | 55.3 | 93.7 | 97.8 | 84.7 | 68.0 | 48.3 | 71.2 |
7 | weather_data_analyzer | 35.2 | 64.2 | 59.0 | 87.8 | 85.0 | 85.4 | 55.4 | 56.8 | 66.1 |
8 | add_yearmonth | 33.0 | 63.8 | 65.2 | 76.2 | 72.8 | 48.0 | 62.2 | 33.2 | 56.8 |
9 | event_scheduler | 29.0 | 47.8 | 42.8 | 94.4 | 66.6 | 36.0 | 59.0 | 37.2 | 51.6 |
10 | ispersonal | 43.0 | 74.0 | 68.6 | 58.6 | 56.0 | 35.0 | 48.0 | 29.5 | 51.6 |
11 | audi_filter | 27.0 | 60.0 | 58.0 | 46.0 | 58.0 | 43.0 | 48.5 | 27.0 | 45.9 |
12 | extract_julia_code | 41.0 | 43.6 | 48.4 | 51.6 | 48.7 | 31.8 | 52.2 | 30.1 | 43.4 |
13 | q_and_a_extractor | 31.7 | 36.3 | 36.7 | 44.7 | 53.3 | 38.7 | 44.7 | 36.0 | 40.2 |
14 | pig_latinify | 24.7 | 36.4 | 23.1 | 53.6 | 61.4 | 27.8 | 28.8 | 33.1 | 36.1 |
This page was generated using Literate.jl.