Results for Paid LLM APIs

The below captures the performance of 3 models from two commercial LLM APIs: OpenAI (GPT-3.5 Turbo, GPT-4, ...) and MistralAI (tiny, small, medium).

There are many other providers, but OpenAI is the most commonly used. MistralAI commercial API has launched recently and has a very good relationship with the Open-Source community, so we've added it as a challenger to compare OpenAI's cost effectiveness ("cost per point", ie, how many cents would you pay for 1pt in this benchmark)

Reminder: The below scores are on a scale 0-100, where 100 is the best possible score and 0 means the generated code was not even parseable.

# Imports
using JuliaLLMLeaderboard
using CairoMakie, AlgebraOfGraphics
using MarkdownTables, DataFramesMeta
using Statistics: mean, median, quantile, std;

# ! Configuration
SAVE_PLOTS = false
DIR_RESULTS = joinpath(pkgdir(JuliaLLMLeaderboard), "code_generation")
PAID_MODELS_DEFAULT = [
    "gpt-3.5-turbo",
    "gpt-3.5-turbo-1106",
    "gpt-3.5-turbo-0125",
    "gpt-4-1106-preview",
    "gpt-4-0125-preview",
    "mistral-tiny",
    "mistral-small",
    "mistral-medium",
];
PROMPTS = [
    "JuliaExpertCoTTask",
    "JuliaExpertAsk",
    "InJulia",
    "JuliaRecapTask",
    "JuliaRecapCoTTask",
];

Load Latest Results

Use only the 5 most recent evaluations available for each definition/model/prompt

df = @chain begin
    load_evals(DIR_RESULTS; max_history = 5)
    @rsubset :model in PAID_MODELS_DEFAULT && :prompt_label in PROMPTS
end;

Model Comparison

Highest average score by model:

fig = @chain df begin
    @by [:model] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
    end
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    @orderby -:score
    @aside local order_ = _.model
    data(_) *
    mapping(:model => sorter(order_) => "Model",
        :score => "Avg. Score (Max 100 pts)") * visual(BarPlot; bar_labels = :y,
        label_offset = 0)
    draw(;
        axis = (limits = (nothing, nothing, 0, 100),
            xticklabelrotation = 45,
            title = "Paid APIs Performance"))
end

Table:

output = @chain df begin
    @by [:model] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_std_deviation = std(:score)
        :count_zero_score = count(iszero, :score)
        :count_full_score = count(==(100), :score)
    end
    transform(_,
        [:elapsed, :score, :score_std_deviation] .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    @orderby -:score
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
ModelElapsedScoreScore Std DeviationCount Zero ScoreCount Full ScoreCost Cents
gpt-4-1106-preview22.474.429.9191421.21
gpt-4-0125-preview30.273.131.7261401.3
gpt-3.5-turbo-01251.262.136.562950.03
mistral-medium18.160.833.222900.41
mistral-small5.960.130.227760.09
gpt-3.5-turbo-11062.158.439.282970.04
mistral-tiny4.646.932.075420.02
gpt-3.5-turbo3.642.338.2132540.04

While the victory of GPT-4 is not surprising, note that the our sample size is small and the standard deviation is quite high.

Overview by Prompt Template

Bar chart with all paid models and various prompt templates

fig = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @aside local average_ = @by(_, :model, :avg=mean(:score)) |>
                            x -> @orderby(x, -:avg).model
    data(_) *
    mapping(:model => sorter(average_) => "Model",
        :score => "Avg. Score (Max 100 pts)",
        color = :prompt_label => "Prompts",
        dodge = :prompt_label) * visual(BarPlot)
    draw(;
        axis = (xticklabelrotation = 45, title = "Comparison for Paid APIs"))
end
SAVE_PLOTS && save("assets/model-prompt-comparison-paid.png", fig)
fig

Table:

  • Surprised by the low performance of some models (eg, GPT 3.5 Turbo) on the CoT prompts? It's because the model accidentally sends a "stop" token before it writes the code.
output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
    end
    @aside average_ = @by _ :model :AverageScore=mean(:score) |> x -> round(x, digits = 1)
    unstack(:model, :prompt_label, :score; fill = 0.0)
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    leftjoin(average_, on = :model)
    @orderby -:AverageScore
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
modelInJuliaJuliaExpertAskJuliaExpertCoTTaskJuliaRecapCoTTaskJuliaRecapTaskAverageScore
gpt-4-1106-preview74.979.171.872.473.674.4
gpt-4-0125-preview70.679.270.872.073.073.1
gpt-3.5-turbo-012573.972.967.628.967.162.1
mistral-medium63.160.563.455.961.260.8
mistral-small67.361.459.956.155.960.1
gpt-3.5-turbo-110674.673.673.415.455.058.4
mistral-tiny51.744.341.150.547.247.0
gpt-3.5-turbo73.160.932.826.218.442.3

Other Considerations

Comparison of Cost vs Average Score

fig = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    data(_) * mapping(:cost => (x -> x * 100) => "Avg. Cost (US Cents/query)",
        :score => "Avg. Score (Max 100 pts)",
        color = :model => "Model")
    draw(;
        axis = (xticklabelrotation = 45,
            title = "Cost vs Score for Paid APIs"))
end
SAVE_PLOTS && save("assets/cost-vs-score-scatter-paid.png", fig)
fig

Table:

  • Point per cent is the average score divided by the average cost in US cents
output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score_avg = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @rtransform :point_per_cent = :score_avg / :cost / 100
    @orderby -:point_per_cent
    #
    transform(_,
        names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
ModelPrompt LabelElapsedScore AvgScore MedianCntPoint Per CentCost Cents
mistral-tinyJuliaExpertAsk2.444.350.070.04333.30.01
gpt-3.5-turbo-0125JuliaExpertAsk0.972.980.070.04074.30.02
mistral-tinyInJulia3.851.750.068.02869.40.02
gpt-3.5-turbo-1106JuliaExpertAsk1.673.680.070.02747.90.03
gpt-3.5-turbo-0125InJulia1.573.981.770.02347.70.03
gpt-3.5-turbo-0125JuliaExpertCoTTask1.267.686.770.02292.00.03
gpt-3.5-turboJuliaExpertAsk3.160.960.070.02177.50.03
mistral-tinyJuliaExpertCoTTask6.641.150.070.02040.50.02
mistral-tinyJuliaRecapCoTTask4.950.550.070.01957.10.03
gpt-3.5-turbo-0125JuliaRecapTask1.267.167.570.01889.60.04
gpt-3.5-turbo-1106JuliaExpertCoTTask1.973.495.069.01873.10.04
mistral-tinyJuliaRecapTask5.147.250.070.01783.80.03
gpt-3.5-turbo-1106InJulia2.974.683.370.01672.10.04
gpt-3.5-turboInJulia5.073.167.570.01633.30.04
mistral-smallJuliaExpertAsk3.761.452.570.01078.70.06
gpt-3.5-turbo-1106JuliaRecapTask1.955.062.569.01028.10.05
gpt-3.5-turboJuliaExpertCoTTask3.132.80.070.01010.40.03
mistral-smallInJulia5.367.360.070.0890.80.08
gpt-3.5-turbo-0125JuliaRecapCoTTask1.128.90.070.0875.40.03
mistral-smallJuliaExpertCoTTask5.359.955.070.0706.00.08
gpt-3.5-turboJuliaRecapCoTTask3.626.20.070.0585.40.04
mistral-smallJuliaRecapCoTTask7.656.157.570.0460.00.12
mistral-smallJuliaRecapTask7.755.955.070.0436.20.13
gpt-3.5-turboJuliaRecapTask3.418.40.070.0423.50.04
gpt-3.5-turbo-1106JuliaRecapCoTTask2.015.40.070.0274.40.06
mistral-mediumJuliaExpertAsk12.360.555.070.0230.30.26
mistral-mediumInJulia14.863.160.070.0187.60.34
gpt-4-0125-previewJuliaExpertAsk10.879.290.070.0158.40.5
mistral-mediumJuliaExpertCoTTask20.063.462.570.0146.80.43
gpt-4-1106-previewJuliaExpertAsk10.979.190.870.0125.20.63
mistral-mediumJuliaRecapTask20.261.265.070.0116.00.53
mistral-mediumJuliaRecapCoTTask23.355.950.070.0110.90.5
gpt-4-1106-previewJuliaExpertCoTTask21.771.892.570.063.91.12
gpt-4-0125-previewJuliaExpertCoTTask27.870.895.470.059.51.19
gpt-4-1106-previewInJulia27.474.986.770.057.91.29
gpt-4-0125-previewInJulia34.370.684.070.049.91.42
gpt-4-1106-previewJuliaRecapCoTTask25.072.485.670.048.91.48
gpt-4-1106-previewJuliaRecapTask26.973.677.570.047.91.54
gpt-4-0125-previewJuliaRecapCoTTask38.072.084.470.043.21.67
gpt-4-0125-previewJuliaRecapTask39.973.090.070.042.91.7

Comparison of Time-to-generate vs Average Score

fig = @chain df begin
    @aside local xlims = quantile(df.elapsed_seconds, [0.01, 0.99])
    @by [:model, :prompt_label] begin
        :elapsed = mean(:elapsed_seconds)
        :elapsed_median = median(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    data(_) * mapping(:elapsed => "Avg. Elapsed Time (s)",
        :score => "Avg. Score (Max 100 pts)",
        color = :model => "Model")
    draw(; figure = (size = (600, 600),),
        axis = (xticklabelrotation = 45,
            title = "Elapsed Time vs Score for Paid APIs",
            limits = (xlims..., nothing, nothing)),
        palettes = (; color = Makie.ColorSchemes.tab20.colors))
end
SAVE_PLOTS && save("assets/elapsed-vs-score-scatter-paid.png", fig)
fig

Table:

  • Point per second is the average score divided by the average elapsed time
output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score_avg = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @rtransform :point_per_second = :score_avg / :elapsed
    @orderby -:point_per_second
    #
    transform(_,
        names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)
ModelPrompt LabelElapsedScore AvgScore MedianCntPoint Per SecondCost Cents
gpt-3.5-turbo-0125JuliaExpertAsk0.972.980.070.084.10.02
gpt-3.5-turbo-0125JuliaExpertCoTTask1.267.686.770.057.10.03
gpt-3.5-turbo-0125JuliaRecapTask1.267.167.570.056.70.04
gpt-3.5-turbo-0125InJulia1.573.981.770.049.40.03
gpt-3.5-turbo-1106JuliaExpertAsk1.673.680.070.045.50.03
gpt-3.5-turbo-1106JuliaExpertCoTTask1.973.495.069.038.90.04
gpt-3.5-turbo-1106JuliaRecapTask1.955.062.569.029.20.05
gpt-3.5-turbo-0125JuliaRecapCoTTask1.128.90.070.027.40.03
gpt-3.5-turbo-1106InJulia2.974.683.370.025.80.04
gpt-3.5-turboJuliaExpertAsk3.160.960.070.019.60.03
mistral-tinyJuliaExpertAsk2.444.350.070.018.70.01
mistral-smallJuliaExpertAsk3.761.452.570.016.50.06
gpt-3.5-turboInJulia5.073.167.570.014.50.04
mistral-tinyInJulia3.851.750.068.013.60.02
mistral-smallInJulia5.367.360.070.012.70.08
mistral-smallJuliaExpertCoTTask5.359.955.070.011.40.08
gpt-3.5-turboJuliaExpertCoTTask3.132.80.070.010.50.03
mistral-tinyJuliaRecapCoTTask4.950.550.070.010.30.03
mistral-tinyJuliaRecapTask5.147.250.070.09.30.03
gpt-3.5-turbo-1106JuliaRecapCoTTask2.015.40.070.07.60.06
mistral-smallJuliaRecapCoTTask7.656.157.570.07.40.12
gpt-3.5-turboJuliaRecapCoTTask3.626.20.070.07.40.04
gpt-4-0125-previewJuliaExpertAsk10.879.290.070.07.30.5
gpt-4-1106-previewJuliaExpertAsk10.979.190.870.07.20.63
mistral-smallJuliaRecapTask7.755.955.070.07.20.13
mistral-tinyJuliaExpertCoTTask6.641.150.070.06.20.02
gpt-3.5-turboJuliaRecapTask3.418.40.070.05.40.04
mistral-mediumJuliaExpertAsk12.360.555.070.04.90.26
mistral-mediumInJulia14.863.160.070.04.30.34
gpt-4-1106-previewJuliaExpertCoTTask21.771.892.570.03.31.12
mistral-mediumJuliaExpertCoTTask20.063.462.570.03.20.43
mistral-mediumJuliaRecapTask20.261.265.070.03.00.53
gpt-4-1106-previewJuliaRecapCoTTask25.072.485.670.02.91.48
gpt-4-1106-previewJuliaRecapTask26.973.677.570.02.71.54
gpt-4-1106-previewInJulia27.474.986.770.02.71.29
gpt-4-0125-previewJuliaExpertCoTTask27.870.895.470.02.51.19
mistral-mediumJuliaRecapCoTTask23.355.950.070.02.40.5
gpt-4-0125-previewInJulia34.370.684.070.02.11.42
gpt-4-0125-previewJuliaRecapCoTTask38.072.084.470.01.91.67
gpt-4-0125-previewJuliaRecapTask39.973.090.070.01.81.7

Test Case Performance

Performance of different models across each test case

output = @chain df begin
    @by [:model, :name] begin
        :score = mean(:score)
    end
    #
    @aside average_ = @by _ :name :AverageScore=mean(:score) |> x -> round(x, digits = 1)
    unstack(:name, :model, :score; fill = 0.0)
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    leftjoin(average_, on = :name)
    @orderby -:AverageScore
end
14×10 DataFrame
Rownamegpt-3.5-turbogpt-3.5-turbo-0125gpt-3.5-turbo-1106gpt-4-0125-previewgpt-4-1106-previewmistral-mediummistral-smallmistral-tinyAverageScore
StringFloat64Float64Float64Float64Float64Float64Float64Float64Float64?
1FloatWithUnits76.094.080.056.072.098.070.080.278.3
2timezone_bumper48.074.879.293.090.097.076.662.077.6
3clean_column35.562.769.884.590.581.084.680.873.7
4count_model_rows52.873.856.298.498.479.067.253.272.4
5keep_only_names50.880.274.285.291.066.276.651.071.9
6wrap_string64.057.555.393.797.884.768.048.371.2
7weather_data_analyzer35.264.259.087.885.085.455.456.866.1
8add_yearmonth33.063.865.276.272.848.062.233.256.8
9event_scheduler29.047.842.894.466.636.059.037.251.6
10ispersonal43.074.068.658.656.035.048.029.551.6
11audi_filter27.060.058.046.058.043.048.527.045.9
12extract_julia_code41.043.648.451.648.731.852.230.143.4
13q_and_a_extractor31.736.336.744.753.338.744.736.040.2
14pig_latinify24.736.423.153.661.427.828.833.136.1

This page was generated using Literate.jl.