Results for Paid LLM APIs

The below captures the performance of 3 models from two commercial LLM APIs: OpenAI (GPT-3.5 Turbo, GPT-4, ...) and MistralAI (tiny, small, medium).

There are many other providers, but OpenAI is the most commonly used. MistralAI commercial API has launched recently and has a very good relationship with the Open-Source community, so we've added it as a challenger to compare OpenAI's cost effectiveness ("cost per point", ie, how many cents would you pay for 1pt in this benchmark)

Reminder: The below scores are on a scale 0-100, where 100 is the best possible score and 0 means the generated code was not even parseable.

# Imports
using JuliaLLMLeaderboard
using CairoMakie, AlgebraOfGraphics
using MarkdownTables, DataFramesMeta
using Statistics: mean, median, quantile, std;

# ! Configuration
SAVE_PLOTS = false
DIR_RESULTS = joinpath(pkgdir(JuliaLLMLeaderboard), "code_generation")
PAID_MODELS_DEFAULT = [
    "gpt-3.5-turbo",
    "gpt-3.5-turbo-1106",
    "gpt-3.5-turbo-0125",
    "gpt-4-1106-preview",
    "gpt-4-0125-preview",
    "mistral-tiny",
    "mistral-small",
    "mistral-medium",
];
PROMPTS = [
    "JuliaExpertCoTTask",
    "JuliaExpertAsk",
    "InJulia",
    "JuliaRecapTask",
    "JuliaRecapCoTTask",
];

Load Latest Results

Use only the 5 most recent evaluations available for each definition/model/prompt

df = @chain begin
    load_evals(DIR_RESULTS; max_history = 5)
    @rsubset :model in PAID_MODELS_DEFAULT && :prompt_label in PROMPTS
end;

Model Comparison

Highest average score by model:

fig = @chain df begin
    @by [:model] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
    end
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    @orderby -:score
    @aside local order_ = _.model
    data(_) *
    mapping(:model => sorter(order_) => "Model",
        :score => "Avg. Score (Max 100 pts)") * visual(BarPlot; bar_labels = :y,
        label_offset = 0)
    draw(;
        axis = (limits = (nothing, nothing, 0, 100),
            xticklabelrotation = 45,
            title = "Paid APIs Performance"))
end

Table:

output = @chain df begin
    @by [:model] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_std_deviation = std(:score)
        :count_zero_score = count(iszero, :score)
        :count_full_score = count(==(100), :score)
    end
    transform(_,
        [:elapsed, :score, :score_std_deviation] .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    @orderby -:score
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)

Model	Elapsed	Score	Score Std Deviation	Count Zero Score	Count Full Score	Cost Cents
gpt-4-1106-preview	22.4	74.4	29.9	19	142	1.21
gpt-4-0125-preview	30.2	73.1	31.7	26	140	1.3
gpt-3.5-turbo-0125	1.2	62.1	36.5	62	95	0.03
mistral-medium	18.1	60.8	33.2	22	90	0.41
mistral-small	5.9	60.1	30.2	27	76	0.09
gpt-3.5-turbo-1106	2.1	58.4	39.2	82	97	0.04
mistral-tiny	4.6	46.9	32.0	75	42	0.02
gpt-3.5-turbo	3.6	42.3	38.2	132	54	0.04

While the victory of GPT-4 is not surprising, note that the our sample size is small and the standard deviation is quite high.

Overview by Prompt Template

Bar chart with all paid models and various prompt templates

fig = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @aside local average_ = @by(_, :model, :avg=mean(:score)) |>
                            x -> @orderby(x, -:avg).model
    data(_) *
    mapping(:model => sorter(average_) => "Model",
        :score => "Avg. Score (Max 100 pts)",
        color = :prompt_label => "Prompts",
        dodge = :prompt_label) * visual(BarPlot)
    draw(;
        axis = (xticklabelrotation = 45, title = "Comparison for Paid APIs"))
end
SAVE_PLOTS && save("assets/model-prompt-comparison-paid.png", fig)
fig

Table:

Surprised by the low performance of some models (eg, GPT 3.5 Turbo) on the CoT prompts? It's because the model accidentally sends a "stop" token before it writes the code.

output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
    end
    @aside average_ = @by _ :model :AverageScore=mean(:score) |> x -> round(x, digits = 1)
    unstack(:model, :prompt_label, :score; fill = 0.0)
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    leftjoin(average_, on = :model)
    @orderby -:AverageScore
end
# markdown_table(output, String) |> clipboard
markdown_table(output)

model	InJulia	JuliaExpertAsk	JuliaExpertCoTTask	JuliaRecapCoTTask	JuliaRecapTask	AverageScore
gpt-4-1106-preview	74.9	79.1	71.8	72.4	73.6	74.4
gpt-4-0125-preview	70.6	79.2	70.8	72.0	73.0	73.1
gpt-3.5-turbo-0125	73.9	72.9	67.6	28.9	67.1	62.1
mistral-medium	63.1	60.5	63.4	55.9	61.2	60.8
mistral-small	67.3	61.4	59.9	56.1	55.9	60.1
gpt-3.5-turbo-1106	74.6	73.6	73.4	15.4	55.0	58.4
mistral-tiny	51.7	44.3	41.1	50.5	47.2	47.0
gpt-3.5-turbo	73.1	60.9	32.8	26.2	18.4	42.3

Other Considerations

Comparison of Cost vs Average Score

fig = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    data(_) * mapping(:cost => (x -> x * 100) => "Avg. Cost (US Cents/query)",
        :score => "Avg. Score (Max 100 pts)",
        color = :model => "Model")
    draw(;
        axis = (xticklabelrotation = 45,
            title = "Cost vs Score for Paid APIs"))
end
SAVE_PLOTS && save("assets/cost-vs-score-scatter-paid.png", fig)
fig

Table:

Point per cent is the average score divided by the average cost in US cents

output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score_avg = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @rtransform :point_per_cent = :score_avg / :cost / 100
    @orderby -:point_per_cent
    #
    transform(_,
        names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)

Model	Prompt Label	Elapsed	Score Avg	Score Median	Cnt	Point Per Cent	Cost Cents
mistral-tiny	JuliaExpertAsk	2.4	44.3	50.0	70.0	4333.3	0.01
gpt-3.5-turbo-0125	JuliaExpertAsk	0.9	72.9	80.0	70.0	4074.3	0.02
mistral-tiny	InJulia	3.8	51.7	50.0	68.0	2869.4	0.02
gpt-3.5-turbo-1106	JuliaExpertAsk	1.6	73.6	80.0	70.0	2747.9	0.03
gpt-3.5-turbo-0125	InJulia	1.5	73.9	81.7	70.0	2347.7	0.03
gpt-3.5-turbo-0125	JuliaExpertCoTTask	1.2	67.6	86.7	70.0	2292.0	0.03
gpt-3.5-turbo	JuliaExpertAsk	3.1	60.9	60.0	70.0	2177.5	0.03
mistral-tiny	JuliaExpertCoTTask	6.6	41.1	50.0	70.0	2040.5	0.02
mistral-tiny	JuliaRecapCoTTask	4.9	50.5	50.0	70.0	1957.1	0.03
gpt-3.5-turbo-0125	JuliaRecapTask	1.2	67.1	67.5	70.0	1889.6	0.04
gpt-3.5-turbo-1106	JuliaExpertCoTTask	1.9	73.4	95.0	69.0	1873.1	0.04
mistral-tiny	JuliaRecapTask	5.1	47.2	50.0	70.0	1783.8	0.03
gpt-3.5-turbo-1106	InJulia	2.9	74.6	83.3	70.0	1672.1	0.04
gpt-3.5-turbo	InJulia	5.0	73.1	67.5	70.0	1633.3	0.04
mistral-small	JuliaExpertAsk	3.7	61.4	52.5	70.0	1078.7	0.06
gpt-3.5-turbo-1106	JuliaRecapTask	1.9	55.0	62.5	69.0	1028.1	0.05
gpt-3.5-turbo	JuliaExpertCoTTask	3.1	32.8	0.0	70.0	1010.4	0.03
mistral-small	InJulia	5.3	67.3	60.0	70.0	890.8	0.08
gpt-3.5-turbo-0125	JuliaRecapCoTTask	1.1	28.9	0.0	70.0	875.4	0.03
mistral-small	JuliaExpertCoTTask	5.3	59.9	55.0	70.0	706.0	0.08
gpt-3.5-turbo	JuliaRecapCoTTask	3.6	26.2	0.0	70.0	585.4	0.04
mistral-small	JuliaRecapCoTTask	7.6	56.1	57.5	70.0	460.0	0.12
mistral-small	JuliaRecapTask	7.7	55.9	55.0	70.0	436.2	0.13
gpt-3.5-turbo	JuliaRecapTask	3.4	18.4	0.0	70.0	423.5	0.04
gpt-3.5-turbo-1106	JuliaRecapCoTTask	2.0	15.4	0.0	70.0	274.4	0.06
mistral-medium	JuliaExpertAsk	12.3	60.5	55.0	70.0	230.3	0.26
mistral-medium	InJulia	14.8	63.1	60.0	70.0	187.6	0.34
gpt-4-0125-preview	JuliaExpertAsk	10.8	79.2	90.0	70.0	158.4	0.5
mistral-medium	JuliaExpertCoTTask	20.0	63.4	62.5	70.0	146.8	0.43
gpt-4-1106-preview	JuliaExpertAsk	10.9	79.1	90.8	70.0	125.2	0.63
mistral-medium	JuliaRecapTask	20.2	61.2	65.0	70.0	116.0	0.53
mistral-medium	JuliaRecapCoTTask	23.3	55.9	50.0	70.0	110.9	0.5
gpt-4-1106-preview	JuliaExpertCoTTask	21.7	71.8	92.5	70.0	63.9	1.12
gpt-4-0125-preview	JuliaExpertCoTTask	27.8	70.8	95.4	70.0	59.5	1.19
gpt-4-1106-preview	InJulia	27.4	74.9	86.7	70.0	57.9	1.29
gpt-4-0125-preview	InJulia	34.3	70.6	84.0	70.0	49.9	1.42
gpt-4-1106-preview	JuliaRecapCoTTask	25.0	72.4	85.6	70.0	48.9	1.48
gpt-4-1106-preview	JuliaRecapTask	26.9	73.6	77.5	70.0	47.9	1.54
gpt-4-0125-preview	JuliaRecapCoTTask	38.0	72.0	84.4	70.0	43.2	1.67
gpt-4-0125-preview	JuliaRecapTask	39.9	73.0	90.0	70.0	42.9	1.7

Comparison of Time-to-generate vs Average Score

fig = @chain df begin
    @aside local xlims = quantile(df.elapsed_seconds, [0.01, 0.99])
    @by [:model, :prompt_label] begin
        :elapsed = mean(:elapsed_seconds)
        :elapsed_median = median(:elapsed_seconds)
        :score = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    data(_) * mapping(:elapsed => "Avg. Elapsed Time (s)",
        :score => "Avg. Score (Max 100 pts)",
        color = :model => "Model")
    draw(; figure = (size = (600, 600),),
        axis = (xticklabelrotation = 45,
            title = "Elapsed Time vs Score for Paid APIs",
            limits = (xlims..., nothing, nothing)),
        palettes = (; color = Makie.ColorSchemes.tab20.colors))
end
SAVE_PLOTS && save("assets/elapsed-vs-score-scatter-paid.png", fig)
fig

Table:

Point per second is the average score divided by the average elapsed time

output = @chain df begin
    @by [:model, :prompt_label] begin
        :cost = mean(:cost)
        :elapsed = mean(:elapsed_seconds)
        :score_avg = mean(:score)
        :score_median = median(:score)
        :cnt = $nrow
    end
    @rtransform :point_per_second = :score_avg / :elapsed
    @orderby -:point_per_second
    #
    transform(_,
        names(_, Not(:model, :prompt_label, :cost)) .=> ByRow(x -> round(x, digits = 1)),
        renamecols = false)
    @rtransform :cost_cents = round(:cost * 100; digits = 2)
    select(Not(:cost))
    rename(_, names(_) .|> unscrub_string)
end
# markdown_table(output, String) |> clipboard
markdown_table(output)

Model	Prompt Label	Elapsed	Score Avg	Score Median	Cnt	Point Per Second	Cost Cents
gpt-3.5-turbo-0125	JuliaExpertAsk	0.9	72.9	80.0	70.0	84.1	0.02
gpt-3.5-turbo-0125	JuliaExpertCoTTask	1.2	67.6	86.7	70.0	57.1	0.03
gpt-3.5-turbo-0125	JuliaRecapTask	1.2	67.1	67.5	70.0	56.7	0.04
gpt-3.5-turbo-0125	InJulia	1.5	73.9	81.7	70.0	49.4	0.03
gpt-3.5-turbo-1106	JuliaExpertAsk	1.6	73.6	80.0	70.0	45.5	0.03
gpt-3.5-turbo-1106	JuliaExpertCoTTask	1.9	73.4	95.0	69.0	38.9	0.04
gpt-3.5-turbo-1106	JuliaRecapTask	1.9	55.0	62.5	69.0	29.2	0.05
gpt-3.5-turbo-0125	JuliaRecapCoTTask	1.1	28.9	0.0	70.0	27.4	0.03
gpt-3.5-turbo-1106	InJulia	2.9	74.6	83.3	70.0	25.8	0.04
gpt-3.5-turbo	JuliaExpertAsk	3.1	60.9	60.0	70.0	19.6	0.03
mistral-tiny	JuliaExpertAsk	2.4	44.3	50.0	70.0	18.7	0.01
mistral-small	JuliaExpertAsk	3.7	61.4	52.5	70.0	16.5	0.06
gpt-3.5-turbo	InJulia	5.0	73.1	67.5	70.0	14.5	0.04
mistral-tiny	InJulia	3.8	51.7	50.0	68.0	13.6	0.02
mistral-small	InJulia	5.3	67.3	60.0	70.0	12.7	0.08
mistral-small	JuliaExpertCoTTask	5.3	59.9	55.0	70.0	11.4	0.08
gpt-3.5-turbo	JuliaExpertCoTTask	3.1	32.8	0.0	70.0	10.5	0.03
mistral-tiny	JuliaRecapCoTTask	4.9	50.5	50.0	70.0	10.3	0.03
mistral-tiny	JuliaRecapTask	5.1	47.2	50.0	70.0	9.3	0.03
gpt-3.5-turbo-1106	JuliaRecapCoTTask	2.0	15.4	0.0	70.0	7.6	0.06
mistral-small	JuliaRecapCoTTask	7.6	56.1	57.5	70.0	7.4	0.12
gpt-3.5-turbo	JuliaRecapCoTTask	3.6	26.2	0.0	70.0	7.4	0.04
gpt-4-0125-preview	JuliaExpertAsk	10.8	79.2	90.0	70.0	7.3	0.5
gpt-4-1106-preview	JuliaExpertAsk	10.9	79.1	90.8	70.0	7.2	0.63
mistral-small	JuliaRecapTask	7.7	55.9	55.0	70.0	7.2	0.13
mistral-tiny	JuliaExpertCoTTask	6.6	41.1	50.0	70.0	6.2	0.02
gpt-3.5-turbo	JuliaRecapTask	3.4	18.4	0.0	70.0	5.4	0.04
mistral-medium	JuliaExpertAsk	12.3	60.5	55.0	70.0	4.9	0.26
mistral-medium	InJulia	14.8	63.1	60.0	70.0	4.3	0.34
gpt-4-1106-preview	JuliaExpertCoTTask	21.7	71.8	92.5	70.0	3.3	1.12
mistral-medium	JuliaExpertCoTTask	20.0	63.4	62.5	70.0	3.2	0.43
mistral-medium	JuliaRecapTask	20.2	61.2	65.0	70.0	3.0	0.53
gpt-4-1106-preview	JuliaRecapCoTTask	25.0	72.4	85.6	70.0	2.9	1.48
gpt-4-1106-preview	JuliaRecapTask	26.9	73.6	77.5	70.0	2.7	1.54
gpt-4-1106-preview	InJulia	27.4	74.9	86.7	70.0	2.7	1.29
gpt-4-0125-preview	JuliaExpertCoTTask	27.8	70.8	95.4	70.0	2.5	1.19
mistral-medium	JuliaRecapCoTTask	23.3	55.9	50.0	70.0	2.4	0.5
gpt-4-0125-preview	InJulia	34.3	70.6	84.0	70.0	2.1	1.42
gpt-4-0125-preview	JuliaRecapCoTTask	38.0	72.0	84.4	70.0	1.9	1.67
gpt-4-0125-preview	JuliaRecapTask	39.9	73.0	90.0	70.0	1.8	1.7

Test Case Performance

Performance of different models across each test case

output = @chain df begin
    @by [:model, :name] begin
        :score = mean(:score)
    end
    #
    @aside average_ = @by _ :name :AverageScore=mean(:score) |> x -> round(x, digits = 1)
    unstack(:name, :model, :score; fill = 0.0)
    transform(_, names(_, Number) .=> ByRow(x -> round(x, digits = 1)), renamecols = false)
    leftjoin(average_, on = :name)
    @orderby -:AverageScore
end

14×10 DataFrame

Row	name	gpt-3.5-turbo	gpt-3.5-turbo-0125	gpt-3.5-turbo-1106	gpt-4-0125-preview	gpt-4-1106-preview	mistral-medium	mistral-small	mistral-tiny	AverageScore
	String	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64?
1	FloatWithUnits	76.0	94.0	80.0	56.0	72.0	98.0	70.0	80.2	78.3
2	timezone_bumper	48.0	74.8	79.2	93.0	90.0	97.0	76.6	62.0	77.6
3	clean_column	35.5	62.7	69.8	84.5	90.5	81.0	84.6	80.8	73.7
4	count_model_rows	52.8	73.8	56.2	98.4	98.4	79.0	67.2	53.2	72.4
5	keep_only_names	50.8	80.2	74.2	85.2	91.0	66.2	76.6	51.0	71.9
6	wrap_string	64.0	57.5	55.3	93.7	97.8	84.7	68.0	48.3	71.2
7	weather_data_analyzer	35.2	64.2	59.0	87.8	85.0	85.4	55.4	56.8	66.1
8	add_yearmonth	33.0	63.8	65.2	76.2	72.8	48.0	62.2	33.2	56.8
9	event_scheduler	29.0	47.8	42.8	94.4	66.6	36.0	59.0	37.2	51.6
10	ispersonal	43.0	74.0	68.6	58.6	56.0	35.0	48.0	29.5	51.6
11	audi_filter	27.0	60.0	58.0	46.0	58.0	43.0	48.5	27.0	45.9
12	extract_julia_code	41.0	43.6	48.4	51.6	48.7	31.8	52.2	30.1	43.4
13	q_and_a_extractor	31.7	36.3	36.7	44.7	53.3	38.7	44.7	36.0	40.2
14	pig_latinify	24.7	36.4	23.1	53.6	61.4	27.8	28.8	33.1	36.1

This page was generated using Literate.jl.