However, if you need a quick response and high-quality outputs, your best choice is "gpt-3.5-turbo-1106" (The "1106" version is important! The default GPT 3.5 Turbo ranks much lower)
For more plots and a table summary, visit Results for Paid APIs.
Open-source models, though not as robust as the best-paid APIs, are rapidly catching up, with some like Magicoder, Phind CodeLlama, and DeepSeek showing notable results. My personal pick would be "magicoder:7b-s-cl-q6_K" served via Ollama.ai, because it has 7 billion parameters, so it's quite fast and the performance is solid.
See more detail here.
Moreover, the benchmark addresses the effectiveness of different prompting strategies. It turns out that even simple prompts can be quite effective, and larger prompts may sometimes confuse smaller models.
We used prompting templates available in PromptingTools.jl 0.6.0., except for "AsIs", which represented the raw task without any mention of Julia language (to see if the LLMs can infer it from the context).
Prompt Template | Elapsed (s, average) | Elapsed (s, median) | Avg. Score (Max 100 pts) | Median Score (Max 100 pts) |
---|---|---|---|---|
InJulia | 16.7 | 11.7 | 50.6 | 50.0 |
JuliaExpertAsk | 11.8 | 7.8 | 47.6 | 50.0 |
JuliaRecapTask | 20.9 | 15.9 | 45.6 | 50.0 |
JuliaExpertCoTTask | 19.7 | 14.9 | 43.9 | 50.0 |
JuliaRecapCoTTask | 19.7 | 15.2 | 42.5 | 50.0 |
AsIs | 36.3 | 11.2 | 13.0 | 0.0 |
Main takeaways:
Always make sure to explicitly mention that you want Julia code (the case-in-point is the "AsIs" prompt which performed poorly)
Just appending "In Julia, ..." can be enough to get a good trade-off of speed and performance
In many cases, "JuliaExpertAsk" was quite successful. It doesn't hurt to stroke AI's ego :)
These insights are just the tip of the iceberg. The full repository includes detailed documentation of the methodology and results. We invite the Julia community and AI enthusiasts to dive into the Julia LLM Leaderboard, contribute test cases, and explore the fascinating world of AI-generated code.
Stay tuned for more in-depth analysis and findings from this project!