Julia LLM Leaderboard: Benchmarking GenAI In Julia Coding


TL;DR

The Julia LLM Leaderboard is a new benchmarking project that evaluates and compares the Julia code generation capabilities of various Large Language Models, revealing that, unsurprisingly, paid APIs like GPT-4 perform exceptionally well, but the locally-hosted models are quickly closing the gap.

Announcing the Julia LLM Leaderboard: A Benchmark for AI-Generated Julia Code

We're excited to announce the launch of the Julia LLM Leaderboard, a comprehensive benchmark project dedicated to evaluating the Julia language generation capabilities of various Large Language Models (LLMs). This unique repository is designed with a focus on practicality, simplicity, and the Julia community's needs.

The project presents a comparative analysis across multiple AI models, assessing their proficiency in generating syntactically correct Julia code. Our evaluation methodology includes simple and practical criteria like parsing, execution without errors of provided examples and passing unit tests. Each model can score up to 100 points based on these criteria, providing a clear and standardized measure of their capabilities.

Initial findings reveal that paid APIs like GPT-4 and the MistralAI models show impressive performance, with "GPT-4-Turbo-1106" consistently ranking among the highest.

Performance of Paid APIs across different prompts

However, if you need a quick response and high-quality outputs, your best choice is "gpt-3.5-turbo-1106" (The "1106" version is important! The default GPT 3.5 Turbo ranks much lower)

For more plots and a table summary, visit Results for Paid APIs.

Locally-Hosted Models

Open-source models, though not as robust as the best-paid APIs, are rapidly catching up, with some like Magicoder, Phind CodeLlama, and DeepSeek showing notable results. My personal pick would be "magicoder:7b-s-cl-q6_K" served via Ollama.ai, because it has 7 billion parameters, so it's quite fast and the performance is solid.

Performance of Locally-Hosted Models

See more detail here.

Prompts, Prompts, Prompts

Moreover, the benchmark addresses the effectiveness of different prompting strategies. It turns out that even simple prompts can be quite effective, and larger prompts may sometimes confuse smaller models.

We used prompting templates available in PromptingTools.jl 0.6.0., except for "AsIs", which represented the raw task without any mention of Julia language (to see if the LLMs can infer it from the context).

Prompt TemplateElapsed (s, average)Elapsed (s, median)Avg. Score (Max 100 pts)Median Score (Max 100 pts)
InJulia16.711.750.650.0
JuliaExpertAsk11.87.847.650.0
JuliaRecapTask20.915.945.650.0
JuliaExpertCoTTask19.714.943.950.0
JuliaRecapCoTTask19.715.242.550.0
AsIs36.311.213.00.0

Main takeaways:

Conclusion

These insights are just the tip of the iceberg. The full repository includes detailed documentation of the methodology and results. We invite the Julia community and AI enthusiasts to dive into the Julia LLM Leaderboard, contribute test cases, and explore the fascinating world of AI-generated code.

Stay tuned for more in-depth analysis and findings from this project!

CC BY-SA 4.0 Jan Siml. Last modified: February 13, 2024. Website built with Franklin.jl and the Julia programming language. See the Privacy Policy