OpenAI's o1 Series: Initial Reactions from a Julia Developer

15 September 2024

TL;DR

OpenAI's o1 series shows promise in code review and improvement suggestions, but its high cost and inconsistent performance across tasks make it a situational tool. It excels in thoughtful critique but may not always justify its premium over alternatives like Claude 3.5 Sonnet for coding tasks.

Introduction

As a Julia enthusiast, I couldn't resist putting OpenAI's new o1 series through its paces. With pricing at $15/60 USD per million tokens for input/output compared to Claude 3.5 Sonnet's $3/15, I was curious to see if the performance boost could justify the cost. Let's dive into my initial impressions and see where this new model shines - and where it might need some polishing.

Methodology

My approach was admittedly more anecdotal than scientific. I tested o1 preview, o1 mini, GPT-4o, and Anthropic's Sonnet 3.5 on various tasks:

To keep things fair, I used the same prompts across all models. You can check out the full conversations around code review in these GitHub gists****.

Highlights from the Code Review

Where o1 really flexed its muscles was in code review and constructive criticism. It caught nuances that other models missed, providing thoughtful and comprehensive feedback.

One standout moment was when o1 flagged an impossible dispatch on a Bool value:

Original code:

function format2str(obj, style::FormatStyle=DefaultStyle(); kwargs...)
    format2str(obj, style, Tables.istable(obj); kwargs...)
end

format2str(obj, ::FormatStyle, ::Bool; kwargs...) = string(obj)
Observation: The current implementation uses dispatch on Bool values, which is non-idiomatic in Julia.

This level of insight into Julia's type system was impressive. O1 also added a recommendation and a clear snippet for how to fix it:

function format2str(obj, style::FormatStyle=DefaultStyle(); kwargs...)
    if Tables.istable(obj)
        return format_table(obj, style; kwargs...)
    else
        return format_non_table(obj, style; kwargs...)
    end
end

There were a few other interesting observations. The suggestion to use conditional dependencies for PrettyTables.jl showed a deep understanding of package management best practices (if only it knew about the least weak deps!)

Observation: The code uses PrettyTables.jl, which might not be essential for all users.

Recommendation: 
1) Make Dependencies Optional:
Use Requires.jl to load optional dependencies.

```julia
@require PrettyTables="08abe8d2-0d0c-5749-adfa-8a2ac140af0d" begin
    # Code that uses PrettyTables
end
```

2) Specify Dependencies in Project.toml:
Ensure that all dependencies are correctly listed with version bounds.

Other notable highlights included:

Downsides

While o1's review capabilities are impressive, they come with some drawbacks:

  1. Higher price point, especially for hobby projects or frequent use (it might still be worth it though!)

  2. Long waiting times for responses, which can disrupt coding flow

  3. Inconsistent performance in actual coding tasks (or rather not as impressive as the code review)

Interestingly, when it came to writing code, the difference between o1 and other models wasn't as pronounced as I'd hoped. In some cases, I even preferred the code style and solutions provided by Sonnet 3.5:

# Generated by OpenAI's o1 preview
function format_table(rows, column_names, ::MarkdownStyle; max_rows=100, truncation="...")
    io = IOBuffer()
    print(io, "| ", join(column_names, " | "), " |\n")
    print(io, "|", join(repeat(["---"], length(column_names)), "|"), "\n")

    for (i, row) in enumerate(rows)
        if i > max_rows
            print(io, "| ", truncation, " |\n")
            break
        end
        values = [format2str(getproperty(row, col)) for col in column_names]
        print(io, "| ", join(values, " | "), " |\n")
    end

    return String(take!(io))
end

Notice how Sonnet 3.5 handles the potential missingness of values in the table:

# Generated by Anthropic's Claude 3.5 Sonnet
function format_table(rows, column_names, ::MarkdownStyle; max_rows::Int=100, truncation::String="...")
    io = IOBuffer()
    println(io, "| ", join(column_names, " | "), " |")
    println(io, "| ", join(repeat(["-"], length(column_names)), " | "), " |")
    
    for (i, row) in enumerate(rows)
        if i > max_rows
            println(io, "| ", truncation, " |")
            break
        end
        println(io, "| ", join([get(row, col, missing) for col in column_names], " | "), " |")
    end
    
    String(take!(io))
end

It was a bit disappointing that o1 didn't follow its own advice on handling missing values and allowing user customization (perhaps it was due to my response to keep it simple at first?).

# Generated by OpenAI's o1 preview
# Missing value formatting
function format2str(::Missing, ::FormatStyle; kwargs...)
    return "missing"
end

# Nothing value formatting
function format2str(::Nothing, ::FormatStyle; kwargs...)
    return "nothing"
end

I did like Sonnet's conciseness:

# Generated by Anthropic's Claude 3.5 Sonnet
format2str(x::Missing, ::FormatStyle; kwargs...) = "missing"
format2str(x::Nothing, ::FormatStyle; kwargs...) = "nothing"

However, o1 did have moments of unexpected brilliance!

When asked to write unit tests, it not only provided comprehensive tests but also told me how to run the tests and suggested adding new dependencies to my Project.toml [extras], because it added some downstream behavior tests for DataFrames.jl - a package not even used in the original code!

# snippet of suggested Project.toml

[extras]
Test = "8dfed614-57bb-5a28-81f9-0d6897e040b8"
DataFrames = "a93c6f00-3ec4-538f-9b9a-9b6c0d9decc9"

[targets]
test = ["Test", "DataFrames"]

Of course, the UUIDs were wrong! :-) But at least the suggestion was crystal clear.

Conclusion

After my whirlwind tour of the o1 series, I'm left with mixed feelings. It's clear that o1 represents a step forward in certain areas, particularly in its ability to provide constructive critique and suggest improvements to existing code or text. The depth and thoughtfulness of its reviews are truly impressive.

However, the question remains: is it worth 4-5 times more than alternatives like Sonnet 3.5? For now, my answer is: it depends. I'll likely use o1 for challenging critique, review, editing, validation, or judgment tasks, but probably only for that initial step before continuing with other models.

The ideal workflow might be using o1 for planning and final validation, with Sonnet 3.5 handling the bulk of the execution. This approach could leverage o1's strengths while mitigating its cost.

It's not always easy to predict where o1 will excel, which adds to the challenge of deciding when to use it. However, for tasks requiring thoughtful analysis and improvement suggestions, it's definitely worth considering.

Looking ahead, I'm excited to see how o1 evolves. If OpenAI can maintain this level of insight while improving speed and reducing costs, it could become a game-changer for developers. For now, it's a powerful but niche tool in my development arsenal, best used strategically for maximum benefit.

CC BY-SA 4.0 Jan Siml. Last modified: December 09, 2024. Website built with Franklin.jl and the Julia asdasdas programming language.