27 November 2023
Transform messy data columns into clear, understandable names, making datasets like ggplot's cars or Austin's Community survey easier to manage and analyze.
As a data enthusiast, you've likely experienced the sinking feeling when opening a data file only to be greeted by a jungle of messy, unclear column names. It's a universal pain point — these unwieldy columns make everything from simple data transformations to creating plots unnecessarily complicated. But what if we could delegate this mundane task to GenAI?
Take the classic cars dataset (Link: ggplot mpg). Its columns might be 'clean' (lowercased), but they're cryptic enough to trip you up.
using DataFramesMeta, CSV
using PromptingTools
## Load
df_cars = CSV.File("mpg.csv") |> DataFrame
names(df_cars)
11-element Vector{String}:
"manufacturer"
"model"
"displ"
"year"
"cyl"
"trans"
"drv"
"cty"
"hwy"
"fl"
"class"
Normally, you'd resort to regex or manual renaming, but GenAI offers a smarter solution. Using Julia's PromptingTools.jl, we asked GenAI to give these columns more intuitive names.
# aitemplates("Detail") # optional: check the available templates and their placeholders
msg = aigenerate(:DetailOrientedTask; task="This is a dataset about car consumptions. Make the columns more descriptive, snakecase, 2-3 words max. Return Julia vector", data=names(df_cars))
[ Info: Tokens: 168 @ Cost: \$0.0003 in 1.4 seconds
AIMessage("["car_make", "car_model", "engine_size", "year", "num_cylinders", "transmission_type", "drive_type", "city_mpg", "highway_mpg", "fuel_type", "vehicle_class"]")
The result? It took 45 seconds end-to-end and not only did the columns become self-explanatory, but they also made the entire dataset more approachable. Concerned about the longer names? Don't be. Modern tools like autosuggest and double-click selection in the REPL make handling them a breeze (as long as the words are separated by underscores).
cols=["car_make", "car_model", "engine_size", "year", "num_cylinders", "transmission_type", "drive_type", "city_mpg", "highway_mpg", "fuel_type", "vehicle_class"]
df_cars = rename(df_cars, cols)
first(df_cars)
Row │ car_make car_model engine_size year num_cylinders transmission_type drive_type city_mpg highway_mpg fuel_type vehicle_class
│ String15 String31 Float64 Int64 Int64 String15 String1 Int64 Int64 String1 String15
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
No more "fl", "hwy", "displ", ...
Next up, let's tackle Austin's Community Survey dataset (link). Survey data is a notorious headache due to its plethora of long-form questions or statements used as column names. Also, there are >250 columns like that, so it's not a task you'd want to do manually.
Let's start by scanning the column names and what we need:
df_survey = CSV.File("Community_Survey_cItyofaustin.csv") |> DataFrame
first(names(df_survey), 8)
8-element Vector{String}:
"ID"
"Year"
"Method"
"The City of Austin as a place to live"
"The City of Austin as a place to work"
"The City of Austin as a place to raise children"
"The City of Austin as a place to retire"
"The City of Austin as a place where I feel welcome"
We could use the previous method and hope that the LLM will generate all 250 columns in exactly the right order, but that's not a good idea. Instead, we'll use the aiextract
function that allows us to specify the output format (no more Regex extractions to avoid the chit-chat in the output!).
Let's define the output type (which fields, types, is it optional, etc.) and provide it as return_type
kwarg. Note: you can add instructions to the docstring.
I used two column names as examples to test the instructions (=prompt) and tweak them. It worked well on the 3rd iteration and with the more powerful GPT-4 Turbo model (aliased as gpt4t
).
struct RenameColumn3 # simply increment the number if you iterate the struct definition
better_name::String
end
column_prompt = """# Instructions
You get a name of a column name from the City of Austin survey data. Create a better clean and descriptive column name.
# Guidelines
- Better Name should be brief, descriptive, snakecase, 2-3 words max. IT MUST BE LOWERCASED.
- Do not mention its from Austin as its obvious
- If some Question ID is included in the name (eg, "Q25 Some Topic"), include it in the Better Name (eg, q25_some_topic)
# Old Column Name
- {{old_column}}
"""
# We use double handlebar templating with {{old_column}}
# test it on a few columns and tweak the instructions
col = "Access to quality mental health care you can afford"
col = "Q25 - Share with the Mayor Topics"
msg = aiextract(column_prompt; old_column=col, return_type=RenameColumn3, model="gpt4t")
msg.content # we can access the string like this: msg.content.better_name (see the Struct above)
# [ Info: Tokens: 184 @ Cost: \$0.002 in 0.8 seconds
# PromptingTools.DataMessage(RenameColumn3)
# RenameColumn3("q25_share_topics")
This works well.
Let's run it for each column in parallel with asyncmap
(we'll also disable the logging with verbose=false
to avoid cluttering the REPL). By using asyncmap
we call the API in parallel for all column names, so we don't have to wait long despite making 250 API calls to the "slow" GPT-4 Turbo model.
@time msgs = asyncmap(col -> aiextract(column_prompt; old_column=col, return_type=RenameColumn3, model="gpt4t", verbose=false), names(df_survey))
# 13.061459 seconds (414.83 k allocations: 33.739 MiB, 0.54% compilation time)
# 251-element Vector{PromptingTools.DataMessage{RenameColumn3}}:
# PromptingTools.DataMessage(RenameColumn3)
# ...
# Let's sum up the cost of each message
# PT.MODEL_COSTS and PT.MODEL_ALIASES are Dicts with the cost per 1K tokens and model alias to proper name mapping, respectively
cost = [sum(msg.tokens ./1000 .* PT.MODEL_COSTS[PT.MODEL_ALIASES["gpt4t"]]) for msg in msgs] |> sum # \$0.5 !!
# Let's define a renaming dictionary
rename_dict = Dict(old_name => msg.content.better_name for (msg, old_name) in zip(msgs, names(df_survey)))
It looks pretty good:
rename_dict
Dict{String, String} with 251 entries:
"Which THREE items in Health and Environment do you th… => "top_three_health_environment_priorities"
"Overall quality of wastewater services provided by Au… => "wastewater_services_quality"
"Traffic flow on major highways (IH-35, Mopac Expy, US… => "major_highway_traffic_flow"
"Condition of major city streets (Congress Ave, Lamar,… => "major_streets_condition"
But when we try to apply it, we get an error:
rename(df_survey, rename_dict)
ERROR: ArgumentError: Tried renaming to :affordable_housing_access multiple times.
Let's find the duplicate!
[k => v for (k, v) in pairs(rename_dict) if v == "affordable_housing_access"]
# 2-element Vector{Pair{String, String}}:
# "Access to quality housing you can afford" => "affordable_housing_access"
# "Access to affordable quality housing" => "affordable_housing_access"
Uff, this is hard even for a human! Let's just add "_2" suffix to all duplicates:
let visited_values = Set()
for (k, v) in pairs(rename_dict)
if v in visited_values
rename_dict[k] = v * "_2"
else
push!(visited_values, v)
end
end
end
# Voila! It works
rename(df_survey, rename_dict)
Never start by making all 250 API calls at once. Pick a few tricky examples and tune the prompt/setup to work well for them. You'll save a lot of time and $$$s that way (AI engineering is all about iterating quickly and effectively)
Provide the output format as a Struct type via the return_type
keyword argument
The main argument of aiextract
is the text to use for extraction. You can add some instructions there (format them into markdown-style sections), eg, # Instructions\n\nDo ABC\n\n\n\n# Data\n\n\<your_data>
You can use the docstring of the return_type
Struct to provide additional instructions for the individual fields. Docstrings will be sent to the LLM as well
It did cost us $0.5, but it was worth it. We got the results in ~5 minutes of meddling. How long would it have taken otherwise?
let cols = names(df_survey)
# reading speed of 900 chars/minute
reading = length.(cols) |> sum |> x -> x / 900
# typing speed of 200 chars/minute (assume we type columns 3x shorter)
typing = length.(cols) |> sum |> x -> x / 3 / 200
@info "Reading time: $(round(reading; digits=1)) minutes, typing time: $(round(typing; digits=1))"
end
# [ Info: Reading time: 17.6 minutes, typing time: 26.3
Simple benchmarks suggest ~45 minutes of just reading + typing without any thinking time. From personal experience, I know that it would take me at least 2 hours to do it manually at a comparable quality.
The best part? We didn't have to do any manual work. We can now focus on the analysis and not on the data cleaning. By making column names more intuitive and datasets more user-friendly, it's turning what was once a tedious chore into a quick, automated process. With GenAI, we're one step closer to focusing on what truly matters in our projects.
For the curious minds:
Part of the above problem with duplicates was that we sent each column name separately, so the LLM didn't see the other columns. What if we sent all the columns at once, would it be faster? Would we still have to de-duplicate them?
Let's find out in part 2 of this blog!