Example 4: Classify Documents
Simetimes you need to assign some specific labels to each document. You could use aiclassify
and process each document separately, but that would be costly. train_classifier
is a good alternative leveraging the embeddings in your document index
.
Necessary imports
using Downloads, CSV, DataFrames
import PlotlyJS, PlotlyDocumenter ## Only for the documentation, not needed for users!
using LLMTextAnalysis
Prepare the Data
For this tutorial, we will use the City of Austin's Community Survey.
We will pick one open-ended question. Download the survey data
Downloads.download(
"https://data.austintexas.gov/api/views/s2py-ceb7/rows.csv?accessType=DOWNLOAD",
joinpath(@__DIR__, "cityofaustin.csv"));
Read the survey data into a DataFrame
df = CSV.read(joinpath(@__DIR__, "cityofaustin.csv"), DataFrame);
Let's select one of the open-ended questions, eg,
col = "Q25 - If there was one thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?"
docs = df[!, col] |> skipmissing |> collect;
Build the Index
Index the documents (ie, embed them)
index = build_index(docs)
DocIndex(Documents: 2933, PlotData: None, Topic Levels: None)
Classification
Sometimes you need to assign some specific labels to each document.
For these situations, LLMTextAnalysis
offers train_classifier
to train a classifier that will assign provided labels
to new documents based on their content.
The labels can be anything you want, but they should be descriptive of the content of the documents (there is a field labels_description
to be able to provide more verbose descriptions).
The resulting return type is a TrainedClassifier
and when you score
a document, you will get a score for each label (ie, a vector). If you score multiple documents, you'll get a matrix of scores, where each row is a document and each column is a label -> the best label is the one with the highest score in each row.
Tip 1: Be careful that if all the scores are similar, it means that the classifier is not very confident about the classification. Ie, look out for scores around 1/number_of_labels
.
Tip 2: When you provide a vector of labels, try to add some "catch all" category like "Other" or "Not sure" to catch the documents that don't fit any of the provided labels.
Classification Based on Labeled Examples
Let's create a few labels inspired by the automatic topic detection
labels = ["Improving traffic situation", "Taxes and public funding",
"Safety and community", "Other"]
4-element Vector{String}:
"Improving traffic situation"
"Taxes and public funding"
"Safety and community"
"Other"
Let's say we have labeled a few documents - ideally, you should have 5-10 examples for EACH label
docs_ids = [1, 2674, 4, 17, 23, 69, 2669, 6]
docs_labels = [1, 1, 2, 2, 3, 3, 4, 4]
8-element Vector{Int64}:
1
1
2
2
3
3
4
4
Train the classifier
cls = train_classifier(index, labels; docs_ids, docs_labels)
TrainedClassifier("Classifier with 4 labels", Docs: 8, Embeddings: OK, Coeffs: OK)
Get scores for all documents
scores = score(index, cls)
scores[1:3, :]
3×4 Matrix{Float32}:
0.968895 0.0153789 0.00994706 0.00577909
0.671424 0.100494 0.0905272 0.137554
0.260434 0.390128 0.208029 0.141408
Note: Watch out for scores around 1/number_of_labels
- it means the classifier is not very confident about the classification
Best label for each document
label_ids = argmax(scores, dims = 2) |> vec |> x -> map(i -> i[2], x)
best_labels = cls.labels[label_ids]
best_labels[1:3]
3-element Vector{String}:
"Improving traffic situation"
"Improving traffic situation"
"Taxes and public funding"
Or do it in one line with return_labels=true
best_labels = score(index, cls; return_labels = true)
best_labels[1:3]
3-element Vector{String}:
"Improving traffic situation"
"Improving traffic situation"
"Taxes and public funding"
Classification Without Labeled Examples
When we don't have any examples, we can ask an AI model to generate some potential examples for us. It might be less precise, but it can save us a lot of time.
Adding label descriptions will improve the quality of generated documents:
labels_description = [
"Survey responses around infrastructure, improving traffic situation and related",
"Decreasing taxes and giving more money to the community",
"Survey responses around Homelessness, general safety and community related topics",
"Any other topics like environment, education, governance, etc."]
4-element Vector{String}:
"Survey responses around infrastructure, improving traffic situation and related"
"Decreasing taxes and giving more money to the community"
"Survey responses around Homelessness, general safety and community related topics"
"Any other topics like environment, education, governance, etc."
Train the classifier - it will generate 20 document examples (5 for each label x 4 labels)
cls = train_classifier(index, labels; labels_description)
TrainedClassifier("Classifier with 4 labels", Docs: 20, Embeddings: OK, Coeffs: OK)
Get scores for all documents
scores = score(index, cls)
scores[1:3, :]
3×4 Matrix{Float32}:
0.871411 0.0425926 0.0259541 0.0600426
0.590365 0.218336 0.0204527 0.170847
0.133558 0.231185 0.109889 0.525368
Best label for each document
best_labels = score(index, cls; return_labels = true)
best_labels[1:3]
3-element Vector{String}:
"Improving traffic situation"
"Improving traffic situation"
"Other"
Adding Custom Topic Level to the Index
Let's say we want to add a custom topic level to the index. We can do it by providing the trained classifier cls
to the function build_clusters!
.
build_clusters!(index, cls; topic_level = "MyClusters")
DocIndex(Documents: 2933, PlotData: None, Topic Levels: MyClusters)
Note: If not topic_level
is provided, it will default to "Custom_1".
Check what topic_levels are available
topic_levels(index) |> keys
KeySet for a Dict{Union{Int64, AbstractString}, Vector{TopicMetadata}} with 1 entry. Keys:
"MyClusters"
Plotting
Whether you have auto-generated topics or custom topics, you can plot them with plot
by leveraging the keyword argument topic_level
.
Let's plot our clusters:
PlotlyJS.plot(index; topic_level = "MyClusters", title = "My Custom Clusters")
This page was generated using Literate.jl.