API Reference
LLMTextAnalysis.DocIndex
LLMTextAnalysis.NoStemmer
LLMTextAnalysis.TopicMetadata
LLMTextAnalysis.TopicTreeNode
LLMTextAnalysis.TrainedClassifier
LLMTextAnalysis.TrainedClassifier
LLMTextAnalysis.TrainedConcept
LLMTextAnalysis.TrainedConcept
LLMTextAnalysis.TrainedSpectrum
LLMTextAnalysis.TrainedSpectrum
LLMTextAnalysis.build_clusters!
LLMTextAnalysis.build_clusters!
LLMTextAnalysis.build_clusters!
LLMTextAnalysis.build_index
LLMTextAnalysis.build_keywords
LLMTextAnalysis.build_topic
LLMTextAnalysis.create_folds
LLMTextAnalysis.cross_validate_accuracy
LLMTextAnalysis.nunique
LLMTextAnalysis.prepare_plot!
LLMTextAnalysis.score
LLMTextAnalysis.score
LLMTextAnalysis.score
LLMTextAnalysis.topic_tree
LLMTextAnalysis.train!
LLMTextAnalysis.train!
LLMTextAnalysis.train!
LLMTextAnalysis.train_classifier
LLMTextAnalysis.train_concept
LLMTextAnalysis.train_spectrum
LLMTextAnalysis.DocIndex
— TypeDocIndex{T1<:AbstractString, T2<:AbstractMatrix} <: AbstractDocumentIndex
A struct for maintaining an index of documents, their embeddings, and related information.
Fields
id::Symbol
: Unique identifier for the document index.docs::Vector{T1}
: Collection of documents.embeddings::Matrix{Float32}
: Embeddings of the documents. Documents are columns.distances::Matrix{Float32}
: Pairwise distances between document embeddings. Documents are columns.keywords_ids::T2
: Sparse matrix representing keywords in documents. Keywords inkeywords_vocab
are rows, documents are columns.keywords_vocab::Vector{<:AbstractString}
: Vocabulary of keywords.plot_data::Union{Nothing, Matrix{Float32}}
: 2D embedding data for plotting. Rows are dimensions, columns are documents.clustering::Any
: Results of clustering the documents.topic_levels::Dict{Union{AbstractString, Int}, Vector{TopicMetadata}}
: Metadata for topics at different levels. Indexed byk
= number of topics if autogenerated or viatopic_level
keyword if manually set viabuild_clusters!
(eg, from a classifier).
Example
docs = ["Document 1 text", "Document 2 text"]
index = DocIndex(docs=docs, embeddings=rand(Float32, (10, 2)), distances=rand(Float32, (2, 2)))
LLMTextAnalysis.NoStemmer
— TypeNoStemmer
A dummy stemmer used as a workaround for bypassing stemming in keyword extraction.
Example
Snowball.stem(NoStemmer(), ["running", "jumps"]) # returns ["running", "jumps"]
LLMTextAnalysis.TopicMetadata
— TypeTopicMetadata <: AbstractTopicMetadata
A struct representing the metadata of a specific topic extracted from a collection of documents.
Fields
index_id::Symbol
: Identifier for the topic.topic_level::Union{AbstractString, Int}
: The level of the topic in the hierarchy.topic_idx::Int
: Index of the topic.label::AbstractString
: Human-readable label of the topic. Defaults to""
.summary::AbstractString
: Brief summary of the topic. Defaults to""
.docs_idx::Vector{Int}
: Indices of documents belonging to this topic. Corresponds to positions inindex.docs
.center_doc_idx::Int
: Index of the central document in this topic. Corresponds to a position indocs_idx
(not index!)samples_doc_idx::Vector{Int}
: Indices of representative documents. Corresponds to positions indocs_idx
(not index!)keywords_idx::Vector{Int}
: Indices of specific keywords associated with this topic. Corresponds to positions inindex.keywords_vocab
.
Example
topic = TopicMetadata(topic_level=1, topic_idx=5)
LLMTextAnalysis.TopicTreeNode
— TypeTopicTreeNode
A node in the topic hierarchy of the index
Fields
topic::TopicMetadata
: The metadata of the topictotal_docs::Int
: The total number of documents in theindex
children::Vector{TopicTreeNode}
: Children nodes
LLMTextAnalysis.TrainedClassifier
— TypeTrainedClassifier
The TrainedClassifier
struct is used for representing and working with a trained classification model, ie, it selects the most appropriate label
for a given document.
It encapsulates all the necessary information required to analyze and score documents based on their relevance to a specific concept.
Fields
index_id
: A unique identifier for theAbstractDocumentIndex
associated with this concept.source_doc_ids
: Indices of the documents from theAbstractDocumentIndex
used for training the concept model. Corresponds toindex.docs
if provided.concept
: The specific concept (as a string) that this model is trained to analyze.docs
: The collection of rewritten documents, which are modified to reflect the concept.embeddings
: The embeddings of the rewritten documents, used for training the model. Columns are documents, rows are dimensions.coeffs
: The coefficients of the trained logistic regression model. Maps to each dimension inembeddings
.
Example
index = build_index(...)
# Training a concept
concept = train_concept(index, "sustainability")
# Using TrainedConcept for scoring
scores = score(index, concept)
# or use it as a functor: `scores = concept(index)`
# Accessing the model details
println("Concept: ", concept.concept)
println("Coefficients: ", concept.coeffs)
println("Source Document IDs: ", concept.source_doc_ids)
println("Re-written Documents: ", concept.docs) # good for debugging if results are poor
LLMTextAnalysis.TrainedClassifier
— Method(classifier::TrainedClassifier)(
index::AbstractDocumentIndex; check_index::Bool = true)
A method definition that allows a TrainedClassifier
object to be called as a function to score documents in an index
. This method delegates to the score
function.
The score reflects how closely each document aligns to each label in the trained classifier (classifier.labels
).
The resulting scores will be a matrix of probabilities for each document and each label.
Scores dimension: num_documents x num_labels
, ie, position [1,3] would correspond to the probability of the first document corresponding to the 3rd label/class.
To pick the best label for each document, you can use argmax(scores, dims=2)
.
Arguments
index::AbstractDocumentIndex
: The index containing the documents to be scored.return_labels::Bool
(optional): Iftrue
, returns the most probable labels instead of the scores. Defaults tofalse
.check_index::Bool
(optional): Iftrue
, performs a check to ensure that the index ID matches the one used in the classifier training. Defaults totrue
.
Returns
- A vector of scores in the range [0, 1], each corresponding to a document in the index.
Example
# Assuming `index` and `classifier` are predefined
scores = classifier(index)
Pick the highest scoring label for each document:
best_labels = score(index, classifier; return_labels = true)
This method provides a convenient and intuitive way to apply a trained classifier model to a document index for scoring.
LLMTextAnalysis.TrainedConcept
— TypeTrainedConcept
The TrainedConcept
struct is used for representing and working with a trained concept model.
It encapsulates all the necessary information required to analyze and score documents based on their relevance to a specific concept.
Fields
index_id
: A unique identifier for theAbstractDocumentIndex
associated with this concept.source_doc_ids
: Indices of the documents from theAbstractDocumentIndex
used for training the concept model. Corresponds toindex.docs
concept
: The specific concept (as a string) that this model is trained to analyze.docs
: The collection of rewritten documents, which are modified to reflect the concept.embeddings
: The embeddings of the rewritten documents, used for training the model. Columns are documents, rows are dimensions.coeffs
: The coefficients of the trained logistic regression model. Maps to each dimension inembeddings
.
Example
index = build_index(...)
# Training a concept
concept = train_concept(index, "sustainability")
# Using TrainedConcept for scoring
scores = score(index, concept)
# or use it as a functor: `scores = concept(index)`
# Accessing the model details
println("Concept: ", concept.concept)
println("Coefficients: ", concept.coeffs)
println("Source Document IDs: ", concept.source_doc_ids)
println("Re-written Documents: ", concept.docs) # good for debugging if results are poor
LLMTextAnalysis.TrainedConcept
— Method(concept::TrainedConcept)(index::AbstractDocumentIndex; check_index::Bool = true)
A method definition that allows a TrainedConcept
object to be called as a function to score documents in an index
. This method delegates to the score
function.
Arguments
index::AbstractDocumentIndex
: The index containing the documents to be scored.check_index::Bool
(optional): Iftrue
, performs a check to ensure that the index ID matches the one used in the concept training. Defaults totrue
.
Returns
- A vector of scores in the range [0, 1], each corresponding to a document in the index.
Example
# Assuming `index` and `concept` are predefined
scores = concept(index)
This method provides a convenient and intuitive way to apply a trained concept model to a document index for scoring, facilitating thematic analysis and concept relevance studies.
LLMTextAnalysis.TrainedSpectrum
— TypeTrainedSpectrum
The TrainedSpectrum
is used to score documents across a spectrum defined by two contrasting concepts.
It encapsulates the essential information required to evaluate and score documents based on their alignment with the specified spectrum.
Note: TrainedSpectrum
supports functor behavior, allowing it to be used as a function to score documents in an AbstractDocumentIndex
based on their alignment with the spectrum.
Fields
index_id
: A unique identifier for theAbstractDocumentIndex
associated with the spectrum.source_doc_ids
: Indices of the documents from theAbstractDocumentIndex
used for training the spectrum model. Corresponds toindex.docs
.spectrum
: A tuple containing the two contrasting concepts (as strings) that define the spectrum.docs
: A collection of rewritten documents, modified to align with the two ends of the spectrum.embeddings
: The embeddings of the rewritten documents, used for training the model. Columns are documents, rows are dimensions.coeffs
: Coefficients of the trained logistic regression model, corresponding to each dimension inembeddings
.
Example
index = build_index(...)
# Create and train a spectrum model
spectrum = TrainedSpectrum(index, ("innovation", "tradition"))
# Using TrainedSpectrum for scoring
scores = score(index, concept)
# or use it as a functor: `scores = spectrum(index)`
# Accessing the model details
println("Spectrum: ", spectrum.spectrum)
println("Coefficients: ", spectrum.coeffs)
println("Source Document IDs: ", spectrum.source_doc_ids)
println("Re-written Documents: ", spectrum.docs) # good for debugging if results are poor
LLMTextAnalysis.TrainedSpectrum
— Method(spectrum::TrainedSpectrum)(index::AbstractDocumentIndex; check_index::Bool = true)
A method definition that allows a TrainedSpectrum
object to be called as a function to score documents in an index
. This method delegates to the score
function.
The score reflects how closely each document aligns with each of the ends of the trained spectrum. Scores are left-to-right, ie, a score closer to 0 indicates a higher alignment to spectrum.spectrum[1]
and a score closer to 1 indicates a higher alignment to spectrum.spectrum[2]
.
Arguments
index::AbstractDocumentIndex
: The index containing the documents to be scored.check_index::Bool
(optional): Iftrue
, performs a check to ensure that the index ID matches the one used in the spectrum training. Defaults totrue
.
Returns
- A vector of scores in the range [0, 1], each corresponding to a document in the index.
Example
# Assuming `index` and `spectrum` are predefined
scores = spectrum(index)
This method provides a convenient and intuitive way to apply a trained spectrum model to a document index for scoring.
LLMTextAnalysis.build_clusters!
— Methodbuild_clusters!(index::AbstractDocumentIndex, cls::TrainedClassifier;
topic_level::AbstractString = "",
verbose::Bool = true, add_label::Bool = false, add_summary::Bool = false,
labeler_kwargs::NamedTuple = NamedTuple())
Builds topics based on the classifier's labels and labels all documents in the index
with the highest probability label.
Example
# Assume `index` is already built and so is classifier `cls`
build_clusters!(index, cls; topic_level = "MyClusters")
# Check that our new topic level is available
topic_levels(index) |> keys
LLMTextAnalysis.build_clusters!
— Methodbuild_clusters!(index::AbstractDocumentIndex, assignments::Vector{Int};
topic_level::AbstractString = "",
labels::Union{Vector{String}, Nothing} = nothing,
verbose::Bool = true, add_label::Bool = false, add_summary::Bool = false,
labeler_kwargs::NamedTuple = NamedTuple())
Builds custom topics based on the provided labels
and assignments
(vector of which topic each document index
belongs to).
Arguments
index
: The document index.assignments
: Vector of topic assignments for each document (eg,[2,3]
-> first document will be assigned to topic 2, second document will be assigned to topic 3).topic_level
: The name of the topic level to create. If not provided, it will be autogenerated (eg,Custom_1
).labels
: Vector of labels for each topic if known. Otherwise, can be generated if you setadd_label=true
.verbose
: Flag to enable INFO logging.add_label
: Flag to enable topic labeling, ie, call LLM to generate topic label.add_summary
: Flag to enable topic summarization, ie, call LLM to generate topic summary.labeler_kwargs
: Keyword arguments to pass to the LLM labeler. See?build_topic
for more details on available arguments.
Example
Simple example with two clusters:
assignments = ones(Int,length(index.docs)) # which cluster each document belongs to
assignments[1:5] = 2 # first 5 documents belong to cluster 2
build_clusters!(index, assignments; topic_level="MyDualCluster", labels = ["Cluster 1", "Cluster 2"])
LLMTextAnalysis.build_clusters!
— Methodbuild_clusters!(index::AbstractDocumentIndex; k::Union{Int, Nothing} = nothing,
h::Union{Float64, Nothing} = nothing,
verbose::Bool = true, add_label::Bool = true, add_summary::Bool = false,
labeler_kwargs::NamedTuple = NamedTuple(),
cluster_kwargs...)
Performs automatic clustering on the document index and builds topics at different levels.
Arguments
index
: The document index.k
: Number of clusters to cut at.h
: Height to cut the dendrogram at. See?Clustering.hclust
for more details.verbose
: Flag to enable INFO logging.add_label
: Flag to enable topic labeling, ie, call LLM to generate topic label.add_summary
: Flag to enable topic summarization, ie, call LLM to generate topic summary.labeler_kwargs
: Keyword arguments to pass to the LLM labeler. See?build_topic
for more details on available arguments.cluster_kwargs
: All remaining arguments will be passed toClustering.hclust
. See?Clustering.hclust
for more details on available arguments.
Returns
- The updated index with clustering information and topic metadata.
Example
index = build_index(["Doc 1", "Doc 2"])
clustered_index = build_clusters!(index, k=2)
LLMTextAnalysis.build_index
— Methodbuild_index(docs::Vector{<:AbstractString}; verbose::Bool = true,
index_id::Symbol = gensym("DocIndex"), aiembed_kwargs::NamedTuple = NamedTuple(),
keyword_kwargs::NamedTuple = NamedTuple(), kwargs...)
Builds an index of the given documents, including their embeddings and extracted keywords.
Arguments
docs
: Collection of documents to index. If you have only one large document, consider splitting it into smaller chunks withPromptingTools.split_by_length
.verbose
: Flag to enable INFO logging.index_id
: Identifier for the document index. Useful if there will be multiple indices.aiembed_kwargs
: Additional arguments forPromptingTools.aiembed
. See?aiembed
for more details.keyword_kwargs
: Additional arguments for keyword extraction. See?build_keywords
for more details.
Returns
- An instance of
DocIndex
containing information about the documents, embeddings, keywords, etc.
Example
docs = ["First document text", "Second document text"]
index = build_index(docs)
LLMTextAnalysis.build_keywords
— Functionbuild_keywords(docs::Vector{<:AbstractString},
return_type::Type = String;
min_length::Int = 2,
stopwords::Vector{String} = stopwords(Languages.English()),
stemmer_language::Union{Nothing, String} = "english")
Extracts and returns keywords from a collection of documents.
Arguments
docs
: Collection of documents from which to extract keywords. If you have only one large document, consider splitting it into smaller chunks withPromptingTools.split_by_length
.return_type
: Element type of the returned keywords. Defaults to String.min_length
: Minimum length of keywords to consider. Will be dropped if they are shorter than this.stopwords
: List of stopwords to exclude from keyword extraction. Defaults to English stopwords (stopwords(Languages.English())
).stemmer_language
: Language for stemming, if applicable. Set tonothing
to disable stemming.
Returns
- A sparse matrix where each column represents a document and each row a keyword, weighted by its frequency.
- A vector of unique keywords across all documents.
Example
docs = ["Sample document text.", "Another document."]
keywords_ids, keywords_vocab = build_keywords(docs)
LLMTextAnalysis.build_topic
— Methodbuild_topic(
index::AbstractDocumentIndex, assignments::Vector{Int}, topic_idx::Int;
topic_level::Union{AbstractString, Int} = nunique(assignments),
verbose::Bool = false, add_label::Bool = true, add_summary::Bool = false,
label_template::Union{Nothing, Symbol} = :TopicLabelerBasic,
label_default::AbstractString = "",
summary_template::Union{Nothing, Symbol} = :TopicSummarizerBasic,
summary_default::AbstractString = "",
num_samples::Int = 8, num_keywords::Int = 10,
cost_tracker::Union{Nothing, Threads.Atomic{Float64}} = nothing, aikwargs...)
Builds the metadata for a specific topic in the document index.
Arguments
index
: The document index.assignments
: Vector of topic assignments for each document.topic_idx
: Index of the topic to build metadata for.
Keyword Arguments
topic_level
: The level of the topic in the hierarchy. Corresponds tok::Int
inbuild_clusters!
or a String-based label for custom topics.verbose
: Flag to enable INFO logging.add_label
: Flag to enable topic labeling, ie, call LLM to generate topic label.add_summary
: Flag to enable topic summarization, ie, call LLM to generate topic summary.label_default
: Default label to use if no label is generated. It can be used to directly provide a label.summary_default
: Default summary to use if no summary is generated. It can be used to directly provide a summary.label_template
: The LLM template to use for topic labeling. See?aitemplates
for more details on templates.summary_template
: The LLM template to use for topic summarization. See?aitemplates
for more details on templates.num_samples
: Number of diverse samples to show to the LLM for each topic.num_keywords
: Number of top keywords to show to the LLM for each topic.cost_tracker
: AnAtomic
to track the cost of the LLM calls, if we trigger multiple calls asynchronously.
Returns
TopicMetadata
instance for the specified topic.
Example
index = build_index(["Document 1", "Document 2"])
assignments = [1, 1]
metadata = build_topic(index, assignments, 1)
LLMTextAnalysis.create_folds
— Methodcreate_folds(k::Int, data_size::Int)
Create k
random folds from a dataset of size data_size
.
Arguments
k::Int
: Number of folds to create.n_obs::Int
: Total number of observations in the dataset.
Returns
Vector{SubArray}
: A vector ofk
vectors, each containing indices for a fold.
Examples
# Create 4 folds from a dataset Xt
n_obs = size(Xt, 1)
folds = create_folds(4, n_obs)
LLMTextAnalysis.cross_validate_accuracy
— Methodcross_validate_accuracy(X::AbstractMatrix{<:Number},
y::AbstractVector{<:Integer};
verbose::Bool = true,
k::Int = 4,
lambda::Real = 1e-5) -> Float64
Perform k-fold cross-validation on the dataset (X, y)
using logistic regression and return the average classification accuracy.
Arguments
X::AbstractMatrix
: The feature matrix (observarions x features).y::AbstractVector{<:Integer}
: The target vector (+-1)verbose::Bool
(optional): Iftrue
, prints the accuracy of each fold. Defaults totrue
.k::Int
(optional): The number of folds for cross-validation. Defaults to 4.lambda::Real
(optional): Regularization parameter for logistic regression. Defaults to 1e-5.
Returns
Float64
: The average classification accuracy across all folds.
Example
acc = cross_validate_accuracy(Xt, y; k = 4, lambda = 1e-2)
LLMTextAnalysis.nunique
— MethodCounts number of unique elements in a vector
LLMTextAnalysis.prepare_plot!
— Methodprepare_plot!(index::AbstractDocumentIndex; verbose::Bool=true, kwargs...) -> AbstractDocumentIndex
Prepares the 2D UMAP plot data for a given document index.
Arguments
index
: The document index to prepare plot data for.verbose
: Flag to enable INFO logging.
Returns
- The updated index with
plot_data
field populated.
Example
index = build_index(["Some text", "More text"])
prepared_index = prepare_plot!(index)
LLMTextAnalysis.score
— Methodscore(index::AbstractDocumentIndex,
classifier::TrainedClassifier;
check_index::Bool = true)
Scores all documents in the provided index
based on the TrainedClassifier
.
The score reflects how closely each document aligns to each label in the trained classifier (classifier.labels
).
The resulting scores will be a matrix of probabilities for each document and each label.
Scores dimension: num_documents x num_labels
, ie, position [1,3] would correspond to the probability of the first document corresponding to the 3rd label/class.
To pick the best label for each document, you can use argmax(scores, dims=2)
.
Arguments
index::AbstractDocumentIndex
: The index containing the documents to be scored.classifier::TrainedClassifier
: The trained classifier model used for scoring.return_labels::Bool
(optional): Iftrue
, returns the most probable labels instead of the scores. Defaults tofalse
.check_index::Bool
(optional): Iftrue
, checks for index ID matching between the provided index and the one used in the classifier training. Defaults totrue
.
Returns
- A matrix of scores, each row corresponding to a document in the index and each column corresponding to probability of that label.
Example
# Assuming `index` and `classifier` are predefined
scores = score(index, classifier)
Pick the highest scoring label for each document:
scores = score(index, classifier)
label_ids = argmax(scores, dims = 2) |> vec |> x -> map(i -> i[2], x)
best_labels = classifier.labels[label_ids]
Or, instead, you can simply provide return_labels=true
to get the best labels directly:
score(index, classifier; return_labels = true)
LLMTextAnalysis.score
— Methodscore(index::AbstractDocumentIndex, concept::TrainedConcept; check_index::Bool = true)
Scores all documents in the provided index
based on the TrainedConcept
.
The score quantifies the relevance or alignment of each document with the trained concept, with a score closer to 1 indicating a higher relevance.
The function uses a sigmoid function to map the scores to a range between 0 and 1, providing a probability-like interpretation.
Arguments
index::AbstractDocumentIndex
: The index containing the documents to be scored.concept::TrainedConcept
: The trained concept model used for scoring.check_index::Bool
(optional): Iftrue
, checks for index ID matching between the provided index and the one used in the concept training. Defaults totrue
.
Returns
- A vector of scores, each corresponding to a document in the index, in the range [0, 1].
Example
# Assuming `index` and `concept` are predefined
scores = score(index, concept)
You can show the top 5 highest scoring documents for the concept:
index.docs[first(sortperm(scores, rev = true), 5)]
This function is particularly useful for analyzing the presence, intensity, or relevance of a specific concept within a collection of documents.
LLMTextAnalysis.score
— Methodscore(index::AbstractDocumentIndex,
spectrum::TrainedSpectrum;
check_index::Bool = true)
Scores all documents in the provided index
based on the TrainedSpectrum
.
The score reflects how closely each document aligns with each of the ends of the trained spectrum. Scores are left-to-right, ie, a score closer to 0 indicates a higher alignment to spectrum.spectrum[1]
and a score closer to 1 indicates a higher alignment to spectrum.spectrum[2]
.
Arguments
index::AbstractDocumentIndex
: The index containing the documents to be scored.spectrum::TrainedSpectrum
: The trained spectrum model used for scoring.check_index::Bool
(optional): Iftrue
, checks for index ID matching between the provided index and the one used in the spectrum training. Defaults totrue
.
Returns
- A vector of scores, each corresponding to a document in the index, in the range [0, 1].
Example
# Assuming `index` and `spectrum` are predefined
scores = score(index, spectrum)
You can show the top 5 highest scoring documents for the spectrum 2:
index.docs[first(sortperm(scores, rev = true), 5)]
# Use rev=false if you want to see documents closest to spectrum 1 (opposite end)
This function is useful for ranking all documents along the chosen spectrum
.
LLMTextAnalysis.topic_tree
— Methodtopic_tree(
index::DocIndex, levels::AbstractVector{<:Union{Integer, AbstractString}};
sorted::Bool = true)
Builds a topic tree from the index
for the provided levels
. Levels must be present in the index, eg, run build_cluster!
first.
Arguments
index::DocIndex
: The document indexlevels::AbstractVector{<:Union{Integer, AbstractString}}
: The levels to include in the tree, eg,[4, 10, 20]
(they must be present in index.topic_levels)sorted::Bool
: Whether to sort the children by the number of documents in each topic. Defaults totrue
.
Example
# Create topic tree for levels k=4, k=10, k=20
root = topic_tree(index, [4, 10, 20])
# Display it
print_tree(root)
# example output
# "All Documents (N: 10, Share: 100.0%, Level: root, Topic ID: 0)"
# ├─ "Topic1 (N: 5, Share: 50.0%, Level: 4, Topic ID: 1)"
# │ └─ "Topic1 (N: 5, Share: 50.0%, Level: 10, Topic ID: 1)"
# ...
# └─ "Topic2 (N: 5, Share: 50.0%, Level: 4, Topic ID: 2)"
# └─ "Topic2 (N: 5, Share: 50.0%, Level: 10, Topic ID: 2)"
# ...
LLMTextAnalysis.train!
— Methodtrain!(index::AbstractDocumentIndex,
classifier::TrainedClassifier;
verbose::Bool = true,
overwrite::Bool = false,
writer_template::Symbol = :TextWriterFromLabel,
lambda::Real = 1e-3, num_samples::Int = 5,
aigenerate_kwargs::NamedTuple = NamedTuple(),
aiembed_kwargs::NamedTuple = NamedTuple())
Refine or retrain a previously trained TrainedClassifier
model.
This function can be used to update the classifier model with new data, adjust parameters, or completely retrain it.
See also: train_classifier
, score
Arguments
index::AbstractDocumentIndex
: The document index containing the documents for analysis.classifier::TrainedClassifier
: The trained classifier object to be refined or retrained.verbose::Bool
(optional): Iftrue
, prints detailed logs during the process. Defaults totrue
.overwrite::Bool
(optional): Iftrue
, existing training data in the classifier will be overwritten. Defaults tofalse
.writer_template::Symbol
(optional): The template used for writing synthetic documents. Defaults to:TextWriterFromLabel
.lambda::Real
(optional): Regularization parameter for logistic regression. Defaults to 1e-3.num_samples::Int
(optional): The number of examples to to generate for each topic label. Defaults to 5.aigenerate_kwargs::NamedTuple
(optional): Additional arguments for theaigenerate
function.aiembed_kwargs::NamedTuple
(optional): Additional arguments for theaiembed
function.
Returns
- The updated
TrainedClassifier
object with refined or new training.
Example
# Assuming `index` and `classifier` are pre-existing objects
train!(index, classifier, verbose = true, overwrite = true)
This function allows for continuous improvement and adaptation of a classifier model to new data.
LLMTextAnalysis.train!
— Methodtrain!(index::AbstractDocumentIndex,
concept::TrainedConcept;
verbose::Bool = true,
overwrite::Bool = false,
rewriter_template::Symbol = :StatementRewriter,
lambda::Real = 1e-3, negatives_samples::Int = 1,
aigenerate_kwargs::NamedTuple = NamedTuple(),
aiembed_kwargs::NamedTuple = NamedTuple(),)
Refine or retrain a previously trained TrainedConcept
model.
This function can be used to update the concept model with new data, adjust parameters, or completely retrain it.
See also: train_concept
, score
Arguments
index::AbstractDocumentIndex
: The document index containing the documents for analysis.concept::TrainedConcept
: The trained concept object to be refined or retrained.verbose::Bool
(optional): Iftrue
, prints detailed logs during the process. Defaults totrue
.overwrite::Bool
(optional): Iftrue
, existing training data in the concept will be overwritten. Defaults tofalse
.rewriter_template::Symbol
(optional): The template used for rewriting statements. Defaults to:StatementRewriter
.lambda::Real
(optional): Regularization parameter for logistic regression. Defaults to 1e-3.negatives_samples::Int
(optional): The number of negative examples to use for training per each positive sample. Defaults to 1.aigenerate_kwargs::NamedTuple
(optional): Additional arguments for theaigenerate
function.aiembed_kwargs::NamedTuple
(optional): Additional arguments for theaiembed
function.
Returns
- The updated
TrainedConcept
object with refined or new training.
Example
# Assuming `index` and `concept` are pre-existing objects
concept = train!(index, concept, verbose = true, overwrite = true)
This function allows for continuous improvement and adaptation of a concept model to new data or analysis perspectives. It is particularly useful in dynamic environments where the underlying data or the concept of interest may evolve over time.
LLMTextAnalysis.train!
— Methodtrain!(index::AbstractDocumentIndex,
spectrum::TrainedSpectrum;
verbose::Bool = true,
overwrite::Bool = false,
rewriter_template::Symbol = :StatementRewriter,
lambda::Real = 1e-5,
aigenerate_kwargs::NamedTuple = NamedTuple(),
aiembed_kwargs::NamedTuple = NamedTuple(),)
Finish a partially trained Spectrum or retrain an existing one (with overwrite=true
).
See also: train_spectrum
, train_concept
, score
Arguments
index::AbstractDocumentIndex
: The document index containing the documents to be analyzed.spectrum::TrainedSpectrum
: The previously trained spectrum object to be trained.verbose::Bool
(optional): Iftrue
, prints logs during the process. Defaults totrue
.overwrite::Bool
(optional): Iftrue
, existing training data in the spectrum will be overwritten. Defaults tofalse
.rewriter_template::Symbol
(optional): The template used for rewriting statements. Defaults to:StatementRewriter
.lambda::Real
(optional): Regularization parameter for logistic regression. Defaults to 1e-5. Reduce if your cross-validated accuracy is too low.aigenerate_kwargs::NamedTuple
(optional): Additional arguments for theaigenerate
function. See?aigenerate
for more details.aiembed_kwargs::NamedTuple
(optional): Additional arguments for theaiembed
function. See?aiembed
for more details.
Returns
- The updated
TrainedSpectrum
object containing the trained model (coeffs
), along with other relevant information like the rewritten document (docs
) and embeddings (embeddings
).
Example
# Assuming `index` and `spectrum` are pre-existing objects
trained_spectrum = train!(index, spectrum, verbose = true, overwrite = true)
This function allows for iterative improvement of a spectrum model, adapting to new data or refinements in the analysis framework.
LLMTextAnalysis.train_classifier
— Methodtrain_classifier(index::AbstractDocumentIndex,
labels::AbstractVector{<:AbstractString};
docs_ids::AbstractVector{<:Integer} = Int[],
docs_labels::AbstractVector{<:Integer} = Int[],
labels_description::Union{Nothing, AbstractVector{<:AbstractString}} = nothing,
num_samples::Int = 5, verbose::Bool = true,
writer_template::Symbol = :TextWriterFromLabel,
lambda::Real = 1e-3,
aigenerate_kwargs::NamedTuple = NamedTuple(),
aiembed_kwargs::NamedTuple = NamedTuple())
Train a model to classify each document into one of several specific topics based on labels
(detailed in labels_description
).
If user provides documents from the index and their corresponding labels (docs_ids
and docs_labels
, respectively), the model will be trained on those documents. Aim for a balanced dataset (all labels
must be present) and a minimum of 5 documents per label (ideally more).
Otherwise, we will first generate num_samples
of synthetic documents for each label in labels
, ie, in total there will be num_samples x length(labels)
generated documents. If labels_description
is provided, these descriptions will be provided to the AI model to generate more diverse and relevant documents for each label (more informative than just one word labels).
Under the hood, we train a multi-label classifier on top of the embeddings of the documents.
The resulting scores will be a matrix of probabilities for each document and each label. Scores dimension: num_documents x num_labels
, ie, position [1,3] would correspond to the probability of the first document corresponding to the 3rd label/class. To pick the best label for each document, you can use argmax(scores, dims=2)
.
See also: score
, train!
, train_spectrum
, train_concept
Arguments
index::AbstractDocumentIndex
: An index containing the documents to be analyzed.labels::AbstractVector{<:AbstractString}
: A vector of labels to be used for training the classifier (documents will be assigned to one of these labels).docs_ids::AbstractVector{<:Integer}
(optional): The IDs of the documents in theindex
to be used for training. Defaults to an empty vector and will generate synthetic documents.docs_labels::AbstractVector{<:Integer}
(optional): The labels corresponding to the documents indocs_ids
. Defaults to an empty vector.labels_description::Union{Nothing, AbstractVector{<:AbstractString}}
(optional): A vector of descriptions for each label. If provided, it will be used to generate more diverse and relevant documents for each label. Defaults tonothing
.num_samples::Int
(optional): The number of documents to generate for each label inlabels
. Defaults to 5.verbose::Bool
(optional): Iftrue
, prints detailed logs during the process. Defaults totrue
.writer_template::Symbol
(optional): The template used for writing synthetic documents. Defaults to:TextWriterFromLabel
.lambda::Real
(optional): Regularization parameter for logistic regression. Defaults to 1e-3aigenerate_kwargs::NamedTuple
(optional): Additional arguments for theaigenerate
function. See?aigenerate
for more details.aiembed_kwargs::NamedTuple
(optional): Additional arguments for theaiembed
function. See?aiembed
for more details.
Returns
- A
TrainedClassifier
object containing the trained model, along with relevant information such as the generated documents (docs
), embeddings (embeddings
), and model coefficients (coeffs
).
Example
Create a classifier for a set of labeled documents in our index (ie, we know the labels for some documents):
# Assuming `index` is an existing document index
# Provide the names of the topics and corresponding labeled documents
labels = ["Improving traffic situation", "Taxes and public funding",
"Safety and community", "Other"]
# Let's say we have labeled a few documents - ideally, you should have 5-10 examples for EACH label
docs_ids = [1, 2674, 4, 17, 23, 69, 2669, 6]
docs_labels = [1, 1, 2, 2, 3, 3, 4, 4] # what topic each doc belongs to
# Train the classifier
cls = train_classifier(index, labels; docs_ids, docs_labels)
# Score the documents in the index
score(index, cls) # or cls(index)
If you do not have any labeled documents, you can ask an AI model to generate some potential examples for you (num_samples
per each topic/label). It helps to provide label descriptions to improve the quality of generated documents:
# Assuming `index` is an existing document index
labels_description = [
"Survey responses around infrastructure, improving traffic situation and related",
"Decreasing taxes and giving more money to the community",
"Survey responses around Homelessness, general safety and community related topics",
"Any other topics like environment, education, governance, etc."]
# Train the classifier - it will generate 20 document examples (5 for each label x 4 labels)
cls = train_classifier(index, labels; labels_description, num_samples=5)
# Get scores for all documents
scores = score(index, cls)
# Get labels for all documens in the index
best_labels = score(index, cls; return_labels = true)
LLMTextAnalysis.train_concept
— Methodtrain_concept(index::AbstractDocumentIndex,
concept::String;
num_samples::Int = 100, verbose::Bool = true,
rewriter_template::Symbol = :StatementRewriter,
lambda::Real = 1e-3, negatives_samples::Int = 1,
aigenerate_kwargs::NamedTuple = NamedTuple(),
aiembed_kwargs::NamedTuple = NamedTuple(),)
Train a model to identify and score a specific Concept (defined by the string concept
) based on num_samples documents from
index`.
We effectively identify the "direction" in the embedding space that represent this concept and develop a model to be able to score our documents against it.
This function focuses on a single Concept, as opposed to a Spectrum (see train_spectrum
), to gauge its presence, strength, or manifestations in the documents.
See also: train_spectrum
, train!
, score
Arguments
index::AbstractDocumentIndex
: An index containing the documents to be analyzed.concept::String
: The concept to be analyzed within the documents.num_samples::Int
(optional): The number of documents to sample from the index for training. Defaults to 100.verbose::Bool
(optional): Iftrue
, prints detailed logs during the process. Defaults totrue
.rewriter_template::Symbol
(optional): The template used for rewriting statements. Defaults to:StatementRewriter
.lambda::Real
(optional): Regularization parameter for logistic regression. Defaults to 1e-3negatives_samples::Int
(optional): The number of negative examples to use for training per each positive sample. Defaults to 1.aigenerate_kwargs::NamedTuple
(optional): Additional arguments for theaigenerate
function. See?aigenerate
for more details.aiembed_kwargs::NamedTuple
(optional): Additional arguments for theaiembed
function. See?aiembed
for more details.
Returns
- A
TrainedConcept
object containing the trained model, along with relevant information such as rewritten documents (docs
), embeddings (embeddings
), and model coefficients (coeffs
).
Example
# Assuming `index` is an existing document index
my_concept = "sustainability"
concept_model = train_concept(index, my_concept)
Show the top 5 highest scoring documents for the concept:
scores = score(index, concept)
index.docs[first(sortperm(scores, rev = true), 5)]
You can customize the training by passing additional arguments to the AI generation and embedding functions. For example, you can specify the model to use for generation and how many samples to use:
concept = train_concept(index,
"action-oriented";
num_samples = 50,
aigenerate_kwargs = (; model = "gpt3t"))
This function leverages large language models to extract and analyze the presence and variations of a specific concept within a document corpus. It can be particularly useful in thematic studies, sentiment analysis, or trend identification in large collections of text.
For further analysis, you can inspect the rewritten documents and their embeddings:
# Concept-related rewritten documents
concept_model.docs
# Embeddings of the rewritten documents
concept_model.embeddings
LLMTextAnalysis.train_spectrum
— Methodtrain_spectrum(index::AbstractDocumentIndex,
spectrum::Tuple{String, String};
num_samples::Int = 100, verbose::Bool = true,
rewriter_template::Symbol = :StatementRewriter,
lambda::Real = 1e-5,
aigenerate_kwargs::NamedTuple = NamedTuple(),
aiembed_kwargs::NamedTuple = NamedTuple(),)
Train a Spectrum, ie, a two-sided axis of polar opposite concepts.
We effectively identify the "directions" in the embedding space that represent the two concepts that you selected as the opposite ends of the spectrum.
Practically, it takes a num_samples
documents from index
, rewrites them through the specified lenses (ends of spectrum), then embeds these rewritten documents, and finally trains a logistic regression model to classify the documents according to the spectrum.
See also: train!
, train_concept
, score
Arguments
index::AbstractDocumentIndex
: An index containing the documents to be analyzed. This index should have been previously built usingbuild_index
.spectrum::Tuple{String, String}
: A pair of strings representing the two lenses through which the documents will be rewritten. For example, ("optimistic", "pessimistic") could be a spectrum.num_samples::Int
(optional): The number of documents to sample from the index for training. Defaults to 100.verbose::Bool
(optional): Iftrue
, prints detailed logs during the process. Defaults totrue
.rewriter_template::Symbol
(optional): The template used for rewriting statements. Defaults to:StatementRewriter
.lambda::Real
(optional): Regularization parameter for the logistic regression. Defaults to 1e-5. Adjust if your cross-validated accuracy is too low.aigenerate_kwargs::NamedTuple
(optional): Additional arguments for theaigenerate
function. See?aigenerate
for more details.aiembed_kwargs::NamedTuple
(optional): Additional arguments for theaiembed
function. See?aiembed
for more details.
Returns
- A
TrainedSpectrum
object containing the trained model (coeffs
), along with other relevant information like the rewritten document (docs
) and embeddings (embeddings
).
Example
# Assuming `index` is an existing document index
my_spectrum = ("pessimistic", "optimistic")
spectrum = train_spectrum(index, my_spectrum)
Show the top 5 highest scoring documents for the spectrum 2 (spectrum.spectrum[2]
which is "optimistic" in this example):
scores = score(index, spectrum)
index.docs[first(sortperm(scores, rev = true), 5)]
# Use rev=false to get the highest scoring documents for spectrum 1 (opposite end)
You can customize the analysis by passing additional arguments to the AI generation and embedding functions. For example, you can specify the model to use for generation and how many samples to use:
spectrum = train_spectrum(index,
("forward-looking", "dwelling in the past");
num_samples = 50, aigenerate_kwargs = (; model = "gpt3t"))
This function utilizes large language models to rewrite and analyze the text, providing insights based on the specified spectrum. The output includes embeddings and a model capable of projecting new documents onto this spectrum for analysis.
For troubleshooting, you can fit the model manually and inspect the accuracy:
X = spectrum.embeddings'
# First half is spectrum 1, second half is spectrum 2
y = vcat(-1ones(length(spectrum.source_doc_ids)), ones(length(spectrum.source_doc_ids))) .|>
Int
accuracy = cross_validate_accuracy(X, y; k = 4, lambda = 1e-8)
Or explore the source documents and re-written documents:
# source documents
index.docs[spectrum.source_doc_ids]
# re-written documents
spectrum.docs