API Reference

LLMTextAnalysis.DocIndexType
DocIndex{T1<:AbstractString, T2<:AbstractMatrix} <: AbstractDocumentIndex

A struct for maintaining an index of documents, their embeddings, and related information.

Fields

  • id::Symbol: Unique identifier for the document index.
  • docs::Vector{T1}: Collection of documents.
  • embeddings::Matrix{Float32}: Embeddings of the documents. Documents are columns.
  • distances::Matrix{Float32}: Pairwise distances between document embeddings. Documents are columns.
  • keywords_ids::T2: Sparse matrix representing keywords in documents. Keywords in keywords_vocab are rows, documents are columns.
  • keywords_vocab::Vector{<:AbstractString}: Vocabulary of keywords.
  • plot_data::Union{Nothing, Matrix{Float32}}: 2D embedding data for plotting. Rows are dimensions, columns are documents.
  • clustering::Any: Results of clustering the documents.
  • topic_levels::Dict{Union{AbstractString, Int}, Vector{TopicMetadata}}: Metadata for topics at different levels. Indexed by k = number of topics if autogenerated or via topic_level keyword if manually set via build_clusters! (eg, from a classifier).

Example

docs = ["Document 1 text", "Document 2 text"]
index = DocIndex(docs=docs, embeddings=rand(Float32, (10, 2)), distances=rand(Float32, (2, 2)))
source
LLMTextAnalysis.NoStemmerType
NoStemmer

A dummy stemmer used as a workaround for bypassing stemming in keyword extraction.

Example

Snowball.stem(NoStemmer(), ["running", "jumps"]) # returns ["running", "jumps"]
source
LLMTextAnalysis.TopicMetadataType
TopicMetadata <: AbstractTopicMetadata

A struct representing the metadata of a specific topic extracted from a collection of documents.

Fields

  • index_id::Symbol: Identifier for the topic.
  • topic_level::Union{AbstractString, Int}: The level of the topic in the hierarchy.
  • topic_idx::Int: Index of the topic.
  • label::AbstractString: Human-readable label of the topic. Defaults to "".
  • summary::AbstractString: Brief summary of the topic. Defaults to "".
  • docs_idx::Vector{Int}: Indices of documents belonging to this topic. Corresponds to positions in index.docs.
  • center_doc_idx::Int: Index of the central document in this topic. Corresponds to a position in docs_idx (not index!)
  • samples_doc_idx::Vector{Int}: Indices of representative documents. Corresponds to positions in docs_idx (not index!)
  • keywords_idx::Vector{Int}: Indices of specific keywords associated with this topic. Corresponds to positions in index.keywords_vocab.

Example

topic = TopicMetadata(topic_level=1, topic_idx=5)
source
LLMTextAnalysis.TopicTreeNodeType
TopicTreeNode

A node in the topic hierarchy of the index

Fields

  • topic::TopicMetadata: The metadata of the topic
  • total_docs::Int: The total number of documents in the index
  • children::Vector{TopicTreeNode}: Children nodes
source
LLMTextAnalysis.TrainedClassifierType
TrainedClassifier

The TrainedClassifier struct is used for representing and working with a trained classification model, ie, it selects the most appropriate label for a given document.

It encapsulates all the necessary information required to analyze and score documents based on their relevance to a specific concept.

Fields

  • index_id: A unique identifier for the AbstractDocumentIndex associated with this concept.
  • source_doc_ids: Indices of the documents from the AbstractDocumentIndex used for training the concept model. Corresponds to index.docs if provided.
  • concept: The specific concept (as a string) that this model is trained to analyze.
  • docs: The collection of rewritten documents, which are modified to reflect the concept.
  • embeddings: The embeddings of the rewritten documents, used for training the model. Columns are documents, rows are dimensions.
  • coeffs: The coefficients of the trained logistic regression model. Maps to each dimension in embeddings.

Example

index = build_index(...)

# Training a concept
concept = train_concept(index, "sustainability")

# Using TrainedConcept for scoring
scores = score(index, concept)
# or use it as a functor: `scores = concept(index)`

# Accessing the model details
println("Concept: ", concept.concept)
println("Coefficients: ", concept.coeffs)
println("Source Document IDs: ", concept.source_doc_ids)
println("Re-written Documents: ", concept.docs) # good for debugging if results are poor
source
LLMTextAnalysis.TrainedClassifierMethod
(classifier::TrainedClassifier)(
    index::AbstractDocumentIndex; check_index::Bool = true)

A method definition that allows a TrainedClassifier object to be called as a function to score documents in an index. This method delegates to the score function.

The score reflects how closely each document aligns to each label in the trained classifier (classifier.labels).

The resulting scores will be a matrix of probabilities for each document and each label.

Scores dimension: num_documents x num_labels, ie, position [1,3] would correspond to the probability of the first document corresponding to the 3rd label/class.

To pick the best label for each document, you can use argmax(scores, dims=2).

Arguments

  • index::AbstractDocumentIndex: The index containing the documents to be scored.
  • return_labels::Bool (optional): If true, returns the most probable labels instead of the scores. Defaults to false.
  • check_index::Bool (optional): If true, performs a check to ensure that the index ID matches the one used in the classifier training. Defaults to true.

Returns

  • A vector of scores in the range [0, 1], each corresponding to a document in the index.

Example

# Assuming `index` and `classifier` are predefined
scores = classifier(index)

Pick the highest scoring label for each document:

best_labels = score(index, classifier; return_labels = true)

This method provides a convenient and intuitive way to apply a trained classifier model to a document index for scoring.

source
LLMTextAnalysis.TrainedConceptType
TrainedConcept

The TrainedConcept struct is used for representing and working with a trained concept model.

It encapsulates all the necessary information required to analyze and score documents based on their relevance to a specific concept.

Fields

  • index_id: A unique identifier for the AbstractDocumentIndex associated with this concept.
  • source_doc_ids: Indices of the documents from the AbstractDocumentIndex used for training the concept model. Corresponds to index.docs
  • concept: The specific concept (as a string) that this model is trained to analyze.
  • docs: The collection of rewritten documents, which are modified to reflect the concept.
  • embeddings: The embeddings of the rewritten documents, used for training the model. Columns are documents, rows are dimensions.
  • coeffs: The coefficients of the trained logistic regression model. Maps to each dimension in embeddings.

Example

index = build_index(...)

# Training a concept
concept = train_concept(index, "sustainability")

# Using TrainedConcept for scoring
scores = score(index, concept)
# or use it as a functor: `scores = concept(index)`

# Accessing the model details
println("Concept: ", concept.concept)
println("Coefficients: ", concept.coeffs)
println("Source Document IDs: ", concept.source_doc_ids)
println("Re-written Documents: ", concept.docs) # good for debugging if results are poor
source
LLMTextAnalysis.TrainedConceptMethod
(concept::TrainedConcept)(index::AbstractDocumentIndex; check_index::Bool = true)

A method definition that allows a TrainedConcept object to be called as a function to score documents in an index. This method delegates to the score function.

Arguments

  • index::AbstractDocumentIndex: The index containing the documents to be scored.
  • check_index::Bool (optional): If true, performs a check to ensure that the index ID matches the one used in the concept training. Defaults to true.

Returns

  • A vector of scores in the range [0, 1], each corresponding to a document in the index.

Example

# Assuming `index` and `concept` are predefined
scores = concept(index)

This method provides a convenient and intuitive way to apply a trained concept model to a document index for scoring, facilitating thematic analysis and concept relevance studies.

source
LLMTextAnalysis.TrainedSpectrumType
TrainedSpectrum

The TrainedSpectrum is used to score documents across a spectrum defined by two contrasting concepts.

It encapsulates the essential information required to evaluate and score documents based on their alignment with the specified spectrum.

Note: TrainedSpectrum supports functor behavior, allowing it to be used as a function to score documents in an AbstractDocumentIndex based on their alignment with the spectrum.

Fields

  • index_id: A unique identifier for the AbstractDocumentIndex associated with the spectrum.
  • source_doc_ids: Indices of the documents from the AbstractDocumentIndex used for training the spectrum model. Corresponds to index.docs.
  • spectrum: A tuple containing the two contrasting concepts (as strings) that define the spectrum.
  • docs: A collection of rewritten documents, modified to align with the two ends of the spectrum.
  • embeddings: The embeddings of the rewritten documents, used for training the model. Columns are documents, rows are dimensions.
  • coeffs: Coefficients of the trained logistic regression model, corresponding to each dimension in embeddings.

Example


index = build_index(...)

# Create and train a spectrum model
spectrum = TrainedSpectrum(index, ("innovation", "tradition"))

# Using TrainedSpectrum for scoring
scores = score(index, concept)
# or use it as a functor: `scores = spectrum(index)`

# Accessing the model details
println("Spectrum: ", spectrum.spectrum)
println("Coefficients: ", spectrum.coeffs)
println("Source Document IDs: ", spectrum.source_doc_ids)
println("Re-written Documents: ", spectrum.docs) # good for debugging if results are poor
source
LLMTextAnalysis.TrainedSpectrumMethod
(spectrum::TrainedSpectrum)(index::AbstractDocumentIndex; check_index::Bool = true)

A method definition that allows a TrainedSpectrum object to be called as a function to score documents in an index. This method delegates to the score function.

The score reflects how closely each document aligns with each of the ends of the trained spectrum. Scores are left-to-right, ie, a score closer to 0 indicates a higher alignment to spectrum.spectrum[1] and a score closer to 1 indicates a higher alignment to spectrum.spectrum[2].

Arguments

  • index::AbstractDocumentIndex: The index containing the documents to be scored.
  • check_index::Bool (optional): If true, performs a check to ensure that the index ID matches the one used in the spectrum training. Defaults to true.

Returns

  • A vector of scores in the range [0, 1], each corresponding to a document in the index.

Example

# Assuming `index` and `spectrum` are predefined
scores = spectrum(index)

This method provides a convenient and intuitive way to apply a trained spectrum model to a document index for scoring.

source
LLMTextAnalysis.build_clusters!Method
build_clusters!(index::AbstractDocumentIndex, cls::TrainedClassifier;
    topic_level::AbstractString = "",
    verbose::Bool = true, add_label::Bool = false, add_summary::Bool = false,
    labeler_kwargs::NamedTuple = NamedTuple())

Builds topics based on the classifier's labels and labels all documents in the index with the highest probability label.

Example

# Assume `index` is already built and so is classifier `cls`
build_clusters!(index, cls; topic_level = "MyClusters")

# Check that our new topic level is available
topic_levels(index) |> keys
source
LLMTextAnalysis.build_clusters!Method
build_clusters!(index::AbstractDocumentIndex, assignments::Vector{Int};
    topic_level::AbstractString = "",
    labels::Union{Vector{String}, Nothing} = nothing,
    verbose::Bool = true, add_label::Bool = false, add_summary::Bool = false,
    labeler_kwargs::NamedTuple = NamedTuple())

Builds custom topics based on the provided labels and assignments (vector of which topic each document index belongs to).

Arguments

  • index: The document index.
  • assignments: Vector of topic assignments for each document (eg, [2,3] -> first document will be assigned to topic 2, second document will be assigned to topic 3).
  • topic_level: The name of the topic level to create. If not provided, it will be autogenerated (eg, Custom_1).
  • labels: Vector of labels for each topic if known. Otherwise, can be generated if you set add_label=true.
  • verbose: Flag to enable INFO logging.
  • add_label: Flag to enable topic labeling, ie, call LLM to generate topic label.
  • add_summary: Flag to enable topic summarization, ie, call LLM to generate topic summary.
  • labeler_kwargs: Keyword arguments to pass to the LLM labeler. See ?build_topic for more details on available arguments.

Example

Simple example with two clusters:

assignments = ones(Int,length(index.docs)) # which cluster each document belongs to
assignments[1:5] = 2 # first 5 documents belong to cluster 2

build_clusters!(index, assignments; topic_level="MyDualCluster", labels = ["Cluster 1", "Cluster 2"])
source
LLMTextAnalysis.build_clusters!Method
build_clusters!(index::AbstractDocumentIndex; k::Union{Int, Nothing} = nothing,
    h::Union{Float64, Nothing} = nothing,
    verbose::Bool = true, add_label::Bool = true, add_summary::Bool = false,
    labeler_kwargs::NamedTuple = NamedTuple(),
    cluster_kwargs...)

Performs automatic clustering on the document index and builds topics at different levels.

Arguments

  • index: The document index.
  • k: Number of clusters to cut at.
  • h: Height to cut the dendrogram at. See ?Clustering.hclust for more details.
  • verbose: Flag to enable INFO logging.
  • add_label: Flag to enable topic labeling, ie, call LLM to generate topic label.
  • add_summary: Flag to enable topic summarization, ie, call LLM to generate topic summary.
  • labeler_kwargs: Keyword arguments to pass to the LLM labeler. See ?build_topic for more details on available arguments.
  • cluster_kwargs: All remaining arguments will be passed to Clustering.hclust. See ?Clustering.hclust for more details on available arguments.

Returns

  • The updated index with clustering information and topic metadata.

Example

index = build_index(["Doc 1", "Doc 2"])
clustered_index = build_clusters!(index, k=2)
source
LLMTextAnalysis.build_indexMethod
build_index(docs::Vector{<:AbstractString}; verbose::Bool = true,
index_id::Symbol = gensym("DocIndex"), aiembed_kwargs::NamedTuple = NamedTuple(),
keyword_kwargs::NamedTuple = NamedTuple(), kwargs...)

Builds an index of the given documents, including their embeddings and extracted keywords.

Arguments

  • docs: Collection of documents to index. If you have only one large document, consider splitting it into smaller chunks with PromptingTools.split_by_length.
  • verbose: Flag to enable INFO logging.
  • index_id: Identifier for the document index. Useful if there will be multiple indices.
  • aiembed_kwargs: Additional arguments for PromptingTools.aiembed. See ?aiembed for more details.
  • keyword_kwargs: Additional arguments for keyword extraction. See ?build_keywords for more details.

Returns

  • An instance of DocIndex containing information about the documents, embeddings, keywords, etc.

Example

docs = ["First document text", "Second document text"]
index = build_index(docs)
source
LLMTextAnalysis.build_keywordsFunction
build_keywords(docs::Vector{<:AbstractString},
    return_type::Type = String;
    min_length::Int = 2,
    stopwords::Vector{String} = stopwords(Languages.English()),
    stemmer_language::Union{Nothing, String} = "english")

Extracts and returns keywords from a collection of documents.

Arguments

  • docs: Collection of documents from which to extract keywords. If you have only one large document, consider splitting it into smaller chunks with PromptingTools.split_by_length.
  • return_type: Element type of the returned keywords. Defaults to String.
  • min_length: Minimum length of keywords to consider. Will be dropped if they are shorter than this.
  • stopwords: List of stopwords to exclude from keyword extraction. Defaults to English stopwords (stopwords(Languages.English())).
  • stemmer_language: Language for stemming, if applicable. Set to nothing to disable stemming.

Returns

  • A sparse matrix where each column represents a document and each row a keyword, weighted by its frequency.
  • A vector of unique keywords across all documents.

Example

docs = ["Sample document text.", "Another document."]
keywords_ids, keywords_vocab = build_keywords(docs)
source
LLMTextAnalysis.build_topicMethod
build_topic(
    index::AbstractDocumentIndex, assignments::Vector{Int}, topic_idx::Int;
    topic_level::Union{AbstractString, Int} = nunique(assignments),
    verbose::Bool = false, add_label::Bool = true, add_summary::Bool = false,
    label_template::Union{Nothing, Symbol} = :TopicLabelerBasic,
    label_default::AbstractString = "",
    summary_template::Union{Nothing, Symbol} = :TopicSummarizerBasic,
    summary_default::AbstractString = "",
    num_samples::Int = 8, num_keywords::Int = 10,
    cost_tracker::Union{Nothing, Threads.Atomic{Float64}} = nothing, aikwargs...)

Builds the metadata for a specific topic in the document index.

Arguments

  • index: The document index.
  • assignments: Vector of topic assignments for each document.
  • topic_idx: Index of the topic to build metadata for.

Keyword Arguments

  • topic_level: The level of the topic in the hierarchy. Corresponds to k::Int in build_clusters! or a String-based label for custom topics.
  • verbose: Flag to enable INFO logging.
  • add_label: Flag to enable topic labeling, ie, call LLM to generate topic label.
  • add_summary: Flag to enable topic summarization, ie, call LLM to generate topic summary.
  • label_default: Default label to use if no label is generated. It can be used to directly provide a label.
  • summary_default: Default summary to use if no summary is generated. It can be used to directly provide a summary.
  • label_template: The LLM template to use for topic labeling. See ?aitemplates for more details on templates.
  • summary_template: The LLM template to use for topic summarization. See ?aitemplates for more details on templates.
  • num_samples: Number of diverse samples to show to the LLM for each topic.
  • num_keywords: Number of top keywords to show to the LLM for each topic.
  • cost_tracker: An Atomic to track the cost of the LLM calls, if we trigger multiple calls asynchronously.

Returns

  • TopicMetadata instance for the specified topic.

Example

index = build_index(["Document 1", "Document 2"])
assignments = [1, 1]
metadata = build_topic(index, assignments, 1)
source
LLMTextAnalysis.create_foldsMethod
create_folds(k::Int, data_size::Int)

Create k random folds from a dataset of size data_size.

Arguments

  • k::Int: Number of folds to create.
  • n_obs::Int: Total number of observations in the dataset.

Returns

  • Vector{SubArray}: A vector of k vectors, each containing indices for a fold.

Examples

# Create 4 folds from a dataset Xt
n_obs = size(Xt, 1)
folds = create_folds(4, n_obs)
source
LLMTextAnalysis.cross_validate_accuracyMethod
cross_validate_accuracy(X::AbstractMatrix{<:Number},
                        y::AbstractVector{<:Integer};
                        verbose::Bool = true,
                        k::Int = 4,
                        lambda::Real = 1e-5) -> Float64

Perform k-fold cross-validation on the dataset (X, y) using logistic regression and return the average classification accuracy.

Arguments

  • X::AbstractMatrix: The feature matrix (observarions x features).
  • y::AbstractVector{<:Integer}: The target vector (+-1)
  • verbose::Bool (optional): If true, prints the accuracy of each fold. Defaults to true.
  • k::Int (optional): The number of folds for cross-validation. Defaults to 4.
  • lambda::Real (optional): Regularization parameter for logistic regression. Defaults to 1e-5.

Returns

  • Float64: The average classification accuracy across all folds.

Example

acc = cross_validate_accuracy(Xt, y; k = 4, lambda = 1e-2)
source
LLMTextAnalysis.prepare_plot!Method
prepare_plot!(index::AbstractDocumentIndex; verbose::Bool=true, kwargs...) -> AbstractDocumentIndex

Prepares the 2D UMAP plot data for a given document index.

Arguments

  • index: The document index to prepare plot data for.
  • verbose: Flag to enable INFO logging.

Returns

  • The updated index with plot_data field populated.

Example

index = build_index(["Some text", "More text"])
prepared_index = prepare_plot!(index)
source
LLMTextAnalysis.scoreMethod
score(index::AbstractDocumentIndex,
    classifier::TrainedClassifier;
    check_index::Bool = true)

Scores all documents in the provided index based on the TrainedClassifier.

The score reflects how closely each document aligns to each label in the trained classifier (classifier.labels).

The resulting scores will be a matrix of probabilities for each document and each label.

Scores dimension: num_documents x num_labels, ie, position [1,3] would correspond to the probability of the first document corresponding to the 3rd label/class.

To pick the best label for each document, you can use argmax(scores, dims=2).

Arguments

  • index::AbstractDocumentIndex: The index containing the documents to be scored.
  • classifier::TrainedClassifier: The trained classifier model used for scoring.
  • return_labels::Bool (optional): If true, returns the most probable labels instead of the scores. Defaults to false.
  • check_index::Bool (optional): If true, checks for index ID matching between the provided index and the one used in the classifier training. Defaults to true.

Returns

  • A matrix of scores, each row corresponding to a document in the index and each column corresponding to probability of that label.

Example

# Assuming `index` and `classifier` are predefined
scores = score(index, classifier)

Pick the highest scoring label for each document:

scores = score(index, classifier)
label_ids = argmax(scores, dims = 2) |> vec |> x -> map(i -> i[2], x)
best_labels = classifier.labels[label_ids]

Or, instead, you can simply provide return_labels=true to get the best labels directly:

score(index, classifier; return_labels = true)
source
LLMTextAnalysis.scoreMethod
score(index::AbstractDocumentIndex, concept::TrainedConcept; check_index::Bool = true)

Scores all documents in the provided index based on the TrainedConcept.

The score quantifies the relevance or alignment of each document with the trained concept, with a score closer to 1 indicating a higher relevance.

The function uses a sigmoid function to map the scores to a range between 0 and 1, providing a probability-like interpretation.

Arguments

  • index::AbstractDocumentIndex: The index containing the documents to be scored.
  • concept::TrainedConcept: The trained concept model used for scoring.
  • check_index::Bool (optional): If true, checks for index ID matching between the provided index and the one used in the concept training. Defaults to true.

Returns

  • A vector of scores, each corresponding to a document in the index, in the range [0, 1].

Example

# Assuming `index` and `concept` are predefined
scores = score(index, concept)

You can show the top 5 highest scoring documents for the concept:

index.docs[first(sortperm(scores, rev = true), 5)]

This function is particularly useful for analyzing the presence, intensity, or relevance of a specific concept within a collection of documents.

source
LLMTextAnalysis.scoreMethod
score(index::AbstractDocumentIndex,
    spectrum::TrainedSpectrum;
    check_index::Bool = true)

Scores all documents in the provided index based on the TrainedSpectrum.

The score reflects how closely each document aligns with each of the ends of the trained spectrum. Scores are left-to-right, ie, a score closer to 0 indicates a higher alignment to spectrum.spectrum[1] and a score closer to 1 indicates a higher alignment to spectrum.spectrum[2].

Arguments

  • index::AbstractDocumentIndex: The index containing the documents to be scored.
  • spectrum::TrainedSpectrum: The trained spectrum model used for scoring.
  • check_index::Bool (optional): If true, checks for index ID matching between the provided index and the one used in the spectrum training. Defaults to true.

Returns

  • A vector of scores, each corresponding to a document in the index, in the range [0, 1].

Example

# Assuming `index` and `spectrum` are predefined
scores = score(index, spectrum)

You can show the top 5 highest scoring documents for the spectrum 2:

index.docs[first(sortperm(scores, rev = true), 5)]

# Use rev=false if you want to see documents closest to spectrum 1 (opposite end)

This function is useful for ranking all documents along the chosen spectrum.

source
LLMTextAnalysis.topic_treeMethod
topic_tree(
    index::DocIndex, levels::AbstractVector{<:Union{Integer, AbstractString}};
    sorted::Bool = true)

Builds a topic tree from the index for the provided levels. Levels must be present in the index, eg, run build_cluster! first.

Arguments

  • index::DocIndex: The document index
  • levels::AbstractVector{<:Union{Integer, AbstractString}}: The levels to include in the tree, eg, [4, 10, 20] (they must be present in index.topic_levels)
  • sorted::Bool: Whether to sort the children by the number of documents in each topic. Defaults to true.

Example


# Create topic tree for levels k=4, k=10, k=20
root = topic_tree(index, [4, 10, 20])

# Display it
print_tree(root)
# example output
# "All Documents (N: 10, Share: 100.0%, Level: root, Topic ID: 0)"
# ├─ "Topic1 (N: 5, Share: 50.0%, Level: 4, Topic ID: 1)"
# │  └─ "Topic1 (N: 5, Share: 50.0%, Level: 10, Topic ID: 1)"
# ...
# └─ "Topic2 (N: 5, Share: 50.0%, Level: 4, Topic ID: 2)"
#    └─ "Topic2 (N: 5, Share: 50.0%, Level: 10, Topic ID: 2)"
# ...
source
LLMTextAnalysis.train!Method
train!(index::AbstractDocumentIndex,
    classifier::TrainedClassifier;
    verbose::Bool = true,
    overwrite::Bool = false,
    writer_template::Symbol = :TextWriterFromLabel,
    lambda::Real = 1e-3, num_samples::Int = 5,
    aigenerate_kwargs::NamedTuple = NamedTuple(),
    aiembed_kwargs::NamedTuple = NamedTuple())

Refine or retrain a previously trained TrainedClassifier model.

This function can be used to update the classifier model with new data, adjust parameters, or completely retrain it.

See also: train_classifier, score

Arguments

  • index::AbstractDocumentIndex: The document index containing the documents for analysis.
  • classifier::TrainedClassifier: The trained classifier object to be refined or retrained.
  • verbose::Bool (optional): If true, prints detailed logs during the process. Defaults to true.
  • overwrite::Bool (optional): If true, existing training data in the classifier will be overwritten. Defaults to false.
  • writer_template::Symbol (optional): The template used for writing synthetic documents. Defaults to :TextWriterFromLabel.
  • lambda::Real (optional): Regularization parameter for logistic regression. Defaults to 1e-3.
  • num_samples::Int (optional): The number of examples to to generate for each topic label. Defaults to 5.
  • aigenerate_kwargs::NamedTuple (optional): Additional arguments for the aigenerate function.
  • aiembed_kwargs::NamedTuple (optional): Additional arguments for the aiembed function.

Returns

  • The updated TrainedClassifier object with refined or new training.

Example

# Assuming `index` and `classifier` are pre-existing objects
train!(index, classifier, verbose = true, overwrite = true)

This function allows for continuous improvement and adaptation of a classifier model to new data.

source
LLMTextAnalysis.train!Method
train!(index::AbstractDocumentIndex,
       concept::TrainedConcept;
       verbose::Bool = true,
       overwrite::Bool = false,
       rewriter_template::Symbol = :StatementRewriter,
       lambda::Real = 1e-3, negatives_samples::Int = 1,
       aigenerate_kwargs::NamedTuple = NamedTuple(),
       aiembed_kwargs::NamedTuple = NamedTuple(),)

Refine or retrain a previously trained TrainedConcept model.

This function can be used to update the concept model with new data, adjust parameters, or completely retrain it.

See also: train_concept, score

Arguments

  • index::AbstractDocumentIndex: The document index containing the documents for analysis.
  • concept::TrainedConcept: The trained concept object to be refined or retrained.
  • verbose::Bool (optional): If true, prints detailed logs during the process. Defaults to true.
  • overwrite::Bool (optional): If true, existing training data in the concept will be overwritten. Defaults to false.
  • rewriter_template::Symbol (optional): The template used for rewriting statements. Defaults to :StatementRewriter.
  • lambda::Real (optional): Regularization parameter for logistic regression. Defaults to 1e-3.
  • negatives_samples::Int (optional): The number of negative examples to use for training per each positive sample. Defaults to 1.
  • aigenerate_kwargs::NamedTuple (optional): Additional arguments for the aigenerate function.
  • aiembed_kwargs::NamedTuple (optional): Additional arguments for the aiembed function.

Returns

  • The updated TrainedConcept object with refined or new training.

Example

# Assuming `index` and `concept` are pre-existing objects
concept = train!(index, concept, verbose = true, overwrite = true)

This function allows for continuous improvement and adaptation of a concept model to new data or analysis perspectives. It is particularly useful in dynamic environments where the underlying data or the concept of interest may evolve over time.

source
LLMTextAnalysis.train!Method
train!(index::AbstractDocumentIndex,
       spectrum::TrainedSpectrum;
       verbose::Bool = true,
       overwrite::Bool = false,
       rewriter_template::Symbol = :StatementRewriter,
       lambda::Real = 1e-5,
       aigenerate_kwargs::NamedTuple = NamedTuple(),
       aiembed_kwargs::NamedTuple = NamedTuple(),)

Finish a partially trained Spectrum or retrain an existing one (with overwrite=true).

See also: train_spectrum, train_concept, score

Arguments

  • index::AbstractDocumentIndex: The document index containing the documents to be analyzed.
  • spectrum::TrainedSpectrum: The previously trained spectrum object to be trained.
  • verbose::Bool (optional): If true, prints logs during the process. Defaults to true.
  • overwrite::Bool (optional): If true, existing training data in the spectrum will be overwritten. Defaults to false.
  • rewriter_template::Symbol (optional): The template used for rewriting statements. Defaults to :StatementRewriter.
  • lambda::Real (optional): Regularization parameter for logistic regression. Defaults to 1e-5. Reduce if your cross-validated accuracy is too low.
  • aigenerate_kwargs::NamedTuple (optional): Additional arguments for the aigenerate function. See ?aigenerate for more details.
  • aiembed_kwargs::NamedTuple (optional): Additional arguments for the aiembed function. See ?aiembed for more details.

Returns

  • The updated TrainedSpectrum object containing the trained model (coeffs), along with other relevant information like the rewritten document (docs) and embeddings (embeddings).

Example

# Assuming `index` and `spectrum` are pre-existing objects
trained_spectrum = train!(index, spectrum, verbose = true, overwrite = true)

This function allows for iterative improvement of a spectrum model, adapting to new data or refinements in the analysis framework.

source
LLMTextAnalysis.train_classifierMethod
train_classifier(index::AbstractDocumentIndex,
    labels::AbstractVector{<:AbstractString};
    docs_ids::AbstractVector{<:Integer} = Int[],
    docs_labels::AbstractVector{<:Integer} = Int[],
    labels_description::Union{Nothing, AbstractVector{<:AbstractString}} = nothing,
    num_samples::Int = 5, verbose::Bool = true,
    writer_template::Symbol = :TextWriterFromLabel,
    lambda::Real = 1e-3,
    aigenerate_kwargs::NamedTuple = NamedTuple(),
    aiembed_kwargs::NamedTuple = NamedTuple())

Train a model to classify each document into one of several specific topics based on labels (detailed in labels_description).

If user provides documents from the index and their corresponding labels (docs_ids and docs_labels, respectively), the model will be trained on those documents. Aim for a balanced dataset (all labels must be present) and a minimum of 5 documents per label (ideally more).

Otherwise, we will first generate num_samples of synthetic documents for each label in labels, ie, in total there will be num_samples x length(labels) generated documents. If labels_description is provided, these descriptions will be provided to the AI model to generate more diverse and relevant documents for each label (more informative than just one word labels).

Under the hood, we train a multi-label classifier on top of the embeddings of the documents.

The resulting scores will be a matrix of probabilities for each document and each label. Scores dimension: num_documents x num_labels, ie, position [1,3] would correspond to the probability of the first document corresponding to the 3rd label/class. To pick the best label for each document, you can use argmax(scores, dims=2).

See also: score, train!, train_spectrum, train_concept

Arguments

  • index::AbstractDocumentIndex: An index containing the documents to be analyzed.
  • labels::AbstractVector{<:AbstractString}: A vector of labels to be used for training the classifier (documents will be assigned to one of these labels).
  • docs_ids::AbstractVector{<:Integer} (optional): The IDs of the documents in the index to be used for training. Defaults to an empty vector and will generate synthetic documents.
  • docs_labels::AbstractVector{<:Integer} (optional): The labels corresponding to the documents in docs_ids. Defaults to an empty vector.
  • labels_description::Union{Nothing, AbstractVector{<:AbstractString}} (optional): A vector of descriptions for each label. If provided, it will be used to generate more diverse and relevant documents for each label. Defaults to nothing.
  • num_samples::Int (optional): The number of documents to generate for each label in labels. Defaults to 5.
  • verbose::Bool (optional): If true, prints detailed logs during the process. Defaults to true.
  • writer_template::Symbol (optional): The template used for writing synthetic documents. Defaults to :TextWriterFromLabel.
  • lambda::Real (optional): Regularization parameter for logistic regression. Defaults to 1e-3
  • aigenerate_kwargs::NamedTuple (optional): Additional arguments for the aigenerate function. See ?aigenerate for more details.
  • aiembed_kwargs::NamedTuple (optional): Additional arguments for the aiembed function. See ?aiembed for more details.

Returns

  • A TrainedClassifier object containing the trained model, along with relevant information such as the generated documents (docs), embeddings (embeddings), and model coefficients (coeffs).

Example

Create a classifier for a set of labeled documents in our index (ie, we know the labels for some documents):

# Assuming `index` is an existing document index

# Provide the names of the topics and corresponding labeled documents
labels = ["Improving traffic situation", "Taxes and public funding",
    "Safety and community", "Other"]

# Let's say we have labeled a few documents - ideally, you should have 5-10 examples for EACH label
docs_ids = [1, 2674, 4, 17, 23, 69, 2669, 6]
docs_labels = [1, 1, 2, 2, 3, 3, 4, 4] # what topic each doc belongs to

# Train the classifier
cls = train_classifier(index, labels; docs_ids, docs_labels)

# Score the documents in the index
score(index, cls) # or cls(index)

If you do not have any labeled documents, you can ask an AI model to generate some potential examples for you (num_samples per each topic/label). It helps to provide label descriptions to improve the quality of generated documents:

# Assuming `index` is an existing document index

labels_description = [
    "Survey responses around infrastructure, improving traffic situation and related",
    "Decreasing taxes and giving more money to the community",
    "Survey responses around Homelessness, general safety and community related topics",
    "Any other topics like environment, education, governance, etc."]

# Train the classifier - it will generate 20 document examples (5 for each label x 4 labels)
cls = train_classifier(index, labels; labels_description, num_samples=5)

# Get scores for all documents
scores = score(index, cls)

# Get labels for all documens in the index
best_labels = score(index, cls; return_labels = true)
source
LLMTextAnalysis.train_conceptMethod
train_concept(index::AbstractDocumentIndex,
              concept::String;
              num_samples::Int = 100, verbose::Bool = true,
              rewriter_template::Symbol = :StatementRewriter,
              lambda::Real = 1e-3, negatives_samples::Int = 1,
              aigenerate_kwargs::NamedTuple = NamedTuple(),
              aiembed_kwargs::NamedTuple = NamedTuple(),)

Train a model to identify and score a specific Concept (defined by the string concept) based on num_samples documents fromindex`.

We effectively identify the "direction" in the embedding space that represent this concept and develop a model to be able to score our documents against it.

This function focuses on a single Concept, as opposed to a Spectrum (see train_spectrum), to gauge its presence, strength, or manifestations in the documents.

See also: train_spectrum, train!, score

Arguments

  • index::AbstractDocumentIndex: An index containing the documents to be analyzed.
  • concept::String: The concept to be analyzed within the documents.
  • num_samples::Int (optional): The number of documents to sample from the index for training. Defaults to 100.
  • verbose::Bool (optional): If true, prints detailed logs during the process. Defaults to true.
  • rewriter_template::Symbol (optional): The template used for rewriting statements. Defaults to :StatementRewriter.
  • lambda::Real (optional): Regularization parameter for logistic regression. Defaults to 1e-3
  • negatives_samples::Int (optional): The number of negative examples to use for training per each positive sample. Defaults to 1.
  • aigenerate_kwargs::NamedTuple (optional): Additional arguments for the aigenerate function. See ?aigenerate for more details.
  • aiembed_kwargs::NamedTuple (optional): Additional arguments for the aiembed function. See ?aiembed for more details.

Returns

  • A TrainedConcept object containing the trained model, along with relevant information such as rewritten documents (docs), embeddings (embeddings), and model coefficients (coeffs).

Example

# Assuming `index` is an existing document index
my_concept = "sustainability"
concept_model = train_concept(index, my_concept)

Show the top 5 highest scoring documents for the concept:

scores = score(index, concept)
index.docs[first(sortperm(scores, rev = true), 5)]

You can customize the training by passing additional arguments to the AI generation and embedding functions. For example, you can specify the model to use for generation and how many samples to use:

concept = train_concept(index,
    "action-oriented";
    num_samples = 50,
    aigenerate_kwargs = (; model = "gpt3t"))

This function leverages large language models to extract and analyze the presence and variations of a specific concept within a document corpus. It can be particularly useful in thematic studies, sentiment analysis, or trend identification in large collections of text.

For further analysis, you can inspect the rewritten documents and their embeddings:

# Concept-related rewritten documents
concept_model.docs

# Embeddings of the rewritten documents
concept_model.embeddings
source
LLMTextAnalysis.train_spectrumMethod
train_spectrum(index::AbstractDocumentIndex,
    spectrum::Tuple{String, String};
    num_samples::Int = 100, verbose::Bool = true,
    rewriter_template::Symbol = :StatementRewriter,
    lambda::Real = 1e-5,
    aigenerate_kwargs::NamedTuple = NamedTuple(),
    aiembed_kwargs::NamedTuple = NamedTuple(),)

Train a Spectrum, ie, a two-sided axis of polar opposite concepts.

We effectively identify the "directions" in the embedding space that represent the two concepts that you selected as the opposite ends of the spectrum.

Practically, it takes a num_samples documents from index, rewrites them through the specified lenses (ends of spectrum), then embeds these rewritten documents, and finally trains a logistic regression model to classify the documents according to the spectrum.

See also: train!, train_concept, score

Arguments

  • index::AbstractDocumentIndex: An index containing the documents to be analyzed. This index should have been previously built using build_index.
  • spectrum::Tuple{String, String}: A pair of strings representing the two lenses through which the documents will be rewritten. For example, ("optimistic", "pessimistic") could be a spectrum.
  • num_samples::Int (optional): The number of documents to sample from the index for training. Defaults to 100.
  • verbose::Bool (optional): If true, prints detailed logs during the process. Defaults to true.
  • rewriter_template::Symbol (optional): The template used for rewriting statements. Defaults to :StatementRewriter.
  • lambda::Real (optional): Regularization parameter for the logistic regression. Defaults to 1e-5. Adjust if your cross-validated accuracy is too low.
  • aigenerate_kwargs::NamedTuple (optional): Additional arguments for the aigenerate function. See ?aigenerate for more details.
  • aiembed_kwargs::NamedTuple (optional): Additional arguments for the aiembed function. See ?aiembed for more details.

Returns

  • A TrainedSpectrum object containing the trained model (coeffs), along with other relevant information like the rewritten document (docs) and embeddings (embeddings).

Example

# Assuming `index` is an existing document index
my_spectrum = ("pessimistic", "optimistic")
spectrum = train_spectrum(index, my_spectrum)

Show the top 5 highest scoring documents for the spectrum 2 (spectrum.spectrum[2] which is "optimistic" in this example):

scores = score(index, spectrum)
index.docs[first(sortperm(scores, rev = true), 5)]

# Use rev=false to get the highest scoring documents for spectrum 1 (opposite end)

You can customize the analysis by passing additional arguments to the AI generation and embedding functions. For example, you can specify the model to use for generation and how many samples to use:

spectrum = train_spectrum(index,
    ("forward-looking", "dwelling in the past");
    num_samples = 50, aigenerate_kwargs = (; model = "gpt3t"))

This function utilizes large language models to rewrite and analyze the text, providing insights based on the specified spectrum. The output includes embeddings and a model capable of projecting new documents onto this spectrum for analysis.

For troubleshooting, you can fit the model manually and inspect the accuracy:

X = spectrum.embeddings'
# First half is spectrum 1, second half is spectrum 2
y = vcat(-1ones(length(spectrum.source_doc_ids)), ones(length(spectrum.source_doc_ids))) .|>
    Int
accuracy = cross_validate_accuracy(X, y; k = 4, lambda = 1e-8)

Or explore the source documents and re-written documents:

# source documents
index.docs[spectrum.source_doc_ids]

# re-written documents
spectrum.docs
source