Text Utilities
Working with Generative AI (and in particular with the text modality), requires a lot of text manipulation. PromptingTools.jl provides a set of utilities to make this process easier and more efficient.
Highlights
The main functions to be aware of are
recursive_splitter
to split the text into sentences and words (of a desired lengthmax_length
)replace_words
to mask some sensitive words in your text before sending it to AIwrap_string
for wrapping the text into a desired length by adding newlines (eg, to fit some large text into your terminal width)length_longest_common_subsequence
to find the length of the longest common subsequence between two strings (eg, to compare the similarity between the context provided and generated text)distance_longest_common_subsequence
a companion utility forlength_longest_common_subsequence
to find the normalized distance between two strings. Always returns a number between 0-1, where 0 means the strings are identical and 1 means they are completely different.
You can import them simply via:
using PromptingTools: recursive_splitter, replace_words, wrap_string, length_longest_common_subsequence, distance_longest_common_subsequence
There are many more (especially in the AgentTools and RAGTools experimental modules)!
RAGTools module contains the following text utilities:
split_into_code_and_sentences
to split a string into code and sentencestokenize
to tokenize a string (eg, a sentence) into wordstrigrams
to generate trigrams from a string (eg, a word)text_to_trigrams
to generate trigrams from a larger string (ie, effectively wraps the three functions above)STOPWORDS
a set of common stopwords (very brief)
Feel free to open an issue or ask in the #generative-ai
channel in the JuliaLang Slack if you have a specific need.
References
recursive_splitter(text::String; separator::String=" ", max_length::Int=35000) -> Vector{String}
Split a given string text
into chunks of a specified maximum length max_length
. This is particularly useful for splitting larger documents or texts into smaller segments, suitable for models or systems with smaller context windows.
There is a method for dispatching on multiple separators, recursive_splitter(text::String, separators::Vector{String}; max_length::Int=35000) -> Vector{String}
that mimics the logic of Langchain's RecursiveCharacterTextSplitter
.
Arguments
text::String
: The text to be split.separator::String=" "
: The separator used to split the text into minichunks. Defaults to a space character.max_length::Int=35000
: The maximum length of each chunk. Defaults to 35,000 characters, which should fit within 16K context window.
Returns
Vector{String}
: A vector of strings, each representing a chunk of the original text that is smaller than or equal to max_length
.
Notes
The function ensures that each chunk is as close to
max_length
as possible without exceeding it.If the
text
is empty, the function returns an empty array.The
separator
is re-added to the text chunks after splitting, preserving the original structure of the text as closely as possible.
Examples
Splitting text with the default separator (" "):
text = "Hello world. How are you?"
chunks = recursive_splitter(text; max_length=13)
length(chunks) # Output: 2
Using a custom separator and custom max_length
text = "Hello,World," ^ 2900 # length 34900 chars
recursive_splitter(text; separator=",", max_length=10000) # for 4K context window
length(chunks[1]) # Output: 4
recursive_splitter(text::AbstractString, separators::Vector{String}; max_length::Int=35000) -> Vector{String}
Split a given string text
into chunks recursively using a series of separators, with each chunk having a maximum length of max_length
(if it's achievable given the separators
provided). This function is useful for splitting large documents or texts into smaller segments that are more manageable for processing, particularly for models or systems with limited context windows.
It was previously known as split_by_length
.
This is similar to Langchain's RecursiveCharacterTextSplitter
. To achieve the same behavior, use separators=["\n\n", "\n", " ", ""]
.
Arguments
text::AbstractString
: The text to be split.separators::Vector{String}
: An ordered list of separators used to split the text. The function iteratively applies these separators to split the text. Recommend to use["\n\n", ". ", "\n", " "]
max_length::Int
: The maximum length of each chunk. Defaults to 35,000 characters. This length is considered after each iteration of splitting, ensuring chunks fit within specified constraints.
Returns
Vector{String}
: A vector of strings, where each string is a chunk of the original text that is smaller than or equal to max_length
.
Usage Tips
I tend to prefer splitting on sentences (
". "
) before splitting on newline characters ("\n"
) to preserve the structure of the text.What's the difference between
separators=["\n"," ",""]
andseparators=["\n"," "]
? The former will split down to character level (""
), so it will always achieve themax_length
but it will split words (bad for context!) I prefer to instead set slightly smallermax_length
but not split words.
How It Works
The function processes the text iteratively with each separator in the provided order. It then measures the length of each chunk and splits it further if it exceeds the
max_length
. If the chunks is "short enough", the subsequent separators are not applied to it.Each chunk is as close to
max_length
as possible (unless we cannot split it any further, eg, if the splitters are "too big" / there are not enough of them)If the
text
is empty, the function returns an empty array.Separators are re-added to the text chunks after splitting, preserving the original structure of the text as closely as possible. Apply
strip
if you do not need them.The function provides
separators
as the second argument to distinguish itself from its single-separator counterpart dispatch.
Examples
Splitting text using multiple separators:
text = "Paragraph 1\n\nParagraph 2. Sentence 1. Sentence 2.\nParagraph 3"
separators = ["\n\n", ". ", "\n"] # split by paragraphs, sentences, and newlines (not by words)
chunks = recursive_splitter(text, separators, max_length=20)
Splitting text using multiple separators - with splitting on words:
text = "Paragraph 1\n\nParagraph 2. Sentence 1. Sentence 2.\nParagraph 3"
separators = ["\n\n", ". ", "\n", " "] # split by paragraphs, sentences, and newlines, words
chunks = recursive_splitter(text, separators, max_length=10)
Using a single separator:
text = "Hello,World," ^ 2900 # length 34900 characters
chunks = recursive_splitter(text, [","], max_length=10000)
To achieve the same behavior as Langchain's RecursiveCharacterTextSplitter
, use separators=["\n\n", "\n", " ", ""]
.
text = "Paragraph 1\n\nParagraph 2. Sentence 1. Sentence 2.\nParagraph 3"
separators = ["\n\n", "\n", " ", ""]
chunks = recursive_splitter(text, separators, max_length=10)
replace_words(text::AbstractString, words::Vector{<:AbstractString}; replacement::AbstractString="ABC")
Replace all occurrences of words in words
with replacement
in text
. Useful to quickly remove specific names or entities from a text.
Arguments
text::AbstractString
: The text to be processed.words::Vector{<:AbstractString}
: A vector of words to be replaced.replacement::AbstractString="ABC"
: The replacement string to be used. Defaults to "ABC".
Example
text = "Disney is a great company"
replace_words(text, ["Disney", "Snow White", "Mickey Mouse"])
# Output: "ABC is a great company"
wrap_string(str::String,
text_width::Int = 20;
newline::Union{AbstractString, AbstractChar} = '
')
Breaks a string into lines of a given text_width
. Optionally, you can specify the newline
character or string to use.
Example:
wrap_string("Certainly, here's a function in Julia that will wrap a string according to the specifications:", 10) |> print
length_longest_common_subsequence(itr1::AbstractString, itr2::AbstractString)
Compute the length of the longest common subsequence between two string sequences (ie, the higher the number, the better the match).
Arguments
itr1
: The first sequence, eg, a String.itr2
: The second sequence, eg, a String.
Returns
The length of the longest common subsequence.
Examples
text1 = "abc-abc----"
text2 = "___ab_c__abc"
longest_common_subsequence(text1, text2)
# Output: 6 (-> "abcabc")
It can be used to fuzzy match strings and find the similarity between them (Tip: normalize the match)
commands = ["product recommendation", "emotions", "specific product advice", "checkout advice"]
query = "Which product can you recommend for me?"
let pos = argmax(length_longest_common_subsequence.(Ref(query), commands))
dist = length_longest_common_subsequence(query, commands[pos])
norm = dist / min(length(query), length(commands[pos]))
@info "The closest command to the query: "$(query)" is: "$(commands[pos])" (distance: $(dist), normalized: $(norm))"
end
But it might be easier to use directly the convenience wrapper distance_longest_common_subsequence
!
[source](https://github.com/svilupp/PromptingTools.jl/blob/698a594f514eeaaffc170aae71009fb7b26b8746/src/utils.jl#L252-L288)
</div>
<br>
<div style='border-width:1px; border-style:solid; border-color:black; padding: 1em; border-radius: 25px;'>
<a id='PromptingTools.distance_longest_common_subsequence-extra_tools-text_utilities_intro' href='#PromptingTools.distance_longest_common_subsequence-extra_tools-text_utilities_intro'>#</a> <b><u>PromptingTools.distance_longest_common_subsequence</u></b> — <i>Function</i>.
```julia
distance_longest_common_subsequence(
input1::AbstractString, input2::AbstractString)
distance_longest_common_subsequence(
input1::AbstractString, input2::AbstractVector{<:AbstractString})
Measures distance between two strings using the length of the longest common subsequence (ie, the lower the number, the better the match). Perfect match is distance = 0.0
Convenience wrapper around length_longest_common_subsequence
to normalize the distances to 0-1 range. There is a also a dispatch for comparing a string vs an array of strings.
Notes
Use
argmin
andminimum
to find the position of the closest match and the distance, respectively.Matching with an empty string will always return 1.0 (worst match), even if the other string is empty as well (safety mechanism to avoid division by zero).
Arguments
input1::AbstractString
: The first string to compare.input2::AbstractString
: The second string to compare.
Example
You can also use it to find the closest context for some AI generated summary/story:
context = ["The enigmatic stranger vanished as swiftly as a wisp of smoke, leaving behind a trail of unanswered questions.",
"Beneath the shimmering moonlight, the ocean whispered secrets only the stars could hear.",
"The ancient tree stood as a silent guardian, its gnarled branches reaching for the heavens.",
"The melody danced through the air, painting a vibrant tapestry of emotions.",
"Time flowed like a relentless river, carrying away memories and leaving imprints in its wake."]
story = """
Beneath the shimmering moonlight, the ocean whispered secrets only the stars could hear.
Under the celestial tapestry, the vast ocean whispered its secrets to the indifferent stars. Each ripple, a murmured confidence, each wave, a whispered lament. The glittering celestial bodies listened in silent complicity, their enigmatic gaze reflecting the ocean's unspoken truths. The cosmic dance between the sea and the sky, a symphony of shared secrets, forever echoing in the ethereal expanse.
"""
dist = distance_longest_common_subsequence(story, context)
@info "The closest context to the query: "$(first(story,20))..." is: "$(context[argmin(dist)])" (distance: $(minimum(dist)))"