Example 2: Label Arbitrary Concepts in the City of Austin Community Survey

For this tutorial, we will use the City of Austin's Community Survey.

We will pick one open-ended question. Let's say we want to help the mayor to prioritize ideas, so we will lay out the verbatims against the concepts of being "action-oriented" and "forward-looking".

You can choose any concepts that you want.

Necessary imports

using Downloads, CSV, DataFrames
import PlotlyJS, PlotlyDocumenter
using LLMTextAnalysis

Prepare the Data

Download the survey data

Downloads.download("https://data.austintexas.gov/api/views/s2py-ceb7/rows.csv?accessType=DOWNLOAD",
    joinpath(@__DIR__, "cityofaustin.csv"));

Read the survey data into a DataFrame

df = CSV.read(joinpath(@__DIR__, "cityofaustin.csv"), DataFrame);

Let's select one of the open-ended questions, eg,

col = "Q25 - If there was one thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?"
docs = df[!, col] |> skipmissing |> collect;

Build the Index

Index the documents (ie, embed them)

index = build_index(docs)
DocIndex(Documents: 2933, PlotData: None, Topic Levels: None)

Sometimes you know what you're looking for, but it's hard to define the exact keywords. For example, you might want to identify documents that are "action-oriented" or "pessimistic" or "forward-looking".

For these situations, LLMTextAnalysis offers two distinct functions for document analysis: train_concept and train_spectrum. Each serves a different purpose in text analysis:

  • train_concept: Focuses on analyzing a single, specific concept within documents (eg, "action-oriented")
  • train_spectrum: Analyzes documents in the context of two opposing concepts (eg, "optimistic" vs. "pessimistic" or "forward-looking" vs. "backward-looking")

The resulting return types are TrainedConcept and TrainedSpectrum, respectively. Both can be used to score documents on the presence of the concept or their position on the spectrum.

Why do we need train_spectrum and not simply use two TrainedConcepts? It's because opposite of "forward-looking" can be many things, eg, "short-sighted", "dwelling in the past", or simply "not-forward looking".

train_spectrum allows you to define the opposite concept that you need and score documents on the spectrum between the two.

Score Documents against a Concept

Let's say we want to identify documents that are "action-oriented". We can use train_concept to train a model to identify documents that are "action-oriented" and score the documents against the concept.

Let's show the top 5 documents that are most "action-oriented".

concept = train_concept(index,
    "action-oriented";
    aigenerate_kwargs = (; model = "gpt3t"))

scores = score(index, concept)
index.docs[first(sortperm(scores, rev = true), 5)]
5-element Vector{String}:
 "stop delaying important decisions and make the best one available and move on to solutions"
 "We need real transportation options, not just more lip service and endless studies that do not achieve implementation."
 "We need a comprehensive plan/vision for managing the growth of this city!"
 "The homeless problem is overwhelming for everyone. We MUST find solutions!"
 "we need to plan for global climate change, water and energy programs must be robust"

Score Documents along a Spectrum

We may want to define an arbitrary "spectrum" (axis/polar opposites) and score documents on it. Let's introduce a spectrum for "dwelling in the past" vs "forward-looking". The higher the score (eg, 100%), the more "forward-looking" the document/text is.

Let's show the top 5 documents that are most "forward-looking".

spectrum = train_spectrum(index,
    ("dwelling in the past", "forward-looking");
    aigenerate_kwargs = (; model = "gpt3t"))

scores = score(index, spectrum)
index.docs[first(sortperm(scores, rev = true), 5)]
5-element Vector{String}:
 "MANAGING GROWTH IN A WAY THAT IMPROVES QUALITY OF LIFE FOR ALL CITIZENS"
 "Enhance affordability and ensure a diverse, healthy population."
 "LEAD IN HOW WE PLAN FOR GROWTH AND BECOME 100 PERCENT RELIABLE ON RENEWABLE ENERGY"
 "He is doing a great job. Setting planning for growth, transportation and mobility together is an excellent approach."
 "PLAN FOR ACCELERATED GROWTH. CLIMATE CHANGES PROMISES TO DELIVER MORE COASTAL CRISIS AND POPULATION DISPLACEMENT. AUSTIN WILL EXPAND AS A RESULT. THINK BIG. THANK YOU FOR PRIORITIZING SMART GROWTH AND A DENSE URBAN LANDSCAPE."

And how about the ones "dwelling in the past" (set rev=false)?

index.docs[first(sortperm(scores, rev = false), 5)]
5-element Vector{String}:
 "I wish the City still felt welcoming to it's residents."
 "The cost of living in Austin and the surrounding v areas is 100% absurd. Beyond ridiculous. I would do anything to have the \"old Austin\" back."
 "When I was five, I told my Auntie \"one day\" I'm going to move back to this city because I was born here and this is where I belong\". When I was 23 I did move back. Now I can't figure out why it meant so much to me. Why? Because this city has lost it's authenticity and now I'm not sure why I would choose to stay. Please help me feel happy in my home."
 "I'm really sad about what Austin has become.  It used to feel like home.  Now, it just feels like a big, impersonal city with not much personality, and a lot of traffic (and high taxes).  The city that I loved is gone, which makes me sad and angry."
 "Process to contest taxes was difficult, time consuming and unfair. Feels like city is not supportive."

Summarize via Plot

Let's interactively explore our results.

We can use plot to plot the documents along the trained concepts and spectrums (simple scatter plot). The positions of args concept and spectrum are important, as they determine the position of the concepts in the plot (x-axis, y-axis)

pl = PlotlyJS.plot(index, spectrum, concept;
    title = "Prioritizing Action-Oriented and Forward-Looking Ideas (Top-right Corner)")

What if you need to add some additional information to the tooltip for each data point? You can do that with hoverdata argument, see ?plot for more details.


This page was generated using Literate.jl.