Hands-On Exercise

Run a free local topic model and interpret the results

This exercise is for students with no prior coding experience.

Students will download the original dataset, use the provided scripts and reference outputs, and run a 10% BERTopic sample on their own computer.

What Students Will Make

By the end, students will have:

  • a spreadsheet of topic labels
  • a spreadsheet showing which passage belongs to which topic
  • a review table with topic labels, top words, years, and contributing documents
  • a short interpretation of what the topics and charts suggest

How This Setup Works

There are two pieces:

  1. Original dataset: downloaded from the original Documenting the American South website.
  2. Project files: scripts, instructions, and reference results.

Download the original dataset separately so the source of the texts and citation information are clear.

Reference outputs vs. your outputs

The repository already contains a complete set of pre-generated results in:

outputs/bertopic_sample_nomic_sensitive_lemmatized/

Do not write to that folder. It contains the reference run used to build the charts and download links on this site. You can use those files to compare against your own results.

When you run the scripts in this exercise, your results will go to a separate folder:

outputs/my_run/

Keep these two folders separate so you can always compare your results to the reference.

Reminder: All commands in this exercise must be run from inside the project folder (topic-modeling-slave-narratives/). Open a terminal there before running anything — see Open a Terminal below. Your CSV results and HTML charts will appear in outputs/my_run/ when the scripts finish.

Download the Original Dataset

Go to the Documenting the American South North American Slave Narratives collection:

https://docsouth.unc.edu/neh/

Download the North American Slave Narratives data package from the collection page.

After downloading, unzip the data. The dataset should include:

data/texts
data/xml
data/toc.csv
data/readme.txt

The scripts expect the dataset’s data folder to be inside the same main project folder as the scripts.

Get the GitHub Repository

The scripts and reference outputs are in this GitHub repository:

https://github.com/jinghanlib/topic-modeling-slave-narratives

Option A — Download as ZIP (no Git required):

Go to the repository page, click the green Code button, choose Download ZIP, and unzip the downloaded file somewhere on your computer.

Option B — Clone with Git:

If you have Git installed, open a terminal and run:

git clone https://github.com/jinghanlib/topic-modeling-slave-narratives.git

Then copy or move the downloaded dataset’s data folder into the repository folder.

Your project folder should look like this:

topic-modeling-slave-narratives/
  data/
    texts/
    xml/
    toc.csv
  scripts/
    run_bertopic_sample.py
    visualize_topic_metadata.py
  instructions/
  outputs/
  requirements.txt

Local Compute: What It Means

This exercise runs on the student’s own computer.

That means:

  • the topic modeling pipeline runs on your computer — the embedding, clustering, and labeling steps all happen locally
  • no paid API is needed
  • enough disk space is needed for Python packages and AI models
  • the first run may take several minutes
  • the computer may become warm or use more battery

Approximate space needed:

Python environment and packages: several GB
nomic-embed-text model: about 300 MB
llama3.1 model: about 5 GB
project data and outputs: hundreds of MB to a few GB

Approximate time:

10% sample: about 5 to 15 minutes on a modern laptop
full collection: much longer, possibly hours

Times vary by computer.

Open a Terminal

Several steps below require typing commands into a terminal (also called a command line or shell).

The easiest way is to right-click the project folder itself:

  • Mac: Right-click the folder in Finder and choose New Terminal at Folder. If you do not see that option, go to System Settings → Privacy & Security → Developer Tools and enable Terminal.
  • Windows: Hold Shift and right-click the folder in File Explorer, then choose Open PowerShell window here or Open Command window here.

This opens a terminal already pointed at the right location. You do not need to type any folder path. If you close and reopen the terminal later, right-click the folder again before running any commands.

Install the Tools

Install Python

Download Python from:

https://www.python.org/downloads/

Use Python 3.11 or newer.

To check Python, open a terminal and run:

python3 --version

If it prints a version number, Python is installed.

Windows note: If python3 --version gives an error, try python --version instead. Windows often uses python rather than python3.

Install Ollama and Local Models

Download Ollama from:

https://ollama.com/

Then open Terminal and run:

ollama pull nomic-embed-text
ollama pull llama3.1

What these models do:

  • nomic-embed-text turns passages into numerical meaning representations
  • llama3.1 writes readable topic labels

These models run locally. They are not uploaded to a paid service.

Set Up the Python Environment

Create a Python environment:

python3 -m venv .venv

Activate it:

source .venv/bin/activate

Windows note: On Windows, use this command instead:

.venv\Scripts\activate

Install packages:

python -m pip install --upgrade pip
python -m pip install -r requirements.txt

This may take several minutes.

Run the 10% Topic Model

This command runs the full topic modeling pipeline on a 10% sample of the collection. It cleans and chunks the texts, converts each chunk into embeddings using the local nomic-embed-text model, clusters the chunks into topics using BERTopic, and generates readable topic labels using the local llama3.1 model. Results are saved to outputs/my_run/. No paid API key is needed. The run takes roughly 5–15 minutes on a modern laptop.

Note: If you run the script more than once with the same --output-dir, the previous results will be overwritten. To keep multiple runs separate, change the folder name each time — for example outputs/my_run_attempt2 or outputs/my_run_seed99. Each folder will contain its own independent set of CSV files and visualizations.

Mac or Linux:

python -u scripts/run_bertopic_sample.py \
  --output-dir outputs/my_run \
  --embedding-backend ollama \
  --ollama-embedding-model nomic-embed-text \
  --representation-backend ctfidf \
  --clustering sensitive \
  --label-backend ollama \
  --ollama-model llama3.1:latest

Windows (Command Prompt):

python -u scripts/run_bertopic_sample.py ^
  --output-dir outputs/my_run ^
  --embedding-backend ollama ^
  --ollama-embedding-model nomic-embed-text ^
  --representation-backend ctfidf ^
  --clustering sensitive ^
  --label-backend ollama ^
  --ollama-model llama3.1:latest

Each option in the command tells the script how to behave:

Option What it does
--output-dir outputs/my_run Where to save your results — keeps them separate from the reference outputs
--embedding-backend ollama Use the local Ollama server to generate embeddings (no internet or API key needed)
--ollama-embedding-model nomic-embed-text Which local model to use for converting text into numbers
--representation-backend ctfidf How to find the most distinctive words for each topic
--clustering sensitive Use more sensitive clustering settings to find more, smaller topics
--label-backend ollama Use the local Ollama server to generate topic labels
--ollama-model llama3.1:latest Which local language model to use for writing topic labels

How To Know It Is Working

Students should see lines like:

Sampled 29 documents into 1543 chunks
Ollama embedded 25/1543 chunks
Ollama embedded 50/1543 chunks
...
Labeled topic 0: Whipping and Plantation Punishment
...
Done. Results written to outputs/my_run

If they see Done, the run worked.

sample_documents.csv is ready as soon as the first line appearsSampled 29 documents into 1543 chunks means the file has already been written to outputs/my_run/. Students can open it immediately in a spreadsheet to see which 29 documents were selected, without waiting for the rest of the pipeline to finish. There is no need to interrupt the run — the script will keep running in the terminal and continue to the embedding, clustering, and labeling steps on its own.

If you do interrupt the run (by closing the terminal or pressing Ctrl+C) and then run the command again with the same --output-dir, the script will restart from the beginning. Embeddings take the most time, so an interrupted run may mean waiting through the embedding step again. To avoid losing progress, let the run finish once it starts.

Generate the Review Table

Run this after the topic model finishes. It reads the topic assignments from outputs/my_run/ and produces topic_review_table.csv — the easiest file to start with when reading your results. It also generates the metadata visualizations (charts comparing topics by publication decade and document).

Mac or Linux:

python scripts/visualize_topic_metadata.py \
  --output-dir outputs/my_run \
  --top-n 15

Windows (Command Prompt):

python scripts/visualize_topic_metadata.py ^
  --output-dir outputs/my_run ^
  --top-n 15
Option What it does
--output-dir outputs/my_run Where to find your topic results and where to save the review table
--top-n 15 How many topics to include in the visualizations (the 15 largest by chunk count)

Which 10% Sample This Uses

The script uses:

sample fraction = 10%
random seed = 42

That produces the same 29-document sample each time.

What the seed number does

A random seed is a starting number that controls how the script picks documents from the collection. The same seed always picks the same documents — so seed 42 will always select the same 29 files, on any computer, every time. This makes runs reproducible and comparable.

A different seed usually picks a mostly different set of documents, but some overlap is possible. Because the script is drawing 29 documents at random from 294, any two draws are likely to share a few documents by chance — the same way two people asked to pick 10% of a card deck often pull some of the same cards. To find out exactly which documents your run used, open sample_documents.csv in your output folder.

You can use any whole number as a seed — --seed 9, --seed 100, --seed 2024 — and each one produces its own 10% sample with its own topics. The number itself has no special meaning; what matters is that two runs with the same seed will always match.

Note: The Second 10% Sample page on this site used seed 9, which happens to have zero overlap with seed 42. That overlap is not guaranteed for every seed — it depends on which documents each draw happens to land on.

To try a different sample, add --seed to the command with any number you choose, and use a new output folder so your results are not overwritten:

python -u scripts/run_bertopic_sample.py \
  --seed 9 \
  --output-dir outputs/my_run_seed9 \
  --embedding-backend ollama \
  --ollama-embedding-model nomic-embed-text \
  --representation-backend ctfidf \
  --clustering sensitive \
  --label-backend ollama \
  --ollama-model llama3.1:latest

Open and Read the Results

Start with:

  • topic_review_table.csv
  • topic_labels_llm.csv
  • topic_info.csv
  • topic_assignments.csv

The easiest file for students is topic_review_table.csv. It shows the topic number, current label, number of chunks, topic words, publication-year range, and the documents that contribute most strongly to the topic.

topic_assignments.csv shows which topic each chunk was assigned to. Use it when a topic label needs to be checked against actual passages.

Here is a small preview of the topic review table:

Topic Label Chunks Top Words
0 Whipping and Plantation Punishment 76 whip, flog, overseer, tie, plantation
1 Limitations of Liberty in the Post-War Era 66 government, constitution, liberty, american, war
2 Separation and Reunion 55 dice, aunt dice, aunt, mos, riverside
3 Plantation Christianity and Slave Spirituality 45 meeting, preach, prayer, spirit, pray

Students can download the CSV files from the buttons below. If a CSV opens in the browser instead of downloading, use File > Save Page As or right-click the button and choose Save Link As.

Download topic review table Download topic labels Download topic info Download topic assignments

Review and Improve Topic Labels

The topic labels are suggestions.

The current workflow uses:

BERTopic topic words + representative passages + local Llama label

Students can replace or improve those labels after reading the topic words and example passages.

That is human interpretation, and it is expected. Renaming a topic does not change the computer’s topic assignments. It only changes the words we use to describe the topic.

If a topic label seems too broad, open topic_review_table.csv, inspect the top_words column, then open topic_assignments.csv and read several chunks from that topic. After that, write a clearer human label.

If a label is manually revised, record that it was revised. A good note might say:

Label revised after reviewing top words and representative chunks.

Where Human Interpretation Comes In

Human interpretation shapes the results throughout the pipeline — from choosing the dataset and cleaning rules to inspecting topic words, renaming labels, and deciding what a visualization means in historical context. The computer helps organize the texts, but the researcher decides what the patterns mean.

For a full walkthrough of each decision point, see the Simple Explanation page.

Read the Charts

Running the two commands above also generates HTML visualizations in your output folder:

outputs/my_run/visualizations/
  topics.html                  # 2D topic cluster map
  topic_barchart.html          # Top words per topic
  topic_hierarchy.html         # Hierarchical topic clustering
  topics_over_time.html        # Topic prevalence as stacked area chart

outputs/my_run/metadata_visualizations/
  topic_prevalence_grouped_bars.html   # Topic shares by publication decade
  topic_prevalence_by_decade.html      # Topic trends by decade (faceted lines)
  topic_decade_heatmap.html            # Topics × decades heatmap
  document_topic_heatmap.html          # Documents × topics heatmap
  sample_documents_timeline.html       # Publication year scatter of sampled documents

Open any of these files by double-clicking them in your file browser. They open directly in a web browser with no internet connection needed.

All nine visualizations — with descriptions and interpretation notes — are on the Simple Explanation page. Refer to that page to understand what each chart shows, then compare the reference charts there against your own results in outputs/my_run/.

How To Read Topic Numbers

BERTopic uses topic numbers.

Important:

Topic -1 = outliers
Topic 0 = first regular topic
Topic 1 = second regular topic

Topic 0 is not the outlier. Only topic -1 is the outlier.

Exercise Questions

After opening topic_labels_llm.csv or topic_review_table.csv, answer:

  1. Which topic label is easiest to understand?
  2. Which topic label seems too vague?
  3. Which topic words support that label?
  4. Which label would you revise after reading the topic words?
  5. Do any topics look similar but not identical?
  6. Which topic has the most chunks?
  7. Which topics seem connected to religion?
  8. Which topics seem connected to violence or punishment?
  9. Which topics seem connected to education or Booker T. Washington?

After opening the metadata visualizations, answer:

  1. Which topics appear mostly in earlier publication decades?
  2. Which topics appear mostly in later publication decades?
  3. Which topics are dominated by one document?
  4. Which topics appear across several documents?
  5. Do the time charts show event dates or publication dates? How can you tell?

Optional: Try a Different Sample

Once your first run works, you can draw a different 10% sample by changing the seed number. The More Analyses → Second 10% Sample tab on this site shows a pre-run reference using seed 9:

Pre-run reference results:  outputs/bertopic_sample2_nomic_sensitive_lemmatized/
Your results will go to:    outputs/my_run_seed9/

Mac or Linux:

python -u scripts/run_bertopic_sample.py \
  --seed 9 \
  --output-dir outputs/my_run_seed9 \
  --embedding-backend ollama \
  --ollama-embedding-model nomic-embed-text \
  --representation-backend ctfidf \
  --clustering sensitive \
  --label-backend ollama \
  --ollama-model llama3.1:latest

Windows (Command Prompt):

python -u scripts/run_bertopic_sample.py ^
  --seed 9 ^
  --output-dir outputs/my_run_seed9 ^
  --embedding-backend ollama ^
  --ollama-embedding-model nomic-embed-text ^
  --representation-backend ctfidf ^
  --clustering sensitive ^
  --label-backend ollama ^
  --ollama-model llama3.1:latest

Check outputs/my_run_seed9/sample_documents.csv to see which 29 documents were selected. Then run the visualization script to generate the review table and metadata charts:

Mac or Linux:

python scripts/visualize_topic_metadata.py \
  --output-dir outputs/my_run_seed9 \
  --top-n 15

Windows (Command Prompt):

python scripts/visualize_topic_metadata.py ^
  --output-dir outputs/my_run_seed9 ^
  --top-n 15

Compare your topic labels and charts against the reference on the Second 10% Sample page.

Optional: Run the Full Collection

Only do this after the 10% sample works. The full run processes all 294 documents and may take several hours. The More Analyses → Full Collection tab on this site shows a pre-run reference:

Pre-run reference results:  outputs/bertopic_full_nomic_sensitive_lemmatized/
Your results will go to:    outputs/my_run_full/

Mac or Linux:

python -u scripts/run_bertopic_sample.py \
  --sample-frac 1.0 \
  --output-dir outputs/my_run_full \
  --embedding-backend ollama \
  --ollama-embedding-model nomic-embed-text \
  --representation-backend ctfidf \
  --clustering sensitive \
  --label-backend ollama \
  --ollama-model llama3.1:latest

Windows (Command Prompt):

python -u scripts/run_bertopic_sample.py ^
  --sample-frac 1.0 ^
  --output-dir outputs/my_run_full ^
  --embedding-backend ollama ^
  --ollama-embedding-model nomic-embed-text ^
  --representation-backend ctfidf ^
  --clustering sensitive ^
  --label-backend ollama ^
  --ollama-model llama3.1:latest

When the run finishes, generate the review table and metadata charts:

Mac or Linux:

python scripts/visualize_topic_metadata.py \
  --output-dir outputs/my_run_full \
  --top-n 15

Windows (Command Prompt):

python scripts/visualize_topic_metadata.py ^
  --output-dir outputs/my_run_full ^
  --top-n 15

Compare your results against the Full Collection page when it finishes.

Common Problems

ollama command not found

Ollama is not installed, or Terminal cannot find it. Install Ollama, then close and reopen Terminal.

Python says packages are missing

Make sure the virtual environment is active.

Mac or Linux:

source .venv/bin/activate

Windows (Command Prompt):

.venv\Scripts\activate

Then reinstall packages:

python -m pip install -r requirements.txt

Ollama connection error

Open the Ollama app, or run:

ollama list

If Ollama is working, it should show installed models.

The run is slow

That is normal. The first run creates embeddings. Later runs reuse them.

A topic label looks wrong

That is also normal. Topic labels are suggestions.

Open topic_review_table.csv, topic_info.csv, and topic_assignments.csv to inspect the actual words and passages behind the label.