Hands-On Exercise
Run a free local topic model and interpret the results
This exercise is for students with no prior coding experience.
Students will download the original dataset, use the provided scripts and reference outputs, and run a 10% BERTopic sample on their own computer.
What Students Will Make
By the end, students will have:
- a spreadsheet of topic labels
- a spreadsheet showing which passage belongs to which topic
- a review table with topic labels, top words, years, and contributing documents
- a short interpretation of what the topics and charts suggest
How This Setup Works
There are two pieces:
- Original dataset: downloaded from the original Documenting the American South website.
- Project files: scripts, instructions, and reference results.
Download the original dataset separately so the source of the texts and citation information are clear.
Reference outputs vs. your outputs
The repository already contains a complete set of pre-generated results in:
outputs/bertopic_sample_nomic_sensitive_lemmatized/
Do not write to that folder. It contains the reference run used to build the charts and download links on this site. You can use those files to compare against your own results.
When you run the scripts in this exercise, your results will go to a separate folder:
outputs/my_run/
Keep these two folders separate so you can always compare your results to the reference.
Reminder: All commands in this exercise must be run from inside the project folder (
topic-modeling-slave-narratives/). Open a terminal there before running anything — see Open a Terminal below. Your CSV results and HTML charts will appear inoutputs/my_run/when the scripts finish.
Download the Original Dataset
Go to the Documenting the American South North American Slave Narratives collection:
https://docsouth.unc.edu/neh/
Download the North American Slave Narratives data package from the collection page.
After downloading, unzip the data. The dataset should include:
data/texts
data/xml
data/toc.csv
data/readme.txt
The scripts expect the dataset’s data folder to be inside the same main project folder as the scripts.
Get the GitHub Repository
The scripts and reference outputs are in this GitHub repository:
https://github.com/jinghanlib/topic-modeling-slave-narratives
Option A — Download as ZIP (no Git required):
Go to the repository page, click the green Code button, choose Download ZIP, and unzip the downloaded file somewhere on your computer.
Option B — Clone with Git:
If you have Git installed, open a terminal and run:
git clone https://github.com/jinghanlib/topic-modeling-slave-narratives.gitThen copy or move the downloaded dataset’s data folder into the repository folder.
Your project folder should look like this:
topic-modeling-slave-narratives/
data/
texts/
xml/
toc.csv
scripts/
run_bertopic_sample.py
visualize_topic_metadata.py
instructions/
outputs/
requirements.txt
Local Compute: What It Means
This exercise runs on the student’s own computer.
That means:
- the topic modeling pipeline runs on your computer — the embedding, clustering, and labeling steps all happen locally
- no paid API is needed
- enough disk space is needed for Python packages and AI models
- the first run may take several minutes
- the computer may become warm or use more battery
Approximate space needed:
Python environment and packages: several GB
nomic-embed-text model: about 300 MB
llama3.1 model: about 5 GB
project data and outputs: hundreds of MB to a few GB
Approximate time:
10% sample: about 5 to 15 minutes on a modern laptop
full collection: much longer, possibly hours
Times vary by computer.
Open a Terminal
Several steps below require typing commands into a terminal (also called a command line or shell).
The easiest way is to right-click the project folder itself:
- Mac: Right-click the folder in Finder and choose New Terminal at Folder. If you do not see that option, go to System Settings → Privacy & Security → Developer Tools and enable Terminal.
- Windows: Hold Shift and right-click the folder in File Explorer, then choose Open PowerShell window here or Open Command window here.
This opens a terminal already pointed at the right location. You do not need to type any folder path. If you close and reopen the terminal later, right-click the folder again before running any commands.
Install the Tools
Install Python
Download Python from:
https://www.python.org/downloads/
Use Python 3.11 or newer.
To check Python, open a terminal and run:
python3 --versionIf it prints a version number, Python is installed.
Windows note: If
python3 --versiongives an error, trypython --versioninstead. Windows often usespythonrather thanpython3.
Install Ollama and Local Models
Download Ollama from:
https://ollama.com/
Then open Terminal and run:
ollama pull nomic-embed-text
ollama pull llama3.1What these models do:
nomic-embed-textturns passages into numerical meaning representationsllama3.1writes readable topic labels
These models run locally. They are not uploaded to a paid service.
Set Up the Python Environment
Create a Python environment:
python3 -m venv .venvActivate it:
source .venv/bin/activateWindows note: On Windows, use this command instead:
.venv\Scripts\activate
Install packages:
python -m pip install --upgrade pip
python -m pip install -r requirements.txtThis may take several minutes.
Run the 10% Topic Model
This command runs the full topic modeling pipeline on a 10% sample of the collection. It cleans and chunks the texts, converts each chunk into embeddings using the local nomic-embed-text model, clusters the chunks into topics using BERTopic, and generates readable topic labels using the local llama3.1 model. Results are saved to outputs/my_run/. No paid API key is needed. The run takes roughly 5–15 minutes on a modern laptop.
Note: If you run the script more than once with the same
--output-dir, the previous results will be overwritten. To keep multiple runs separate, change the folder name each time — for exampleoutputs/my_run_attempt2oroutputs/my_run_seed99. Each folder will contain its own independent set of CSV files and visualizations.
Mac or Linux:
python -u scripts/run_bertopic_sample.py \
--output-dir outputs/my_run \
--embedding-backend ollama \
--ollama-embedding-model nomic-embed-text \
--representation-backend ctfidf \
--clustering sensitive \
--label-backend ollama \
--ollama-model llama3.1:latestWindows (Command Prompt):
python -u scripts/run_bertopic_sample.py ^
--output-dir outputs/my_run ^
--embedding-backend ollama ^
--ollama-embedding-model nomic-embed-text ^
--representation-backend ctfidf ^
--clustering sensitive ^
--label-backend ollama ^
--ollama-model llama3.1:latest
Each option in the command tells the script how to behave:
| Option | What it does |
|---|---|
--output-dir outputs/my_run |
Where to save your results — keeps them separate from the reference outputs |
--embedding-backend ollama |
Use the local Ollama server to generate embeddings (no internet or API key needed) |
--ollama-embedding-model nomic-embed-text |
Which local model to use for converting text into numbers |
--representation-backend ctfidf |
How to find the most distinctive words for each topic |
--clustering sensitive |
Use more sensitive clustering settings to find more, smaller topics |
--label-backend ollama |
Use the local Ollama server to generate topic labels |
--ollama-model llama3.1:latest |
Which local language model to use for writing topic labels |
How To Know It Is Working
Students should see lines like:
Sampled 29 documents into 1543 chunks
Ollama embedded 25/1543 chunks
Ollama embedded 50/1543 chunks
...
Labeled topic 0: Whipping and Plantation Punishment
...
Done. Results written to outputs/my_run
If they see Done, the run worked.
sample_documents.csv is ready as soon as the first line appears — Sampled 29 documents into 1543 chunks means the file has already been written to outputs/my_run/. Students can open it immediately in a spreadsheet to see which 29 documents were selected, without waiting for the rest of the pipeline to finish. There is no need to interrupt the run — the script will keep running in the terminal and continue to the embedding, clustering, and labeling steps on its own.
If you do interrupt the run (by closing the terminal or pressing Ctrl+C) and then run the command again with the same
--output-dir, the script will restart from the beginning. Embeddings take the most time, so an interrupted run may mean waiting through the embedding step again. To avoid losing progress, let the run finish once it starts.
Generate the Review Table
Run this after the topic model finishes. It reads the topic assignments from outputs/my_run/ and produces topic_review_table.csv — the easiest file to start with when reading your results. It also generates the metadata visualizations (charts comparing topics by publication decade and document).
Mac or Linux:
python scripts/visualize_topic_metadata.py \
--output-dir outputs/my_run \
--top-n 15Windows (Command Prompt):
python scripts/visualize_topic_metadata.py ^
--output-dir outputs/my_run ^
--top-n 15
| Option | What it does |
|---|---|
--output-dir outputs/my_run |
Where to find your topic results and where to save the review table |
--top-n 15 |
How many topics to include in the visualizations (the 15 largest by chunk count) |
Which 10% Sample This Uses
The script uses:
sample fraction = 10%
random seed = 42
That produces the same 29-document sample each time.
What the seed number does
A random seed is a starting number that controls how the script picks documents from the collection. The same seed always picks the same documents — so seed 42 will always select the same 29 files, on any computer, every time. This makes runs reproducible and comparable.
A different seed usually picks a mostly different set of documents, but some overlap is possible. Because the script is drawing 29 documents at random from 294, any two draws are likely to share a few documents by chance — the same way two people asked to pick 10% of a card deck often pull some of the same cards. To find out exactly which documents your run used, open sample_documents.csv in your output folder.
You can use any whole number as a seed — --seed 9, --seed 100, --seed 2024 — and each one produces its own 10% sample with its own topics. The number itself has no special meaning; what matters is that two runs with the same seed will always match.
Note: The Second 10% Sample page on this site used seed 9, which happens to have zero overlap with seed 42. That overlap is not guaranteed for every seed — it depends on which documents each draw happens to land on.
To try a different sample, add --seed to the command with any number you choose, and use a new output folder so your results are not overwritten:
python -u scripts/run_bertopic_sample.py \
--seed 9 \
--output-dir outputs/my_run_seed9 \
--embedding-backend ollama \
--ollama-embedding-model nomic-embed-text \
--representation-backend ctfidf \
--clustering sensitive \
--label-backend ollama \
--ollama-model llama3.1:latestOpen and Read the Results
Start with:
topic_review_table.csvtopic_labels_llm.csvtopic_info.csvtopic_assignments.csv
The easiest file for students is topic_review_table.csv. It shows the topic number, current label, number of chunks, topic words, publication-year range, and the documents that contribute most strongly to the topic.
topic_assignments.csv shows which topic each chunk was assigned to. Use it when a topic label needs to be checked against actual passages.
Here is a small preview of the topic review table:
| Topic | Label | Chunks | Top Words |
|---|---|---|---|
| 0 | Whipping and Plantation Punishment | 76 | whip, flog, overseer, tie, plantation |
| 1 | Limitations of Liberty in the Post-War Era | 66 | government, constitution, liberty, american, war |
| 2 | Separation and Reunion | 55 | dice, aunt dice, aunt, mos, riverside |
| 3 | Plantation Christianity and Slave Spirituality | 45 | meeting, preach, prayer, spirit, pray |
Students can download the CSV files from the buttons below. If a CSV opens in the browser instead of downloading, use File > Save Page As or right-click the button and choose Save Link As.
Download topic review table Download topic labels Download topic info Download topic assignments
Review and Improve Topic Labels
The topic labels are suggestions.
The current workflow uses:
BERTopic topic words + representative passages + local Llama label
Students can replace or improve those labels after reading the topic words and example passages.
That is human interpretation, and it is expected. Renaming a topic does not change the computer’s topic assignments. It only changes the words we use to describe the topic.
If a topic label seems too broad, open topic_review_table.csv, inspect the top_words column, then open topic_assignments.csv and read several chunks from that topic. After that, write a clearer human label.
If a label is manually revised, record that it was revised. A good note might say:
Label revised after reviewing top words and representative chunks.
Where Human Interpretation Comes In
Human interpretation shapes the results throughout the pipeline — from choosing the dataset and cleaning rules to inspecting topic words, renaming labels, and deciding what a visualization means in historical context. The computer helps organize the texts, but the researcher decides what the patterns mean.
For a full walkthrough of each decision point, see the Simple Explanation page.
Read the Charts
Running the two commands above also generates HTML visualizations in your output folder:
outputs/my_run/visualizations/
topics.html # 2D topic cluster map
topic_barchart.html # Top words per topic
topic_hierarchy.html # Hierarchical topic clustering
topics_over_time.html # Topic prevalence as stacked area chart
outputs/my_run/metadata_visualizations/
topic_prevalence_grouped_bars.html # Topic shares by publication decade
topic_prevalence_by_decade.html # Topic trends by decade (faceted lines)
topic_decade_heatmap.html # Topics × decades heatmap
document_topic_heatmap.html # Documents × topics heatmap
sample_documents_timeline.html # Publication year scatter of sampled documents
Open any of these files by double-clicking them in your file browser. They open directly in a web browser with no internet connection needed.
All nine visualizations — with descriptions and interpretation notes — are on the Simple Explanation page. Refer to that page to understand what each chart shows, then compare the reference charts there against your own results in outputs/my_run/.
How To Read Topic Numbers
BERTopic uses topic numbers.
Important:
Topic -1 = outliers
Topic 0 = first regular topic
Topic 1 = second regular topic
Topic 0 is not the outlier. Only topic -1 is the outlier.
Exercise Questions
After opening topic_labels_llm.csv or topic_review_table.csv, answer:
- Which topic label is easiest to understand?
- Which topic label seems too vague?
- Which topic words support that label?
- Which label would you revise after reading the topic words?
- Do any topics look similar but not identical?
- Which topic has the most chunks?
- Which topics seem connected to religion?
- Which topics seem connected to violence or punishment?
- Which topics seem connected to education or Booker T. Washington?
After opening the metadata visualizations, answer:
- Which topics appear mostly in earlier publication decades?
- Which topics appear mostly in later publication decades?
- Which topics are dominated by one document?
- Which topics appear across several documents?
- Do the time charts show event dates or publication dates? How can you tell?
Optional: Try a Different Sample
Once your first run works, you can draw a different 10% sample by changing the seed number. The More Analyses → Second 10% Sample tab on this site shows a pre-run reference using seed 9:
Pre-run reference results: outputs/bertopic_sample2_nomic_sensitive_lemmatized/
Your results will go to: outputs/my_run_seed9/
Mac or Linux:
python -u scripts/run_bertopic_sample.py \
--seed 9 \
--output-dir outputs/my_run_seed9 \
--embedding-backend ollama \
--ollama-embedding-model nomic-embed-text \
--representation-backend ctfidf \
--clustering sensitive \
--label-backend ollama \
--ollama-model llama3.1:latestWindows (Command Prompt):
python -u scripts/run_bertopic_sample.py ^
--seed 9 ^
--output-dir outputs/my_run_seed9 ^
--embedding-backend ollama ^
--ollama-embedding-model nomic-embed-text ^
--representation-backend ctfidf ^
--clustering sensitive ^
--label-backend ollama ^
--ollama-model llama3.1:latest
Check outputs/my_run_seed9/sample_documents.csv to see which 29 documents were selected. Then run the visualization script to generate the review table and metadata charts:
Mac or Linux:
python scripts/visualize_topic_metadata.py \
--output-dir outputs/my_run_seed9 \
--top-n 15Windows (Command Prompt):
python scripts/visualize_topic_metadata.py ^
--output-dir outputs/my_run_seed9 ^
--top-n 15
Compare your topic labels and charts against the reference on the Second 10% Sample page.
Optional: Run the Full Collection
Only do this after the 10% sample works. The full run processes all 294 documents and may take several hours. The More Analyses → Full Collection tab on this site shows a pre-run reference:
Pre-run reference results: outputs/bertopic_full_nomic_sensitive_lemmatized/
Your results will go to: outputs/my_run_full/
Mac or Linux:
python -u scripts/run_bertopic_sample.py \
--sample-frac 1.0 \
--output-dir outputs/my_run_full \
--embedding-backend ollama \
--ollama-embedding-model nomic-embed-text \
--representation-backend ctfidf \
--clustering sensitive \
--label-backend ollama \
--ollama-model llama3.1:latestWindows (Command Prompt):
python -u scripts/run_bertopic_sample.py ^
--sample-frac 1.0 ^
--output-dir outputs/my_run_full ^
--embedding-backend ollama ^
--ollama-embedding-model nomic-embed-text ^
--representation-backend ctfidf ^
--clustering sensitive ^
--label-backend ollama ^
--ollama-model llama3.1:latest
When the run finishes, generate the review table and metadata charts:
Mac or Linux:
python scripts/visualize_topic_metadata.py \
--output-dir outputs/my_run_full \
--top-n 15Windows (Command Prompt):
python scripts/visualize_topic_metadata.py ^
--output-dir outputs/my_run_full ^
--top-n 15
Compare your results against the Full Collection page when it finishes.
Common Problems
ollama command not found
Ollama is not installed, or Terminal cannot find it. Install Ollama, then close and reopen Terminal.
Python says packages are missing
Make sure the virtual environment is active.
Mac or Linux:
source .venv/bin/activateWindows (Command Prompt):
.venv\Scripts\activate
Then reinstall packages:
python -m pip install -r requirements.txtOllama connection error
Open the Ollama app, or run:
ollama listIf Ollama is working, it should show installed models.
The run is slow
That is normal. The first run creates embeddings. Later runs reuse them.
A topic label looks wrong
That is also normal. Topic labels are suggestions.
Open topic_review_table.csv, topic_info.csv, and topic_assignments.csv to inspect the actual words and passages behind the label.