Topic Modeling Slave Narratives

A computational approach to historical texts

What We Set Out to Do

What themes appear across the texts?
Which passages seem related to each other?
Do themes change across publication dates?
Can we make the computer’s topic names readable?

The Collection

North American Slave Narratives

294 texts from Documenting the American South, University of North Carolina
First-person accounts, abolitionist testimonies, autobiographies
Published across more than a century: 1770s–1940s
Full texts in plain text and XML

For this exercise: a reproducible 10% sample — 29 documents, same seed each time.

Why Sample?

Full collection

294 documents
Possibly thousands of chunks
Hours to embed and cluster
Needs a powerful machine

10% sample

29 documents → 1,543 chunks
5–15 minutes on a modern laptop
Same pipeline, same outputs
Reproducible with seed = 42

Why Topic Modeling?

Two Approaches

Word counting

Counts how often each word appears
Treats every passage independently
Lists whip, flog, punishment separately
Answers: which words are most frequent?

Topic modeling

Groups passages by shared meaning
Recognizes related passages even with different words
Groups whip, flog, overseer, punishment as one theme
Answers: which passages discuss similar things?

Example — The Punishment Theme

A word count would list these words separately:

whip · flog · overseer · punishment · cruelty · tie · plantation

Topic modeling groups the passages that use these words together into one topic:

Topic 0: Whipping and Plantation Punishment

Preparing the Texts

Cleaning — Every Choice Is a Research Decision

Remove image markers like [Cover Image]
Remove frontmatter words: PREFACE, CONTENTS, CHAPTER
Remove numbers and symbols
Remove common English stopwords: the, and, of, to
Remove collection-specific words: page, image, chapter

What we keep is as important as what we remove.

Chunking

Long narratives contain many different themes. Giving a whole narrative to the model as one unit loses that internal variation.

We divided each text into chunks:

About 750 words per chunk
Small overlap between neighboring chunks — so no passage falls between two chunks
29 documents → 1,543 chunks

The topic model grouped these chunks, not whole narratives.

Tokenization and Lemmatization

These two steps standardize how words are represented:

Tokenization — breaking text into individual units

Master and master may be treated as different words
Hyphenated or contracted words may split unexpectedly

Lemmatization — reducing variants to a shared base form

whip, whipped, whipping  →  whip
slaves                   →  slave
preached, preaching      →  preach
slaveholders             →  slaveholder

Both are choices made by the researcher.

The Pipeline

Pipeline Overview

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px', 'fontFamily': 'ui-sans-serif, system-ui, sans-serif', 'primaryColor': '#eaf1fb', 'primaryBorderColor': '#003da5', 'primaryTextColor': '#1f2937', 'lineColor': '#003da5'}}}%%
flowchart LR
    subgraph top[" "]
        direction LR
        A["Raw texts"] --> B["Clean & chunk"] --> C["Embed"] --> D["Reduce dims"]
    end
    subgraph bottom[" "]
        direction LR
        E["Cluster"] --> F["Topic words"] --> G["Label"] --> H["Results"]
    end
    D --> E
    style top fill:none,stroke:none
    style bottom fill:none,stroke:none

Step 1 — Embeddings

An embedding is a list of 768 numbers representing the meaning of a passage.

Passages with similar meanings get similar numbers — even when they use different words.

Two passages about being sold away from family → similar number lists Two passages about plantation labor and railroad travel → very different number lists

BERTopic uses that structure to find groups of passages that cluster together.

We used nomic-embed-text — runs locally, no paid API, no internet required.

Step 2 — Reduce Dimensions

768 numbers per chunk is too complex to cluster directly.

UMAP compresses 768 dimensions → 5 dimensions

Like flattening a 3D globe onto a 2D map: some detail is lost, but the overall shape is preserved.

Step 3 — Clustering

HDBSCAN looks for dense regions in the compressed space.

Each dense region → one topic.

Chunks that don’t fit clearly anywhere → topic −1 (outliers)

In our sample: 59 topics, 18% outliers

Topic 0 is the first regular topic. Only topic −1 is the outlier.

Step 4 — Topic Words

After clustering, which words are most distinctive for each group?

c-TF-IDF: which words appear much more often in this topic than in the rest of the collection?

Topic	Top words
0	whip, flog, overseer, tie, plantation
1	government, constitution, liberty, american, war
2	dice, aunt, mos, riverside
3	meeting, preach, prayer, spirit, pray

Step 5 — Topic Labels

Top words alone can be mechanical:

whip_flog_overseer_tie

We used llama3.1 — a local language model — to write clearer labels.

The model saw: - The topic number - The top words - A few representative passages from that topic

Output: Whipping and Plantation Punishment

Results

Example Topic Labels

Topic	Label	Chunks
0	Whipping and Plantation Punishment	76
1	Limitations of Liberty in the Post-War Era	66
2	Separation from Family and Emotional Trauma	55
3	Slave-led Christian Services and Meetings	45
4	Ranch Life and Conflict with Native Americans	44
5	Treatment by Slaveholders	43
6	Racial Identity and Prejudice	41

Nine Interactive Visualizations

Metadata charts

Topic shares by publication decade
Topic trends by decade (faceted)
Topics × decades heatmap
Documents × topics heatmap
Sample documents timeline

BERTopic charts

Topic prevalence area chart
Topic word bars (top words per topic)
Topic hierarchy
2D topic cluster map

What the Charts Suggest

1830s: dominated by abolitionist testimony — punishment, treatment, witness accounts
1900s–1910s: Booker T. Washington, Tuskegee, education, race leadership, public life
Ranch life topic: tied almost entirely to Nat Love’s narrative (one document)
Whipping topic: concentrated in American Slavery As It Is (1839)
Some topics span many documents; others are dominated by one long text

Human Interpretation

Where Judgment Enters

The computer organizes — people interpret. Decisions happen at every step:

Choosing which texts belong in the collection
Deciding what to remove during cleaning
Choosing chunk size and overlap
Deciding to use lemmatization
Choosing the embedding model and clustering settings
Inspecting outliers
Accepting, revising, merging, or rejecting topic labels
Interpreting visualizations in historical context

How to Revise a Label

Topic labels are suggestions, not final claims.

To revise a label:

Open topic_review_table.csv — inspect the top words
Open topic_assignments.csv — read several actual passages from that topic
Decide whether the label fits the passages
Write a clearer human label if needed

Revising a label does not move chunks. It only changes the description.

The Exercise

What You Will Do

Run the same pipeline on your own computer — free, local, no cloud:

Download the dataset and the repository
Install Python and Ollama
Run the pipeline on the 10% sample (5–15 minutes)
Generate the review table and charts
Open and interpret your results

What You Will Produce

CSV files

topic_labels_llm.csv — topic labels
topic_review_table.csv — labels, top words, years, documents
topic_assignments.csv — which passage → which topic
topic_info.csv — BERTopic internal statistics

HTML visualizations

4 BERTopic charts in visualizations/
5 metadata charts in metadata_visualizations/

Open any .html file by double-clicking — no internet needed.

Then: compare your results against the reference outputs on the website.

Questions to Think About

Reading the topics

Which label is easiest to understand?
Which label seems too vague?
Which label would you revise?
Do any topics look similar to each other?

Reading the charts

Which topics appear in earlier decades?
Which topics appear in later decades?
Which topics are dominated by one document?
Do the time charts show event dates or publication dates?

Full instructions, visualizations, and CSV downloads:

jinghanlib.github.io/topic-modeling-slave-narratives

Simple Explanation · Hands-On Exercise