Topic Modeling Slave Narratives

A computational approach to historical texts

What We Set Out to Do

  • What themes appear across the texts?
  • Which passages seem related to each other?
  • Do themes change across publication dates?
  • Can we make the computer’s topic names readable?

The Collection

North American Slave Narratives

  • 294 texts from Documenting the American South, University of North Carolina
  • First-person accounts, abolitionist testimonies, autobiographies
  • Published across more than a century: 1770s–1940s
  • Full texts in plain text and XML

For this exercise: a reproducible 10% sample — 29 documents, same seed each time.

Why Sample?

Full collection

  • 294 documents
  • Possibly thousands of chunks
  • Hours to embed and cluster
  • Needs a powerful machine

10% sample

  • 29 documents → 1,543 chunks
  • 5–15 minutes on a modern laptop
  • Same pipeline, same outputs
  • Reproducible with seed = 42

Why Topic Modeling?

Two Approaches

Word counting

  • Counts how often each word appears
  • Treats every passage independently
  • Lists whip, flog, punishment separately
  • Answers: which words are most frequent?

Topic modeling

  • Groups passages by shared meaning
  • Recognizes related passages even with different words
  • Groups whip, flog, overseer, punishment as one theme
  • Answers: which passages discuss similar things?

Example — The Punishment Theme

A word count would list these words separately:

whip · flog · overseer · punishment · cruelty · tie · plantation

Topic modeling groups the passages that use these words together into one topic:

Topic 0: Whipping and Plantation Punishment

Preparing the Texts

Cleaning — Every Choice Is a Research Decision

  • Remove image markers like [Cover Image]
  • Remove frontmatter words: PREFACE, CONTENTS, CHAPTER
  • Remove numbers and symbols
  • Remove common English stopwords: the, and, of, to
  • Remove collection-specific words: page, image, chapter

What we keep is as important as what we remove.

Chunking

Long narratives contain many different themes. Giving a whole narrative to the model as one unit loses that internal variation.

We divided each text into chunks:

  • About 750 words per chunk
  • Small overlap between neighboring chunks — so no passage falls between two chunks
  • 29 documents → 1,543 chunks

The topic model grouped these chunks, not whole narratives.

Tokenization and Lemmatization

These two steps standardize how words are represented:

Tokenization — breaking text into individual units

  • Master and master may be treated as different words
  • Hyphenated or contracted words may split unexpectedly

Lemmatization — reducing variants to a shared base form

whip, whipped, whipping  →  whip
slaves                   →  slave
preached, preaching      →  preach
slaveholders             →  slaveholder

Both are choices made by the researcher.

The Pipeline

Pipeline Overview

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '22px', 'fontFamily': 'ui-sans-serif, system-ui, sans-serif', 'primaryColor': '#eaf1fb', 'primaryBorderColor': '#003da5', 'primaryTextColor': '#1f2937', 'lineColor': '#003da5'}}}%%
flowchart LR
    subgraph top[" "]
        direction LR
        A["Raw texts"] --> B["Clean & chunk"] --> C["Embed"] --> D["Reduce dims"]
    end
    subgraph bottom[" "]
        direction LR
        E["Cluster"] --> F["Topic words"] --> G["Label"] --> H["Results"]
    end
    D --> E
    style top fill:none,stroke:none
    style bottom fill:none,stroke:none

Step 1 — Embeddings

An embedding is a list of 768 numbers representing the meaning of a passage.

Passages with similar meanings get similar numbers — even when they use different words.

Two passages about being sold away from family → similar number lists Two passages about plantation labor and railroad travel → very different number lists

BERTopic uses that structure to find groups of passages that cluster together.

We used nomic-embed-text — runs locally, no paid API, no internet required.

Step 2 — Reduce Dimensions

768 numbers per chunk is too complex to cluster directly.

UMAP compresses 768 dimensions → 5 dimensions

Like flattening a 3D globe onto a 2D map: some detail is lost, but the overall shape is preserved.

Step 3 — Clustering

HDBSCAN looks for dense regions in the compressed space.

Each dense region → one topic.

Chunks that don’t fit clearly anywhere → topic −1 (outliers)

In our sample: 59 topics, 18% outliers

Topic 0 is the first regular topic. Only topic −1 is the outlier.

Step 4 — Topic Words

After clustering, which words are most distinctive for each group?

c-TF-IDF: which words appear much more often in this topic than in the rest of the collection?

Topic Top words
0 whip, flog, overseer, tie, plantation
1 government, constitution, liberty, american, war
2 dice, aunt, mos, riverside
3 meeting, preach, prayer, spirit, pray

Step 5 — Topic Labels

Top words alone can be mechanical:

whip_flog_overseer_tie

We used llama3.1 — a local language model — to write clearer labels.

The model saw: - The topic number - The top words - A few representative passages from that topic

Output: Whipping and Plantation Punishment

Results

Example Topic Labels

Topic Label Chunks
0 Whipping and Plantation Punishment 76
1 Limitations of Liberty in the Post-War Era 66
2 Separation from Family and Emotional Trauma 55
3 Slave-led Christian Services and Meetings 45
4 Ranch Life and Conflict with Native Americans 44
5 Treatment by Slaveholders 43
6 Racial Identity and Prejudice 41

Nine Interactive Visualizations

Metadata charts

  • Topic shares by publication decade
  • Topic trends by decade (faceted)
  • Topics × decades heatmap
  • Documents × topics heatmap
  • Sample documents timeline

BERTopic charts

  • Topic prevalence area chart
  • Topic word bars (top words per topic)
  • Topic hierarchy
  • 2D topic cluster map

What the Charts Suggest

  • 1830s: dominated by abolitionist testimony — punishment, treatment, witness accounts
  • 1900s–1910s: Booker T. Washington, Tuskegee, education, race leadership, public life
  • Ranch life topic: tied almost entirely to Nat Love’s narrative (one document)
  • Whipping topic: concentrated in American Slavery As It Is (1839)
  • Some topics span many documents; others are dominated by one long text

Human Interpretation

Where Judgment Enters

The computer organizes — people interpret. Decisions happen at every step:

  • Choosing which texts belong in the collection
  • Deciding what to remove during cleaning
  • Choosing chunk size and overlap
  • Deciding to use lemmatization
  • Choosing the embedding model and clustering settings
  • Inspecting outliers
  • Accepting, revising, merging, or rejecting topic labels
  • Interpreting visualizations in historical context

How to Revise a Label

Topic labels are suggestions, not final claims.

To revise a label:

  1. Open topic_review_table.csv — inspect the top words
  2. Open topic_assignments.csv — read several actual passages from that topic
  3. Decide whether the label fits the passages
  4. Write a clearer human label if needed

Revising a label does not move chunks. It only changes the description.

The Exercise

What You Will Do

Run the same pipeline on your own computer — free, local, no cloud:

  1. Download the dataset and the repository
  2. Install Python and Ollama
  3. Run the pipeline on the 10% sample (5–15 minutes)
  4. Generate the review table and charts
  5. Open and interpret your results

What You Will Produce

CSV files

  • topic_labels_llm.csv — topic labels
  • topic_review_table.csv — labels, top words, years, documents
  • topic_assignments.csv — which passage → which topic
  • topic_info.csv — BERTopic internal statistics

HTML visualizations

  • 4 BERTopic charts in visualizations/
  • 5 metadata charts in metadata_visualizations/

Open any .html file by double-clicking — no internet needed.

Then: compare your results against the reference outputs on the website.

Questions to Think About

Reading the topics

  • Which label is easiest to understand?
  • Which label seems too vague?
  • Which label would you revise?
  • Do any topics look similar to each other?

Reading the charts

  • Which topics appear in earlier decades?
  • Which topics appear in later decades?
  • Which topics are dominated by one document?
  • Do the time charts show event dates or publication dates?

Full instructions, visualizations, and CSV downloads:

jinghanlib.github.io/topic-modeling-slave-narratives

Simple Explanation · Hands-On Exercise