Simple Explanation

Topic modeling, preprocessing, local models, and interpretation

Project Overview

We wanted to look across a group of slave narratives and ask:

  • What themes appear across the texts?
  • Which passages seem related to each other?
  • Do themes change across publication dates?
  • Can we make the computer’s topic names easier for people to understand?

Instead of reading every document one by one, we used topic modeling. Topic modeling is a way of asking a computer to group passages that seem to discuss similar themes.

This does not replace historical interpretation. It helps us find patterns that we can investigate more carefully.

Data and Sampling

The Collection

The full collection has 294 text files.

For testing, we used a reproducible 10% sample:

29 documents
sample fraction = 10%
random seed = 42

The random seed matters because it makes the sample repeatable. If another person uses the same data, same code, and same seed, they should get the same 29 documents.

Which 10% of the Data We Used

The 29 sampled files were:

fpn-bruce-bruce.xml
fpn-hughes-hughes.xml
fpn-jackson-jackson.xml
nc-jones85-jones85.xml
neh-aaron-aaron.xml
neh-arter-arter.xml
neh-boxbrown-boxbrown.xml
neh-bragg-bragg.xml
neh-brownj-brownj.xml
neh-bruner-bruner.xml
neh-charlton-charlton.xml
neh-detroit-detroit.xml
neh-hammon-hammon.xml
neh-hammond-hammond.xml
neh-henson58-henson58.xml
neh-hopper-hopper.xml
neh-iwilliams-iwilliams.xml
neh-jeter-jeter.xml
neh-mars-mars.xml
neh-natlove-natlove.xml
neh-nicholson-nicholson.xml
neh-oneal-oneal.xml
neh-robinsonn-robinson.xml
neh-scott-scott.xml
neh-stedman-stedman.xml
neh-troy-troy.xml
neh-webb-webb.xml
neh-weld-weld.xml
neh-williams-williams.xml

Data Cleaning and Preprocessing

Why We Cleaned the Texts

The texts include useful narrative material, but they also include things that can confuse topic modeling, such as image labels, title-page text, chapter headings, and repeated formatting words.

We cleaned the text by:

  • removing image markers like [Cover Image]
  • removing some frontmatter words like PREFACE, CONTENTS, and CHAPTER
  • removing numbers and extra symbols
  • removing common English stopwords such as the, and, of, and to
  • removing collection-specific stopwords such as page, image, and chapter

Cleaning is not neutral. A researcher decides what counts as useful text and what counts as distracting text. Those choices affect the results.

Chunks

Many of the narratives are long. If we give an entire narrative to the topic model as a single unit, the model may miss smaller themes within it.

So we divided each text into smaller pieces called chunks.

A chunk is a passage of text. In this project, each chunk was about 750 words, with a small overlap between neighboring chunks.

The overlap helps because an important passage might fall near the edge of a chunk. Overlap gives the model a better chance to keep nearby context together.

For the 10% sample, the 29 documents became:

1,543 chunks

The topic model grouped these chunks, not whole narratives.

Tokenization and Lemmatization

Before the topic model can find important words for each topic, it needs to standardize how words are represented. This happens in two steps.

Tokenization is how the computer breaks text into individual units. Tokens are usually words, but not always. A computer may treat master and Master as different words because of the capital letter, split a hyphenated word like well-known into two pieces, or handle contractions and older spellings in unexpected ways.

Lemmatization then reduces word variants to a shared base form:

whip, whipped, whipping -> whip
slaves -> slave
slaveholders -> slaveholder
preached, preaching -> preach

Without lemmatization, the model might identify whip, whipped, and whipping as separate signals. Reducing them to whip makes the topic words easier to read and the topics easier to interpret.

Topic Modeling Compared With Word Counting

Students often ask: why use topic modeling instead of just counting words?

Word counting answers the question: which words appear most often? That is useful, but a word count treats every passage independently. It does not recognize that two passages are discussing the same theme unless they share the exact same words.

Topic modeling asks a different question: which passages seem to be about similar themes? It groups passages together based on meaning, even when different passages use different vocabulary.

For example, a topic model can group passages about physical punishment even if they use different words — whip, flog, overseer, punishment, cruelty. A word count would list those words separately. Topic modeling tries to recognize that they point to a shared theme.

In our results, that shared theme became Whipping and Plantation Punishment.

BERTopic still uses words when describing topics. After grouping similar chunks together, it identifies which words are especially distinctive within each group. But the grouping itself is based on meaning, not word frequency.

The BERTopic Pipeline

The pipeline has five steps. Each step feeds the next.

Step What happens Tool
1 Embed — convert each passage into 768 numbers representing its meaning nomic-embed-text
2 Reduce dimensions — compress 768 numbers down to 5 UMAP
3 Cluster — group similar passages into topics HDBSCAN
4 Find topic words — identify the most distinctive words per topic c-TF-IDF
5 Generate labels — write a human-readable name for each topic llama3.1

Step 1: Turn Text Into Embeddings

An embedding is a list of numbers that represents the meaning of a passage. Every chunk gets converted into its own list — in our case, 768 numbers.

The key insight is how those numbers are assigned. The embedding model was trained on enormous amounts of text, and it learned to place passages with similar meanings close together in that 768-dimensional space, and passages with different meanings farther apart. Two passages that both describe being sold away from family will end up with similar number lists, even if they share few of the same words. Two passages about completely different subjects — say, plantation labor and railroad travel — will end up with very different number lists.

The computer does not understand language the way a reader does, but the embedding model translates language into a form where closeness means relatedness. BERTopic uses that structure to find groups of passages that cluster together.

In our project, we used a local embedding model through Ollama:

nomic-embed-text

This model runs entirely on your own computer, requires no paid API, and is designed specifically for producing embeddings from text. It is the same kind of model used in larger commercial systems, but it can run locally without a GPU.

Step 2: Reduce the Dimensions

Each embedding is a list of 768 numbers. That many dimensions are too complex to cluster directly.

BERTopic uses a method called UMAP to compress those numbers into a much smaller set — usually 5 — while trying to keep track of which passages are similar to each other. Think of it as flattening a 3D map into 2D: some detail is lost, but the overall shape is preserved.

Step 3: Cluster Similar Chunks

After the embeddings are compressed, BERTopic groups similar chunks together using a clustering method called HDBSCAN. It looks for dense regions in the compressed space and treats each region as a topic.

Chunks that do not fit clearly into any region become outliers, assigned to topic -1.

Step 4: Find Important Words for Each Topic

After chunks are grouped into topics, BERTopic looks for words that are especially important in each topic using a scoring method called c-TF-IDF. In short: which words appear much more often in this topic than in the rest of the collection?

We used cleaned and lemmatized text for this step so the topic words would be easier to read.

Step 5: Give Each Topic a Human-Readable Label

BERTopic can produce topic names from top words, but those names are often mechanical.

For example:

whip_flog_overseer_tie

So we used a local LLM through Ollama:

llama3.1

The LLM looked at:

  • the topic number
  • the top words for that topic
  • a few representative passages from that topic

Then it wrote a clearer label, such as:

Whipping and Plantation Punishment

The label is a draft, not a final answer. A different LLM — or even the same model run again — might write a different label for the same topic. One model might call a cluster “Whipping and Plantation Punishment” while another calls it “Physical Violence Under Slavery” or “Overseer Brutality and Forced Labor.” The top words and the passages do not change, but the phrasing does. This is where human interpretation becomes essential: a researcher who reads the top words and a few representative passages can judge whether the label is accurate, too broad, too narrow, or missing the historical point entirely. The LLM label is a starting point, not a conclusion.

Topic Words, Labels, and Human Interpretation

Topic words are important evidence. They are shown in:

  • topic_info.csv
  • topic_labels_llm.csv
  • topic_review_table.csv

Students should use these words to check whether a topic label makes sense.

Here is a small preview of the topic review table:

Topic Label Chunks Top Words
0 Whipping and Plantation Punishment 76 whip, flog, overseer, tie, plantation
1 Limitations of Liberty in the Post-War Era 66 government, constitution, liberty, american, war
2 Separation and Reunion 55 dice, aunt dice, aunt, mos, riverside
3 Plantation Christianity and Slave Spirituality 45 meeting, preach, prayer, spirit, pray

Students can download the CSV files from the links below. If a file opens in the browser instead of downloading, use File > Save Page As or right-click the link and choose Save Link As.

Download topic review table Download topic labels Download topic info Download topic assignments

The LLM label is a draft interpretation. It can be replaced or improved by a human reader.

This is a normal part of topic modeling: the computer suggests patterns, and people interpret whether the labels are historically and textually accurate.

Outliers

Some chunks do not fit clearly into any topic. BERTopic puts those chunks into an outlier group.

In BERTopic, the outlier topic is always:

topic -1

This is important: topic 0 is not the outlier. Topic 0 is the first regular topic.

In our final sample run, BERTopic found:

59 non-outlier topics
277 outlier chunks
18.0% outliers

An outlier is not necessarily bad. It can mean the passage is unusual, mixed, short on shared vocabulary, or not similar enough to other chunks.

Example Results

Some topic labels from this sample:

  • Whipping and Plantation Punishment
  • Limitations of Liberty in the Post-War Era
  • Separation and Reunion
  • Plantation Christianity and Slave Spirituality
  • Life on the Range
  • Life in Missouri and Experiences with Plantation Owners
  • Divine Providence and Racial Identity
  • Racial Identity and Prejudice
  • Finding Freedom and Employment in the North
  • Whipping and Plantation Punishment by Female Supervisors
  • Traveling by Railroad
  • Booker Washington’s Leadership and National Organization
  • Negro Slave Markets and Sales
  • Food Rations and Hunger
  • Joanna’s Relationship with Captain Stedman

Note that similar labels — such as multiple “Whipping and Plantation Punishment” topics — can appear in the same run. These are distinct clusters that the model kept separate because the passages within them differ in vocabulary or context, even though they share a broad theme. This is expected behavior, not an error. A human reader can inspect the top words and representative passages to understand what distinguishes them.

Reading the Charts

All charts are interactive — hover over bars and cells to see exact values, and click legend items to show or hide topics. Each chart can also be downloaded as a standalone HTML file from the GitHub repository under outputs/bertopic_sample_nomic_sensitive_lemmatized/. Open any .html file directly in a browser to explore it.

Topic Shares by Publication Decade

This chart shows what share of the top 8 topics appears in each publication decade. Each bar represents one topic’s share of all modeled chunks published in that decade.

What it suggests in this sample:

  • The 1830s are strongly shaped by topics from abolitionist testimony, especially physical punishment and treatment by slaveholders.
  • The 1900s and 1910s show more topics connected to Booker T. Washington, Tuskegee, education, race leadership, churches, and public life.
  • The ranch-life topic appears mainly in the early 1900s because it is strongly connected to Nat Love’s narrative.

Topic-by-Decade Heatmap

This heatmap shows which topics concentrate in which publication decades. Darker cells mean more chunks from that topic appear in that decade. It shows concentration at a glance across all top topics simultaneously.

What it suggests in this sample:

  • Some topics are concentrated in one document or decade, not spread evenly across the whole sample.
  • “Ranch Life and Conflict with Native Americans” is concentrated around the 1900s.
  • “Booker T. Washington and Tuskegee Institute” is concentrated around the 1910s.
  • “Whipping and Plantation Punishment” is concentrated in the 1830s because many chunks come from anti-slavery testimony published in 1839.

Document-by-Topic Heatmap

This heatmap shows which documents contribute most strongly to each topic. A topic can look large because it appears across many documents, or because one long document contributes many chunks. This chart tells you which.

What it suggests in this sample:

  • “Whipping and Plantation Punishment” is heavily influenced by American Slavery As It Is.
  • “Ranch Life and Conflict with Native Americans” is heavily influenced by Nat Love’s narrative.
  • Booker T. Washington and Tuskegee topics are heavily influenced by Booker T. Washington, Builder of a Civilization.

Sample Documents Timeline

This chart shows the publication years of the 29 sampled documents. Each point is one document. Use it to see how the sample is distributed across time and whether certain decades are over- or under-represented.

Topic Prevalence Over Time (Area Chart)

This chart shows topic prevalence as a stacked area chart across publication decades. Each colored band represents one topic. It gives an overview of how the overall composition of topics shifts across the sample’s time span.

Topic Word Bars

This chart shows the top words for every topic as horizontal bar charts, one panel per topic.

Top words are not simply the most frequent words in a topic — they are the words that appear much more often in that topic than in the rest of the collection. A word like “slavery” might appear in every topic and therefore rank low; a word like “overseer” that appears heavily in one topic but rarely in others will rank high. This scoring method is called c-TF-IDF.

The bar length shows the c-TF-IDF score — a longer bar means the word is more distinctive to that specific topic. Words with the longest bars are the strongest signals for what that topic is about.

Use this to check whether the top words for a topic actually support its label, or whether the label needs revision.

Topic Hierarchy

This chart shows a hierarchical clustering of topics — which topics are most similar to each other and could potentially be merged. Topics that branch close together share more vocabulary and meaning than topics that branch far apart.

Topic Cluster Map

This chart shows all 59 topics plotted in a 2D space based on their embedding positions. Topics that appear close together are semantically similar. Each bubble represents one topic, and the size reflects how many chunks were assigned to it.

How To Interpret the Results

Each topic is a suggested grouping of passages.

A topic label is not a final historical claim. It is a guide.

When interpreting a topic, we should ask:

  • What words define this topic?
  • Which passages were assigned to it?
  • Which documents dominate it?
  • Does the label accurately describe the passages?
  • Is this topic meaningful, or is it an artifact of the data?
  • Are two topics really part of the same larger theme?

Where Human Interpretation Enters

Human interpretation is part of the workflow from the beginning, not only at the end.

People make decisions when they:

  • choose which texts belong in the collection
  • decide whether a sample is appropriate
  • decide what to remove during cleaning
  • decide whether words such as names, page labels, or headings are meaningful or distracting
  • choose chunk size and overlap
  • decide whether to use lemmatization
  • choose the embedding model and clustering settings
  • inspect outliers
  • compare topic words with representative passages
  • accept, revise, merge, split, or reject topic labels
  • interpret visualizations in historical context

The model helps organize the material, but it does not remove the need for judgment.