Second 10% Sample Results

A different random sample — seed 9, 29 documents, 1,752 chunks

About This Sample

This page presents results from a second 10% sample of the collection, drawn with a different random seed. Seed 9 was chosen because it produces a sample with zero overlap with the first sample (seed 42) — all 29 documents are different. This is not guaranteed for every seed: because the script draws 29 documents at random from 294, some overlap between independently drawn samples is possible. Seed 9 happens to avoid it entirely. Comparing the two runs shows how topic modeling results can differ across samples and which themes are stable enough to appear in both.

To reproduce this run yourself, use the same script and the same command from the Hands-On Exercise, but add --seed 9 to the command. No changes to the script code are needed — --seed is a command-line flag. For example:

python -u scripts/run_bertopic_sample.py \
  --seed 9 \
  --output-dir outputs/my_run_seed9 \
  --embedding-backend ollama \
  --ollama-embedding-model nomic-embed-text \
  --representation-backend ctfidf \
  --clustering sensitive \
  --label-backend ollama \
  --ollama-model llama3.1:latest
Sample fraction: 10%
Random seed:     9
Documents:       29
Chunks:          1,752
Topics found:    60
Outliers:        ~25%

All CSV files and HTML visualizations for this run are in the repository under:

outputs/bertopic_sample2_nomic_sensitive_lemmatized/
  topic_review_table.csv
  topic_labels_llm.csv
  topic_assignments.csv
  topic_info.csv
  sample_documents.csv
  visualizations/         ← 4 BERTopic charts (open in browser)
  metadata_visualizations/ ← 5 metadata charts (open in browser)

The 29 sampled documents are completely different from the first sample (zero overlap):

fpn-burton-burton.xml
fpn-hortonlife-horton.xml
fpn-lane-lane.xml
fpn-mason-mason.xml
fpn-veney-veney.xml
neh-barrett-barrett.xml
neh-branham-branham.xml
neh-brinch-brinch.xml
neh-brown55-brown55.xml
neh-brownww-brown.xml
neh-clarkes-clarkes.xml
neh-delaney-delaney.xml
neh-dsmith-dsmith.xml
neh-edwardsc-edwards.xml
neh-hayden-hayden.xml
neh-henderson-henderson.xml
neh-jacksonc-jackson.xml
neh-leehf-leehf.xml
neh-mallory-mallory.xml
neh-mott-mott.xml
neh-mott26-mott26.xml
neh-nell-nell.xml
neh-parker-parker.xml
neh-pickard-pickard.xml
neh-rudd-rudd.xml
neh-slaveryillus-slaveryillus.xml
neh-story-story.xml
neh-webster-webster.xml
neh-wilkerson-wilkerson.xml

Topic Review Table (Top Topics)

The table below shows the 15 largest topics by chunk count. Download the full CSV files for all 60 topics.

Topic Label Chunks
0 Farm, Bond, Scott, Cotton 73
1 Longing for Family and Freedom 71
2 Struggling for Freedom 61
3 Escape by Sea 51
4 Expressions of Sympathy and Gratitude 47
5 Family Relationships and Guidance 46
6 American Identity and Nationalism 37
7 The Struggle for Equal Rights and Representation 36
8 Fighting for Freedom and Citizenship 35
9 Confrontations with Slave Catchers 33
10 Ministerial Training and Early Ministry 30
11 Whipping and Plantation Punishment 29
12 Traveling to Work or Deliveries 27
13 Effective Classroom Management 26
14 Faith in Salvation and Heaven 26

Top Topics in This Sample

Some topic labels from this sample — compare against the first sample and consider: which topics appear in both? Which appear only here? Download the full CSV for all 60 topics.

  • Farm, Bond, Scott, Cotton
  • Longing for Family and Freedom
  • Struggling for Freedom
  • Escape by Sea
  • Expressions of Sympathy and Gratitude
  • Family Relationships and Guidance
  • American Identity and Nationalism
  • The Struggle for Equal Rights and Representation
  • Fighting for Freedom and Citizenship
  • Confrontations with Slave Catchers
  • Ministerial Training and Early Ministry
  • Whipping and Plantation Punishment
  • Traveling to Work or Deliveries
  • Effective Classroom Management
  • Faith in Salvation and Heaven

Note that similar labels — such as multiple “Whipping and Plantation Punishment” topics — can appear in the same run. These are distinct clusters the model kept separate because the passages within them differ in vocabulary or context, even though they share a broad theme. This is expected behavior, not an error. Inspecting the top words and representative passages in topic_assignments.csv will show what distinguishes them.

Download the Results

Download topic review table Download topic labels Download topic info Download topic assignments

Visualizations

Topic Shares by Publication Decade

Topic-by-Decade Heatmap

Document-by-Topic Heatmap

Sample Documents Timeline

Topic Prevalence Over Time (Area Chart)

Topic Word Bars

Topic Hierarchy

Topic Cluster Map