LLM-Augmented Active Learning for Email Topic Classification

Fall 2025 — CSCI 5541 NLP: Class Project

Josh Krueger - krueg709@umn.edu

Team: Actively Learning

View project on GitHub

Abstract

I investigate reducing human labeling effort on an email topic classification task by augmenting pool-based active learning with LLM-generated (synthetic) labels and LLM-guided human selection. Using a K-Means initialization and a robust, calibrated SVM student model, we show LLM-augmented strategies increase label efficiency in simulation experiments (see plots below).


System Architecture

A high-level schematic: embeddings + SVM student are combined with LLM synthetic labeling and LLM strategic selection to reduce human labeling needs.

Teaser image


Introduction / Background / Motivation

Goal: Reduce human labels required for accurate email topic models by leveraging LLMs for labeling and selection.

Context: Standard pool-based active learning selects uncertain examples from an embedding space (we use precomputed embeddings in `embeddings_matrix.npy`). However, many examples can be reliably labeled by a modern LLM — if used carefully this can save human effort.

Impact: Lower annotation cost and faster iterations for domain-specific classifiers while maintaining accuracy.


Approach

Pipeline: (1) precompute embeddings, (2) initialize labeled set with K-Means centroids, (3) train a robust SVM (wrapped with calibration), (4) run active learning iterations where we either query humans (uncertainty or LLM-strategic selection) or request batch LLM labels, (5) re-train with weighted synthetic labels.

Key components: Core active learning loops with LLM integration, an interactive notebook demonstrating experimental runs, precomputed embeddings, and a dataset with email subjects and bodies.

LLM usage: We use a Gemini flash model for batched JSON output. The LLM returns synthetic labels for many items and a separate strategic-selection prompt can return which items to send to humans.


Results

Evaluation: Simulated experiments compare (1) standard uncertainty sampling AL, (2) LLM-augmented labeling AL, and (3) LLM-strategic AL. We measure accuracy on held-out data vs number of human labels.

Experiment Standard AL LLM-Labeling AL LLM-Strategic AL
Metric Accuracy vs Labels Accuracy vs Labels Accuracy vs Labels
Notes Baseline uncertainty sampling LLM provides synthetic labels (discounted weight) LLM selects which examples humans should label
Table 1. Summary of experiment conditions

Results plot

Figure: Example accuracy curves (notebook `active_learning.ipynb` reproduces these runs).


Conclusion and Future Work

LLM-augmented active learning shows promise to reduce human labeling budgets. Reproducibility requires the dataset, precomputed embeddings, and a (possibly paid) LLM API key to recreate synthetic-label experiments. Future work: improved prompt engineering, calibration of synthetic label weights, and an interactive human-in-the-loop labeling UI.