LLM-Augmented Active Learning for Email Topic Classification

Fall 2025 — CSCI 5541 NLP: Class Project

Josh Krueger - krueg709@umn.edu

Team: Actively Learning

Abstract

I investigate reducing human labeling effort on an email topic classification task by augmenting pool-based active learning with LLM-generated (synthetic) labels and LLM-guided human selection. Using a K-Means initialization and a robust, calibrated SVM student model, we show LLM-augmented strategies increase label efficiency in simulation experiments (see plots below).

System Architecture

A high-level schematic: embeddings + SVM student are combined with LLM synthetic labeling and LLM strategic selection to reduce human labeling needs.

Teaser image

Introduction / Background / Motivation

Goal: Reduce human labels required for accurate email topic models by leveraging LLMs for labeling and selection.

Context: Standard pool-based active learning selects uncertain examples from an embedding space (we use precomputed embeddings in `embeddings_matrix.npy`). However, many examples can be reliably labeled by a modern LLM — if used carefully this can save human effort.

Impact: Lower annotation cost and faster iterations for domain-specific classifiers while maintaining accuracy.

Approach

Pipeline: (1) precompute embeddings, (2) initialize labeled set with K-Means centroids, (3) train a robust SVM (wrapped with calibration), (4) run active learning iterations where we either query humans (uncertainty or LLM-strategic selection) or request batch LLM labels, (5) re-train with weighted synthetic labels.

Key components: Core active learning loops with LLM integration, an interactive notebook demonstrating experimental runs, precomputed embeddings, and a dataset with email subjects and bodies.

LLM usage: We use a Gemini flash model for batched JSON output. The LLM returns synthetic labels for many items and a separate strategic-selection prompt can return which items to send to humans.

Results

Evaluation: Simulated experiments compare (1) standard uncertainty sampling AL, (2) LLM-augmented labeling AL, and (3) LLM-strategic AL. We measure accuracy on held-out data vs number of human labels.

Table 1. Summary of experiment conditions
Experiment	Standard AL	LLM-Labeling AL	LLM-Strategic AL
Metric	Accuracy vs Labels	Accuracy vs Labels	Accuracy vs Labels
Notes	Baseline uncertainty sampling	LLM provides synthetic labels (discounted weight)	LLM selects which examples humans should label

Figure: Example accuracy curves (notebook `active_learning.ipynb` reproduces these runs).

Conclusion and Future Work

LLM-augmented active learning shows promise to reduce human labeling budgets. Reproducibility requires the dataset, precomputed embeddings, and a (possibly paid) LLM API key to recreate synthetic-label experiments. Future work: improved prompt engineering, calibration of synthetic label weights, and an interactive human-in-the-loop labeling UI.