Adversarial TDA

Adversarial perturbations in the context of large language models (LLMs) are subtle changes added to input data (i.e., images or text) that are designed to alter predictions or outputs of machine learning models. We introduce several novel visualizations using topological data anal- ysis (TDA) (leveraging persistent homology) to characterize how adversarial perturbations act on text inputs, specifically, how sandbagging and code-injection attacks alter the geometric structure of attention heads in transformer mod- els. By computing persistent homology metrics from attention maps across different model ar- chitectures (such as BERT, RoBERTa, ELEC- TRA, DistilGPT, etc.), we find that adversarial inputs alter higher-dimensional topological fea- tures (H1 loops and H2 voids) in ways that distinguish them from clean, non-adversarial inputs.