Vitalii Shutov and Arip Asadulaev, ITMO University, Russia
Vision Transformers (ViTs) offer strong performance but face high computational costs fromprocessing all tokens through their full depth. Standard ViTs lack adaptivity. This work introduces Adaptive Halting Transformer (AHT-ViT) to enhance efficiency by dynamically adjusting processing depth pertoken. AHT-ViT employs hierarchical ”planner” modules predicting token-specific target halting depths and an extremely parameter-efficient ”supervisor” mechanism (two shared parameters) generating per-layerhalting scores. Tokens halt when their cumulative score crosses a threshold. A novel KL divergence-based loss, Ltarget depth, explicitly aligns executed halting distributions with planned depths. Evaluation on ImageNet, Places365, and CIFAR-100 using DeiT-S shows AHT-ViT achieves an improved accuracy-efficiency trade-off compared to its static baseline and demonstrates competitive performance against other adaptive methods (DynamicViT, A-ViT) evaluated under the same conditions, while significantly reducing FLOPs. Key hyperparameters were selected via grid search on a validation split.
Vision Transformer, Adaptive Computation, Early Exit, Dynamic Depth, Model Efficiency, Image Classification.