It is a task that aims to combine visual and auxiliary linguistic modalities to co-locate the target object in a video sequence. Currently, multi-modal data scarcity and burdensome modality fusion ...
However, existing unsupervised anomaly detectors, particularly those for the vision modality, face significant challenges due to redundant information and sparse latent space. In contrast, anomaly ...