2025

TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

Yan Xia1,2*†, Yunxiang Lu2*, Rui Song2,4, Oussema Dhaouadi2,5, João F. Henriques6, Daniel Cremers2,3

ICCV 2025

We tackle the problem of localizing the traffic surveillance cameras in cooperative perception. To overcome the lack of large-scale real-world intersection datasets, we introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. Moreover, we introduce a novel neural network, TrafficLoc, localizing traffic cameras within a 3D reference map. TrafficLoc employs a coarse-to-fine matching pipeline. For image-point cloud feature fusion, we propose a novel Geometry-guided Attention Loss to address cross-modal viewpoint inconsistencies. During coarse matching, we propose an Inter-Intra Contrastive Learning to achieve precise alignment while preserving distinctiveness among local intra-features within image patch-point group pairs. Besides, we introduce Dense Training Alignment with a soft-argmax operator to consider additional features when regressing the final position. Extensive experiments show that our TrafficLoc improves the localization accuracy over the state-of-the-art Image-to-point cloud registration methods by a large margin (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating strong localization ability across both in-vehicle and traffic cameras.

3D场景 图像点云特征融合 自动驾驶
-

OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion

Shuhao Kang*, Martin Y. Liao*, Yan Xia, Olaf Wysocki, Boris Jutzi, Daniel Cremers

CoRL 2025

LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel network for LiDAR place recognition that leverages OpenStreetMap as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components: a cross-modal visibility mask that identifies maximal observable regions from both modalities to guide feature learning, and an adaptive radial fusion module that dynamically consolidates multiscale radial features into discriminative global descriptors. Extensive experiments on the augmented KITTI and KITTI-360 datasets demonstrate OPAL's superiority, achieving 15.98% higher recall at @1m threshold for top-1 retrieved matches while operating at 12x faster inference speeds compared to state-of-the-art approaches.

雷达位置识别 OpenStreetMap 多尺度径向特征融合
-

AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models

Haokun Chen, Jianing Li, Yao Zhang, Jinhe Bi, Yan Xia, Jindong Gu, Volker Tresp

AAAI 2026

Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.

多模态大型语言模型 数据隐私 目标视觉概念
-

SparseAlign: A Fully Sparse Framework for Cooperative Object Detection

Yunshuang Yuan, Yan Xia†, Daniel Cremers, Monika Sester

CVPR 2025

This paper proposes SparseAlign, a fully sparse framework for cooperative object detection. The framework achieves efficient multi-agent cooperative perception through sparse feature representation and alignment mechanisms. Our method significantly reduces communication overhead while maintaining detection accuracy, providing a novel solution for cooperative perception in autonomous driving and intelligent transportation systems.

协同目标检测 全稀疏框架 自动驾驶
-

Localizing Events in Videos with Multimodal Queries

Gengyuan Zhang, Mang Ling Ada Fok, Jialu Ma, Yan Xia†, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

CVPR 2025

This paper investigates the problem of localizing events in videos using multimodal queries. We propose a novel approach that can handle queries from multiple modalities including text, images, and audio to accurately localize relevant events in videos. Our method achieves state-of-the-art performance on multiple benchmark datasets, providing important contributions to the fields of video understanding and retrieval.

多模态查询 视频事件定位 视频理解
-

2024

Text2Loc: 3D Point Cloud Localization from Natural Language

Yan Xia*†, Letian Shi*, Zifeng Ding, João F. Henriques, Daniel Cremers

CVPR 2024

This paper presents Text2Loc, a novel method for 3D point cloud localization from natural language descriptions. We design a hierarchical Transformer architecture that can understand natural language descriptions and accurately localize target positions in large-scale 3D point cloud scenes. Our method achieves excellent performance on multiple indoor and outdoor datasets, opening new research directions for language-based 3D scene understanding.

3D点云定位 层次化Transformer 自然语言处理
12

2023

CASSPR: Cross Attention Single Scan Place Recognition

Yan Xia*†, Mariia Gladkova*, Rui Wang, Qianyun Li, Uwe Stilla, João F. Henriques, Daniel Cremers

ICCV 2023

This paper proposes CASSPR, a cross attention single scan place recognition method. The approach employs an innovative cross attention architecture that enables efficient place recognition from single LiDAR scans, significantly improving recognition accuracy and robustness in complex environments. Our method achieves state-of-the-art performance on multiple benchmark datasets, providing important technical support for robotic navigation and SLAM systems.

地点识别 交叉注意力 激光雷达 机器人导航
25

2021

SOE-net: A self-attention and orientation encoding network for point cloud based place recognition

Yan Xia, Yusheng Xu, Shuang Li, Rui Wang, Juan Du, Daniel Cremers, Uwe Stilla

CVPR 2021

This paper proposes SOE-net, a self-attention and orientation encoding network for point cloud based place recognition. The method captures geometric structure information of point clouds through innovative orientation encoding techniques, and combines self-attention mechanisms to learn long-range dependencies between points, significantly improving point cloud-based place recognition accuracy. SOE-net achieves excellent performance on multiple benchmark datasets, providing important technical support for mobile robot localization and navigation.

地点识别 自注意力机制 方向编码 点云处理
89