We tackle the problem of localizing the traffic surveillance cameras in cooperative perception. To overcome the lack of large-scale real-world intersection datasets, we introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. Moreover, we introduce a novel neural network, TrafficLoc, localizing traffic cameras within a 3D reference map. TrafficLoc employs a coarse-to-fine matching pipeline. For image-point cloud feature fusion, we propose a novel Geometry-guided Attention Loss to address cross-modal viewpoint inconsistencies. During coarse matching, we propose an Inter-Intra Contrastive Learning to achieve precise alignment while preserving distinctiveness among local intra-features within image patch-point group pairs. Besides, we introduce Dense Training Alignment with a soft-argmax operator to consider additional features when regressing the final position. Extensive experiments show that our TrafficLoc improves the localization accuracy over the state-of-the-art Image-to-point cloud registration methods by a large margin (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating strong localization ability across both in-vehicle and traffic cameras.
LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel network for LiDAR place recognition that leverages OpenStreetMap as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components: a cross-modal visibility mask that identifies maximal observable regions from both modalities to guide feature learning, and an adaptive radial fusion module that dynamically consolidates multiscale radial features into discriminative global descriptors. Extensive experiments on the augmented KITTI and KITTI-360 datasets demonstrate OPAL's superiority, achieving 15.98% higher recall at @1m threshold for top-1 retrieved matches while operating at 12x faster inference speeds compared to state-of-the-art approaches.
Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.
This paper proposes SparseAlign, a fully sparse framework for cooperative object detection. The framework achieves efficient multi-agent cooperative perception through sparse feature representation and alignment mechanisms. Our method significantly reduces communication overhead while maintaining detection accuracy, providing a novel solution for cooperative perception in autonomous driving and intelligent transportation systems.
This paper investigates the problem of localizing events in videos using multimodal queries. We propose a novel approach that can handle queries from multiple modalities including text, images, and audio to accurately localize relevant events in videos. Our method achieves state-of-the-art performance on multiple benchmark datasets, providing important contributions to the fields of video understanding and retrieval.
This paper presents Text2Loc, a novel method for 3D point cloud localization from natural language descriptions. We design a hierarchical Transformer architecture that can understand natural language descriptions and accurately localize target positions in large-scale 3D point cloud scenes. Our method achieves excellent performance on multiple indoor and outdoor datasets, opening new research directions for language-based 3D scene understanding.
This paper proposes CASSPR, a cross attention single scan place recognition method. The approach employs an innovative cross attention architecture that enables efficient place recognition from single LiDAR scans, significantly improving recognition accuracy and robustness in complex environments. Our method achieves state-of-the-art performance on multiple benchmark datasets, providing important technical support for robotic navigation and SLAM systems.
This paper proposes SOE-net, a self-attention and orientation encoding network for point cloud based place recognition. The method captures geometric structure information of point clouds through innovative orientation encoding techniques, and combines self-attention mechanisms to learn long-range dependencies between points, significantly improving point cloud-based place recognition accuracy. SOE-net achieves excellent performance on multiple benchmark datasets, providing important technical support for mobile robot localization and navigation.