PIXOR: Real-time 3D Object Detection from Point CloudsWe address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Speed is critical as detection is a necessary component for safety. Existing approaches are, however, expensive in computation due to high dimensionality of point clouds. We utilize the 3D data more efficiently by representing the scene from the Bird's Eye View (BEV), and propose PIXOR, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixel-wise neural network predictions. The input representation, network architecture, and model optimization are specially designed to balance high accuracy and real-time efficiency. We validate PIXOR on two datasets: the KITTI BEV object detection benchmark, and a large-scale 3D vehicle detection benchmark. In both datasets we show that the proposed detector surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at 10 FPS.
T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From VideosKai Kang, Hongsheng Li, Junjie Yan et al.|IEEE Transactions on Circuits and Systems for Video Technology|2017 The state-of-the-art performance for object detection has been significantly improved over the past two years. Besides the introduction of powerful deep neural networks, such as GoogleNet and VGG, novel object detection frameworks, such as R-CNN and its successors, Fast R-CNN, and Faster R-CNN, play an essential role in improving the state of the art. Despite their effectiveness on still images, those frameworks are not specifically designed for object detection from videos. Temporal and contextual information of videos are not fully investigated and utilized. In this paper, we propose a deep learning framework that incorporates temporal and contextual information from tubelets obtained in videos, which dramatically improves the baseline performance of existing still-image detection frameworks when they are applied to videos. It is called T-CNN, i.e., tubelets with convolutional neueral networks. The proposed framework won newly introduced an object-detection-from-video task with provided data in the ImageNet Large-Scale Visual Recognition Challenge 2015. Code is publicly available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/myfavouritekk/T-CNN</uri> .
SBNet: Sparse Blocks Network for Fast InferenceConventional deep convolutional neural networks (CNNs) apply convolution operators uniformly in space across all feature maps for hundreds of layers - this incurs a high computational cost for real-time applications. For many problems such as object detection and semantic segmentation, we are able to obtain a low-cost computation mask, either from a priori problem knowledge, or from a low-resolution segmentation network. We show that such computation masks can be used to reduce computation in the high-resolution main network. Variants of sparse activation CNNs have previously been explored on small-scale tasks and showed no degradation in terms of object classification accuracy, but often measured gains in terms of theoretical FLOPs without realizing a practical speedup when compared to highly optimized dense convolution implementations. In this work, we leverage the sparsity structure of computation masks and propose a novel tiling-based sparse convolution algorithm. We verified the effectiveness of our sparse CNN on LiDAR-based 3D object detection, and we report significant wall-clock speed-ups compared to dense convolution without noticeable loss of accuracy.
Gated Bi-directional CNN for Object DetectionXingyu Zeng, Wanli Ouyang, Bin Yang et al.|Lecture notes in computer science|2016 Perceive, Attend, and Drive: Learning Spatial Attention for Safe Self-DrivingIn this paper, we propose an end-to-end self-driving network featuring a sparse attention module that learns to automatically attend to important regions of the input. The attention module specifically targets motion planning, whereas prior literature only applied attention in perception tasks. Learning an attention mask directly targeted for motion planning significantly improves the planner safety by performing more focused computation. Furthermore, visualizing the attention improves interpretability of end-to-end self-driving.