Multiscale Deep Alternative Neural Network for Large-Scale Video Classification
Abstract
With the rapid increase in the amount of multimedia data, video classification has become a demanding and challenging research topic. Compared with image classification, video classification requires mapping a video that contains hundreds of frames to semantic tags, which poses many challenges to the direct use of advanced models originally designed for image-oriented tasks. On the other hand, continuous frames in a video also give us more visual clues that we can leverage to achieve better classification. One of the most important clues is the context in the spatiotemporal domain. In this paper, we introduce the multiscale deep alternative neural network (DANN), a novel architecture combining the strengths of both convolutional neural network and recurrent neural networks to achieve a deep network that can collect rich context hierarchies for video classification. In particular, the DANN is stacked with alternative layers, each of which consists of a volumetric convolutional layer followed by a recurrent layer. The former acts as a local feature learner, whereas the latter is used to collect contexts. Compared with popular deep feed-forward neural networks, the DANN learns local features and their contexts from the very beginning. This setting enables preserving context evolutions, which we show to be essential for improving the accuracy of video classification. To release the full potential of the DANN, we develop a deeper version with stochastic-layer skip-connections and construct a multiscale DANN to incorporate contexts at different scales. We show how to apply the multiscale DANN for video classification with carefully designed configurations in terms of both input-output settings and training-testing methods. The DANN is shown to be robust to not only human-centric videos, but also natural videos. As there are few large-scale natural disaster video datasets, we construct a new large-scale one and make it publicly available. Experiments on four datasets show the effectiveness of our method for both human actions and natural events.