Towards Good Practices for Very Deep Two-Stream ConvNets


Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets [12]) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet [13], GoogLeNet [15]), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of 91.4%.

Extracted Key Phrases

3 Figures and Tables

Citations per Year

108 Citations

Semantic Scholar estimates that this publication has 108 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@article{Wang2015TowardsGP, title={Towards Good Practices for Very Deep Two-Stream ConvNets}, author={Limin Wang and Yuanjun Xiong and Zhe Wang and Yu Qiao}, journal={CoRR}, year={2015}, volume={abs/1507.02159} }