Recognition of Visual Events using Spatio-Temporal Information of the Video Signal

Soltanian, Mohammad; Ghaemmaghami, Shahrokh

doi:10.52547/jsdp.18.1.134

Volume 18, Issue 1 (5-2021) JSDP 2021, 18(1): 134-119 | Back to browse issues page

‎ 10.52547/jsdp.18.1.134

Mendeley

Zotero

RefWorks

Soltanian M, Ghaemmaghami S. Recognition of Visual Events using Spatio-Temporal Information of the Video Signal. JSDP 2021; 18 (1) :134-119
URL: http://jsdp.rcisp.ac.ir/article-1-928-en.html

Recognition of Visual Events using Spatio-Temporal Information of the Video Signal

Mohammad Soltanian

, Shahrokh Ghaemmaghami ^*

Sharif University of Technology

Abstract: (2939 Views)

Recognition of visual events as a video analysis task has become popular in machine learning community. While the traditional approaches for detection of video events have been used for a long time, the recently evolved deep learning based methods have revolutionized this area. They have enabled event recognition systems to achieve detection rates which were not reachable by traditional approaches.
Convolutional neural networks (CNNs) are among the most popular types of deep networks utilized in both imaga and video recognition tasks. They are initially made up of several convolutional layers, each of which followed by proper activation and possibly pooling layers. They often encompass one or more fully connected layers as the last layers. The favorite property of them in this work is the ability of CNNs to extract mid-level features from video frames. Actually, despite traditional approaches based on low-level visual features, the CNNs make it possible to extract higher level semantic features from the video frames.
The focus of this paper is on recognition of visual events in video using CNNs. In this work, ‎image trained descriptor‎s are used to make video recognition can be done with low computational complexity. A tuned CNN is used as the frame descriptor and its fully connected layers are utilized as concept detectors. So, the featue maps of activation layers following fully connected layers act as feature vectors. These feature vectors (concept vectors) are actually the mid-level features which are a better video representation than the low level features. The obtained mid-level features can partially fill the semantic gap between low level features and high level semantics of video.
The obtained descriptors from the CNNs for each video are varying length stack of feature vectors. To make the obtained descriptors organized and prepared for clasification, they must be properly encoded. The coded descriptors are then normalized and classified. The normaliztion may consist of conventional

and

normalization or more advanced power-law normalization. The main purpose of normalization is to change the distribution of descriptor values in a way to make them more uniformly distributed. So, very large or very small descriptors could have a more balanced impact on recognition of events.
The main novelty of this paper is that spatial and temporal information in mid-level features are employed to construct a suitable coding procedure. We use temporal information in coding of video descriptors. Such information is often ignored, resulting in reduced coding efficiency. Hence, a new coding is proposed which improves the trade-off between the computation complexity of the recognition scheme and the accuracy in identifying video events.
It is also shown that the proposed coding is in the form of an optimization problem that can be solved with existing algorithms. The optimization problem is initially non-convex and not solvable with the existing methods in polynomial time. So, it is transformed to a convex form which makes it a well defined optimization problem. While there are many methods to handle these types of convex optimization problems, we chose to use a strong convex optimization library to efficiently solve the problem and obtain the video descriptors.
To confirm the effectiveness of the proposed descriptor coding method, extensive experiments are done on two large public datasets: Columbia consumer video (CCV) dataset and ActivityNet dataset. Both CCV and ActivityNet are popular publically available video event recognition datasets, with standard train/test splits, which are large enough to be used as reasonable benchmarks in video recognition tasks.
Compared to the best practices available in the field of detecting visual events, the proposed method provides a better model of video and a much better mean average precision, mean average recall, and F score on the test set of CCV and ActivityNet datasets. The presented method not only improves the performance in terms of accuracy, but also reduces the computational cost with respect to those of the state of the art. The experiments vividly confirm the potential of the proposed method in improving the performance of visual recognition systems, especially in supervised video event detection.

Keywords: Convolutional neural network, Average pooling, Max pooling, Support vector machine, Vector of locally aggregated descriptors

Full-Text [PDF 1403 kb] (1372 Downloads)

Type of Study: بنیادی | Subject: Paper
Received: 2018/11/12 | Accepted: 2019/02/19 | Published: 2021/05/22 | ePublished: 2021/05/22

References

1. [1] M. Mohseni and M. Seriani, "Pedestrian Detection in Infrared Image Sequences Using SVM and Histogram Classifiers," JSDP, vol. 6, no. 1, pp. 79-90, 2009.

2. [2] S. Shafeipour Yourdeshahi, H. Seyedarabi, and A. Aghagolzadeh, "Video based Face Recognition Using Orthogonal Locality Preserving Projection," JSDP, vol. 13, no. 2, pp. 139-149, Sep. 2016.

3. [3] Z. Xu, Y. Yang, and A. G. Hauptmann, "A discriminative CNN video representation for event detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1798-1807, Accessed: Apr. 08, 2016. [Online]. [DOI:10.1109/CVPR.2015.7298789] [PMCID]

4. [4] L. Wang, C. Gao, J. Liu, and D. Meng, "A novel learning-based frame pooling method for event detection," Signal Processing, vol. 140, pp. 45-52, 2017. [DOI:10.1016/j.sigpro.2017.05.005]

5. [5] S. Kwak, B. Han, and J. H. Han, "Scenario-based video event recognition by constraint flow," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3345-3352. [DOI:10.1109/CVPR.2011.5995435]

6. [6] Y. Cong, J. Yuan, and J. Luo, "Towards scalable summarization of consumer videos via sparse dictionary selection," IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 66-75, 2012. [DOI:10.1109/TMM.2011.2166951]

7. [7] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2005, vol. 1, pp. 886-893.

8. [8] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004. [DOI:10.1023/B:VISI.0000029664.99615.94]

9. [9] K. E. Van De Sande, T. Gevers, and C. G. Snoek, "Evaluating color descriptors for object and scene recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1582-1596, 2010. [DOI:10.1109/TPAMI.2009.154] [PMID]

10. [10] I. Laptev, "On space-time interest points," International Journal of Computer Vision, vol. 64, no. 2, pp. 107-123, 2005. [DOI:10.1007/s11263-005-1838-7]

11. [11] M. Chen and A. Hauptmann, "Mosift: Recognizing human actions in surveillance videos," Carnegie Mellon University, Technical Report CMU-CS-09-161, 2009.

12. [12] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, "Action recognition by dense trajectories," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2011, pp. 3169-3176. [DOI:10.1109/CVPR.2011.5995407]

13. [13] H. Wang and C. Schmid, "Action recognition with improved trajectories," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Dec. 2013, pp. 3551-3558. [DOI:10.1109/ICCV.2013.441] [PMCID]

14. [14] D. Oneata, J. Verbeek, and C. Schmid, "Action and event recognition with Fisher vectors on a compact feature set," in Proceedings of the IEEE International Conference on Computer Vision, Dec. 2013, pp. 1817-1824. [DOI:10.1109/ICCV.2013.228]

15. [15] F. Metze, S. Rawat, and Y. Wang, "Improved audio features for large-scale multimedia event detection," in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Jul. 2014, pp. 1-6. [DOI:10.1109/ICME.2014.6890234]

16. [16] M.-L. Shyu, Z. Xie, M. Chen, and S.-C. Chen, "Video semantic event/concept detection using a subspace-based multimedia data mining framework," IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 252-259, 2008. [DOI:10.1109/TMM.2007.911830]

17. [17] V. S. Tseng, J.-H. Su, J.-H. Huang, and C.-J. Chen, "Integrated mining of visual features, speech features, and frequent patterns for semantic video annotation," IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 260-267, 2008. [DOI:10.1109/TMM.2007.911832]

18. [18] X. Peng and C. Schmid, "Encoding feature maps of cnns for action recognition," presented at the CVPR, THUMOS Challenge Workshop, 2015, Accessed: Apr. 08, 2016. [Online].

19. [19] R. Baradaran, E.Golpar-Raboki, "Feature Extraction and Efficiency Comparison Using Dimension Reduction Methods in Sentiment Analysis Context", JSDP, 2019, vol. 16 (3), pp. 88-79. [DOI:10.29252/jsdp.16.3.88]

20. [20] F. Sherafati, J.Tahmoresnezhad, "Image Classification via Sparse Representation and Subspace Alignment", JSDP, 2020, vol.17 (2) , pp.58-47 [DOI:10.29252/jsdp.17.2.58]

21. [21] R. Aly et al., "The AXES submissions at TrecVid 2013," 2013.

22. [22] Y. Shi, Y. Tian, Y. Wang, and T. Huang, "Sequential deep trajectory descriptor for action recognition with three-stream CNN," IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1510-1520, 2017. [DOI:10.1109/TMM.2017.2666540]

23. [23] P. Scovanner, S. Ali, and M. Shah, "A 3-dimensional sift descriptor and its application to action recognition," in Proceedings of the 15th ACM international conference on Multimedia, Sep. 2007, pp. 357-360. [DOI:10.1145/1291233.1291311]

24. [24] A. Klaser, M. Marszałek, and C. Schmid, "A spatio-temporal descriptor based on 3d-gradients," in Proceedings of the 19th British Machine Vision Conference (BMVC), 2008, pp. 99.1-99.10. [DOI:10.5244/C.22.99]

25. [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [DOI:10.1109/5.726791]

26. [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2009, pp. 248-255. [DOI:10.1109/CVPR.2009.5206848]

27. [27] M. Momeny, M .A Sarram, A. Latif, R. Sheikhpour, "A Convolutional Neural Network based on Adaptive Pooling for Classification of Noisy Images", JSDP, 2021, vol.17 (4), pp.139-154. [DOI:10.29252/jsdp.17.4.139]

28. [28] D. Oneata, J. Verbeek, and C. Schmid, "The LEAR submission at Thumos 2014," 2014.

29. [29] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov, "Exploiting image-trained cnn architectures for unconstrained video classification," in Proceedings of the 26th British Machine Vision Conference, 2015, p. 60.1-60.13. [DOI:10.5244/C.29.60]

30. [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Proceedings of the Advances in neural information processing systems, Jan. 2012, pp. 1097-1105.

31. [31] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei, "ILSVRC-2012, 2012," 2012, [Online]. Available: http://www. image-net. org/challenges/LSVRC.

32. [32] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in Proceedings of the European Conference on Computer Vision, 2014, pp. 818-833. [DOI:10.1007/978-3-319-10590-1_53]

33. [33] V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th International Conference on Machine Learning (ICML), 2010, pp. 807-814.

34. [34] M. Soltanian and S. Ghaemmaghami, "Hierarchical Concept Score Post-processing and Concept-wise Normalization in CNN based Video Event Recognition," IEEE Transactions on Multimedia, vol. 21, no. 1, pp. 157-172, 2019. [DOI:10.1109/TMM.2018.2844101]

35. [35] Y. Han, X. Wei, X. Cao, Y. Yang, and X. Zhou, "Augmenting image descriptions using structured prediction output," IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1665-1676, 2014. [DOI:10.1109/TMM.2014.2321530]

36. [36] H. Jégou, M. Douze, C. Schmid, and P. Pérez, "Aggregating local descriptors into a compact image representation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3304-3311. [DOI:10.1109/CVPR.2010.5540039]

37. [37] Z. Zhao, Y. Song, and F. Su, "Specific video identification via joint learning of latent semantic concept, scene and temporal structure," Neurocomputing, vol. 208, pp. 378-386, 2016. [DOI:10.1016/j.neucom.2016.06.002]

38. [38] F. Perronnin and C. Dance, "Fisher kernels on visual vocabularies for image categorization," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1-8, Accessed: Jul. 06, 2016. [Online]. [DOI:10.1109/CVPR.2007.383266]

39. [39] F. Markatopoulou et al., "ITI-CERTH participation to TRECVID 2013," in TRECVID 2013 Workshop, 2013, pp. 12-17.

40. [40] C. Sun and R. Nevatia, "Large-scale web video event classification by use of fisher vectors," in IEEE Workshop on Applications of Computer Vision (WACV), 2013, pp. 15-22, Accessed: Jul. 06, 2016. [Online]. [DOI:10.1109/WACV.2013.6474994]

41. [41] R. Arandjelovic and A. Zisserman, "All about VLAD," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1578-1585, Accessed: Jul. 06, 2016. [Online]. [DOI:10.1109/CVPR.2013.207]

42. [42] J. Delhumeau, P.-H. Gosselin, H. Jégou, and P. Pérez, "Revisiting the vlad image representation," in Proceedings of the 21st ACM international conference on Multimedia, Oct. 2013, pp. 653-656. [DOI:10.1145/2502081.2502171]

43. [43] G. Tolias, Y. Avrithis, and H. Jegou, "To Aggregate or Not to aggregate: Selective Match Kernels for Image Search," in Proceedings of the IEEE International Conference on Computer Vision, Dec. 2013, pp. 1401-1408, doi: 10.1109/ICCV.2013.177. [DOI:10.1109/ICCV.2013.177]

44. [44] Y.-L. Boureau, J. Ponce, and Y. LeCun, "A theoretical analysis of feature pooling in visual recognition," in Proceedings of the 27th international conference on machine learning (ICML), Jun. 2010, pp. 111-118.

45. [45] T. De Campos, G. Csurka, and F. Perronnin, "Images as sets of locally weighted features," Computer Vision and Image Understanding, vol. 116, no. 1, pp. 68-85, 2012. [DOI:10.1016/j.cviu.2011.07.011]

46. [46] N. Murray and F. Perronnin, "Generalized Max Pooling," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2014, pp. 2473-2480, doi: 10.1109/CVPR.2014.317. [DOI:10.1109/CVPR.2014.317]

47. [47] T. Ge, Q. Ke, and J. Sun, "Sparse-Coded Features for Image Retrieval.," in Proceedings of the British Machine Vision Conference (BMVC), 2013, pp. 1-11, Accessed: Nov. 10, 2016. [Online]. [DOI:10.5244/C.27.132]

48. [48] H. Jégou and O. Chum, "Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening," in Proceedings of the European Conference on Computer Vision (ECCV), 2012, pp. 774-787, Accessed: Nov. 10, 2016. [Online]. [DOI:10.1007/978-3-642-33709-3_55]

49. [49] M. K. Reddy, S. Arora, and R. V. Babu, "Spatio-temporal feature based VLAD for efficient video retrieval," in Proceedings of 4th National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, 2013, pp. 1-4, Accessed: Nov. 10, 2016. [Online]. [DOI:10.1109/NCVPRIPG.2013.6776268]

50. [50] M. Jain, H. Jegou, and P. Bouthemy, "Better Exploiting Motion for Better Action Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2013, pp. 2555-2562, doi: 10.1109/CVPR.2013.330. [DOI:10.1109/CVPR.2013.330]

51. [51] M. Douze, H. Jégou, C. Schmid, and P. Pérez, "Compact video description for copy detection with precise temporal alignment," in Proceedings of the European Conference on Computer Vision, 2010, pp. 522-535, Accessed: Nov. 10, 2016. [Online]. [DOI:10.1007/978-3-642-15549-9_38]

52. [52] A. Abbas, N. Deligiannis, and Y. Andreopoulos, "Vectors of locally aggregated centers for compact video representation," in 2015 IEEE International Conference on Multimedia and Expo (ICME), 2015, pp. 1-6, Accessed: Nov. 07, 2016. [Online]. [DOI:10.1109/ICME.2015.7177501]

53. [53] J. Revaud, M. Douze, C. Schmid, and H. Jegou, "Event Retrieval in Large Video Collections with Circulant Temporal Encoding," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2013, pp. 2459-2466, doi: 10.1109/CVPR.2013.318. [DOI:10.1109/CVPR.2013.318]

54. [54] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, 2014.

55. [55] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui, "Consumer video understanding: A benchmark database and an evaluation of human and machine performance," in Proceedings of the 1st ACM International Conference on Multimedia Retrieval, 2011, pp. 29.1-29.8. [DOI:10.1145/1991996.1992025]

56. [56] "Pretrained CNNs - MatConvNet," 2017, Accessed: Jun. 12, 2017. [Online]. Available: http://www.vlfeat.org/matconvnet/pretrained/.

57. [57] C.-C. Chang and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 27.1-27.27, 2011. [DOI:10.1145/1961189.1961199]

58. [58] "Matlab VideoUtils," SourceForge, 2015. https://sourceforge.net/projects/videoutils/ (accessed May 29, 2016).

59. [59] A. Vedaldi and K. Lenc, "MatConvNet: Convolutional neural networks for matlab," in Proceedings of the 23rd Annual ACM Conference on Multimedia, Oct. 2015, pp. 689-692. [DOI:10.1145/2733373.2807412]

60. [60] A. Vedaldi and B. Fulkerson, "VLFeat: An open and portable library of computer vision algorithms," in Proceedings of the 18th ACM International Conference on Multimedia, 2010, pp. 1469-1472. [DOI:10.1145/1873951.1874249]

61. [61] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang, "Super Fast Event Recognition in Internet Videos," IEEE Transactions on Multimedia, vol. 17, no. 8, pp. 1174-1186, 2015. [DOI:10.1109/TMM.2015.2436813]

62. [62] P. Napoletano, "Visual descriptors for content-based retrieval of remote-sensing images," International Journal of Remote Sensing, vol. 39, no. 5, pp. 1343-1376, 2018. [DOI:10.1080/01431161.2017.1399472]

63. [63] C. Goutte and E. Gaussier, "A probabilistic interpretation of precision, recall and F-score, with implication for evaluation," in Proceedings of the European Conference on Information Retrieval, 2005, pp. 345-359. [DOI:10.1007/978-3-540-31865-1_25]

64. [64] F. Perronnin, J. Sánchez, and T. Mensink, "Improving the fisher kernel for large-scale image classification," in Proceedings of the European Conference on Computer Vision (ECCV), Sep. 2010, pp. 143-156. [DOI:10.1007/978-3-642-15561-1_11]

65. [65] Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. G. Hauptmann, "Feature weighting via optimal thresholding for video analysis," in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3440-3447. [DOI:10.1109/ICCV.2013.427]

66. [66] A. J. Ma and P. C. Yuen, "Reduced analytic dependency modeling: Robust fusion for visual recognition," International journal of computer vision, vol. 109, no. 3, pp. 233-251, 2014. [DOI:10.1007/s11263-014-0723-7]

67. [67] G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang, "Robust late fusion with rank minimization," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3021-3028.

68. [68] I.-H. Jhuo et al., "Discovering joint audio-visual codewords for video event detection," Machine vision and applications, vol. 25, no. 1, pp. 33-47, 2014. [DOI:10.1007/s00138-013-0567-0]

69. [69] D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang, "Sample-specific late fusion for visual category recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 803-810. [DOI:10.1109/CVPR.2013.109]

70. [70] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue, "Exploring inter-feature and inter-class relationships with deep neural networks for video classification," in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 167-176. [DOI:10.1145/2647868.2654931]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote