Skip to main content
Log in

Incorporating Side Information by Adaptive Convolution

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in deep learning based counting systems. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information. In particular, we model the filter weights as a low-dimensional manifold within the high-dimensional space of filter weights. The filter weights are generated using a learned “filter manifold” sub-network, whose input is the side information. With the help of side information and adaptive weights, the ACNN can disentangle the variations related to the side information, and extract discriminative features related to the current context (e.g. camera perspective, noise level, blur kernel parameters). We demonstrate the effectiveness of ACNN incorporating side information on 3 tasks: crowd counting, corrupted digit recognition, and image deblurring. Our experiments show that ACNN improves the performance compared to a plain CNN with a similar number of parameters and achieves similar or better than state-of-the-art performance on crowd counting task. Since existing crowd counting datasets do not contain ground-truth side information, we collect a new dataset with the ground-truth camera angle and height as the side information. We also perform ablation experiments, mainly for crowd counting, to study the helpfulness of the side information, and the effect of the placement of the adaptive convolutional layers in order to get insight about ACNNs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. The perspective value on a pixel location is proportional to the size of the object if the object exists there.

  2. To reduce clutter, here we do not show the bias term for the convolution.

  3. The mean absolute difference (MAD) between the density maps generated using the original perspective maps and our perspective maps is 0.475 on average, and [0.029, 0.818, 0.800, 0.597, 0.131] respectively on the five test scenes.

  4. The MAD between the original density maps and those using single Gaussian kernels is 2.893 on average, and [0.582, 4.491, 1.946, 7.078, 0.368] respectively on the five test scenes (using our perspective map). This is because the ROI boundary cuts through the most crowded regions on scenes 2 and 4.

  5. CSRNet termed the first ten convolution layers from VGG as front-end, which is more commonly referred as back-end elsewhere.

  6. On the clean MNIST dataset, the 2-conv and 4-conv CNN architectures achieve 0.81% and 0.69% error, while the current state-of-the-art is \(\sim \) 0.23% error (Ciresan et al. 2012).

References

  • Arteta, C., Lempitsky, V., Noble, J. A., & Zisserman, A. (2014). Interactive object counting. In ECCV

  • Burger, H. C., Schuler, C. J., & Harmeling, S. (2012). Image denoising: Can plain neural networks compete with BM3D? In CVPR

  • Chan, A. B., & Vasconcelos, N. (2009). Bayesian poisson regression for crowd counting. In ICCV

  • Chan, A. B., Liang, Z. S. J., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR. IEEE.

  • Chan, A. B., & Vasconcelos, N. (2012). Counting people with low-level features and bayesian regression. IEEE Transactions on Image Processing, 21, 2160–2177.

    Article  MathSciNet  Google Scholar 

  • Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In CVPR

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In CVPR

  • De Brabandere, B., Jia X., Tuytelaars, T., & Van Gool, L. (2016). Dynamic filter networks. In NIPS

  • Dozat, T. (2015). Incorporating nesterov momentum into adam. Technical report, Stanford University (2015). http://cs229.stanford.edu/proj2015/054report.pdf

  • Eigen, D., Krishnan, D., & Fergus, R. (2013). Restoring an image taken through a window covered with dirt or rain. In ICCV

  • Fiaschi, L., Nair, R., Koethe, U., & Hamprecht, F. (2012). Learning to count with regression forest and structured labels. In ICPR

  • Gharbi, M., Chaurasia, G., Paris, S., & Durand, F. (2016). Deep joint demosaicking and denoising. ACM Transactions on Graphics (TOG).

  • Ha, D., Dai, A., & Le, Q. V. (2017). HyperNetworks. In ICLR

  • He, K., Zhang, X., Ren, S., & Sun J. (2016). Deep residual learning for image recognition. In CVPR

  • Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.

    Article  MathSciNet  Google Scholar 

  • Idrees, H., Saleemi, I., Seibert, C., & Shah, M. (2013). Multi-source multi-scale counting in extremely dense crowd images. In CVPR

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML

  • Jaderberg, M., Simonyan, K, Zisserman A, & Kavukcuoglu K. (2015). Spatial transformer networks. In NIPS

  • Kang, D., & Chan, A. (2018). Crowd counting by adaptively fusing predictions from an image pyramid. In BMVC

  • Kang, D., Dhar, D., & Chan A. (2017). Incorporating side information by adaptive convolution. In NIPS

  • Kang, D., Ma, Z., & Chan, A. B. (2018). Beyond counting: Comparisons of density maps for crowd analysis tasks–Counting, detection, and tracking. IEEE Transactions on Circuits and Systems for Video Technology, 29, 1408–1422.

    Article  Google Scholar 

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980

  • Klein, B., Wolf, L., & Afek, Y. (2015). A dynamic convolutional layer for short range weather prediction. In CVPR

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS

  • Lempitsky, V., & Zisserman, A. (2010). Learning to count objects in images. In NIPS

  • Li, S., Liu, Z. Q., & Chan, A. B. (2015). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: IJCV

  • Li, Y., Zhang, X., & Chen, D. (2018). CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In CVPR

  • Liu, R., Li, Z., & Jia, J. (2008). Image partial blur detection and classification. In CVPR

  • Ma, Z., Yu, L., & Chan, A. B. (2015). Small instance detection by integer programming on object density maps. In CVPR

  • Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In ICML

  • Niu, Z., Zhou, M., Wang, L., Gao, X., & Hua, G. (2016). Ordinal regression with multiple output CNN for age estimation. In CVPR

  • Onoro-Rubio, D., & López-Sastre, R. J. (2016). Towards perspective-free object counting with deep learning. In ECCV

  • Pech-Pacheco, J. L., Cristóbal, G., Chamorro-Martinez, J., & Fernández-Valdivia, J. (2000). Diatom autofocusing in brightfield microscopy: A comparative study. In ICPR

  • Ren, W., Kang, D., Tang, Y., & Chan, A. (2017). Fusing crowd density maps and visual object trackers for people tracking in crowd scenes. In CVPR

  • Rodriguez, M., Laptev, I., Sivic, J., & Audibert, J. Y. Y. (2011). Density-aware person detection and tracking in crowds. In ICCV

  • Rothe, R., Timofte, R., & Van Gool, L. (2015). DEX: Deep expectation of apparent age from a single image. In ICCVW

  • Sam, D. B., Surya, S., & Babu, R. V. (2017). Switching convolutional neural network for crowd counting. In CVPR

  • Shi, J., Xu, L., & Jia, J. (2014). Discriminative blur detection features. In CVPR

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR

  • Sindagi, V. A., & Patel, V. M. (2017). Generating high-quality crowd density maps using contextual pyramid CNNs. In ICCV

  • Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. In NIPS

  • Xu, L., Ren, J. S., Liu, C., & Jia, J. (2014). Deep convolutional neural network for image deconvolution. In NIPS

  • Zhang, C., Li, H., Wang, X., & Yang, X. (2015). Cross-scene crowd counting via deep convolutional neural networks. In CVPR

  • Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial landmark detection by deep multi-task learning. In ECCV

  • Zhang, L., Shi, M., & Chen, Q. (2018). Crowd counting via scale-adaptive convolutional neural network. In WACV

  • Zhang, Y., Zhou, D., & Chen, S., Gao, S., & Ma, Y. (2016). Single-image crowd counting via multi-column convolutional neural network. In CVPR

Download references

Acknowledgements

The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. [T32-101/15-R] and CityU 11212518). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Di Kang.

Additional information

Communicated by S. Soatto.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kang, D., Dhar, D. & Chan, A.B. Incorporating Side Information by Adaptive Convolution. Int J Comput Vis 128, 2897–2918 (2020). https://doi.org/10.1007/s11263-020-01345-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01345-8

Keywords

Navigation