Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

Li, Zaike; Liu, Li; Zhang, Huaxiang; Liu, Dongmei; Song, Yu; Li, Boqun

doi:10.1007/s00530-023-01222-7

Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

Regular Paper
Published: 19 January 2024

Volume 30, article number 34, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Zaike Li¹,
Li Liu¹,
Huaxiang Zhang¹,
Dongmei Liu¹,
Yu Song¹ &
…
Boqun Li¹

132 Accesses
Explore all metrics

Abstract

Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Fig. 7

Text-to-Image Synthesis with Threshold-Equipped Matching-Aware GAN

Co-GAN: A Text-to-Image Synthesis Model with Local and Integral Features

An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis

Article 27 February 2023

Data availability

The datasets generated and analysed during this study are available in the repository: https://cocodataset.org/ and http://www.vision.caltech.edu/visipedia/CUB-200-2011.html.

References

Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.:Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)
Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Tan, H., Liu, X., Liu, M., Yin, B., Li, X.: Kt-gan: Knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE Trans. Image Process. 30, 1275–1290 (2020)
Article ADS PubMed Google Scholar
Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5706–5714 (2017)
Li, B., Qi, X., Torr, P., Lukasiewicz, T.: Lightweight generative adversarial networks for text-guided image manipulation. Adv. Neural. Inf. Process. Syst. 33, 22020–22031 (2020)
Google Scholar
Nam, S., Kim, Y., Kim, S.J.: Text-adaptive generative adversarial networks: manipulating images with natural language. Adv. Neural. Inf. Process. Syst. Neural Inf Process Syst 31, 42–51 (2018)
Google Scholar
Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: Manigan: Text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7880–7889 (2020)
Qu, L., Liu, M., Wu, J., Gao, Z., Nie, L.: Dynamic Modality Interaction Modeling for Image-Text Retrieval, pp. 1104–1113. Springer (2021)
Google Scholar
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Qu, L., Liu, M., Cao, D., Nie, L., Tian, Q.: Context-Aware Multi-view Summarization Network for Image-Text Matching, pp. 1047–1055. Springer (2020)
Google Scholar
Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019)
Huan, H., Guo, Z., Cai, T., He, Z.: A text classification method based on a convolutional and bidirectional long short-term memory model. Connect. Sci. 34(1), 2108–2124 (2022)
Article ADS Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Li, B., Qi, X., Lukasiewicz, T., Torr, P.: Controllable text-to-image generation. Adv. Neural. Inf. Process. Syst. 32, 5098 (2019)
Google Scholar
Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., Zhang, Y.: Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10921–10930 (2020)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltechucsd birds-200–2011 dataset (2011)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, 13th edn., pp. 740–755. Springer (2014)
Chapter Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Adv. Neural. Inf. Process. Syst. Neural Inf Process Syst 29, 42 (2016)
Google Scholar
Alec, R., Jong, W.K., Chris, H., Aditya, R., Gabriel, G., Sandhini, A., Girish, S., Amanda, A., Pamela, M., Jack, C., Gretchen, K., Ilya, S.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 [cs] (2021)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

Download references

Acknowledgements

The work is partially supported by the National Natural Science Foundation of China (Nos. 62076153, 62176144), the major fundamental research project of Shandong, China (No. ZR2019ZD03), and the Taishan Scholar Project of Shandong, China (No. ts20190924).

Author information

Authors and Affiliations

School of Information Science and Engineering, Shandong Normal University, Jinan, 250014, Shandong Province, China
Zaike Li, Li Liu, Huaxiang Zhang, Dongmei Liu, Yu Song & Boqun Li

Authors

Zaike Li
View author publications
You can also search for this author in PubMed Google Scholar
Li Liu
View author publications
You can also search for this author in PubMed Google Scholar
Huaxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Song
View author publications
You can also search for this author in PubMed Google Scholar
Boqun Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZL contributed to the conception of the study, performed the experiment, analyzed the data and wrote the manuscript; YS cooperated with the experimental part; ZL contributed significantly to analysis and manuscript preparation; BL and DL performed the data analyses and wrote the manuscript; LL and HZ helped perform the analysis with constructive discussions. All authors reviewed and revised the manuscript for accuracy and intellectual content.

Corresponding author

Correspondence to Li Liu.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Z., Liu, L., Zhang, H. et al. Locally controllable network based on visual–linguistic relation alignment for text-to-image generation. Multimedia Systems 30, 34 (2024). https://doi.org/10.1007/s00530-023-01222-7

Download citation

Received: 26 March 2023
Accepted: 08 December 2023
Published: 19 January 2024
DOI: https://doi.org/10.1007/s00530-023-01222-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

Abstract

Access this article

Similar content being viewed by others

Text-to-Image Synthesis with Threshold-Equipped Matching-Aware GAN

Co-GAN: A Text-to-Image Synthesis Model with Local and Integral Features

An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

Abstract

Access this article

Similar content being viewed by others

Text-to-Image Synthesis with Threshold-Equipped Matching-Aware GAN

Co-GAN: A Text-to-Image Synthesis Model with Local and Integral Features

An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation