Skip to main content
Log in

Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig.12

Similar content being viewed by others

Data availability

The datasets generated and analysed during this study are available in the repository: https://cocodataset.org/ and http://www.vision.caltech.edu/visipedia/CUB-200-2011.html.

References

  1. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)

  2. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

  3. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.:Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)

  4. Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763 (2019)

  5. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

  6. Tan, H., Liu, X., Liu, M., Yin, B., Li, X.: Kt-gan: Knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE Trans. Image Process. 30, 1275–1290 (2020)

    Article  ADS  PubMed  Google Scholar 

  7. Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5706–5714 (2017)

  8. Li, B., Qi, X., Torr, P., Lukasiewicz, T.: Lightweight generative adversarial networks for text-guided image manipulation. Adv. Neural. Inf. Process. Syst. 33, 22020–22031 (2020)

    Google Scholar 

  9. Nam, S., Kim, Y., Kim, S.J.: Text-adaptive generative adversarial networks: manipulating images with natural language. Adv. Neural. Inf. Process. Syst. Neural Inf Process Syst 31, 42–51 (2018)

    Google Scholar 

  10. Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.: Manigan: Text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7880–7889 (2020)

  11. Qu, L., Liu, M., Wu, J., Gao, Z., Nie, L.: Dynamic Modality Interaction Modeling for Image-Text Retrieval, pp. 1104–1113. Springer (2021)

    Google Scholar 

  12. Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)

  13. Qu, L., Liu, M., Cao, D., Nie, L., Tian, Q.: Context-Aware Multi-view Summarization Network for Image-Text Matching, pp. 1047–1055. Springer (2020)

    Google Scholar 

  14. Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019)

  15. Huan, H., Guo, Z., Cai, T., He, Z.: A text classification method based on a convolutional and bidirectional long short-term memory model. Connect. Sci. 34(1), 2108–2124 (2022)

    Article  ADS  Google Scholar 

  16. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

  17. Li, B., Qi, X., Lukasiewicz, T., Torr, P.: Controllable text-to-image generation. Adv. Neural. Inf. Process. Syst. 32, 5098 (2019)

    Google Scholar 

  18. Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., Zhang, Y.: Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10921–10930 (2020)

  19. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltechucsd birds-200–2011 dataset (2011)

  20. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, 13th edn., pp. 740–755. Springer (2014)

    Chapter  Google Scholar 

  21. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Adv. Neural. Inf. Process. Syst. Neural Inf Process Syst 29, 42 (2016)

    Google Scholar 

  22. Alec, R., Jong, W.K., Chris, H., Aditya, R., Gabriel, G., Sandhini, A., Girish, S., Amanda, A., Pamela, M., Jack, C., Gretchen, K., Ilya, S.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 [cs] (2021)

  23. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

Download references

Acknowledgements

The work is partially supported by the National Natural Science Foundation of China (Nos. 62076153, 62176144), the major fundamental research project of Shandong, China (No. ZR2019ZD03), and the Taishan Scholar Project of Shandong, China (No. ts20190924).

Author information

Authors and Affiliations

Authors

Contributions

ZL contributed to the conception of the study, performed the experiment, analyzed the data and wrote the manuscript; YS cooperated with the experimental part; ZL contributed significantly to analysis and manuscript preparation; BL and DL performed the data analyses and wrote the manuscript; LL and HZ helped perform the analysis with constructive discussions. All authors reviewed and revised the manuscript for accuracy and intellectual content.

Corresponding author

Correspondence to Li Liu.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Liu, L., Zhang, H. et al. Locally controllable network based on visual–linguistic relation alignment for text-to-image generation. Multimedia Systems 30, 34 (2024). https://doi.org/10.1007/s00530-023-01222-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-023-01222-7

Keywords

Navigation