ABSTRACT
Customers on a fashion e-commerce platform although expressing their clothing preferences through combined imagery and textual information, they are limited to retrieve with single-round fixed inputs. At the same time, large language models (LLMs) have been gaining attention across various fields. ChatGPT is a remarkable example of an LLM, known for its user-friendly language interface, impressive conversational proficiency, and reasoning abilities. To this end, we propose Fashion-GPT, a system paradigm that integrates ChatGPT with a pool of AI models in the fashion domain to achieve a multi-round multi-modal search. Specifically, it enables the system to utilize the LLMs for understanding user queries, select retrieval models based on their function descriptions, execute each subtask with the selected fashion models, and leverage LLMs to summarize the response corresponding to the execution results.
In order to boost the performance dominated by AI experts, we also introduce a novel pre-trained framework called 3M (short for Multi-view Multi-modal Matching). In particular, unlike prior studies that rely solely on one-to-one matching on image-text pair, 3M incorporates multiple texts describing the same image to achieve one-to-many alignment. Maximizing mutual information between features extracted from these views boosts capturing information about high-level factors that influence multiple views, such as the occurrence of specific objects. In addition, with the advantage of the characteristics of fashion data, multi-view images from the same product, like front-view and side-view, are naturally suitable for intra-modal self-alignment. Therefore, 3M also introduces an intra-modal contrastive objective to provide additional benefits in representation learning from the image perspective. To the best of our knowledge, our framework is the first to consider one-to-many mapping for multi-modality representation learning. Experimental evaluations demonstrate that our fashion experts are competitive and achieve state-of-the-art performance, bringing a +3.47% R@10 boost on Fashion-200K and +1.98% R@10 boost on the Fashion-IQ dress dataset compared to the previous SOTA results.
- Rafal Ablamowicz and Bertfried Fauser. 2007. CLIFFORD: a Maple 11 Package for Clifford Algebra Computations, version 11. http://math.tntech.edu/rafal/cliff11/index.html Retrieved February 28, 2008 fromGoogle Scholar
- , Patricia S. Abril and Robert Plant. 2007. The patent holder's dilemma: Buy, sell, or troll? Commun. ACM, Vol. 50, 1 (Jan. 2007), 36--44. https://doi.org/10.1145/1188913.1188915Google ScholarDigital Library
- A. Adya, P. Bahl, J. Padhye, A.Wolman, and L. Zhou. 2004. A multi-radio unification protocol for IEEE 802.11 wireless networks. In Proceedings of the IEEE 1st International Conference on Broadnets Networks (BroadNets'04). IEEE, Los Alamitos, CA, 210--217.Google Scholar
- I. F. Akyildiz, T. Melodia, and K. R. Chowdhury. 2007. A Survey on Wireless Multimedia Sensor Networks. Computer Netw. , Vol. 51, 4 (2007), 921--960.Google ScholarDigital Library
- I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. 2002. Wireless Sensor Networks: A Survey. Comm. ACM, Vol. 38, 4 (2002), 393--422.Google Scholar
- Sten Andler. 1979. Predicate Path expressions. In Proceedings of the 6th. ACM SIGACT-SIGPLAN symposium on Principles of Programming Languages (POPL '79). ACM Press, New York, NY, 226--236. https://doi.org/10.1145/567752.567774Google ScholarDigital Library
- David A. Anisi. 2003. Optimal Motion Control of a Ground Vehicle. Master's thesis. Royal Institute of Technology (KTH), Stockholm, Sweden.Google Scholar
- Sam Anzaroot and Andrew McCallum. 2013. UMass Citation Field Extraction Dataset. http://www.iesl.cs.umass.edu/data/data-umasscitationfield Retrieved May 27, 2019 fromGoogle Scholar
- Sam Anzaroot, Alexandre Passos, David Belanger, and Andrew McCallum. 2014. Learning Soft Linear Constraints with Application to Citation Field Extraction. arxiv: 1403.1349Google Scholar
- J. E. Archer, Jr., R. Conway, and F. B. Schneider. 1984. User recovery and reversal in interactive systems. ACM Trans. Program. Lang. Syst. , Vol. 6, 1 (Jan. 1984), 1--19.Google ScholarDigital Library
- Philip Bachman, R. Devon Hjelm, and William Buchwalter. 2019. Learning Representations by Maximizing Mutual Information Across Views. https://doi.org/10.48550/arXiv.1906.00910 arxiv: 1906.00910 [cs, stat]Google ScholarCross Ref
- P. Bahl, R. Chancre, and J. Dungeon. 2004. SSCH: Slotted Seeded Channel Hopping for Capacity Improvement in IEEE 802.11 Ad-Hoc Wireless Networks. In Proceeding of the 10th International Conference on Mobile Computing and Networking (MobiCom'04). ACM, New York, NY, 112--117.Google Scholar
- Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022a. Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features. 4959--4968. https://openaccess.thecvf.com/content/CVPR2022W/ODRUM/html/Baldrati_Conditioned_and_Composed_Image_Retrieval_Combining_and_Partially_Fine-Tuning_CLIP-Based_CVPRW_2022_paper.htmlGoogle Scholar
- Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022b. Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4955--4964. https://doi.org/10.1109/CVPRW56347.2022.00543Google ScholarCross Ref
- Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022c. Effective conditioned and composed image retrieval combining CLIP-based features. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 21434--21442. https://doi.org/10.1109/CVPR52688.2022.02080Google ScholarCross Ref
- Lutz Bornmann, K. Brad Wray, and Robin Haunschild. 2019. Citation concept analysis (CCA)--A new form of citation analysis revealing the usefulness of concepts for other researchers illustrated by two exemplary case studies including classic books by Thomas S. Kuhn and Karl R. Popper. arxiv: 1905.12410 [cs.DL]Google Scholar
- Mic Bowman, Saumya K. Debray, and Larry L. Peterson. 1993. Reasoning About Naming Systems. ACM Trans. Program. Lang. Syst. , Vol. 15, 5 (November 1993), 795--825. https://doi.org/10.1145/161468.161471Google ScholarDigital Library
- Johannes Braams. 1991. Babel, a Multilingual Style-Option System for Use with LaTeX's Standard Document Styles. TUGboat, Vol. 12, 2 (June 1991), 291--301.Google Scholar
- Jonathan F. Buss, Arnold L. Rosenberg, and Judson D. Knott. 1987 a. Vertex Types in Book-Embeddings. Technical Report. Amherst, MA, USA.Google Scholar
- Jonathan F. Buss, Arnold L. Rosenberg, and Judson D. Knott. 1987 b. Vertex Types in Book-Embeddings. Technical Report. Amherst, MA, USA.Google Scholar
- Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. 2023. VLP: A Survey on Vision-language Pre-training. , Vol. 20, 1 (Feb. 2023), 38--56. https://doi.org/10.1007/s11633-022--1369--5Google ScholarCross Ref
- Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020a. Image Search With Text Feedback by Visiolinguistic Attention Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2998--3008. https://doi.org/10.1109/CVPR42600.2020.00307Google ScholarCross Ref
- Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020b. Image Search With Text Feedback by Visiolinguistic Attention Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2998--3008. https://doi.org/10.1109/CVPR42600.2020.00307Google ScholarCross Ref
- Malcolm Clark. 1991. Post Congress Tristesse. In TeX90 Conference Proceedings. TeX Users Group, 84--89.Google Scholar
- Kenneth L. Clarkson. 1985 a. Algorithms for Closest-Point Problems (Computational Geometry). Ph.,D. Dissertation. Stanford University, Palo Alto, CA. UMI Order Number: AAT 8506171.Google Scholar
- Kenneth Lee Clarkson. 1985 b. Algorithms for Closest-Point Problems (Computational Geometry). Ph.,D. Dissertation. Stanford University, Stanford, CA, USA. Advisor(s) Yao, Andrew C. AAT 8506171.Google Scholar
- Jacques Cohen (Ed.). 1996. Special issue: Digital Libraries. Commun. ACM , Vol. 39, 11 (Nov. 1996).Google Scholar
- Sarah Cohen, Werner Nutt, and Yehoshua Sagic. 2007. Deciding equivalances among conjunctive aggregate queries. J. ACM, Vol. 54, 2, Article 5 (April 2007), bibinfonumpages50 pages. https://doi.org/10.1145/1219092.1219093Google ScholarDigital Library
- Mauro Conti, Roberto Di Pietro, Luigi V. Mancini, and Alessandro Mei. 2009a. (new) Distributed data source verification in wireless sensor networks. Inf. Fusion, Vol. 10, 4 (Oct. 2009), 342--353. https://doi.org/10.1016/j.inffus.2009.01.002Google ScholarDigital Library
- Mauro Conti, Roberto Di Pietro, Luigi V. Mancini, and Alessandro Mei. 2009b. (old) Distributed data source verification in wireless sensor networks. Inf. Fusion, Vol. 10, 4 (2009), 342--353. https://doi.org/10.1016/j.inffus.2009.01.002Google ScholarDigital Library
- D. Culler, D. Estrin, and M. Srivastava. 2004. Overview of Sensor Networks. IEEE Comput. , Vol. 37, 8 (Special Issue on Sensor Networks) (2004), 41--49.Google ScholarDigital Library
- Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. 2022. ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity. https://doi.org/10.48550/ARXIV.2203.08101Google ScholarCross Ref
- E. Dijkstra. 1979. Go to statement considered harmful. In Classics in software engineering (incoll). Yourdon Press, Upper Saddle River, NJ, USA, 27--33. http://portal.acm.org/citation.cfm?id=1241515.1241518Google Scholar
- Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, and Kofi Boakye. 2020. Modality-Agnostic Attention Fusion for visual search with text feedback. http://arxiv.org/abs/2007.00145Google Scholar
- Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, and Michael Zeng. 2022. An Empirical Study of Training End-to-End Vision-and-Language Transformers. https://doi.org/10.48550/arXiv.2111.02387 arxiv: 2111.02387 [cs]Google ScholarCross Ref
- Bruce P. Douglass, David Harel, and Mark B. Trakhtenbrot. 1998. Statecarts in use: structured analysis and object-orientation. In Lectures on Embedded Systems, , Grzegorz Rozenberg and Frits W. Vaandrager (Eds.). Lecture Notes in Computer Science, Vol. 1494. Springer-Verlag, London, 368--394. https://doi.org/10.1007/3--540--65193--4_29Google ScholarCross Ref
- Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. 2022. A Survey of Vision-Language Pre-Trained Models. https://doi.org/10.48550/arXiv.2202.10936 arxiv: 2202.10936 [cs]Google ScholarCross Ref
- D. D. Dunlop and V. R. Basili. 1985. Generalizing specifications for uniformly implemented loops. ACM Trans. Program. Lang. Syst. , Vol. 7, 1 (Jan. 1985), 137--158.Google ScholarDigital Library
- Ian Editor (Ed.). 2007. The title of book one 1st. ed.). The name of the series one, Vol. 9. University of Chicago Press, Chicago. https://doi.org/10.1007/3--540-09237--4Google ScholarCross Ref
- Ian Editor (Ed.). 2008. The title of book two 2nd. ed.). University of Chicago Press, Chicago, Chapter 100. https://doi.org/10.1007/3--540-09237--4Google ScholarCross Ref
- Simon Fear. 2005. Publication quality tables in ŁaTeX. http://www.ctan.org/pkg/booktabs.Google Scholar
- Dan Geiger and Christopher Meek. 2005. Structured Variational Inference Procedures and their Realizations (as incol). In Proceedings of Tenth International Workshop on Artificial Intelligence and Statistics, rm The Barbados. The Society for Artificial Intelligence and Statistics.Google Scholar
- Michael Gerndt. 1989. Automatic Parallelization for Distributed-Memory Multiprocessing Systems. Ph.,D. Dissertation. University of Bonn, Bonn, Germany.Google Scholar
- Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, RAKESH CHADA, Yue Wu, Varsha Hedau, and Pradeep Natarajan. 2022. FashionVLP: Vision language transformer for fashion retrieval with feedback. In CVPR 2022. https://www.amazon.science/publications/fashionvlp-vision-language-transformer-for-fashion-retrieval-with-feedbackGoogle ScholarCross Ref
- Michel Goossens, S. P. Rahtz, Ross Moore, and Robert S. Sutor. 1999. The Latex Web Companion: Integrating TEX, HTML, and XML 1st ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.Google Scholar
- Matthew Van Gundy, Davide Balzarotti, and Giovanni Vigna. 2007. Catch me, if you can: Evading network signatures with web-based polymorphic worms. In Proceedings of the first USENIX workshop on Offensive Technologies (WOOT '07). USENIX Association, Berkley, CA, Article 7, bibinfonumpages9 pages.Google ScholarDigital Library
- Matthew Van Gundy, Davide Balzarotti, and Giovanni Vigna. 2008. Catch me, if you can: Evading network signatures with web-based polymorphic worms. In Proceedings of the first USENIX workshop on Offensive Technologies (WOOT '08). USENIX Association, Berkley, CA, Article 7, bibinfonumpages2 pages.Google Scholar
- Matthew Van Gundy, Davide Balzarotti, and Giovanni Vigna. 2009. Catch me, if you can: Evading network signatures with web-based polymorphic worms. In Proceedings of the first USENIX workshop on Offensive Technologies (WOOT '09). USENIX Association, Berkley, CA, 90--100.Google Scholar
- Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. https://doi.org/10.48550/arXiv.1805.00145 arxiv: 1805.00145 [cs]Google ScholarCross Ref
- Tanmay Gupta and Aniruddha Kembhavi. 2022. Visual Programming: Compositional visual reasoning without training. https://doi.org/10.48550/arXiv.2211.11559 arxiv: 2211.11559 [cs]Google ScholarCross Ref
- Torben Hagerup, Kurt Mehlhorn, and J. Ian Munro. 1993. Maintaining Discrete Probability Distributions Optimally. In Proceedings of the 20th International Colloquium on Automata, Languages and Programming (Lecture Notes in Computer Science, Vol. 700). Springer-Verlag, Berlin, 253--264.Google Scholar
- Xiao Han, Sen He, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022a. UIGR: Unified Interactive Garment Retrieval. https://doi.org/10.48550/arXiv.2204.03111 arxiv: 2204.03111 [cs]Google ScholarCross Ref
- Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017a. Automatic Spatially-aware Fashion Concept Discovery. https://doi.org/10.48550/ARXIV.1708.01311Google ScholarCross Ref
- Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017b. Automatic Spatially-aware Fashion Concept Discovery. https://doi.org/10.48550/ARXIV.1708.01311Google ScholarCross Ref
- Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022b. FashionViL: Fashion-Focused Vision-and-Language Representation Learning. https://link.springer.com/chapter/10.1007/978--3-031--19833--5_37Google Scholar
- David Harel. 1978. LOGICS of Programs: AXIOMATICS and DESCRIPTIVE POWER. MIT Research Lab Technical Report TR-200. Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
- David Harel. 1979. First-Order Dynamic Logic. Lecture Notes in Computer Science, Vol. 68. Springer-Verlag, New York, NY. https://doi.org/10.1007/3--540-09237--4Google ScholarCross Ref
- J. Heering and P. Klint. 1985. Towards monolingual programming environments. ACM Trans. Program. Lang. Syst. , Vol. 7, 2 (April 1985), 183--213.Google ScholarDigital Library
- Maurice Herlihy. 1993. A Methodology for Implementing Highly Concurrent Data Objects. ACM Trans. Program. Lang. Syst. , Vol. 15, 5 (November 1993), 745--770. https://doi.org/10.1145/161468.161469Google ScholarDigital Library
- Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In Defense of the Triplet Loss for Person Re-Identification. https://doi.org/10.48550/arXiv.1703.07737 arxiv: 1703.07737 [cs]Google ScholarCross Ref
- R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. https://doi.org/10.48550/arXiv.1808.06670 arxiv: 1808.06670 [cs, stat]Google ScholarCross Ref
- C. A. R. Hoare. 1972. Chapter II: Notes on data structuring. In Structured programming (incoll), , O. J. Dahl, E. W. Dijkstra, and C. A. R. Hoare (Eds.). Academic Press Ltd., London, UK, UK, 83--174. http://portal.acm.org/citation.cfm?id=1243380.1243382Google Scholar
- Billy S. Hollis. 1999. Visual Basic 6: Design, Specification, and Objects with Other 1st ed.). Prentice Hall PTR, Upper Saddle River, NJ, USA.Google Scholar
- Lars Hörmander. 1985 a. The analysis of linear partial differential operators. III. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Vol. 275. Springer-Verlag, Berlin, Germany. viii525 pages. Pseudodifferential operators.Google Scholar
- Lars Hörmander. 1985 b. The analysis of linear partial differential operators. IV. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Vol. 275. Springer-Verlag, Berlin, Germany. vii352 pages. Fourier integral operators.Google Scholar
- Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. 2023. Language Is Not All You Need: Aligning Perception with Language Models. https://doi.org/10.48550/arXiv.2302.14045 arxiv: 2302.14045 [cs]Google ScholarCross Ref
- Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual Compositional Learning in Interactive Image Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2 (May 2021), 1771--1779. https://doi.org/10.1609/aaai.v35i2.16271Google ScholarCross Ref
- Markus Kirschmer and John Voight. 2010. Algorithmic Enumeration of Ideal Classes for Quaternion Orders. SIAM J. Comput. , Vol. 39, 5 (Jan. 2010), 1714--1747. https://doi.org/10.1137/080734467Google ScholarCross Ref
- Donald E. Knuth. 1981 a. Seminumerical Algorithms. Addison-Wesley.Google Scholar
- Donald E. Knuth. 1981 b. Seminumerical Algorithms 2nd ed.). The Art of Computer Programming, Vol. 2. Addison-Wesley, Reading, MA.Google Scholar
- Donald E. Knuth. 1984. The TeXbook. Addison-Wesley, Reading, MA.Google Scholar
- Donald E. Knuth. 1997. The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd. ed.). Addison Wesley Longman Publishing Co., Inc.Google ScholarDigital Library
- Donald E. Knuth. 1998. The Art of Computer Programming 3rd ed.). Fundamental Algorithms, Vol. 1. Addison Wesley Longman Publishing Co., Inc. (book).Google ScholarDigital Library
- Wei-Chang Kong. 2001 a. E-commerce and cultural values. IGI Publishing, Hershey, PA, USA, Name of chapter: The implementation of electronic commerce in SMEs in Singapore (Inbook-w-chap-w-type), 51--74. http://portal.acm.org/citation.cfm?id=887006.887010Google Scholar
- Wei-Chang Kong. 2001 b. The implementation of electronic commerce in SMEs in Singapore (as Incoll). In E-commerce and cultural values. IGI Publishing, Hershey, PA, USA, 51--74. http://portal.acm.org/citation.cfm?id=887006.887010Google Scholar
- Wei-Chang Kong. 2002. Chapter 9. In E-commerce and cultural values (Incoll-w-text (chap 9) 'title'), , Theerasak Thanasankit (Ed.). IGI Publishing, Hershey, PA, USA, 51--74. http://portal.acm.org/citation.cfm?id=887006.887010Google Scholar
- Wei-Chang Kong. 2003. The implementation of electronic commerce in SMEs in Singapore (Incoll). In E-commerce and cultural values, , Theerasak Thanasankit (Ed.). IGI Publishing, Hershey, PA, USA, 51--74. http://portal.acm.org/citation.cfm?id=887006.887010Google Scholar
- Wei-Chang Kong. 2004. E-commerce and cultural values - (InBook-num-in-chap). IGI Publishing, Hershey, PA, USA, Chapter 9, 51--74. http://portal.acm.org/citation.cfm?id=887006.887010Google Scholar
- Wei-Chang Kong. 2005. E-commerce and cultural values (Inbook-text-in-chap). IGI Publishing, Hershey, PA, USA, Chapter: The implementation of electronic commerce in SMEs in Singapore, 51--74. http://portal.acm.org/citation.cfm?id=887006.887010Google Scholar
- Wei-Chang Kong. 2006. E-commerce and cultural values (Inbook-num chap). IGI Publishing, Hershey, PA, USA, Chapter (in type field) 22, 51--74. http://portal.acm.org/citation.cfm?id=887006.887010Google Scholar
- E. Korach, D. Rotem, and N. Santoro. 1984. Distributed algorithms for finding centers and medians in networks. ACM Trans. Program. Lang. Syst. , Vol. 6, 3 (July 1984), 380--401.Google ScholarDigital Library
- Jacob Kornerup. 1994. Mapping Powerlists onto Hypercubes. Master's thesis. The University of Texas at Austin. (In preparation).Google Scholar
- David Kosiur. 2001. Understanding Policy-Based Networking 2nd. ed.). Wiley, New York, NY.Google ScholarDigital Library
- Leslie Lamport. 1986. ŁaTeX: A Document Preparation System. Addison-Wesley, Reading, MA.Google ScholarDigital Library
- Jan Lee. 1981. Transcript of question and answer session. In History of programming languages I (incoll), Richard L. Wexelblat (Ed.). ACM, New York, NY, USA, 68--71. https://doi.org/10.1145/800025.1198348Google ScholarDigital Library
- Newton Lee. 2005. Interview with Bill Kinder: January 13, 2005. Video. Comput. Entertain. , Vol. 3, 1, Article 4 (Jan.-March 2005). https://doi.org/10.1145/1057270.1057278Google ScholarDigital Library
- Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 802--812. https://doi.org/10.1109/CVPR46437.2021.00086Google ScholarCross Ref
- Cheng-Lun Li, Ayse G. Buyuktur, David K. Hutchful, Natasha B. Sant, and Satyendra K. Nainwal. 2008. Portalis: using competitive online interactions to support aid initiatives for the homeless. In CHI '08 extended abstracts on Human factors in computing systems (Florence, Italy). ACM, New York, NY, USA, 3873--3878. https://doi.org/10.1145/1358628.1358946Google ScholarDigital Library
- Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. https://doi.org/10.48550/arXiv.2301.12597 arxiv: 2301.12597 [cs]Google ScholarCross Ref
- Si Liu, Zheng Song, Guangcan Liu, Changsheng Xu, Hanqing Lu, and Shuicheng Yan. 2012. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 3330--3337. https://doi.org/10.1109/CVPR.2012.6248071 ISSN: 1063--6919.Google ScholarCross Ref
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692 arxiv: 1907.11692 [cs]Google ScholarCross Ref
- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021a. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. https://doi.org/10.48550/arXiv.2103.14030 arxiv: 2103.14030 [cs]Google ScholarCross Ref
- Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. 2021b. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. https://doi.org/10.48550/ARXIV.2108.04024Google ScholarCross Ref
- Daniel D. McCracken and Donald G. Golden. 1990. Simplified Structured COBOL with Microsoft/MicroFocus COBOL. John Wiley & Sons, Inc., New York, NY, USA.Google Scholar
- Suvir Mirchandani, Licheng Yu, Mengjiao Wang, Animesh Sinha, Wenwen Jiang, Tao Xiang, and Ning Zhang. 2022a. FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning. https://doi.org/10.48550/arXiv.2210.15028Google ScholarCross Ref
- Suvir Mirchandani, Licheng Yu, Mengjiao Wang, Animesh Sinha, Wenwen Jiang, Tao Xiang, and Ning Zhang. 2022b. FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning. https://doi.org/10.48550/ARXIV.2210.15028Google ScholarCross Ref
- Sape Mullender (Ed.). 1993. Distributed systems (2nd Ed.). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA.Google Scholar
- E. Mumford. 1987. Managerial expert systems and organizational change: some critical research issues. In Critical issues in information systems research (incoll). John Wiley & Sons, Inc., New York, NY, USA, 135--155. http://portal.acm.org/citation.cfm?id=54905.54911Google Scholar
- A. Natarajan, M. Motani, B. de Silva, K. Yap, and K. C. Chua. 2007. Investigating Network Architectures for Body Sensor Networks. In Network Architectures, G. Whitcomb and P. Neece (Eds.). Keleuven Press, Dayton, OH, 322--328. https://doi.org/10.1145/1721695.1721705Google ScholarDigital Library
- Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, and Mohamed Elhoseiny. 2023. ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions. arxiv: 2303.06594 [cs] http://arxiv.org/abs/2303.06594Google Scholar
Index Terms
- Fashion-GPT: Integrating LLMs with Fashion Retrieval System
Recommendations
Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalWe study the task of conversational fashion image retrieval via multiturn natural language feedback. Most previous studies are based on single-turn settings. Existing models on multiturn conversational fashion image retrieval have limitations, such as ...
Collaborative Fashion Recommendation: A Functional Tensor Factorization Approach
MM '15: Proceedings of the 23rd ACM international conference on MultimediaWith the rapid expansion of online shopping for fashion products, effective fashion recommendation has become an increasingly important problem. In this work, we study the problem of personalized outfit recommendation, i.e. automatically suggesting ...
Robo fashion world: a multimodal corpus of multi-child human-computer interaction
UM3I '14: Proceedings of the 2014 workshop on Understanding and Modeling Multiparty, Multimodal InteractionsWe present a retrospective view on our experience with small groups of more than 175 children (ages 4 to 10) playing versions of a language-based game hosted by an animated character. After describing the task, the audio-visual annotations used for ...
Comments