Aesthetic Assessment of Packaging Design Based on Con-Transformer

Different from the traditional natural images’ aesthetic assessment task, the aesthetic assessment of packaging design should not only pay attention to artistic beauty, but also pay attention to functional beauty, that is, the attraction of the packaging design to consumers. In this paper, the authors propose a con-transformer packaging design aesthetic assessment method, which takes advantage of convolutional operations and self-attention mechanisms for enhanced representation learning, resulting in an effective aesthetic assessment of the packaging design images. Specifically, con-transformer integrates convolution network branch and transformer network branch to extract local representation features and global representation features of the packaging design images respectively. Finally, the fused representation features are used for aesthetic assessment. Experimental results show that the proposed method can not only effectively assess the aesthetic of packaging design images, but also be applied to the aesthetic assessment of natural images.


INTRoDUCTIoN
Aesthetics is people's innate ability.Studying the artificial intelligence technology to make computers perceive and discover "beauty" will help computers understand and learn the thinking process of professional photographers or professional designers, and provide professional aesthetic suggestions for people to take photos or make designs, which is a very challenging task.At present, with the rapid development and wide popularization of camera, video camera, smartphone and other photographing devices, more and more people like to capture and record the bits of life and upload them to the Internet.Especially since the beginning of 2020, due to the impact of COVID-19, face-to-face communication between people becomes decreased, which further increasing the sharing and dissemination of images, videos and other visual content on the Internet.For example, according to statistics, the total number of images and videos published on Instagram every day exceeds 100000000 (Lan 2021).Users on Flickr have shared more than 50 million photos (Smith 2021).In the face of such complex data, how to quickly and automatically screen out high-quality aesthetic images and rank the results with high visual attraction in the top of the search results has important practical significance.In addition, for album management software, automatically and quickly filtering out good-looking photos and deleting photos with low aesthetic quality can save a lot of time for manual screening of photos.In the photo editing software, it is very difficult for ordinary users to learn photography knowledge in limited time and energy (Zhu 2016).They do not know how to improve the aesthetic attributes of the image and which aspect makes the image more attractive.In this case, the software that can give professional and clear image cutting suggestions becomes more and more important.Therefore, the research on the intelligent assessment technology of image aesthetic quality with artificial intelligence as the core not only has great economic benefits, but also can promote the development of artificial intelligence technology to simulate people's aesthetic and thinking process.
Packaging design refers to the deep integration of visual form and product function, highlighting product characteristics and accurately conveying product information through the use of reasonable organization and layout methods and novel graphic expression methods, so as to form a unique packaging (Zhu 2017).The packaging can protect the products and should be easy to carry, and it also has the sales function, that is, the visual aesthetics of product packaging attracts the attention of consumers as a silent salesperson, therefore, visual element design occupies an important position in the process of packaging design.There are many kinds of packaging design aesthetics, such as functional beauty and artistic beauty.Artistic beauty is a perfect visual image composed of modeling, color and composition, which is endowed with decorative characteristics.Decorative products can convey the cultural connotation and emotional experience given by designers to consumers, so as to make the products have functional beauty and pleasing purpose.Artistic beauty should not only satisfy the development of society, economy, culture, science and technology, but also satisfy the consumer psychological and spiritual needs of consumers.Besides, artistic beauty originates from natural beauty.Artistic beauty is a product with cultural connotation, aesthetic value and spiritual value created by people and combined with their own thinking about things that existing in nature.Artistic beauty brings certain positive changes to people's aesthetic feelings and aesthetic concepts, and plays a guiding, propaganda and educational role in people.Functional beauty mainly refers to the attraction to consumers, including visual impact and findability.Visual impact mainly measures the ability of packaging to "stand out from the crowd" on the shelf, while findability determines how consumers can focus on the target packaging in many products.In addition, the visual element design usually contains text information.The unique text is vivid and has strong recognition and aesthetic functions, which visually brings consumers visual enjoyment of beauty, so as to stimulate consumers' desire to buy.Therefore, researching the aesthetic assessment algorithm of packaging design based on computer vision technology can help enterprises quickly design product packaging images with both artistic beauty and functional beauty.
Image aesthetic quality assessment can automatically and quickly help people rate and sort massive images, help people select high-quality and beautiful images, and greatly reduce the time spent by people in managing personal albums, editing photos (Wang 2019), image retrieval (Baraldi 2016).Therefore, image aesthetic quality assessment technology has attracted the attention of a large number of researchers.Traditional image aesthetic quality assessment methods are usually developed based on manual design features (that is, the low-level visual features (Zhu 2022), high-level aesthetic features and composition aesthetic features are manually designed by experts) to evaluate image aesthetics (Datta 2010, Tong 2005, Nishiyama 2011, Dhar 2011).However, this type of methods requires researchers to have professional knowledge of photographic aesthetics such as composition and color, and the representation ability of manual design features is usually limited.In recent years, with the rise of deep learning technology, researchers have introduced convolution neural network (CNN) to solve related problems in image aesthetic assessment tasks.Due to the strong automatic learning ability of CNN, researchers can automatically extract high-level semantic features from a large number of image data without professional aesthetic knowledge, which makes CNN becoming the mainstream method to solve the problem of image aesthetic assessment (Lu 2015, Dong 2015, Kao 2016, Kong 2016, Jin 2017, Wang 2019).However, the above methods are designed based on the aesthetic quality assessment task of natural images, which does not consider the unique functional beauty attribute of packaging design images.In addition, CNN is designed based on convolution operation and is good at extracting local features of images.However, it is difficulty to capture global representations, e.g., long-distance relationships among visual elements, which are often critical for high-level computer visual tasks.
In view of the above problems, in this paper, we propose a two branches network architecture, termed Con-Transformer, with the aim of packaging design image aesthetic assessment.Specifically, we first extend the data label to include both artistic beauty label and functional beauty label.Then, in order to fuse the local information and global information of packaging design images, the proposed Con-Transformer develops a CNN branch and a Transformer branch which respectively follow the design of ResNet (Kaiming 2016) and ViT (Alexey 2020) networks.A large number of experiments demonstrate that our proposed method can not only effectively assess packaging design images, but also be applied to the task of natural images aesthetic assessment.
The following part of this paper is organized as follows: the related works are reviewed in Section 2; the architecture of the proposed Con-Transformer model is proposed in Section 3; the experiments are provided in Section 4; Section 5 is the conclusion.

ReLATeD woRKS AND ANALySIS
With the continuous development of computer vision and digital image processing technology and the continuous progress of computer aesthetics, the research on image aesthetic quality evaluation has attracted more and more attention, and plays a more and more important role in image retrieval and image editing.The existing image aesthetic quality assessment methods can be grouped into traditional manual design features-based aesthetic assessment methods and popular deep learningbased aesthetic assessment methods.
Manual design features-based aesthetic assessment methods mainly assess image aesthetics by manually designing low-level visual features, high-level aesthetic features and composition aesthetic features.For example, Datta et al. (Datta 2010), as pioneers, first proposed the relationship between computer vision features and image aesthetics.Based on the basic aesthetic principles such as color matching and contrast of pictures, pictures are divided into two categories: high aesthetic feeling and low aesthetic feeling through support vector machine and regression.Tong et al. (Tong 2005) used the low-level feature to learn classification model to distinguish the photographic pictures of professional photographers and ordinary users.Nishiyama et al. (Nishiyama 2011) developed a method to evaluate the aesthetic quality of pictures based on color coordination.Sagnik et al. (Dhar 2011) selected high-quality images based on image layout, scene and natural lighting conditions.Literature (Obrador 2010) focused on the impact of image structural features on its aesthetic quality, and 55 features related to structural information are used, and the classification accuracy is very close to the accuracy of aesthetic benchmark.Literatures (Wang1 2015, Wang2 2015) used 41 global and local features such as composition, color and brightness to classify the aesthetic quality of images.Through experiments, these 41 features have good accuracy in assessing and classifying the aesthetic quality of images.However, these manually extracted features have limited representation ability, and cannot fully meet the needs of image aesthetic quality assessment.
Deep learning based aesthetic assessment methods introduce deep neural network into the task of image aesthetic assessment, and the assessment results are generally better than the traditional methods.For instance, by modifying the convolutional neural network, researchers (Kong 2016, Jin 2017, Wang 2019) can make it suitable for solving different image aesthetic assessment problems.Jin et al. (Jin 2017) proposed the deep convolution neural network RS-CJS.Wang et al. (Wang 2019) proposed an aesthetic image reviewer model NAIR based on CNN and recurrent neural network (RNN), which can not only predict the aesthetic score, but also generate semantic assessment.Different from the traditional aesthetic classification methods, which classify aesthetics into good and bad two classes, the NIMA (Talebi 2018) method proposed by Google predicts the probability distribution of human aesthetic assessment of an image through convolution neural network.The obtained probability distribution map can more accurately understand the centralized trend of user evaluation of an image, and can more accurately guide how many people in the crowd feel good-looking.In addition, recently some scholars have combined the attention mechanism with the aesthetic assessment task to extract the global representation of the image (Zhang 2019).After analyzing the impact of image fine-grained information on aesthetic assessment task, Zhang et al. (Wu 2017) proposed a gated concave visual convolution neural network (GPF-CNN) for predicting the distribution of image aesthetic score.After that, Zhang et al. (Zhang 2021) considered that the aesthetic quality assessment task is a highly subjective and complex task.Aiming at the existing aesthetic assessment algorithms that rely heavily on the visual features extracted by convolution features without considering the relationship between visual elements in image composition, a multi-modal self-cooperative attention network was proposed.

Packaging Design Image Data Acquisition
Since there is no public dataset suitable for this scenario, through web crawlers and other ways, we have collected a total of 1200 packaging design images, including 100 award-winning packaging design images over the years downloaded from professional websites, 1100 ordinary design images, and all images are resized of 400 200 ´ pixels.And most of the collected packaging design images contain text information (some sample examples are shown in Figure 1).In order to label the data, participants were recruited in the school, taking into account the participants' age, gender, educational background and design experience.A total of 10 participants were recruited, including five women, five men.In addition, among these participants, 8 are ordinary users and 2 are experienced designers, aged from 20 to 35.Each participant scored independently, and the 5-point system is used to score the artistic beauty and functional beauty of packaging design images respectively (i.e., for artistic beauty: 5 points for very good-looking, 4 points for good-looking, 3 points for general, 2 points for not good-looking, and 1 point for very bad-looking; for functional beauty: 5 points for very attractive, 4 points for attractive, 3 points for general, 2 points for not attractive, and 1 point for very unattractive).In addition, for the award-winning images, the lowest score is assumed to be 3 points and the highest score is 5 points (some sample examples are shown in Table 1).
The data distribution of the collected packaging design images is shown in Figure 2. As can be seen from Figure 2, the data samples of each grade are uneven.For the artistic beauty score, most of the marked data fall between grade 3 and grade 3.6.For functional beauty, most of the marked data fall between grade 3 and grade 3.3.Moreover, there are few high-level and low-level data, the overall distribution is Gaussian.Finally, 90% of the sample data in each grade are randomly sampled as the training dataset, and the remaining 10% are used as the testing dataset.Finally, 1080 training image samples and 120 testing image samples are obtained.

Aesthetic Assessment of Packaging Design Based on Con-Transformer
The key to aesthetic assessment is to learn more representative features.Traditional convolutional neural network is good at extracting local features of images, however, it is difficulty to capture global representations, e.g., long-distance relationships among visual elements, which are often critical for high-level computer visual tasks.Fortunately, the recently proposed visual Transformer (ViT) (Alexey 2020) method can effectively capture global representations because it makes full use of the attention mechanism.In view of this, we propose a new method called Con-Transformer to integrate convolution and attention mechanism.The flow chart of the proposed method is shown in Figure 3.In Figure 3, our Con-Transformer architectural includes two branches, the upper branch is global feature extraction networks, which using ViT as the basic network, and the feature vector

Loss Function
The goal of our proposed method is to predict the distribution of ratings for a given image.In other words, we want to predict the probability that the input image belongs to each score.However, the output class of the traditional assessment task is ordered-classes.For example, for AVA data, the output is 1, 2, …, 10, which satisfy 1<2< … <10.But in our task, the output includes two parts: the output belonging to artistic beauty 1, 2, …, 5 and the output belonging to functional beauty 1, 2, …, 5.In order to unify with traditional tasks, we change the label of functional beauty from 1, 2, …, 5 to 6, 7, …, 10.Such that, the ground truth distribution of our task of human ratings of a given image can be expressed as an empirical probability mass function p p p where s i denotes the i th score bucket.represents the probability of a quality score falling in the i th bucket.Yet, in our task we modify the rule to the sum of the first five probabilities is 1, and the sum of the last five probabilities is 1, that is, artistic beauty is required to be a group and functional beauty is required to be a group.In other words, we can get the probability distributions of artistic beauty and functional beauty at the same time.
In our collected dataset, each example consists of an image and its round truth ratings p .Our objective is to find the probability mass function p that is an accurate estimate of p .Next, our training loss function is discussed.
NIMA has proved that for the ordered-classes task like image aesthetic assessment, the classification loss can outperform regression loss.Therefore, drawing on NIMA, we also use the EMD-based losses as our loss function.Specifically, for image quality ratings task, our classes are inherently ordered as s s s i 1 10 £ £ , and based on this denotes, we can define the r -norm distance between classes as s s i j r -, where 1 10 EMD is defined as the minimum cost to move the mass of one distribution to another.For example, given the ground truth and estimated probability mass functions p and p , the EMD can be formulated as: where r is set as 2 in our experiments.

eXPeRIMeNTAL ReSULTS AND ANALySIS
The effectiveness of our proposed method is verified on our collected packaging design dataset and the AVA dataset.For AVA dataset, we use 80% as training data and use the remaining 20% as testing data.For our proposed dataset, 90% are used as the training data and 10% are used as the testing data.When using AVA dataset, the weights of our baselines ResNet-50 and ViT are initialized by respectively training on ImageNet dataset, and the last FC layer and MLP layer are randomly initialized.When using our proposed dataset, the best weights training on AVA are used as our initialized weights.Linear correlation coefficient (LCC), Spearman's rank correlation coefficient (SRCC) and the EMD value are used as our evaluation criteria.In order to output the rating probabilities of artistic beauty and functional beauty at the same time, the neurons in the last full connection layer are grouped, that is, the first five neurons form a group, and the last five neurons form another group, which are activated by SoftMax function respectively.

Performance on our Proposed Dataset
First, in order to visualize the difference between the outputs of our network and the ground truth rating, the comparison results on ID1 sample are listed in Figure 4.As can be observed from Figure 4, for the artistic beauty of ID1 sample, 30% of the people think that the aesthetic grade of this packaging design image is 2 or 3, and 20% think that the grade is 4 or 5, while our Con-Transformer network thinks that the probability of grade 2 is 29%, and the probability of grade 3 is 33%.For functional beauty, 40% of people think its grade is 3 (i.e., 3 = 8 -5), and the network thinks that the probability of grade 3 is 44%.Therefore, it can be considered that the network's prediction results are basically consistent with people's scoring distribution.
Then, the comparison results on our proposed dataset are shown in Table 2. From Table 2, we can observe that in all cases, our proposed method achieves the best results, which means that our proposed method is effective.Specifically, the linear correlation between ground truth and results of our Con-Transformer is 0.61, which means that our predict results have strong correlation with the ground truth.While the linear correlation between ground truth and results of NIMA(Inception-v2), ResNet-50, ViT are 0.59, 0.54, 0.58, respectively.This means that their results have moderate correlation with the ground truth.In addition, the comparison between Con-Transformer and the two backbones, ResNet-50 and ViT, demonstrate that our fusion network can integrate the discrimination ability from local features and global features.

Performance on AVA Dataset
We also carried out experiments on the natural image aesthetic assessment dataset AVA to verify the effect of the method proposed in this paper.Referring to the setting in NIMA, the specific experimental results are shown in Table 3, where the Reg+Rank+Att+Cont method is proposed in literature (Kong 2016), A-Lamp CNN method is proposed in literature (Ma 2017).From Table 3, it can be seen that our Con-Transformer is superior to NIMA method in almost all attributes.Specifically, in two-class aesthetic categorization task, our Con-Transformer shows the highest accuracy, while the results from A-Lamp CNN and NIMA(Inception-v2) show the comparable accuracies.In multi-classes aesthetic categorization task, Con-Transformer achieves the best results on LCC (mean), SRCC (mean) and SRCC (std.dev)indicators.This means that our proposed method can not only effectively deal with the packaging design aesthetic assessment task, but also be effectively applied to the natural image aesthetic assessment task.In addition, we also compare our method with the two baselines: ResNet-50 and ViT.And the experimental results again show that our proposed Con-Transformer can inherit the advantages of ResNet-50 and ViT.

CoNCLUSIoN
In this paper, aiming at the intelligent packaging design aesthetic assessment task, we first construct a new dataset that containing various packaging design images and corresponding scores.Then, a new local feature and global feature fusion network, i.e., Con-Transformer, is proposed, which can not only be used on the packaging design image aesthetic assessment task, but also be used on the natural image aesthetic assessment task.The experimental results on both packaging design aesthetic assessment dataset and natural image aesthetic assessment dataset show the superiority to some stateof-the-art methods.Future research work includes continuing to expand the packaging design image aesthetic assessment dataset and exploring more intelligent aesthetic assessment methods.

ACKNowLeDGMeNT
This work is supported by the Anhui Social Science Innovation and Development Research Project (2021CX136) and the 2022 Anhui Provincial Scientific Research Preparation Plan Project (2022AH052089).

Figure 1 .
Figure 1.Some sample examples of the collected packaging design datasets

Figure 2 .
Figure 2. Data distribution of packaging design images aesthetic assessment.(a) Artistic beauty score data distribution, (b) functional beauty score data distribution.

Figure 3 .
Figure 3. Flowchart of the proposed Con-Transformer method