Effectiveness Assessment of Recent Large Vision-Language Models

The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their effectiveness in specialized tasks, we employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. These six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope that this study can provide useful insights for the future development of LVLMs, helping researchers improve LVLMs for both general and specialized applications.


INTRODUCTION
The emergence of large language models (LLMs) [1], [2] has sparked a revolution in the field of natural language processing, owing to their promising generalization and reasoning capabilities.Motivated by this progress, researchers have pioneered the development of powerful large visionlanguage models (LVLMs) [3], [4], [5], leveraging the impressive capabilities of LLMs to enhance comprehension of visual semantics.This advance particularly improves model performance in complex vision-language tasks [4], [6], [7], and represents a major step toward artificial general intelligence (AGI).AGI refers to intelligent systems that are capable of solving any task that can be performed by humans or animals.Generally, tasks performed by humans can be divided into general and specialized tasks according to whether special domain knowledge is required.Therefore, the capabilities of LVLMs can be categorized into these two aspects accordingly, and both of them are essential for LVLMs on the path toward AGI.
Recently, numerous studies have assessed and investigated the general and specialized capabilities of LVLMs [6], [7], [11], [12], [13], [14], [15].Qin et al. [7] conducted empirical studies encompassing various general tasks, such as object detection and counting to evaluate the visual understanding capabilities of Google Bard.Fu et al. [15] introduced a comprehensive evaluation benchmark to assess the perceptual and cognitive capabilities of recent LVLMs on general tasks (e.g., optical character recognition and object counting).Zhang et al. [11] explored the potential of GPT-4V [5] in visual anomaly detection, while Tang et al. [12] generalized Shikra [9] to challenging camouflaged object detection scenarios without training.However, as these studies primarily focus on evaluating the general capabilities of LVLMs [6], [7], [15] or exploring the effectiveness of a particular LVLM in a specialized domain [11], [12], [13], [14], there is a lack of quantitative analysis regarding the performance of recent LVLMs in a diverse range of specialized tasks, leading to an insufficient understanding of their capabilities.
In this paper, we conduct a comprehensive assessment of several recent open-source LVLMs, spanning a diverse array of challenging specialized and general tasks.Our evaluation platform is illustrated in Fig. 1.To evaluate the ability of LVLMs to perform specialized tasks, we select three recent open-source LVLMs (MiniGPT-v2 [4], LLaVA-1.5 [8], and Shikra [9]) and conduct quantitative assessment on six [25,19,48,93], [46,8,87,66] Detection Detect the <transparent objects>.challenging specialized tasks in three different application scenarios: natural, healthcare, and industrial.For natural scenarios, we select salient object detection (SOD) [16], [17], [18], transparent object detection (TOD) [19], and camouflaged object detection (COD) [20], [21], as these tasks involve targets that are increasingly rare in real-life and possess progressively complex characteristics, thereby presenting distinct challenges to LVLMs.In the field of healthcare, the effectiveness of LVLMs is evaluated by skin lesion detection [22] and polyp detection [23], which show prominent and slightly weaker visual features, respectively.Besides, anomaly detection (AD) [24], a vital task in industrial scenarios, is also selected for assessment.In academia, these six tasks come with tailored datasets and cover broad specialized domains, thereby enabling comprehensive evaluation of specialized capabilities of LVLMs.As illustrated in Fig. 1, given inherent challenges posed by these tasks in terms of recognizing and localizing target objects, we employ tailored prompts to assess the recognition ( § 2) and localization ( § 3) capabilities of the models.Furthermore, we conduct empirical investigations on a universal dataset (COCO [10]) that is free from domain-specific expertise.We abstain from specifying particular object types ("camouflaged", "transparent", or else) in prompts, aiming to explore multi-modal understanding capabilities ( § 4) of the above-mentioned models and GPT-4V in general tasks (i.e., object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning).
Our investigations reveal that while these models show strong potential for specialized tasks, they still exhibit suboptimal performance and limited cognitive capabilities.This reveals their inadequate transfer ability in this particular context.Performance issues are further magnified by typical weaknesses of LVLMs such as object hallucination, textto-image interference, and decreased robustness in complex problems.In addition to the shortcomings revealed in specialized tasks, these models also show significant room for improvement in general tasks, particularly in object counting, spatial reasoning, and absurd question answering.
In summary, the main contributions of this paper are three-fold: (1) We construct an evaluation platform comprising six specialized tasks and five general tasks to assess the effectiveness of LVLMs.(2) On the evaluation platform, we evaluate the specialized capabilities of three recent opensource LVLMs and also the general capabilities of four LVLMs.(3) We analyze their performance and limitations for both specialized and general tasks, and discuss the future development and application of LVLMs.

RECOGNITION VIA LVLMS IN SPECIALIZED TASKS
When LVLMs are applied in these specialized tasks, recognizing these target objects is a crucial step, which reflects models' global understanding of such tasks and directly influences their effectiveness.Therefore, we first conduct quantitative evaluation of their recognition capabilities on the aforementioned six specialized tasks.Subsequently, we carry out additional tests to delve into failure cases and gain further insights.

Experimental Setup
Recognition in specialized tasks involves determining the existence of targets and classifying them.The first evaluation of recognition capabilities is to judge object existence, requiring models to answer either "Yes" or "No" to questions like "Are there any <camouflaged objects> in the picture?Please answer Yes or No." as demonstrated in Fig. 1.The placeholder "<...>" in the queries denotes flexible words/phrases that can be substituted in different tasks, such as "polyps" in polyp detection.The evaluation considers two different setups: the full set, which includes both positive and negative samples, and the positive set, which includes only positive samples.
Beyond the first evaluation, we delve deeper into the fine-grained recognition ability of LVLMs by asking them to categorize targets.Our method is to prompt LVLMs to designate the most suitable category for a target object from a pre-defined set of potential categories (w/ vocabulary).Within this experiment, the questions like "Which of the following is the most likely category for the camouflaged object in the picture?'seahorse, mantis, spider ...' " are used.The pre-defined set contains all categories that appear in the dataset.Besides, another evaluation is considered, featuring an open-vocabulary inquiry without giving a pre-defined set (w/o vocabulary).In this test, a straightforward question like "What is the camouflaged object in the picture?" is used.
The versions of LLavA-1.5 [8], Shikra [9], and MiniG-PT-v2 [4] that are equipped with language models of ∼7 billion parameters are selected for evaluation.All configurations of each model are set as default during evaluation.Since all tests in this paper are based on the above configurations, we will not mention again in the following sections.

Metrics
As for the first evaluation, accuracy (A) is employed to measure the performance of LVLMs in judging object existence, while the probability of positive responses (responses indicating "yes") on the full set is also reported for reference.A and the probability of positive responses (Y) can be formulated as follows: where TP, FP, TN, FN denote true positive, false positive, true negative, and false negative, respectively.For fine-grained recognition, LVLMs typically select categories from a pre-defined set when available, enabling direct matching with labels for assessing correctness.However, in the absence of such a set, the generated categories exhibit significant variation, posing challenges in directly evaluating correctness through class matching.Hence, we utilize accuracy (A * ) and semantic similarity (S) [30] to measure the performance in these two settings, respectively.The former quantifies the fraction of responses that contain correct category names, while the latter quantifies the semantic similarity between responses and ground truth labels.Considering that LVLMs may occasionally generate similar categories not included in the pre-defined set, S is also employed to evaluate the performance of the w/ vocabulary setting.
COD10K, the only dataset that provides category labels for each target, is utilized to evaluate the fine-grained recognition ability of LVLMs.Since judging target existence in negative samples is certainly challenging for LVLMs, we exclude the interference and use only the positive samples of COD10K to more accurately evaluate the fine-grained recognition ability.

Result Analyses and Discussions
Evaluation results of existence determination on the full set and positive set, and fine-grained recognition are detailed in Tables 1-3.The absence of negative samples leads to TN = 0 and FP = 0, and hence A in Table 2 is equivalent to Y in Table 1.Three observations from these results are as follows.
Over-positive issue.From the results in Table 1 and the proportion of positive samples in each dataset (in § 2.1.3),we can observe that these models consistently yield a greater proportion of positive responses (Y) compared to the proportion of positive samples.Especially on SOC and CP-CHILD-B, these LVLMs generally achieve Y higher than 0.9, while the proportions of positive samples in these datasets are only 50% and 25%.This indicates that the models tend to give positive responses, which is further proved on the positive sets in Table 2, where extremely high scores on A (e.g., 1.000) are achieved (particularly for LLaVA-1.5).The reason behind this phenomenon could be that most of the samples learned by these LVLMS during the training are positive image-text pairs, which makes them over-positive and thus have a tendency to answer "yes" to the questions [31], [32].
Limited performance in determining existence.Though notably high accuracy (A) in Table 2 are achieved by LVLMs, the inclusion of negative samples results in an overall decrease in accuracy.As shown in Table 1, most accuracies drop below 0.7, indicating an inadequate recognition ability of LVLMs in determining the existence of targets, particularly in the case where negative samples are presented.Among these models, LLaVA-1.5 shows better recognition capabilities for camouflaged objects, achieving higher accuracy (A) while obtaining Y scores that are close to the Experimental results for three LVLMs regarding the presence of targets on the full sets.We present the probability of positive answers (Y, representing the percentage of "yes").The highest accuracy (A) score is highlighted in bold.
w/ vocabulary proportions of positive samples in COD10K.In contrast, Shikra shows extremely bad results (on Y) on COD10K due to its frequent misclassification of positive samples, indicating its less capability in recognizing camouflaged objects.3 clearly demonstrate that these LVLMs struggle with classifying camouflaged objects.Although LLaVA-1.5 achieves the highest scores, its performance is still unsatisfactory.The unsatisfactory performance could be attributed to various factors.First, these models may face challenges in identifying camouflaged objects that closely resemble the background, as indicated by their unsatisfactory recognition accuracy in Table 1.Second, the category of camouflaged objects may lie beyond the models' domain of knowledge, hindering their capability to match objects with their categories accurately.Additionally, the extended length of the prompt, stemming from the incorporation of the pre-defined set, may impede the model's comprehension.This aligns with the results in Table 3, where MiniGPT-v2 and Shikra demonstrate improved performance (S) when the pre-defined set is excluded (i.e.w/o vocabulary), as opposed to when the vocabulary is provided (i.e.w/ vocabulary).

Uncovering Insights into Failure Cases
Recalling that these models encounter challenges in differentiating negative samples, so we conduct tests on representative negative samples to gain insight into the potential causes of this phenomenon.LVLMs are prompted to provide additional descriptions or reasoning when determining the existence of targets.The results are illustrated in Fig. 2, deriving three potential factors.
Limited cognition towards specific attributes.As illustrated in the first example of Fig. 2, when presented with the question "Is there camouflaged object in the picture?What is it?",MiniGPT-v2 erroneously recognizes the "small black rock" as a camouflaged object, while LLaVA-1.5 misclassifies a "plant" as such.These models classify rocks and plants as camouflaged objects just because of their visual resemblance to the surroundings, indicating their limited cognition about camouflage.This phenomenon also occurs in other specialized tasks, e.g., anomaly detection, implying their limited cognition on special object types.
Object hallucinations.Object hallucination, which involves imagining objects in the response but not present in the image [31], [33], could impact the recognition capability of LVLMs in specialized tasks.For instance, as demonstrated by the answers to "Is there a camouflaged object in the picture?What is it?" in the second example of Fig. 2, LLaVA-1.5 states that "a person is standing on the sandy beach", while MiniGPT-v2 mentions the presence of "small bush or tree".These objects may affect the recognition of targets [12], resulting in decreased recognition performance in determining object presence.
Text-to-image interference.The inadequate performance in determining the presence of targets may also be attributed to text-to-image interference, which originates from the textual prompts supplied to the models [33].As shown in the second example in Fig. 2, when prompted with "Please describe the picture in detail", LLaVA-1.5 provides an accurate description of the image.However, when

MiniGPT-v2
Yes, there is a camouflaged object in the picture.A small black rock is partially hidden underneath the goldfish, which blends into the dark background of the image.

LLaVA-1.5
Yes, there is a camouflaged object in the picture.Aa It is a plant that is placed in the aquarium with the fish.

Shikra
The answer is no.
Question: "Is there camouflaged object in the picture?What is it?"GT: No camouflaged object.

MiniGPT-v2
Yes, there is a camouflaged object in the picture.The object is a small bush or tree that is blending into the sand dune.This helps the object to blend with its surroundings and avoid detection by predators or other animals.

LLaVA-1.5
Yes, there is a camouflaged object in the picture.It is aaaaa person who is standing on the sandy beach, and their body is blending in with …

Shikra
The answer is no.The image features a large, sandy dune with a beautiful blue sky in the background.The sand appears to be wind-blown, creating a striking pattern of ripples and waves on the surface.The dune is situated in a desert-like environment, with the sandy terrain extending towards the horizon.The overall scene evokes a sense of tranquility and natural beauty.
The image features a vast expanse of sand dunes, stretching as far as the eye can see.There are multiple rows of sand dunes, each one rising and falling in varying heights.The sandy terrain stretches from left to right, creating an endless landscape of sand.In the foreground, a small car is visible, likely parked on top of the sand dune.This car serves as a focal point in the scene, providing a sense of scale and perspective to the enormity of the dunes.

MiniGPT-v2
The image captures a beautiful desert scene with a large sand dune under a blue sky.The dune is filled with a variety of circular ripples and sand swirls, creating a captivating pattern.
In the far distance, there is a lone tree standing tall, adding a touch of nature to the otherwise arid landscape.prompted with "Is there a camouflaged object in the picture?What is it?",the mention of the "camouflaged object" in the prompt may interfere with the answers, resulting in hallucination and misjudgment from LLaVA-1.5.

Summary
Sect. 2 evaluates the recognition performance of MiniGPT-v2 [4], LLaVA-1.5 [8], and Shikra [9] in various specialized tasks.Among them, LLaVA-1.5 generally shows better recognition ability in both existence determination and object classification.However, quantitative analyses indicate that while these models exhibit certain cognitive capabilities in various specialized tasks without domain-specific finetuning, their recognition performance requires further enhancement.When directly applied to these tasks, they still achieve limited cognition and understanding of specialized domains.Apart from such limited cognition, other typical weaknesses of LVLMs, as revealed in qualitative investigations, such as object hallucination and text-to-image interference, are likely to result in inferior performance.

LOCALIZATION VIA LVLMS IN SPECIALIZED TASKS
In this section, we assess the localization capabilities of three LVLMs on the six specialized tasks, and further explore their strengths and limitations through additional qualitative tests.

Experimental Setup
Recent LVLMs have demonstrated a remarkable visual grounding capability as they can locate objects with bounding boxes (bboxes) that are specified in language prompts.This capability makes it feasible to apply these models to the specialized tasks described above.To achieve this goal, we employ a two-step methodology consisting of detection followed by segmentation.Specifically, as illustrated in Fig. 1, we initially prompt LVLMs to provide bounding boxes for a particular type of objects (e.g., transparent objects) with a question like "Detect the <transparent objects>.",yielding detection of targets.Subsequently, the predicted bounding boxes are used as further prompts to the segment anything model (SAM) [34] to perform fine segmentation.Given the potential presence of multiple boxes in a picture, we first employ SAM to generate a separate mask for each box and then merge these results using the Boolean OR operation to obtain the final segmentation result.The SAM with the ViT-H backbone [35] is employed as the default in all the experiments.We also conduct segmentation using ground truth bounding boxes, which serve as the upper bound of segmentation performance.

Metrics
As mentioned previously, we perform detection followed by segmentation to utilize these models for specialized tasks.Therefore, during evaluation, we assess their localization capabilities by evaluating their performance in both detection and segmentation.To evaluate the detection results, three widely used detection metrics (i.e., Precision, Recall, and F1 with an intersection-over-union (IoU) threshold of 0.5 [36]) are adopted.Additionally, three segmentation metrics (mean absolute error (M ) [37], S-measure (S α ) [38], and maximum F-measure (F β ) [39]) are employed to assess segmentation performance.It should be noted that since these models solely predict bounding boxes without providing corresponding confidence values, we exclude those common metrics like average precision (AP) [36] in anomaly detection.

Result Analyses and Discussions
The results are reported in Tables 4-6, from which several observations can be derived.
Promising yet insufficient localization capability for specific tasks.The results in  show that these LVLMs hold promise for addressing specialized tasks without requiring domain-specific fine-tuning, particularly in natural scenarios.While Shikra and MiniGPT-v2 show better localization capability compared to LLaVA-1.5, superior segmentation performance is achieved by Shikra on DUTS (S α score 0.754) and Trans10K (S α score 0.683) when only TABLE 6 Detection and segmentation results of MiniGPT-v2, LLaVA-1.5, and Shikra in industrial scenarios.The symbols ↑/↓ indicate that a higher/lower score is better, while the highest scores are marked in bold.The upper bound (on ground truth bounding boxes) of detection and segmentation via LVLMs in diverse specialized tasks is marked in gray.

Dataset Model
Detection Segmentation (with SAM applied to bboxes) MVTec AD [24] MiniGPT  provided with category names.However, their detection and segmentation performance is found inadequate as their performance is much lower than that of the upper bound.This indicates their insufficient localization capability in these specialized tasks.Specifically, the low scores in terms of Precision and Recall demonstrate that these models struggle to generate precise bounding boxes (i.e., most predicted boxes are inaccurate) and identify targets (i.e., most objects are missed for detection).These limitations ultimately restrict the final segmentation performance of LVLMs on specialized tasks.
Superior performance in natural scenarios.According to the results presented in Tables 4-6, these models demonstrate superior performance in natural scenarios, especially on DUTS and Trans10K.The underlying reason may be that transparent and salient objects are more prevalent and exhibit common attributes.Conversely, medical and abnormal images are relatively scarce and with complex characteristics, thereby posing greater challenges for LVLMs.Furthermore, we illustrate the detection and segmentation results in Fig. 3.As evidence, these models face challenges in providing accurate bounding boxes, consequently resulting in subpar segmentation performance.These findings underscore their limited localization capabilities in specialized tasks.

Uncovering Insights into Failure Cases
As mentioned in § 3.1, we evaluate the localization capability of LVLMs by solely specifying object types.This setting concurrently evaluates their recognition, reasoning, and localization capabilities by requiring models to accurately perceive each object.Therefore, we sought to gain insight into the underlying reasons behind such inability by breaking down the question in § 3.1 into multiple questions.We focus on failure cases of LVLMs and prompt them with multiple questions.In natural scenarios, two questions are prompted to assess the models in accurately localizing given objects ("Question 1") and determining the target of specific types ("Question 2").In industrial scenarios, because anomalies usually are hard to tell their detailed categories, we evaluate the recognition of anomalies by querying the existence ("Question 1") and image description ("Question 2"), and further test their capability to locate anomalous areas by providing corresponding descriptions ("Question 3").In healthcare (colon polyp detection), we follow the same protocol as in industrial cases.The results are separately presented in Figs.4-6.Two underlying reasons for failing to locate can then be drawn.Decreased robustness in complex problems.The results in Fig. 4 reveal that these models are good at locating a given object or inferring the target, especially for salient and transparent objects.However, they make errors when asked to locate the target types directly, as shown in Fig. 3.This failure indicates that they exhibit decreased robustness or are unskilled when faced with more complex and abstracted problems.That is, they need to understand the notion of complex concept of "salient".It is worth noting that LVLMs achieve lower performance on camouflaged objects, which could be attributed to the resemblance of camouflaged objects to their surroundings.Such failures also demonstrate the challenge faced by LVLMs in accurately categorizing these objects, as mentioned in § 2.2.
Limited cognition toward medical images and anoma-lies.Fig. 5 and Fig. 6 clearly demonstrate the limited cognition of LVLMs on medical images and anomalies.For instance, LLaVA-1.5 and Shikra erroneously categorize the "black and orange color scheme" and "the number 500" as anomalies (as shown in Fig. 5), while MiniGPT-v2 incorrectly recognizes colon image as "the inside of an orange" (as shown in Fig. 6).Despite their limitations, these LVLMs show superior localization capabilities on polyp when provided with relevant descriptions (as evidenced by the responses to "Question 3" in Fig. 6).Nevertheless, there still remains room for enhancement in localization regarding anomaly detection.Yes, there is an anomaly in the image.The pill appears to be split or broken, which is not typical for a standard capsule or pill.This could indicate that it has been damaged during manufacturing, shipping, or storage, or it might be a defect in the design or material of the pill.
The presence of such an anomaly ... Yes, there is an anomaly in the image.aaaThe pill is described as having a black and orange color scheme, but it is actually black and pink.This deviation from the expected norm is an anomaly in the context of the image.
Yes, there is an anomaly in the image.The aaatwo halves of the pill cap have the numbers 500 printed on them, which is a significant deviation from the expected norm.This unusual feature can be considered an "anomaly" in the context of the pill cap.
The image shows a close-up of a pill that is primarily black in color, with a small orange spot at the bottom.This could be a medication or supplement capsule.aaaThe pill has a white label with the word "90" printed on it.The image is clear … The image features a close-up view of a pill with a black and orange color scheme.The pill is placed on a white surface, which could be a table or a countertop.The pill is described as being orange and black, with the orange part being larger than the black part.The pill is also labeled with the number "500,"… The image features a large pill with a unique design.The pill is orange and black, with a large number "500" printed on its side.This pill cap is placed on a white surface, which could be a table or counter.Yes, there are polyps in this image.They appear as small, flat bumps or tiny mushroom-like stalks.
Yes, there are polyps in the image.The image shows a close-up of a person's stomach, and there are polyps visible on the stomach lining.
Yes, there are polyps in the image, which are visible as small, shiny bubbles on the surface of the tissue close to the end of the colon.
The image displays a close-up view of the inside of an orange, possibly a banana or pineapple.This is a very detailed and intimate look at the fruit's texture, with visible fibers and small, irregular bumps on the surface.
The image features a close-up view of a person's stomach, showcasing the inner lining of the stomach.The stomach appears to be inflamed, with a large, white, and yellowish area visible.The area is surrounded by a clear, watery substance, which could be a mixture of saliva and stomach acid.The stomach's inner ... The image features a close-up of a person's abdomen, with a clear view of the inner workings of the stomach.The photograph is a large image that allows for a detailed look at the anatomy, including the glands and small air bubbles.There are several air bubbles scattered throughout the stomach area, some of which ...

MiniGPT-v2
LLaVA-1.5 Shikra Fig. 6.Responses of three LVLMs regarding recognizing and locating the colon polyp.Predicted bounding boxes and ground truth are marked in blue and green, respectively.Incorrect responses are marked with red underlines and crosses.

Summary
Sect. 3 evaluates the effectiveness of MiniGPT-v2 [4], LLaVA-1.5 [8], and Shikra [9] in localizing targets in diverse specialized tasks.The results reveal that these models hold promise for addressing specialized tasks (particularly in natural scenarios), while Shikra and MiniGPT-v2 show superior TABLE 7 Performance summary of MiniGPT-v2, LLaVA-1.5, and Shikra in SOD, TOD, COD, polyp detection (PD), skin lesion detection (SLD), and AD.Thresholds are established at 60% and 80% of the upper-bound performance to categorize model performance into three intuitive levels: low (L), medium (M), and high (H).The notation "-" denotes inconclusive cases, since the evaluation is performed only on the positive sets, while the models incur the over-positive issue.

Model Recognition Localization
Natural Healthcare Industrial Natural Healthcare Industrial Question 2: "What color is the airplane in the picture?"GT: No airplane. White.
The answer is white.
The airplane in the picture is red.
Sorry, I can't provide information about that image as there's no airplane visible in it.

Shikra
The answer is 2.

GPT-4V
I'm sorry, I can't assist with that request.

LLaVA-1.5
There are two pandas in the picture.
Question 1: "How many pandas are there in the picture?"GT: 2.
Question 3: "What objects in the picture can I sit on?"GT: Chair. Chairs.
The answer is chairs.
In the picture, there are chairs available for people to sit on.
In the picture, you can sit on the chairs that are around the table.Yellow.

MiniGPT-v2
The answer is red and yellow.
The flowers in the picture are red and yellow.
The flowers in the picture are primarily red and yellow.There appear to be some green leaves among the flowers as well.

MiniGPT-v2 Shikra
LLaVA-1.5 GPT-4V Question 5: "Which of the three women in the picture is closest to the orange juice?"GT: The woman in colorful stripes or left. Right.
The answer is left.
The woman in the black suit is sitting next to the orange juice.
In the image, the woman on the left, wearing a striped top, is closest to the orange juice.The glass of orange juice is located on the table to her immediate left.

MiniGPT-v2
Shikra LLaVA-1.5 GPT-4V localization capability compared to LLaVA-1.5.Nonetheless, despite the successes, the detection and segmentation performance of these models are still inadequate, indicating a weakness in localization capability for specialized tasks.The limited cognition of medical images and anomalies hampers the transfer capability of these LVLMs, whereas decreased robustness when facing complex problems may also be an additional constraint.
As a summary, we give the general performance of those three models on the six tasks in Table 7, where intuitive thresholds are set to categorize the models' average performance into three levels.It is evident that the recognition and localization performance of these models in the six tasks remain insufficient, with most cases exhibiting low (L) or medium (M) performance, indicating less usability in real-world scenarios.Notably, Shikra stands out with a high (H) score on the TOD task, whereas among these models, LLaVA-1.5 demonstrates superiority on recognition compared to MiniGPT-v2 and Shikra.However, the opposite appears to be true for localization.

CAPABILITIES OF LVLMS IN GENERAL TASKS
In this section, we conduct empirical investigations to evaluate the performance of MiniGPT-v2 [4], LLaVA-1.5 [8],

Shikra
The answer is 2.

GPT-4V
The picture shows four cigarettes scattered on the floor next to the pack of …

LLaVA-1.5
There are two scattered cigarettes in the picture.
Question 1: "How many scattered cigarettes are there in the picture?"GT: 3.

MiniGPT-v2
4 Question 2: "What color is the airplane in the picture?"GT: No airplane. White.
The answer is white.
The airplane in the picture is red.Floor.
The answer is chair.
You can sit on a chair in the room.
This image shows …, there are no actual seats available in the image.If you need to sit, you would need to find a chair or a seating area not visible in this image.
The answer is red.
The chair in the picture is red.
The chair in the picture is red.
The answer is cigarette.
The bottle is closer to the chair.
In the image, the bottle appears to be closer to the chair compared to the cigarette pack…

MiniGPT-v2
Shikra LLaVA-1.5 GPT-4V Shikra [9], and GPT-4V [5] in a diverse range of general tasks.Given that the recognition and localization of general objects are targets learned by many current LVLMs, and their performance on these tasks has been extensively studied [4], [8], [9], we shift our focus to five other widely recognized general tasks, including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning.We conduct some evaluations of the aforementioned tasks utilizing the COCO [10] dataset and select three representative examples that demonstrate outcomes similar to those of other tests for display, as illustrated in Figs.7-9.Note that, since there are no ground truth annotations/labels regarding the above general tasks in the COCO dataset, only empirical investigations are considered for this evaluation.

Object Counting
Object counting capability serves as a comprehensive indicator of the perception abilities of LVLMs, necessitating not only the recognition of individual targets but also robust counting capabilities.To evaluate this capability, we prompt LVLMs with questions like "How many..." on three images, as shown in Figs.7-9.The results show that MiniGPT-v2, LLaVA-1.5, and Shikra achieve only one-third accuracy on this evaluation, whereas GPT-4V fails on all tests.This indicates significant room for enhancement in the object counting capability of LVLMs.Moreover, the inefficacy of these models in counting challenging objects, including small objects (Fig. 8), underscores the importance of enhancing the visual perception capabilities inherent in vision models.

Absurd Question Answering
Recent LVLMs seamlessly integrate textual and visual inputs, achieving superior multi-modal understanding capabilities.However, an intriguing question arises: what transpires when there is a lack of relevance between text content and images?To explore this, we endeavor to subject these models to absurd questions.As illustrated in Figs.7-9, we ask LVLMs "What color is the airplane in the picture?" on three different images where no airplane is present.The results show that while GPT-4V responds with "no airplane" on all tests, the other three models always give colors of the nonexistent airplane.The incorrect responses indicate that

Shikra
The answer is 12.

GPT-4V
There are eleven vases in the picture.
Question 1: "How many vases are there in the picture?"GT: 10.

LLaVA-1.5
There are nine vases in the picture.
Question 2: "What color is the airplane in the picture?"GT: No airplane. Green.
The answer is pink.
The airplane in the picture is blue.
There is no airplane in the picture

Vase
The answer is vase.
In this picture, you can use the vases to hold water.… In the picture, there are various vases that can hold water.These include the clear blue vases, the red vase, the green vases, and …

Green
The answer is green.
The tallest bottle in the picture is blue.
The tallest bottle in the picture is red.

MiniGPT-v2
Shikra LLaVA-1.5 GPT-4V Question 5: "In the picture, which is closer to the flowers, the red bottle or the light bottle?"GT: Light purple bottle.

Light purple
The answer is red. in such cases, these models cannot effectively utilize visual information and heavily rely on language input to generate responses.A potential reason for this phenomenon could be that the textual inputs provide prior information to models, which leads to erroneous judgments of LVLMs [33].

Affordance Reasoning
Affordance delineates the cognitive capability of a model regarding the potential functionalities or interactions that an object can offer [7].We delve into affordance reasoning of LVLMs by employing inquiries such as "What objects in the picture can I ...".The outcomes show that these models can accurately identify objects capable of executing the prescribed actions in most cases.It is noteworthy that despite the incorrect responses of MiniGPT-v2 and GPT-4V in Fig. 8, which may be caused by the partial visibility of the chair, the mention of reasonable objects demonstrates their ability to establish a connection between behavior and its corresponding object.These results demonstrate their proficient performance in affordance reasoning.

Attribute Recognition
We proceed to validate the object attribute recognition capabilities of the aforementioned models using "question 4" with increasing complexity, as illustrated in Figs.7-9.
From the results, it is clear that there is a greater need for improvement in MiniGPT-v2 compared to the other models, as MiniGPT-v2 shows a deficiency in accurately identifying all the colors of flowers in Fig. 7, while other models demonstrate commendable performance in simple cases (in Fig. 7 and Fig. 8).Besides, the failures of LLaVA-1.5 and GPT-4V on complex cases (in Fig. 9) indicate that their ability to solve complex problems needs to be further improved.

Spatial Relation Reasoning
We evaluate the spatial relation reasoning capability of LVLMs with the last question in Figs.7-9.From this evaluation, we can find that MiniGPT-v2 and Shikra exhibit superior performance by providing incorrect answers only once, while LLaVA-1.5 demonstrates the poorest performance with incorrect answers in all tests.The experimental results that compared with MiniGPT-v2 and Shikra, LLaVA-1.5 still has a lot of room for improvement in spatial relation reasoning.

Concluding Remarks
In this study, we assess the progress of LVLMs by evaluating their effectiveness in specialized and general tasks.We begin by evaluating the performance of three recent open-source LVLMs, namely MiniGPT-v2, LLaVA-1.5, and Shikra, in six specialized tasks.These tasks include salient/camouflaged/transparent object detection, polyp detection, skin lesion detection, and industrial anomaly detection.Additional empirical investigations are conducted on GPT-4V and the aforementioned models to assess their capabilities in general tasks.The quantitative results indicate that while these models demonstrate promise in specialized tasks, they manifest inadequate transfer capability when directly applied to these tasks (as revealed by Table 7).This limitation stems from their limited understanding of specialized task domains.In addition to the aforementioned limitation, performance challenges are exacerbated by typical weaknesses of LVLMs, including object hallucination, text-to-image interference, and decreased robustness when confronted with complex problems/concepts.In addition to the lack of transfer capability in specialized tasks, they exhibit suboptimal performance in some general tasks, i.e. object counting, spatial relation reasoning, and absurd question answering.The inadequacies observed in both specialized and general tasks highlight a significant gap that LVLMs have yet to bridge on the path toward achieving AGI.These challenges also highlight the limitations of LVLMs for real-world applications, particularly in critical domains such as healthcare and industry where errors often yield significant negative consequences.The performance and reliability of LVLMs still fall short of the practical scenarios.

Discussions
Based on findings presented, we initiate several discussions concerning the application of LVLMs in specialized tasks and their future development.We hope that our discussions will stimulate thought and facilitate further exploration in this area.
Exploring more effective prompts.Though suboptimal performance has been achieved by current LVLMs, they hold great promise in specialized tasks.Hence, exploring effective strategies to enhance their performance is important, which would benefit both the field of specialized tasks and LVLMs.In this regard, providing additional information within prompts, a practice known as prompt engineering [40], is a viable strategy to improve their performance, as demonstrated in Fig. 6.This strategy has also been verified by some recent studies, which offer more anomaly definitions in prompts [11] or incorporating additional features of camouflaged targets into the prompts [12].
Optimizing LVLMs toward specialized tasks.As previously noted, prompt engineering has shown promise in improving the performance of LVLMs.However, the effectiveness of prompt engineering is still limited when the targets are difficult to be clearly described, such as on COD and AD.Hence, one of the future research directions involves optimizing LVLMs for specific tasks.This can be achieved by incorporating domain-specific knowledge through techniques such as prompt-tuning or fine-tuning [14], [41], [42], thereby enhancing their performance in specialized tasks.
Mitigating hallucination and also other issues.Current LVLMs encounter significant challenges in hallucination [31], [33], [43], [44], which impact their effectiveness in both general and specific tasks.In future research, overcoming these challenges by leveraging advanced techniques, such as hallucination revisor [43] and chain of visual perception [12], holds promise for enhancing the effectiveness of LVLMs in diverse tasks and facilitating broader application of these models.Besides, it is equally imperative to implement suitable strategies, such as data augmentation that eliminate co-occurrence patterns [45] to address the issues.Apart from the hallucination, these models encounter additional challenges, including decreased robustness when confronted with complex problems and reduced effectiveness in numerous general tasks, underscoring the fact that the comprehensive capabilities of current LVLMs remain limited.Future research is anticipated to leverage increasingly challenging datasets/problems while also providing detailed and specific procedures in instruction tuning [9], [46] to enhance the comprehensive capabilities of LVLMs.Besides, adopting advanced techniques such as feedback/reward mechanisms [47], [48] and integrating expert models [49] are also viable avenues to enhance their capabilities.
Incorporating additional visual information.Current LVLMs exhibit a significant limitation in leveraging visual information, as they are restricted to utilizing a single image, typically an RGB image, for each task [50].It is widely recognized that for certain visual tasks, such as object detection and recognition in complex scenes (e.g., those with heavy background clutter), relying solely on a single modality of visual information poses significant challenges [17], [51].Therefore, the visual perceptual capabilities of LVLMs will be greatly limited when applied to these tasks.To address this issue, a potential avenue for the future development of LVLMs is to integrate complementary visual information, such as depth [52], [53], [54], [55], [56] and focus cues [51] to augment their perceptual capabilities, the effectiveness of which has been extensively validated in the domain of computer vision.
Other potential applications of LVLMs.Despite the existing room for improvement, LVLMs have exhibited remarkable proficiency in tasks such as image summarization/description and visual question answering.Their superior proficiency in these fundamental tasks holds promise for their application in diverse domains.For example, harnessing the aforementioned capabilities of LVLMs to assist data annotation can significantly reduce annotation cost, which can further provide more support for training expert models or enhancing model capabilities [57].Besides, the potential of LVLMs in effectively performing a wide range of video-language tasks, such as video retrieval and video description, has been demonstrated remarkably [58].Inspired by this, LVLMs can be further applied to address other video-language tasks, such as video object segmentation [59], [60], [61] and video captioning [62] by first generating object descriptions and then conducting the tasks in a single frame.

Fig. 1 .
Fig.1.Overall diagram of our evaluation platform.We evaluate the recent LVLMs in both specialized and general tasks using tailored prompts, with and without specifying object types.The specialized tasks include salient object detection (SOD), transparent object detection (TOD), camouflaged object detection (COD), polyp detection, skin lesion detection, as well as industrial anomaly detection (AD).The evaluation is realized by conducting recognition ( § 2) and localization ( § 3) under these tasks, and three recent open-source LVLMs (MiniGPT-v2[4], LLaVA-1.5[8], and Shikra[9]) are tested.Besides, empirical investigations are conducted on the COCO[10] dataset to reflect the capabilities of LVLMs in general tasks ( § 4), including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning.Examples are presented in each figure group, where "<...>" indicates a placeholder that can be replaced with other words/phrases in different tasks.

Question 1 :
"Is there camouflaged object in the picture?What is it?"GT: No camouflaged object.Question 2: "Please describe the picture in detail."LLaVA-1.5

Fig. 2 .
Fig. 2. Responses of three LVLMs regarding the perception of camouflaged objects on negative samples.Incorrect responses are underlined in red and marked with crosses.

Fig. 3 .
Fig. 3. Detection and segmentation results of three LVLMs in six specialized tasks.The predicted bounding boxes and ground truth are marked with blue and green.From left to right in each scenario: detection (top) and segmentation (bottom) results of MiniGPT-v2 [4], LLaVA-1.5 [8], and Shikra [9], as well as segmentation results of upper bound (top) and the ground truth masks (bottom).

Question 2 :Fig. 4 .
Fig. 4. Responses of three LVLMs regarding locating given objects and recognizing objects of specific types.Predicted bounding boxes and ground truth are marked in blue and green.From top to bottom: examples of salient object detection, transparent object detection, and camouflaged object detection.Incorrect responses are marked with red underlines and crosses.

Question 1 :Question 2 :
"An anomaly in industry refers to a significant deviation from the expected norm.Are there any anomalies in this image?""Please describe the picture in detail."

Question 3 :Fig. 5 .Question 2 :
Fig. 5. Responses of three LVLMs regarding recognizing and locating the anomaly.Predicted bounding boxes and ground truth are marked in blue and green, respectively.The incorrect responses are marked with red underlines and crosses.

Question 3 :
"Detect flat bumps or tiny mushroom-like stalks in the picture."

Question 4 :
"What colors are flowers in the picture?"GT: Red and yellow.

I
'm sorry, but there seems to be a misunderstanding.The picture you've shared does not contain an airplane.It features …

•
Yao Jiang is with Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, and Sichuan University, Chengdu, China.• Xinyu Yan is with Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, and Tianjin University, Tianjin, China.• Ge-Peng Ji is with Australian National University, Canberra, Australia.• Keren Fu is with Sichuan University, Chengdu, China.• Meijun Sun is with Tianjin University, Tianjin, China.• Huan Xiong is with Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, and Harbin Institute of Technology, Harbin, China.

TABLE 2
Experimental results for three LVLMs regarding the presence of targets on the positive sets.The highest accuracy score is marked in bold.Given the absence of negative samples in the positive set, resulting in TN = 0 and FP = 0, the metric A in this table is equivalent to Y.

TABLE 3
Quantitative results of three LVLMs for classifying camouflaged objects.

TABLE 4
Detection and segmentation results of MiniGPT-v2, LLaVA-1.5, and Shikra in natural scenarios.The symbols ↑/↓ indicate that a higher/lower score is better, while the highest scores are marked in bold.The upper bound (on ground truth bounding boxes) of detection and segmentation via LVLMs in diverse specialized tasks is marked in gray.

TABLE 5
Detection and segmentation results of MiniGPT-v2, LLaVA-1.5, and Shikra in healthcare.The symbols ↑/↓ indicate that a higher/lower score is better, while the highest scores are marked in bold.The upper bound (on ground truth bounding boxes) of detection and segmentation via LVLMs in diverse specialized tasks is marked in gray.