Quantifying Interpretability Loss due to Image Compression

Video imagery provides a rich source of information for a range of applications including military missions, security, and law enforcement. Because video imagery captures events over time, it can be used to monitor or detect activities through observation by a user or through automated processing. Inherent in these applications is the assumption that the image quality of the video data will be sufficient to perform the required tasks. However, the large volume of data produced by video sensors often requires data reduction through video compression, frame rate decimation, or cropping the field-of-view as methods for reducing data storage and transmission requirements. This paper presents methods for analyzing and quantifying the information loss arising from various video compression techniques. The paper examines three specific issues:  Measurement of image quality: Building on methods employed for still imagery, we present a method for measuring video quality with respect to performance of relevant analysis tasks. We present the findings from a series of perception experiments and user studies which form the basis for a quantitative measure of video quality.  User-based assessments of quality loss: The design, analysis, and findings from a userbased assessment of image compression are presented. The study considers several compression methods and compression rates for both interand intra-frame compression.  Objective measures of image compression: The final topic is a study of video compression using objective image metrics. The findings of this analysis are compared to the user evaluation to characterize the relationship between the two and indicate a method for performing future studies using the objective measures of video quality.


Introduction
Video imagery provides a rich source of information for a range of applications including military missions, security, and law enforcement.Because video imagery captures events over time, it can be used to monitor or detect activities through observation by a user or through automated processing.Inherent in these applications is the assumption that the image quality of the video data will be sufficient to perform the required tasks.However, the large volume of data produced by video sensors often requires data reduction through video compression, frame rate decimation, or cropping the field-of-view as methods for reducing data storage and transmission requirements.This paper presents methods for analyzing and quantifying the information loss arising from various video compression techniques.The paper examines three specific issues:


Measurement of image quality: Building on methods employed for still imagery, we present a method for measuring video quality with respect to performance of relevant analysis tasks.We present the findings from a series of perception experiments and user studies which form the basis for a quantitative measure of video quality.


User-based assessments of quality loss: The design, analysis, and findings from a userbased assessment of image compression are presented.The study considers several compression methods and compression rates for both inter-and intra-frame compression.


Objective measures of image compression: The final topic is a study of video compression using objective image metrics.The findings of this analysis are compared to the user evaluation to characterize the relationship between the two and indicate a method for performing future studies using the objective measures of video quality.

Information content and analysis
Video data provides the capability to analyze temporal events which enables far deeper analysis than is possible with still imagery.At the primitive level, analysis of still imagery depends on the static detection, recognition, and characterization of objects, such as people or vehicles.By adding the temporal dimension, video data reveals information about the movement of objects, including changes in pose and position and changes in the spatial configuration of objects.This additional information can support the recognition of basic activities, associations among objects, and analysis of complex behavior (Fig. 1).
Fig. 1 is a hierarchy for target recognition information complexity.Each box's color indicates the ability of the developer community to assess the performance and provide confidence measures.The first two boxes on the left exploit information in the sensor phenomenology domain.The right two boxes exploit extracted features derived from the sensor data.
To illustrate the concept, consider a security application with a surveillance camera overlooking a bank parking lot.If the bank is robbed, a camera that collects still images might acquire an image depicting the robbers exiting the building and show several cars in the parking lot.The perpetrators have been detected but additional information is limited.A video camera might collect a clip showing these people entering a specific vehicle for their getaway.Now both the perpetrators and the vehicle have been identified because the activity (a getaway) was observed.If the same vehicle is detecting on other security cameras throughout the city, analysis of multiple videos could reveal the pattern of movement and suggest the location for the robbers' base of operations.In this way, an association is formed between the event and specific locations, namely the bank and the robbers' hideout.If the same perpetrators were observed over several bank robberies, one could discern their pattern of behavior, i.e. their modus operandi.This information could enable law enforcement to anticipate future events and respond appropriately (Gualdi et al. 2008;Porter et al. 2010).

Image interpretability
A fundamental premise of the preceding example is that the imagery, whether a still image or a video clip, is of sufficient quality that the appropriate analysis can be performed (Le Meur et al. 2010;Seshadrinathan et al. 2010;Xia et al. 2010).Military applications have led to the development of a set of standards for assessing and quantifying this aspect of the imagery.The National Imagery Interpretability Rating Scale (NIIRS) is a quantification of image interpretability that has been widely applied for intelligence, surveillance, and reconnaissance (ISR) missions (Irvine 2003;Leachtenauer 1996;Maver et al. 1995).Each NIIRS level indicates the types of exploitation tasks an image can support based on the expert judgments of experienced analysts.Development of a NIIRS for a specific imaging modality rests on a perception-based approach.Additional research has verified the relationship between NIIRS and performance of target detection tasks (Baily 1972;Driggers et al. 1997;Driggers et al. 1998;Lubin 1995).Accurate methods for predicting NIIRS from the sensor parameters and image acquisition conditions have been developed empirically and substantially increase the utility of NIIRS (Leachtenauer et al. 1997;Leachtenauer and Driggers 2001).
The NIIRS provides a common framework for discussing the interpretability, or information potential, of imagery.NIIRS serves as a standardized indicator of image interpretability within the community.An image quality equation (IQE) offers a method for predicting the NIIRS of an image based on sensor characteristics and the image acquisition conditions (Leachtenauer et al. 1997;Leachtenauer and Driggers 2001).Together, the NIIRS and IQE are useful for:


Communicating the relative usefulness of the imagery,  Documenting requirements for imagery,  Managing the tasking and collection of imagery,  Assisting in the design and assessment of future imaging systems, and  Measuring the performance of sensor systems and imagery exploitation devices.
The foundation for the NIIRS is that trained analysts have consistent and repeatable perceptions about the interpretability of imagery.If more challenging tasks can be performed with a given image, then the image is deemed to be of higher interpretability.A set of standard image exploitation tasks or "criteria" defines the levels of the scale.To illustrate, consider Fig. 2. Several standard NIIRS tasks for visible imagery appear at the right.Note that the tasks for levels 5, 6, and 7 can be performed, but the level 8 task cannot.
The grill detailing and/or license plate on the sedan are not evident.Thus, an analyst would assign a NIIRS level 7 to this image.

Fig. 2. Illustration of NIIRS for a still image
Recent studies have extended the NIIRS concept to motion imagery (video).In exploring avenues for the development of a NIIRS-like metric for motion imagery, a clearer understanding of the factors that affect the perceived quality of motion imagery was needed (Irvine et al. 2006a;Young et al. 2010b).Several studies explored specific aspects of this problem, including target motion, camera motion, and frame rate, and the nature of the analysis tasks (Hands 2004;Huynh-Thu et al. 2011;Moorthy et al. 2010).Factors affecting perceived interpretability of motion imagery include the ground sample distance (GSD) of the imagery, motion of the targets, motion of the camera, frame rate (temporal resolution), viewing geometry, and scene complexity.These factors have been explored and characterized in a series of evaluations with experienced imagery analysts:  Spatial resolution: Evaluations shows that for motion imagery the interpretability of an video clip exhibits a linear relationship with the natural log of the ground sample distance (GSD), at least for clips where the GSD is fairly constant over the clip (Cermak et al. 2011;Irvine et al. 2004;Irvine et al. 2005;Irvine et al. 2007b) .


Motion and Complexity: User perception evaluations assessed the effects of target motion, camera motion, and scene complexity on perceived image quality (Irvine et al. 2006b).The evaluations indicated that target motion has a significant positive effect on perceived image quality, whereas camera motion has a barely discernable effect.


Frame Rate: These evaluations assessed object detection and identification and other image exploitation tasks as a function of frame rate and contrast (Fenimore et al. 2006).
The study demonstrated that an analyst's ability to detect and recognize objects of interest degrades at frame rates below 15 frames per second.Furthermore, the effect of reduced frame rate is more pronounced with low contrast targets.


Task Performance: The evaluations assessed the ability of imagery analysts to perform various image exploitation tasks with motion imagery.The tasks included detection and recognition of objects, as might be done with still imagery and the detection and recognition of activities, which relies on the dynamic nature of motion imagery (Irvine et al. 2006b;Irvine et al. 2007c).Analysts exhibited good consistency in the performance of these tasks.In addition, dynamic exploitation tasks that require detection and recognition of activities are sensitive to the frame rate of the video clip.
Building on these perceptions studies, a new Video NIIRS was developed (Petitti et al. 2009;Young et al. 2009).The work presented in this paper quantifies video interpretability using a 100-point scale described in Section 3 (Irvine et al. 2007a;Irvine et al. 2007b;Irvine et al. 2007c).The scale development methodologies imply that each scale is a linear transform of the other, although this relationship has not been validated (Irvine et al. 2006a;Irvine et al. 2006b).Other methods for measuring video image quality frequently focus on objective functions of the imagery data, rather than perception of the potential utility of the imagery to support specific types of analysis (Watson et al. 2001;Watson and Kreslake 2001;Winkler 2001;Winkler et al. 2001).

Image compression
A recent study of compression for motion imagery focused on objective performance of target detection and target tracking tasks to quantify the information loss due to compression (Gibson et al. 2006).Gibson et al. (2006) leverage recent work aimed at quantification of the interpretability of motion imagery (Irvine et al. 2007b).Using techniques developed in these earlier studies, this paper presents a user evaluation of the interpretability of motion imagery compressed under three methods and various bitrates.The interpretability of the native, uncompressed imagery establishes the reference for comparison (He and Xiong 2006;Hewage et al. 2009;Yang et al. 2010;Yasakethu et al. 2009).

Data compression
The dataset for the study consisted of the original (uncompressed) motion imagery clips and clips compressed by three compression methods at various compression rates (Abomhara et al. 2010).The three compression methods were:

-intraframe and interframe
All three were exercised in intraframe mode.Each of the parent clips was compressed to three megabits per second, representing a modest level of compression.In addition, each parent clip was severely compressed to examine the limits of the codecs.Actual bitrates for these severe cases depend on the individual clip and codec.The choice of compression methods and levels supports two goals: comparison across codecs and comparisons of the same compression method at varying bitrates.Table 1 shows the combinations represented in the study.We recorded the actual bit rate for each product and use this as a covariate in the analysis.
The study used the Kakadu implementation of JPEG2000, the Vanguard Software Solutions, Inc. implementation of H.264, and the Adobe Premiere's MPEG-2 codec.In each case, the 300 key frame interval was used for interframe compression unless otherwise noted.Intraframe encoding is comparable to interframe encoding with 1 key frame interval.
The study used progressive scan motion imagery in a 848 x 480 pixel raster at 30 frames per second (f/s).Since most of the desirable source material was available to us in 720 P HD video, a conversion process was employed to generate the lower resolution/lower frame rate imagery.We evaluated the conversion process to assure the goals of the study could be met.The video clips were converted using Adobe Premiere tools.

Intraframe
Note: the severe bitrate represents the limit of the specific codec on a given clip.
Table 1.Codecs and Compression Rates

Experimental design
The study consists of two parts.Both parts used the set of compression products described above.The first part was an evaluation in which trained imagery analysts viewed the compressed products and the original parent clip to assess the effects of compression on interpretability.The second part of the study implemented a set of computational image metrics and examined their behavior with respect to bitrate and codec.The typical duration of each clip is 10 seconds.Ten video clips were used for this study.

User-based evaluation of compression
To quantify image interpretability, subjective rating scale was developed by Irvine et al. (2007c), based on consistent ratings by trained imagery analysts.The scale assigns the values 0 to a video clip of no utility and 100 to clips that could support any of the analysis tasks under consideration (Fig. 3).Three additional clips identified in this study formed markers to evenly divide the subjective interpretability space.Thus, reference clips were available at subjective rating levels of 0, 25, 50, 75, and 100.

Fig. 3. NIIRS Development Functional Decomposition
A set of specific image exploitation tasks were reviewed by imagery analysts and rated relative to these marker video clips.In this way, these analysis tasks were calibrated to the subjective rating scale.A subset of these "calibrated" analysis tasks were used to evaluate the compressed video products (Table 2).Note that some of these tasks do not require analysis of temporal activity and could be performed with still imagery.We label these as "static" tasks.A second set of tasks are "dynamic" because they require direct observation or inference about movements of objects.
Image analysts rated their confidence in performing each image exploitation task with respect to each compression product, including the original (uncompressed) clip.We calculated an overall interpretability rating from each analyst for each clip.

Approach
For each parent clip, three criteria (image exploitation tasks) were assigned.The considerations for selecting the criteria were:


The criteria should "bound" the interpretability of the parent clip, i.e. at least one of the three should be difficult to do and one should be easy  The criteria (or at least some of the criteria) should reference objects and activity that are comparable to the content of the clip  The criteria should have exhibited low rater variance in the previous evaluations

Analysis and findings
The data analysis progresses through several stages: verification and quality control, exploratory analysis to uncover interesting relationships, and statistical modeling to validate findings and formally test hypotheses of interest.The initial analysis examined the data for anomalies or outliers.None were found in this case.
Next, we calculated an overall interpretability rating from each analyst for each clip.The method for calculating these ratings was as follows: Each of the three criteria used to rate each clip was calibrated (on a 0-100 scale) in terms of interpretability, where this calibration was derived from an earlier evaluation (Irvine et al. 2007c).Multiplying the interpretability level by the IA's confidence rating produces a score for each criterion.The final interpretability score (Equation 1) was the maximum of the three scores for a given clip.
Interpretability Score(j, k) = max {C i,j,k I i,k : i=1,2,3} / 100 (1) Where C i,j,k is the confidence rating by the jth IA on the kth clip for the ith criterion and I i,k is the calibrated interpretability level for that criterion.All subsequent analysis presented below is based on this final interpretability score.The remaining analysis is divided into two sections: interframe compression and intraframe compression.Ultimately, we compared the analyst derived utility measures, as a NIIRS surrogate, to the automated computational values.
All three codecs yielded products for the evaluation.However, MPEG-2 would not support extreme compression rates.Bitrate was the dominant factor, but pronounced differences among the codecs emerged too (Fig. 6 and Fig. 7).At modest compression rates, MPEG-2 exhibited a substantial loss in interpretability compared to either H.264 or JPEG-2000.Only JPEG-2000 supported more extreme intraframe compression.A computational model was developed to characterize the significance among codec, scene, and bitrate's effect on data quality.There were systematic differences across the clips, as expected, but the effects of the codecs and bitrates were consistent.When modeled as a covariate, the effects of bitrate dominate.The effect due to codec is modest, but still significant.As expected, there is a significant main effect due to scene, but no scene-by-codec interaction.

Interframe compression
Analysis of the interframe ratings shows a loss in image interpretability for both MPEG-2 and H.264 as a function of bitrate (Fig. 4).The initial compression from the native rate to 3 MB per second corresponds to a modest loss in interpretability.This finding is consistent with previous work.At extreme compression levels (below 1 MB per second), the interpretability loss is substantial.H.264 generally supported more extreme compression levels, but the interpretability degrades accordingly.Although the exact compression level varies by clip, the pattern is clear for all clips.Statistical analysis shows that bitrate, modeled as a covariate, is the primary factor affecting interpretability.For interframe compression, the differences between H.264 and MPEG-2 are small but statistically significant (Table 3).
The pattern holds across all scenes, as indicated by the lack of a codec-by-scene interaction effect.

Intraframe compression
In the case of intraframe compression, all three codecs yielded products for the evaluation, although MPEG-2 would not support extreme compression rates.The findings in this case are slightly different than interframe compression.Bitrate remained the dominant factor, but more pronounced differences among the codecs emerged (Figure 4).At modest compression rates, MPEG-2 exhibited a substantially loss in interpretability compared to either H.264 or JPEG-2000.Only JPEG-2000 supported more extreme intraframe compression and highly compressed renditions were produced from all of the parent clips.As with the interframe comparisons, there were systematic differences across the clips, as expected, but the effects of the codecs and bitrates were consistent.The analysis of covariance confirms these statistical effects (Table 4).When modeled as a covariate, effects of bitrate dominate.The effect due to codec modest, but still significant.As expected, there is a significant main effect due to scene, but no scene-by-codec interaction.

Computational measures and performance assessment
In the previous section, analyst assessments of image quality were characterized.This section identifies computational attributes for image quality that can be extracted from video clips.Performance measures will evaluate the computational image quality metrics and provide an understanding of how well they compare to codec, bitrate, and scene parameters.

Computational image metrics
We reviewed a variety of image metrics to quantify image quality (Bhat et al. 2010;Chikkerur et al. 2011;Culibrk et al. 2011;Huang 2011;Sohn et al. 2010).Based on a review of the literature and assessment of the properties of these metrics, we selected four measures for this study: two edge-based metrics, structural similarity image metric (SSIM), and SNR.SSIM and edge metrics are performed at each pixel location.The resultant can be viewed as an image (Fig. 6).SNR metrics deal with overall information content and cannot be visualized as an image.These metrics were computed for the original (uncompressed) clips and for all of the compressed products.We will present the computation methods and the results.

SSIM
The first metric for image quality is the Structural Similarity Image Metric (Wang et al. 2004).SSIM quantifies differences between two images, I 1 and I 2 , by taking three variables into consideration, luminance, contrast, and spatial similarity.For grey level images, those variables are measured in the images as the mean, standard deviation, and Pearson's correlation coefficient between the two images respectively.For our application, the RGB data was converted to grey level using the standard Matlab function.Let: Equation 4 is modified to avoid singularities, e.g., when both means are 0. SSIM is computed locally on each corresponding MxM sub-image of I 1 and I 2 .In practice, the sub-image window size is 11x11, implemented as a convolution filter.The SSIM value is the average across the entire image.

Edge metrics
Two edge metrics were examined.The first is denoted by CE for Common Edges and the second is denoted SE for strength of edges (O'Brien et al. 2007).Heuristically, CE measures the ratio of the number edges in a compressed image to the number of edges in the original; whereas SE measures a ratio of the strength of the edges in a compressed version to strength of the edges in the original.
Given two images I 1 and I 2 CE(I 1 , I 2 ) and SE(I 1 , I 2 ) are computed as follows.From the grey level images, edge images are constructed using the Canny edge operator.The edge images are designated as E 1 and E 2 .Assume that the values in E 1 and E 2 are 1 for an edge pixel and 0 otherwise.Let "*" denote the pixel wise product.Let G 1 and G 2 denote the gradient images of I 1 and I 2 respectively.G(m,n) was approximated as the maximum of absolute value of the set {I(m,n) -I(m+t1,n + t2) | -6 < t1 < 6 and -6< t2 < 6}, i.e. the maximum difference between the center value and all values in a 5x5 neighborhood around it.With that notation, where the sum is taken over all the pixels within a given frame.where the sum is taken over all the pixels within a given frame.
An additional set of edge operators were also applied.These operators are called edge strength (ES) metrics.Let Y 1 be the luminance component of an original frame from an clip and let Y 2 be the corresponding frame after compression processing, also in luminance.We apply a Sobel filter, S, to both Y 1 and Y 2 , where for a grayscale frame F: The filters H and V used in the Sobel edge detector are: We define two metrics, one for local loss of edge energy (EL) (thus finding blurred edges from Y 1 in Y 2 ) and the other for the addition of edge energy (thus finding edges added to Y 2 that are weaker in Y 1 ).Each metric examines the strongest edges in one image (either Y 1 or Y 2 ) and compares them to the edges at the corresponding pixels in the other (Y 2 or Y 1 ).
For the grayscale image F, let I(F,f) be the set of image pixels, p, where F (p) is at least as large as f * max(F).That is: Using the definition of Y(F,f), the two edge metrics are: where the means are taken over the set I(S(I 1 ), 0.99) where the means are taken over the set I(S(Y 2 ), 0.99).

SNR
Finally, we examined the peak signal to noise ratio (PSNR).The PSNR is defined for a pair of m×n luminance images, Y 1 and Y 2 .Let MSE be defined by, The PSNR is defined as: where MAX I is maximum pixel value of the image.In our case, MAX I is taken to be 255.

Metrics and performance
The image metrics were plotted (Fig. 7).The image metrics are all highly correlated across both bitrate and codec, for both intraframe and interframe compression techniques.For the set of clips with every 300 key frame interval, the correlation was greater than 0.9.In each case, the lower information content is indicated by lower position on the Y axis, quality.The X axis is the target bitrate.Due to the high correlation a single computational metric was chosen for more detailed analysis to quantify the relationship between image quality and bitrate.SSIM was selected because it generates an image to diagnose unexpected values and the computation is based upon perceptual difference of spatial similarity and contrast.H.264's asymptotic quality improvement observed in the rise in the graph from the initial frames (Fig. 7).This corresponds to exactly where the algorithm is increasing its fidelity of

49
the compressed frames to the original frames.Along this initial portion of the clip the metrics agree with human perception of the image quality increasing.
Fig. 8 plots SSIM versus frame at differeing bitrates for the H.264 codec, which is an interframe codec.The saw-tooth nature of the graph is the result of the group of pictures (GOP) sequence.The peak and trough differences are between bilinear interpolation between key frames (B) and predicted (P) encoded frames.
The observations for the metrics listed above for H.264 were also visually evident in the case of MPEG compression.Close inspection of the clips shows the quality to be lower in the case of MPEG than for H.264.The example in Fig. 9 is taken from a clip that was compressed to 2 Mbits/second using both codecs.While discernable in both the original and the H.264 compressed versions, some of the individuals' heads seem to be nearly totally lost in the MPEG version.Each clip was compressed using H.264 with the key frames 1 every 300 frames.

Discussion
These experiments demonstrate the existence of several metrics that are monotonic with bitrate.The metrics showed considerable sensitivity to image quality that matched the authors' observations.Specifically, the MPEG quality was considerably less than H.264 at the same bitrate.The knee of the quality curves exist between 500k and 1000k bps.In addition, the metrics were sensitive to the encoded structure of the individual frames as the saw tooth differences between the B and P frames were readily observable.A qualitative comparison of the objective metrics to the user assessment of interpretability shows strong consistency.Compression of these video products to bitrates below 1,000k bps yields discernable losses in image interpretability.The objective metrics shows a similar knee in the curve.These data suggest that one could estimate loss in interpretability from compression using the objective metrics and derive a prediction of the loss in Video NIIRS.
Development of such a model would require conducting a second user experiment to establish the relationship between the subjective interpretability scale used in this study and the published Video NIIRS.The additional data from such an experiment would also support validation of a model for predicting loss due to compression.

Conclusion
The evaluations and analyses presented in this Chapter characterize the loss in perceived interpretability of motion imagery arising from various compression methods and compression rates.The findings build on previous studies (Irvine et al. 2007a;O'Brien et al. 2007).The findings are consistent with other evaluations of video compression (Gibson et al. 2006;Young et al. 2010a).Evaluation of image compression for motion imagery illustrates how interpretability-based methods can be applied to the analysis of the image chain.We present both objective image metrics and analysts' assessments of various compressed products.The results show good agreement between the two approaches.Further research

Fig
Fig. 4. Summary Comparison Across Codec and Bitrate

Fig
Fig. 5. Summary Comparison Across Codec and Bitrate

Fig. 6 .
Fig. 6.(a) Original, (b) Compressed version, (c,d) original edge images, (e) edge images together where Red, Blue, Magenta are from the original, the compressed, both edge images respectively (f) edge intensities, and (g) is the SSIM image darker areas represent more noticeable differences.

Fig. 7 .
Fig. 7. Target Bitrate (k bps) versus Image Metric: SSIM, EL, ES, and PSNR Fig. 7 indicates that SSIM, CE, and SE that measures separate image quality based on bitrate.H.264's asymptotic quality improvement observed in the rise in the graph from the initial frames (Fig.7).This corresponds to exactly where the algorithm is increasing its fidelity of

Fig. 8 .
Fig. 8. Plot of the SSIM evaluated on each frame for 11 different bit rates.Each clip was compressed using H.264 with the key frames 1 every 300 frames.

Table 3 .
Analysis of Covariance for Interframe Comparisons

Table 4 .
Analysis of Variance for Interframe Comparisons