Exploiting Data-Parallelism on Multicore and SMT Systems for Implementing the Fractal Image Compressing Problem

This paper presents a parallel modeling of a lossy image compression method based on the fractal theory and its evaluation over two versions of dual-core processors: with and without simultaneous multithreading (SMT) support. The idea is to observe the speedup on both configurations when changing application parameters and the number of threads at operating system level. Our target application is particularly relevant in the Big Data era. Huge amounts of data often need to be sent over low/medium bandwidth networks, and/or to be saved on devices with limited store capacity, motivating efficient image compression. Especially, the fractal compression presents a CPU-bound coding method known for offering higher indexes of file reduction through highly time-consuming calculus. The structure of the problem allowed us to explore data-parallelism by implementing an embarrassingly parallel version of the algorithm. Despite its simplicity, our modeling is useful for fully exploiting and evaluating the considered architectures. When comparing performance in both processors, the results demonstrated that the SMT-based one presented gains up to 29%. Moreover, they emphasized that a large number of threads does not always represent a reduction in application time. In average, the results showed a curve in which a strong time reduction is achieved when working with 4 and 8 threads when evaluating pure and SMT dual-core processors, respectively. The trend concerns a slow growing of the execution time when enlarging the number of threads due to both task granularity and threads management.


Introduction
Considering the era of Big Data, the thematic of image compression becomes more and more relevant (Chen et al., 2012;Revathy & Jayamohan, 2012;Sundaresan & Devika, 2012).The main objective consists in reducing the irrelevance and redundancy of the image data to store or transmit data in an efficient way.For instance, images obtained by experiments in the fields of astronomy, medicine and geology may present several gigabytes in memory, emphasizing the use of image compression properly (Pinto & Gawande, 2012).In this context, a technique called Fractal Image Compression (FIC) appears as one of most efficient solutions for reducing the size of files (Jeng et al., 2009;Khan & Akhtar, 2013).An expensive encoding phase characterizes the FIC method, since the search used in the algorithm to find self-similarities is time-consuming.A square image with 1024 pixels as dimension may take more than an hour to be compressed in a single processing system.This elucidates why this technique is not so widespread among the traditional operating systems.However, at high compression ratios, fractal compression may offer superior quality than JPEG and Discrete-cosine-transform (DCT)-based algorithms (George & Al-Hilo, 2009).Unlike the coding phase, the decoding one occurs quickly, for instance, enabling users to download compressed images or videos from Web servers and visualize them in their hosts in a reasonable time interval.
Considering a lower encoding phase of FIC method, some alternatives are considered to minimize this process.Basically, the most alternatives try to reduce the coding time by reducing the search for the best-match block in a large domain pool (Fu & Zhu, 2009;Jeng et al., 2009;Mitra et al., 1998;Qin et al., 2009;Revathy & Jayamohan, 2012;Rowshanbin et al., 2006;Sun & Wun, 2009;Vahdati et al., 2010).Other possibilities consist in exploring the power of parallel architectures like nCUBE (Jackson & Blom, 1995), SIMD (Single Instruction Multiple Data) (Khan & Akhtar, 2013;Wakatani, 2012) processors and clusters (Righi, 2012;Qureshi & Hussain, 2008).The use of multitasking on recent computing systems is a possibility not deeply explored for solving the FIC problem (Cao & Gu, 2010;Cao & Gu, 2011).The authors of these last initiatives presented an OpenMP solution that was tested over a quad-core processor.Besides multicore, we are focusing our attention on SMT (Simultaneous Multithreading) (Raasch & Reinhardt, 2003) capability, since both technologies are common on off-the-shelf computers.Some researchers affirm that we will have tens or hundreds of cores, each one with multiple execution threads (Note 1), inside a processor in the next years (Diamond et al., 2011;Rai et al., 2010).This emphasizes the significance of modeling applications for such architectures.
The improvement in performance obtained by using multicore and SMT technologies depends on the software algorithms and their implementations.Task granularity, threads synchronization and scheduling, memory allocation, conditional variables and mutual exclusion are parameters under user control that must be carefully analyzed for extracting the power of these technologies in a better way.In this context, the present paper describes the FIC technique and its threads-based implementation.The FIC problem allows a program organization without data dependencies among the threads, which is special useful for observing key performance factors on parallel machines.Therefore, we modeled an embarrassingly parallel application by exploiting data-parallelism on the aforemesaid problem.Contrary to (Cao & Gu, 2010;Cao & Gu, 2011), we obtained the results by varying the input image, the application parameters as well as the target machine.Particularly, we used two dual-core machines, one with and another without SMT capacity.In this case, SMT doubles the number of execution threads from 1 per core to 2, increasing processor throughput by multiplexing the execution threads onto a common set of pipeline resources.Our evaluation confirmed gains up to 29% when enabling SMT.Besides computer architecture information, this paper also discusses the impact of the number of threads and task granularity on the obtained results.This paper is organized as follows.Section 2 describes the two traditional approaches for image compression.The FIC method is presented in Section 3 in details.Section 4 shows the parallel modeling proposed for the FIC problem, while Section 5 describes its implementation.The tests and the discussion of the results are presented in Section 6. Section 7 presents some related works.Finally, Section 8 points out the concluding remarks, future works and emphasizes the main contribution of the work.

Image Compression
A Pixel is the minimum unit to define an image.A digital image is a bi-dimensional matrix composed by a set of pixels whose spatial resolution is I × J, where both I and J ∈ N and corresponding matrix element value identifies a set of discretized attributes (ex.gray level, color, transparency, and so on).Consequently, the larger the size of the image, greater will be the number of its pixels and attribute discretization, where each pixel is represented by a collection of bits, normally 16, 24 or 32 bits.In case, 16 Mbytes of memory are required to store a single image of 2048 × 2048, with 32 bits/pixel.In addition, some square images obtained by researchers can present dimensions up to 106, which turns clear the importance of the image compression field.We could classify the compression process in two subprocesses: (i) lossless compression and; (ii) lossy compression.

Lossless Compression
Situations in which the information needs to be kept intact after uncompressing usually employ Lossless compression.Medical images, technical drawings or texts are examples of using the lossless approach (Chen & Chuang, 2010).First, this process consists in transforming an input image f(x) in f'(x).According to Fu and Zhu (2009), this transformation can include differential and predictive mapping, unitary transforms, sub-band decomposing and color spacing conversion.After that, the data-to-mapping stage converts the f'(x) in symbols, using the partitioning or run-length coding (RLC).
Lossless symbol coding stage generates a bit-stream by assigning binary codewords to symbols that were already mapped.Lossless compression is usually achieved by using variable-length codewords.This variable-length codeword assignment is known as variable-length coding (VLC) and also as entropy coding.Figure 1 depicts the process for obtaining a compressed image through the lossless method.Such method is used on algorithms for producing BMP, TGA, TIFF and PNG-typed images.cis.ccsenet.

Lossy C
Lossy com process, th (Jeng et

Experimental Results and Discussion
We have used two input images for performing our evaluation.The first refers to the Lenna (Note 2) picture and presents 256x256 pixels, while the second is a Coliseum photo with 512x512 pixels.Each experiment was run 30 times and we got the mean value and the standard deviation.Considering all the tests, the highest standard deviation for the 256x256 image as 2.78% from the average, while 1.51% was the index obtained for the 512x512 input.We started the time counter before launching the first thread and stopped it after finalizing the execution of all threads.This method discarded sequential code in the measures.Table 1 presents the obtained PSNR when varying the number of ranges.The number of threads does not matter for evaluating this index since the output image is always the same.The 2x2-sized range achieved the best results resulting from its better entropy when compared to larger ranges.Visually, images with PSNR greater than 21 have a good visualization capacity for human beings (Türkan et al., 2012).We achieved a compression rate of 2:1 in both images when employing a range with dimensions 2x2.However, 234:1 and 250:1 compression ratios were observed for 32x32-sized ranges when manipulating Lenna and Coliseum images, respectively.Tables 2 and 3 present the evaluation of both input images when using a dual-core machine without SMT facility.
As expected, the best results appear when testing 2 or 4 threads.For example, when testing only one thread with a range dimension equal to 4 the result was 6.57 seconds.This configuration does take profit of the parallel machine.However, the execution with 32 threads presented the highest execution time when comparing executions of multiple threads.This behavior is explained by the overhead of mutex, synchronization and thread management primitives.The larger the number of threads, the higher this overhead.This elucidates a common behavior on evaluating threads on dual-core processors, where the application time decreases abruptly with 2 and 4 threads and grows up slowly when enlarging the number of threads.Figure 6 illustrates the speedup (sequential_time ÷ parallel_time) and the parallel efficiency (Speedup ÷ processors) for the tests with 2x2 range.
Our application presents a poor speedup because the number of threads is greater than the number of execution cores.This statement becomes clear in the efficiency graph.Considering that we have only 2 physical cores, the execution with two threads presented the highest efficiency The execution with 4 up to 32 threads expresses the dilemma of concurrence, since each pair of threads competes for a single processor.
Figure 7 depicts the speedup evaluation results of the Coliseum image over a dual-core machine.This image presents a larger computation grain if compared with the Lenna one.In other words, the overhead associated with threads are better amortized when testing the Coliseum image since each thread has more work to compute in comparison with the other image.In this way, the execution with 2 threads reaches indexes up to 1.97 of speedup which is considered a good measure since the ideal speedup for this configuration is 2.Besides this analysis, it is possible to observe other two behaviors in the graph of Figure 7. Firstly, the larger the dimension of the ranges, the lower the captured speedup.For example, the execution with a range of 32x32 presents a lowest computation grain per each thread.Secondly, we can observe an execution pattern among the threads.Independent of the number of threads, the speedup curve presents the same aspect.obtained gains up to 1.98, which are considered a good measure for this number of threads.However, we can observe that the use of 2 threads does not take profit from the entire power of parallel architecture, since an execution thread remains allocated per core.The execution with 4 up to 32 cores took profit from the SMT solution.Particularly, we obtained a gain of 3.05 when testing 4 threads and ranges with dimension 16x16, which represents more than 75% of usage considering the execution threads inside the cores.The most relevant verifications concern the execution with 2x2 ranges.As we can see in Figure 8, the performance of this configuration not scale well when treating for 4 or more threads.The calculus with this dimension of ranges is more computationally intensive than others.Furthermore, interactions require more memory since the subset of ranges belonging to each thread is larger than other range configurations.Clearly, any of the following observations causes a system bottleneck (Diamond et al., 2011): (i) memory contention; (ii) cache miss; (iii) concurrent access to components in the superscalar pipeline of the SMT core. Figure 9 illustrates a comparison graph considering both configuration of dual-core processors and the Lenna image.Although the SMT processor operates with 4 execution threads, our evaluation showed that the best results were obtained with 8 user threads.This combination was the best one for enlarging the efficiency regarding the cores utilization.Despite a large number of threads rises the operating system time for both managing and scheduling them efficiently, the threads are useful for exploiting superscalar and preemption facilities found on SMT processors.Logically, the number of threads must be analyzed with the thread granularity.In out case, 8 threads and 8x8 ranges compose the set with better performance.Finally, Figure 10 depicts the tests in which a range of 32x32 pixels and the Coliseum image were employed.This configuration points out the traditional curve when working with threads.We have a perceptible reduction in time when enabling threads and the time grows up when enlarging the number of threads as well.This is explained by computational work grain.The larger the number of threads, the lower the grain to be calculated by each thread (each thread receives a subset of ranges uniformly).In addition, more threads implies in a higher cost on synchronization and mutex primitives.

Related Work
FIC technique has grabbed much attention in recent years because of manifold advantages, very high compression ratio, high decompression speed, high bit-rate and resolution independence.There have been many techniques, and improvements published in this field since 1990.Most of them are focused on some algorithm improvements for a smart search, which both reduce the size of search pool for range-domain matching and yield a significant speedup in execution time (Fu & Zhu, 2009;Jeng et al., 2009;Mitra et al., 1998;Qin et al., 2009;Revathy & Jayamohan, 2012;Rowshanbin et al., 2006;Sun & Wun, 2009;Vahdati et al., 2010).In particular, Revathy and Jayamohan (2012) proposed a dynamic preparation of a domain pool for each range block, instead of working with a set of static domains from the beginning of the execution (Revathy & Jayamohan, 2012).Vahdati et al. (2010) presented a Chaotic particle swarm optimization (CPSO) based on the characteristics of fractal and partitioned iterated function system.In addition, Ant Colony (Li et al., 2008), Neural Networks (Sun et al., 2001) and Genetic Algorithm (Mitra et al., 2000;Mitra et al., 1998;Wu & Lin, 2010) techniques were proposed to greatly decreases the search space for finding the self similarities in the given image.Contrary of exploring a reduction in the application time, Selim et al. focused on procuring a high compression index by maintaining a peak signal to noise ratio (PSNR) larger than 30 (Selim et al., 2008).
Regarding the exploration of parallel architectures, for the best of our knowledge there are the following initiatives for solving the FIC problem.Jackson and Blom (1995), based in a nCUBE multiprocessor, showed a parallel solution implementing a "host and nodes" solution, where a single processor was dedicated for distributing the workload to nodes and gathering results.Another message-passing solution were proposed by Qureshi and Hussein (2008), who implemented a three static master-worker MPI (Message Passing Interface) strategies for enabling load balancing on a Beowulf cluster of workstations.The authors measured both the speedup and the worker idle time of each implementation.Other features used in the context of FIC, considering a multicomputer environment, were Web Services (Fang et al., 2011) and process migration (Righi, 2012).Particular, this second work applies process rescheduling in grid environments for dealing with architecture heterogeneity and application dynamicity.Some works explore SIMD (Single Instruction Multiple Data) architectures, and more especially GPU (Graphical Processing Unit) (Wakatani, 2012;Khan & Akhtar, 2013).Kim and Choi (2011) combined both GPU and multithreading in their 2D DCT (discrete cosine transform) solution for the FIC problem (Kim & Choi, 2011).The article focused on the OpenCL parallel modeling.The authors just used an Intel core 2 Duo for the tests.Cao and Gu (Cao & Gu, 2010;Cao & Gu, 2011) presented a multithreading-based FIC implementation with OpenMP library by putting pragma codeword on iterative constructions simply.Albeit they pointed out a multicore implementation, the authors just presented tests with a quad-core system.Analyzing the contemplated related works, we observed a lack of studies on comparing the power of the recent multicore and SMT architectures for calculating the FIC problem.Hence, this opportunity of work was explored in this article.

Conclusion
With the help of recent development on semiconductor design, modern processors can provide a great opportunity to increase the performance on processing multimedia data by exploiting data-parallelism in multicore and SMT systems.Aiming to verify this statement, we employed in this article a parallel modeling of the so-called Fractal Image Compression (FIC) problem.Over the recent decades, FIC is a field of intensive research, applied not only in image processing but also in database indexing, texture mapping and pattern recognition problems.We designed a fork-join modeling to explore the fully potential of the parallel architecture, where each thread has a copy of the entire D (Domain) set and receives from the main program its own subset of ranges, which represents a subpart of the input image.The threads run without dependencies among themselves and are synchronized once for collecting the compressed image.
We confirmed the Garcia and Gao's (2013) affirmation, that says applications with data-parallelism, where multiple threads execute the same code on different sets of data, can improve their performance dramatically when taking profit from SMT and multicore technologies.The results showed gains up to 68% (with SMT) and 48% (without SMT) when comparing multiple and single-thread scenarios in both configurations of dual-core processors.We can explain this rate by: (i) our modeling strategy and; (ii) fact that FIC is a CPU-bound problem.
The benefits of data parallelism exploration were more evident in the SMT configuration.The use of 4 execution threads in SMT-assisted dual-core provided a performance gain up to 29% if compared to a non-SMT configuration.Particularly, we obtained this index with 8 user threads, which occupy each execution thread in a better way.In the best of our knowledge, this article is the first that presents a parallel FIC application focused on multicore and SMT systems, showing a detailed evaluation on them.Besides this, we can extend our contribution to operating systems.They can include the parallel FIC implementation proposed here as an optional for compressing images, since multicore systems have become state-of-the-art in processor architecture field.
Finally, the tests allow us to conclude that the performance of a multithreading system depends on the computational grain on each thread, the number of processors in the target machine and the mutex/synchronization directives in the code.Future work comprises the execution of the FIC problem by modeling a message-passing application to execute over AMPI (Adaptive MPI) (Rodrigues et al., 2010).In this way, we intent to evaluate the problem with threads, with MPI solely and by combining both threads and MPI approaches.

Table 1 .
Analyzing the obtained PSNR (measured in decibels) for both evaluated images.

Table 2 .
Evaluating a dual-core processor without SMT support with a 256x256-sized image (Lenna) -Time in seconds.

Table 4 .
Evaluating a SMT dual-core processor with a 256x256-sized image (Lenna) -Time in seconds.

Table 5 .
Evaluating a SMT dual-core processor with a 512x512-sized image (Coliseum) -Time in seconds.