Comparison of Swarm Optimization and Memetic Algorithm for Systolic Mapping of Texture Analysis

- Systolic processors offer a hardware design which can accommodate more functions in a small footprint. Hardware utilization efficiency can be enhanced by appropriately designating the intended hardware with a task in space and time through parallel computing platforms. Regular algorithms known for their computational complexity can be mapped to systolic array by dependence graphs, which allot hardware to the design data. Manual mapping techniques tend to be tedious with more inaccuracy and calls for efficient mapping techniques, automated through algorithmic procedures. Texture Analysis marks the preliminary progression of image analysis and interpretation. Automotive systems, Robotics, Industrial processing and similar automated applications can be simplified through texture analysis. This work deals with employing evolutionary algorithms for mapping texture analysis onto systolic architecture. Memetic Algorithms (MA) and Particle Swarm Optimization (PSO) algorithms were comparatively studied and the efficiency of designing a parallel architecture through systolic array is analyzed through cost function and processing time.


Introduction
Current trends have an escalating upscale in development through advancements in design procedures and available resources. Recent improvements in designing and manufacturing demand low power, minimum hardware, high speed and good accuracy. Parallel designs that offer high throughput is an apt choice but results in a memory access conflict in a hierarchical database. By designing a proper memory fetch strategy, optimized designs of hardware and software can be derived and ported to an application-specific instruction-set (ASIP), dedicated Field Programmable Gate Arrays (FPGA) [1] or an Application Specific Integrated Circuit (ASIC) processor.
Hardware algorithms are hardwired techniques with a fixed set of processors and interconnections. Iterative procedure on hardware algorithms requires a huge area of design units and high computation time. Reusing the available hardware with proper scheduling methods to obtain a compact regular structure would be a better solution. Systolic arrays are processor units that can be designed and scheduled appropriately to compute a train of sequences through complete utilization of the available hardware. The architecture is primarily developed to increase hardware reliability, cost-efficiency and accuracy through parallelism and pipelining on standard image processing algorithms. Iterative algorithms run similar operations over various inputs and can be parallelized through systolic implementation. Texture analysis is an iterative image classification method known to improve the perception of the image through mean and variance analysis over a shifting window. Texture analysis is analyzed based on feature-based evaluation metrics such as contrast, coarseness, busyness and strength. Mapping a regular texture analysis algorithm to a systolic processor proceed by selecting the allotment of an atomic computation to a processor and managing the time corresponding to scheduling successive processes on the available hardware [2].
Emergent complexity is a behavior where larger entities arise through interactions among smaller entities and this phenomenon has been the basement for Evolution. Existing species evolve and develop necessary characteristics through an expensive process of increase in complexity. Evolutionary algorithms (EA) define the process of improvement of a group of individuals through variations applied to size of population and impact of reproduction mechanism. These variations had lead to a new class of meta-heuristic algorithms grouped under a domain called 'Bio-inspired Mechanisms' that work to solve a heuristic problem in a random manner. Applications of evolutionary methods are vast and one such use of EA is to find the Efficient mapping procedure for systolic arrays. From a huge set of available solutions for mapping, the best solution is selected by ensuring constant improvement in quality and quality of each solution being analyzed through a cost function. Evolutionary algorithms for processor mapping have become popular through its robust nature of ensuring best result in every run. This thesis concerns the application of evolutionary algorithms such as genetic algorithms [3], memetic algorithm [4] and the optimisation of particulate swarms [5] in a systolic way.
The paper is organized as follows: Section 2 deals with the elucidation of systolic architecture and mapping procedure for systolic arrays, Section 3 enlists different evolutionary algorithms, Section 4 with texture analysis, an example to implement systolic arrays, Section 5 analyses the results of applying evolutionary algorithms for mapping systolic architecture for texture analysis.

Systolic Architecture
Systolic arrays are a group of processors that work on a controlled flow of data through synchronized input and output. Systolic arrays were first named by [6]. The maiden research dealt with the implementation of processors that are similar in structure and function. The design units are regulated and controlled by a control system in terms of computation and data storage. The performance of a systolically designed system can be measured by the throughput of a peripheral device which feeds on the system. A highly concentrated architecture demands high efficiency and low demand for supporting organizations that supply data from local connections [7]. The data is circulated into the systolic array every computation cycle and the module-based design of systolic arrays ensures the data fetching, data processing and data forwarding to the next stage of design. This process improves the throughput and eliminates the need for large number of processors [8].
Systolic arrays are known for its innate pipelined structure and qualify as a good parallel computing system. The array is synchronized with discrete time steps with sufficient time separation between successive operations. Unlike forced pipelining in a signal processing chain, systolic arrays can be extended in multiple spatial domains. The design units are used less at the start and end of the feed sequence, they are fully operating on the remaining data resulting in increased throughput.
Scalability, regular design and reconfigurability are some features for which systolic arrays are famous for. The systolic implementation of matrix multiplication [9] uses N1 × N 2 processors for multiplying two matrices of size N 1 × N 3 and N 3 × N 2 . The same architecture is available for multiplying matrices of lesser dimensions accounting for its scalability. Regular architecture results from its uniformity in the design of processing units. Reconfigurability can be interpreted from its implementation in FPGA providing hardware and software flexibility.
Systolic arrays allegedly modularize the computations that require long processing time and power hungry hardware structures [10].
The basic structure of systolic architecture is demonstrated in the Figure 1. Systolic array consists of processing elements (PE) which are scheduled to evaluate a part of inputs and the results will be routed back to the memory database. Memory database holds the control of data movement and commonly is a central processing unit (CPU). In systolic processing array, the rate of data processing increases depending on the number of connected array elements. Mapping a computationally intensive task onto a resourcerestricted hardware such as systolic systems requires a space-time dimensional transformation defining the physical location and duration of the operation. The spacetime transformation is essentially a mapping process which assigns each point in iteration space a scheduled processing element for the operation at discrete time.

Evolutionary algorithms
Evolutionary algorithms [11] belong to a class of nontraditional techniques which mimic the biological behavior of organisms to obtain the solution. Evolutionary algorithms [12] form the basis of computational intelligence which interprets the environmental variables and performs sensibly on population-based improvement methods. It duplicates the personality of evolution in various species: group of ants, swarm of birds, school of fishes, groups of frogs etc [13]. The nature of self-improvement and discarding unnecessary talents has been a manner of evolution starting from unicellular organisms. As shown in Figure 2, EA methods try to design a copycat mechanism of natural evolution [14].

Fig. 2: Evolutionary Algorithm process flow
An example to appreciate evolution is the disappearance of coccyx or tailbone from humans due to its least use for the species. Evolutionary algorithms are of four categories: Genetic algorithm (GA), Genetic Programming (GP), Evolutionary Programming (EP) and Evolutionary Strategy (ES) [15]. The scope of this work is limited to Genetic algorithm, Memetic Algorithm (MA) and Particle Swarm Optimization (PSO).
Evolutionary algorithms are best suited for systolic arrays because of their inherent nature of parallelism [16], [17]. Efficient turbulence handlings as the algorithm starts with random candidates of a solution, computational integrity by trying to achieve the required fitness [18], application of domain intelligence rather than a bird's eye view of a problem are some of the benefits of EA .

Memetic Algorithm
Memetic algorithms (MA) [29] are group of stochastic search methods that employ evolution and local search strategy. MA was named by Moscato following the notion of Dawkin's meme. MAs are best suited for a proximate solution of NP hard problems [30]. The outstanding feature of MA is the exploitation of entire knowledge of the problem justifying the nomination of MA to be named as Hybrid Evolutionary algorithms (HEA). Memetic techniques enhance the global search methods of EA by adopting local search methods in a restricted group of individuals. It follows a standard of evolving the individuals from one population to another by processing and enhancing the candidates. This process ensures the attainment of optimum faster through a synergy mechanism of two optimization methods. Any advantageous information available in local area can be used to guide the search to a better solution. Local search methods iteratively examine a set of points in a neighborhood of the current solution and replace the current solution with a better neighbor if one exists. Local search methods in hybridizing EAs are used to improve the information content of a population in one or more phases of implementation. MAs can implement local search to the descendants and skip the mutation process [31].
MAs are similar to GA which forms a group of elements (memes). The procedure involves selection of a group of memes (candidate solutions) and evolution of memes towards the optimal solution by crossover and mutation along with personal experience of the individual memes. The inheritance of favorable traits by descendants is generally encouraged in all evolutionary methods. Memetic algorithm was the first method to introduce the Baldwin's learning of traits by a candidate of evolutionary process. The inheritance methods of MA can be designed in two ways: • Lamarckian -Local stage result replaces the least fit individual in the population • Baldwinian -original individuals of the population are retained when the results approach the outcome of local search The common method of inheritance is Lamarckian method. The local improvement factor along with information variation when added to MA allows for faster convergence compared to GA. Genetic algorithms suffer from a longer processing time as there is a possibility of gene diversion when the genetic operators are acted upon the population and the search does not take any local improvements. The stopping condition for MA can be total number of iterations before reaching target, number of iterations for which the target value has been sFigur or satisfactory target value. The size of memes generally ranges from 40 to 1000. The number of iterations is usually more than 100 and better results have been observed for more number of iterations [32]. Memetic algorithms have been used in solving optimization problems [33], software testing [34], medical imaging [35], bioinformatics [36], object recognition [37], [38] and scheduling [39], [40].

Particle Swarm Optimization
As shown in Figure 3, Particle swarm optimization [41] is a trajectory evolving algorithm which imitates a flock of birds or bird-like objects (boids) [42] trying to reach the destination [43]. Binary Particle Swarm Optimization (BPSO) [44] is an improvement suggested for PSO with the particles assuming binary values for position. The position Xp and velocity V p of each particle is given as shown in Equation (1) and Equation (2) The position of a particle is randomly generated. Initial velocity range is [0.0 ~ 1.0]. Each particle is updated with its position.
The new position of a particle is updated based on three factors: inertia, cognitive experience and social experience. The algorithm advances towards optimum by maintaining a balance between exploration of new solutions and exploitation of fit particles as shown in Equation (3), Equation (4) , Equation (5) and Equation (6) Inertia deals with the effect of current velocity on the updated position and the weight 'w' decreases from a value of 0.9 to 0.4 (maybe 0.2) in subsequent iterations. The change in inertial weight is necessary to reduce the randomness in particle position as the iteration number grows.
Cognitive component refers to the experience of each particle and how far it is placed from its own personal best experienced position. Social behaviour indicates the movement of particles influenced by the overall best in the current population of particles and an urge on the group of particles to move towards the global best position. The parameters such as c1, c 2 , rand 1 , rand 2 are hard to choose and empirical values are used for such parameters since inception of PSO. The influence of these parameters in rate of convergence and quality of solution are significant. The position of a particle is fixed based on a stability function S as shown in Equation (7).
A random value is generated in the range [0.0 ~ 1.0] and if the number is greater than S(V p (i+1)), X p (i+1) is set as 1, else it is set as 0 [44]. The group of solution remains the same, but the position and velocity of solutions change for every iteration. The number of particles in PSO generally range from 10 to 100 and maximum number of iterations vary between 50 to 200, depending on the type of problem [45]. Generally, PSO is used for unconstrained problems where the variables have no limits [46]. PSO is well adapted for non-linear functions in a search space of high dimensionality. The applications of PSO extends from network intrusion detection [47], data analysis [49], combinatorial optimization problems [49], text summarization [50], machine learning methods [51], neural networks [52], medical image processing [53], computer gaming and entertainment [54], power systems [55] and pattern recognition [56].

Systolic Array Architecture for Image Texture
Analysis Systolic arrays offer compact design for computationally intensive algorithms and it can be a good choice for texture analysis. Texture Analysis is a regular algorithm with similar data flow directions between computational nodes in a signal flow graph. Dependence graph can be considered as an all-space, no-time representation of the algorithm indicating all the pre-requisite computations to be completed before the current computation [57]. In contrast, systolic architectures have a space-time representation. Dependence graphs are essential in systolic arrays as the interdependencies of data can be analyzed and maximal parallelism can be achieved [58]. The mapping of an algorithm onto a systolic array through evolutionary algorithms is explained in Fig. 4.
The overall working of the array is based on the matrix (p) and the scheduling matrix (s). The matrix for assigning the processor assigns numerous data to available processors. The scheduling matrix determines how long a certain processor is to enter the next data. The vector (d) specifies the basic processor architecture and defines a collection of data to be interpreted by the same processor. These three vectors define the mapping of an algorithm onto hardware. For the mapping to be feasible, vectors chosen must satisfy space and timing constraints. Processor space vector and projection vector must be orthogonal ensuring that only one set of data at any point of time can be processed at a specific processor. Timing constraint dictates that no two sets of data should be mapped to the same processor at a specific instant of time. Hardware utilization efficiency (HUE) defines the amount of reusability in the defined hardware for a specific projection vector (d) and scheduling vector (s). For an efficient design, HUE should be close to 1.
The search of exact set of vectors obeying the feasibility constraints is data-intensive process where each location in space and time has to be evaluated iteratively.
The quest for optimum matrix normally includes heuristic [59], [60] which is a general search approach that slopes the search path in the best way possible for immediate goals [61], [62]. Alternative methods such as evolutionary algorithms or more organized methods can be applied to reduce the computation time and resources.

Image Texture Analysis
The categorization of or segmentation of textural features in respect of the small component type, density and steadiness direction shall be known as Texture Analysis. In many machine learning and image processing algorithms, analysis of texture features is a mandatory process for applications such as texture classification, denoising, image reconstruction and feature extraction. All texture features based on first-order statistics of the gray value distributions are invariant to the orientation and scale of object. The estimate of the parameters for texture segmentation requires variance and the mean to be averaged over a local neighborhood for normalization. The mean is calculated over a window of n × n, with 'n' being an odd number for convenient selection of pixels In a texture analysis problem, the gray scale intensity values g(x,y) of a window are analyzed. To analyze the variations within a region around a point, texture value T v (x,y) of that point can be computed. The quality of texture analysis applied to an image can be quantified through parameters such as coarseness, contrast, busyness and texture strength [63], [64]. In Equation (8), 'm' denotes the function for mean of gray values of the image within the window of size N. The variables 's' and 't' in Equation (8) represents the extent of window function for calculating texture. The summation is completed when the computation advances in z-direction until it reaches the edge of the image. The result of texture analysis is a edgedetected image which emphasizes on the high intensity variations on the edge. The basic pattern of a texture Contrast The capacity to clearly differentiate neighboring pixels Busyness Rapid intensity changes from one pixel to another Texture Strength Easiness of dening a texture from its region Parameters such as contrast, coarseness, busyness and strength as mentioned in Table 1 are used to evaluate the resultant image after texture analysis. Texture Processing may also separate image data into information that is easier to view and used in different applications, including industrial automation, imagery retrieval, medical and remote sensing.

Selection of Best Architecture
The search technology is used to find the right matrices to reduce costs. The cost function involves the sum and number of cycles used to complete the image size algorithm. In order to emphasize efficient hardware usage, higher weightage is given to the number of processors [65]. When the feasibility constraints are being satisfied, the edge mapping of the algorithm can be derived.
Edge mapping connects the nodes of a software design in the dependence graph to the computational nodes at the physical design. If an edge 'e' exists in the space representation of a dependence graph, then an edge is introduced between the concerned processors with appropriate delay. Edge mapping concludes the systolic array implementation and this work ends with selection of number of processors using methods alternative to heuristics.
In optimization problems, the optimal solution tends to occur at the boundary points of the space limited by the constraints of the problem [66]. From the processor space vector and scheduling vector, the number of processing elements and the total time required for completion of the algorithm can be derived.

min max
The Equation (9) specifies the number of scheduled processing elements NPE needed for effective hardware usage to complete the mission accounting. The maximum and minimal computing sections used to the algorithm are shown in the processor allocation matrix p and the matrix q belonging to a search space S is indicated by PEmax and PEmin.
The processors are arranged in the quest (index) space with the Cartesian coordinates as shown in Equation (10).
The time taken for complete evaluation N cycle depends on scheduling matrix s and also the allocation of processors in space as shown Equation (11).
The cost function is chosen by giving higher weightage to number of processors to emphasize on hardware efficiency as shown Equation (12).
(12) From the processor space vector and scheduling vector, the number of processing elements and the total time required for completion of the algorithm can be derived.

Results and Discussion
This work deals with evaluating texture analysis process through defined parameters and by designing the mapping vectors for systolic array through evolutionary mapping process.
Texture analysis is applied to a gray image obtained from a tricolor image. The parameter N implying window width is assigned a value 3, 5 and 7. The different sized images are taken to evaluate the variation of parameters based on number of pixels and window size of Texture Analysis.
Complexity of the algorithm increases with increase in number of pixels per window and image size. Texture analysis applied on some standard images as shown in Fig.  5. Since texture analysis is a sequential process, the computations are executed in order and consume design time. Table 2 lists the computational time of texture analysis for various images. When the image size increases, evaluation time of the texture is more. Processing time reduces with increase in window size for smaller images. When the images are large, for a window size of 5, the algorithm takes more time to complete. It can be deduced that the images of higher pixel dimensions require more execution time for a specified window size and for small images the time reduces with increase in window size. The time increases with increase in window size for images of dimension more than 300 x 300 pixels.
For the chosen images in window sizes 3, 5 and 7, texture parameters such as coarseness, contrast, occupationality and texture intensity are evaluated. Coarseness of an image reduces with increase in number of pixels and depending on the intensity values, the dependence of coarseness on window size becomes unpredictable. Contrast increases with increase in both number of pixels and window size as shown in Figure 4.  Systolic arrays can be implemented with texture analysis through efficient mapping techniques such as heuristics, memetic algorithm and particle swarm optimization. The mapping procedure is done through coding in MATLAB tool. Texture analysis is a four dimensional problem with two dimensions from the Cartesian coordinates of the image under consideration and other two dimensions from the window of evaluation. The end result of the product is a matrix indicating the processor and time details of the mapped function. The four dimensional texture analysis problem results in two 1 × 4 matrices representing processor matrices in x-and ydirection and one 1 × 4 matrix indicating the scheduling matrix. The scope of this work is limited in finding the algorithm that maps texture analysis problem to a minimum number of processors with time efficiency.
The optimization of Particle Swarm applied for the systematic array-mapping uses 50 sections as the complete set of solutions. PSO has set the cumulative iteration to 100. Continuously, the inertia weight is changed with an initial value of 0.2 and a final value of 0.4. The running time is the time needed to run and finish the whole operation with the effects of the number of processors and cycles. The mean value is a measure of the degree to which the output varies between the poles and gives the outputs of the algorithm a variance. Original population, location and speed are altered and the solution increases with various iterations [67]. The findings show that the processing time is roughly constant, since there is no sharing of information between people, unlike MA. The optimal result from PSO ended with 30 processors and the cycle for evaluation is 58. The cost function resulted was 37.Memetic algorithm is implemented with the same set of parameters. A meme is a group of individuals and the number of individuals for MA is 40 with 4 individuals per meme as shown in Table 4. In the computation process, local improvements are given higher weightage in moving the algorithm forward. In Memetic algorithm too, the number of iterations vary over a wide range. The program executed for 0.536 s before converging to an optimal result of 16 processors (4 x 4 cells) in 17 cycles which is the minimum number of processors of all the above mentioned techniques. Systolic mapping technique can be chosen based on the requirement of hardware efficiency and latency. When particle swarm optimization is used, it maps the array with less delay and more processors but results in more number of cycles for the algorithm to conclude. As shown in Table  5, the above analysis cannot be concluded with few algorithms. The results indicate that the memetic algorithm qualifies as a better choice for systolic array implementation of texture analysis as shown in Table 3. As a future work, Algorithms such as Ant colony optimization, Shuffled frog Leap algorithm and Differential Evolution for systolic mapping can be evaluated.

Conclusion
Systolic arrays are a pristine solution of computationally intensive applications, like real time image processing and applications which require accuracy, such as medical image classification. Texture analysis has been sampled out from a huge set of image processing applications to demonstrate the computational efficiency offered by systolic arrays. Efficient mapping techniques are inevitable for a compact design of parallel arrays that reuse the available hardware to the maximum extent. Of the four discussed algorithms, memetic algorithm has been found to result in a compact design due to its Lamarckian inheritance property and local search methodology. As a future work, texture analysis results can be used to train a network of processors for image classification and decision making.