Keywords

1 Introduction

The Traveling Salesman Problem (TSP) is well-known NP-hard, which indicates a permutation tour that allows a salesman to travel each city once and return to his starting city. Kohonen Self-organizing Map [1,2,3,4,5] and 2-opt local search have been proved before that they can work together to get an intermediate Euclidean TSP solution considering trade-off between quality and execution time [6]. Also, for further acceleration, parallel strategies both for SOM [7,8,9,10,11,12,13] and 2-opt [14,15,16,17] have been separately studied for different applications base on various multi-threads devices (SP2, GPU) in past two decades. However, it still exists some problems in these previous works in terms of types of topology, level of parallelism and memory occupancy.

Considering the general SOM working on predefined topological map, its computation time can be divided into following three parts: a time required to determine the closest winner node to the present input; a time required to determine winner node’s neighborhood and a time required for updating weights (map updating) [8]. Here, the first operation can be accelerated by using massively parallel local spiral search to simultaneously find the closest winner node for each input data [13, 18]. The second and third operations are influenced by network architecture of the topological maps that provide access to neighboring nodes during training process of SOM. Topological maps are usually constructed with unidirectional link network or buffer grid like those SOM applications in [6, 11, 13, 18]. However, training processes on the unidirectional network can only go in one direction. Though algorithms using 2D grid can access arbitrary neighboring node within needed radius centering one winner node, the topological map is limited to regular topology structures, like these work in [6, 21]. Besides, method to access neighboring nodes on different topological map plays an important role for applications that require preservation of initial topological relationship between nodes during training process of SOM.

Previous parallel 2-opt implementations can not be defined as massively parallel method for the reason of tour ordering requirement. Like tailored data parallelism that assigns one thread to treat one partial tour’ optimization [14], function parallelism that one edge’ optimization is executed in parallel [16, 19], or geometrically parallelism that assigns one thread to search optimization in divided sub-areas [20].

To solve these problems, in this paper, we propose a platform to implement training process of SOM on irregular topological maps while preserving topological relationship between nodes and a modified 2-opt method to make 2-opt happen in a massive parallel way. We test this platform on GPU taking advantage of its multi-threads’ operation on global memory.

The following paper is organized as this: Sect. 2 presents related work about parallel SOM and 2-opt; Sect. 3 presents the proposed platform including parallel SOM working on topological maps and modified 2-opt framework with one concrete implementation. And the last section includes the experiments and discussion.

2 Related Work

For parallel implementations of SOM, two common parallel techniques of network partitioning and data partitioning have been used both on CPU- and GPU-based system, which has been discussed and compared in various literatures [7,8,9,10,11,12,13]. The main difference between these two techniques is whether the step of node update (map updating) happens independently in different part of the network or simultaneously for all the input data after the training step. Network partitioning technique breaks up the map into parts and each part has a thread to update neurons separately. While using data decomposition technique, map updating does not occur until all the input data has found its winner node and trained its neighbor nodes [12]. For TSP applications, map updating step indicates the relocation of each neuron on Euclidean space.

For parallel SOM specified to TSP applications, Wang et al. [21] proposes parallel SOM implementations based on GPU cellular matrix model proposed by Zhang et al. [18], in which he use buffer grid as neural network structure and assigns one thread to treat all input data in one cell sequentially. His training step of SOM works on the integral buffer grid on global memory [21]. Different from his work, the network in this paper is memorized by doubly linked network both for SOM and 2-opt instead of intermediately by cellular matrix, the cellular matrix model is only used for parallel local spiral search operator, and we assign one thread for one city instead of for one cell.

For parallel 2-opt algorithms, Johnson et al. [15, 19] discussed parallel schemes like “geometric partitioning and tour-based partitioning”. One geometric partitioning scheme proposed by Karp [20] is based on a recursive subdivision of the overall region containing the cities into rectangles [19]. Verhoeven et al. [14] distinguished parallel 2-opt algorithms between data and function parallelism [14] in which he proposed a tour repartitioning scheme that guarantees their algorithm will not halt until it has found a minimum for the complete problem [14]. Van Luong et al. [17] and Rocki and Sudha [16] adopt parallel strategies similar to “function parallelism” which means one sequential 2-opt is executed in parallel, as Rocki and Sudha [16] distributes the calculation for one edge’s exhaustive 2-opt optimization between threads, but only the first edge’s optimization has finished, the second edge begins its parallel 2-opt optimization.

3 Proposed Methods

Outline of the proposed parallel platform for SOM and 2-opt to large scale TSP applications is shown in Algorithm 1. It mainly includes three parts, first one for initialization step, second one for parallel SOM and last one for parallel 2-opt. Both the two algorithms work base on topological maps constructed by using doubly linked network shown in Figs. 2(b) and 3. “Doubly linked” means that if node A connects (buffers) node B, node B should necessarily connect (buffer) node A, and every node only buffers its directly connected nodes.

figure a
Fig. 1.
figure 1

Tour ordering plays an important role for 2-opt optimization.

Fig. 2.
figure 2

Comparison of necessary operation after one same 2-exchange using different TSP tour order representations. (a): The algorithm needs extra temporary memory to invert tour ordering for each 2-exchange. (b): The algorithm just needs to change links of the related four cities and can go easily in two opposite directions from current edge to get possible local optimization.

Fig. 3.
figure 3

Topological maps represented by doubly linked network. (a) doubly linked list. (b, c, d) Topologies of rhombus, hexagonal and irregular that respect needed topological properties. They share the same proposed method to access neighboring nodes for one training step of SOM that preserves topological relationship between nodes.

3.1 Initialization

The initialization step mainly includes the preparation of SOM maps with initialized neurons, cellular matrix for local spiral search and necessary data transmission from CPU to GPU side.

The initial SOM maps can be predefined with properties that these topologically close nodes correspond to close pixels in images as shown in Fig. 3 or without properties, for example, just add two neighboring nodes to each neuron to be an input TSP solution, as shown in Fig. 2(b). For a TSP instance with N cities (“trainer”), the artificial doubly linked neural network (“learner”) is initialized with \(2\times {N}\) neurons [5]. Coordinate of these initial neurons are initialized according to input cities by setting a small random difference.

One basic building block in this paper is local spiral search operator [22] on 2-dimension Euclidean space. This operator works on cellular matrix [18, 23] to find the closest neighboring nodes for the proposed methods. The Euclidean area that contains all the input N points is partitioned into cellular matrix where each cell has its coordinate and buffers nodes lying on corresponding partitioned area. For spiral search operator used by SOM, the algorithm should prepare two cellular matrix for input data (“trainer”) and neurons (“learner”) separately.

After those steps, the input data, topological maps and the two cellular matrix are copied to GPU side for parallel computing.

3.2 Parallel Self-organizing Map

Our parallel implementation of SOM contains following three steps in one epoch (iteration, epo), as shown in Fig. 4.

  1. 1.

    Search winner node : spiral search each input point’s closest neuron;

  2. 2.

    Train neighborhood : iteratively access the directly connected neighbor nodes according to topological distance and apply Kohonen’s learning low;

  3. 3.

    Refresh cellular matrix : refresh position of each neuron on cellular matrix.

Fig. 4.
figure 4

Kernel functions of the parallel implementation of SOM, we assign one thread for one node.

The search winner node step mainly indicates massively parallel local spiral search operation working on cellular matrix with one thread for one city. For each input city, the algorithm gets cell coordinate of the cell where this city lies on the “trainer’s” cellular matrix, and tries to find this city’s closest neuron on “learner’s” cellular matrix beginning with the same cell coordinate. If this cell on “learner’s cellular does not have neurons, the algorithm searches its neighborhood cells one by one in a spiral manner centering the beginning cell. Every searching operation stops at radius \((lssr + 1)\) or maximum searching range in cellular matrix (LSSR), lssr is the searching radius where the operator encounters the first closest neuron. It has been proved that a single spiral search operation on a bounded data distribution or a uniformed data distribution only takes average O(1) complexity for finding the closest point [22].

The train neighborhood step in one epoch follows Kohonen leaning low that is present in Eq. 1. Considering current epo’ iteration, a winner node \(p^*\) has been found at previous step for one input city p, the Kononen learning low is applied to \(p*\) and to neurons within a finite neighborhood of \(p^*\) of topological radius r. Topological distance from the winner node \(d_G\) and learning rate \(\alpha (epo)\) affect the learning force of each neuron, p indicates the Euclidean position of input city, \(w_k(epo)\) represents Euclidean position of neurons and k indicates different neighborhood neurons. The proposed method applies this learning low on doubly linked network instead of 2D grid, as shown in Figs. 3 and 5(b). After training step of one iteration, the algorithm updates SOM parameters for next iteration by decreasing learning rate \(\alpha (epo)\) and training radius r.

$$\begin{aligned} w_{k}(epo+1)= w_{k}(epo) + \alpha (epo)\times exp(-d_{G}(p^*, k)^2 / r^{2} )\times (p-w_{k}(epo)) \end{aligned}$$
(1)
Fig. 5.
figure 5

SOM’ training procedure on two different network architectures. A black node is one input point (a pixel or one city), red node is its winner node. (Color figure online)

We should emphasize that the train neighborhood operation accesses all needed neighboring neurons in an iterative manner of one “circle” after another according to their topological distance from winner node. As shown in Fig. 5(b), only all green neurons have been trained, the training procedure begins to train these blue neurons. This iterative operation ensures that neurons with larger topological distance \(d_G\) would not move closer to the winner node than neurons with less \(d_G\) during one training process, which preserves the predefined topological relationship between nodes.

Apply SOM to TSP. When applying SOM to specifically solve TSP applications, every initial city should have a chance to be a “trainer” to train the same network. And it is easy to satisfy this requirement by assigning one thread for one city. Once SOM has stopped, projecting one city to a non-occupied closest neuron to generate an initial TSP solution for further optimization. One TSP solution of SOM is shown in Fig. 6.

Fig. 6.
figure 6

A result of the proposed SOM implementation for lu980.tsp from TSPLIB.

3.3 Massively Parallel 2-opt with Data Decomposition

To optimize the TSP solution produced by SOM, various 2-opt strategies can be used, while the nature attributes of 2-opt make its massively parallel implementation become more complex because of tour ordering requirement. Reasons are following: first, one 2-opt move needs to consider tour ordering to avoid cutting the tour, as shown in Fig. 1; second, massive correct 2-opt moves simultaneously found in one same tour orientation may also cut the tour as shown in Fig. 7(b,c). However, these massively non-interacted 2-exchanges shown in Fig. 7(a) can be executed in parallel without cutting the tour.

Here, trying to get massively parallel 2-opt optimization while respecting those nature attributes of 2-opt, one simple choice is to adopt a strategy of massively parallel 2-opt evaluation with sequential execution, simplified as “parallel evaluation but sequential execution”. Its principle idea is straightforward: the algorithm begins with simultaneously searching each edge’s 2-opt optimization in one same tour orientation according to a certain neighborhood edge searching rule; but only those non-interacted 2-exchanges shown in Fig. 7(a) can be sequentially detected and executed on CPU side in one iteration.

Fig. 7.
figure 7

Massively parallel 2-opt framework: parallel evaluation but sequential execution. (a) Case of multiple 2-exchanges that do not influence with each other. (b, c) Cases of multiple 2-exchanges interacting with each other: multiple 2-exchanges share one same edge in (b); execution of the two 2-opt moves in (C) will cut the original integral tour.

Fig. 8.
figure 8

Massively parallel 2-opt framework with a certain edge searching rule.

Overall outline of this massively parallel 2-opt framework is shown in Fig. 8. The algorithm begins with a TSP solution represented by doubly linked list where each city \(P_m (m = 0,1,2 ... N-1)\) has its two and only two links that are used to compose a TSP tour ring. In one iteration, the first step needs to follow current tour solution and assign each city a unique increasing tour order. We use this tour order to detect non-interactive 2-opt moves and indicate current tour orientation in later operations. After this step of “refresh tour order”, the algorithm copies necessary data from CPU to GPU side and launches kernel functions to check each edge’s 2-opt optimization according to a certain neighboring edge searching rule and store each edge’s 2-opt information to this edge’s starting city according to current tour orientation. At last, the algorithm copies each edge’s optimizing information from GPU to CPU side to sequentially detect and execute these massive non-interacted 2-opt exchanges.

Under this straightforward parallel scheme, various concrete implementations can be applied. For example, the algorithm can simultaneously search 2-opt optimization for each edge along the integral tour or in a local search range. Here, we apply a concrete edge searching rule through searching each edge’s 2-opt optimization among its neighborhood edges by using local spiral search operator, namely 2-opt LSS.

4 Experiments and Analysis

We implement this combinatorial method on GPU taking advantage of GPU’s parallel read/write operation on global memory with atomic control [24]. For neighborhood 2-opt optimization using local spiral search (2-opt LSS), its optimization capability compared with traditionally sequentially exhaustive 2-opt along integral tour (2-opt EAT) is interesting. We test these two strategies base on same TSP solution of SOM. Both the two optimization methods adopt an evaluation strategy of “first optimized first accept”. As both these two methods can not further optimize the TSP tour after fixed number of iterations, we set a stop criterion to judge whether the tour has been optimized or not in current iteration.

Parameter setting in Eq. 1 influences result quality and running time of SOM. we apply same parameters shown in Table 1 for different TSP instances. \(\alpha _{ini}\) and \(\alpha _{final}\) are learning rate at the starting and final epoch; \( r_{ini}\) and \(r_{final}\) are neighborhood topological radius from the winner node at the starting and final epoch; epo sets the number of iteration; LSSR sets the searching range on cellular matrix for the spiral search operator.

Table 1. Parameter setting for SOM.

Average results of ten tests for each TSP instance are shown in Tables 2 and 3. In each table, 2-opt works based on same results of SOM. For each method listed in these two tables, “t(s)” is the average time taken in one test, including necessary time for generating random data, refreshing TSP tour order and copying data from GPU to CPU; “%PDM” is the percentage deviation between the mean solution and the optimum solution; “%PDB” is the percentage deviation between the best solution and the optimum solution; “epo” indicates the average quantity of iterations in one test.

Table 2. Test results of sequential implementations.
Table 3. Test results of parallel implementations.
Fig. 9.
figure 9

Visual results of one test in Table 3. Columns from left to right for each TSP instance: (a) Results of parallel SOM; (b) Results of sequentially exhaustive 2-opt along tour; (c) Results of massively parallel 2-opt with local spiral search.

Comparing running time of the two tables, acceleration of the proposed parallel implementations is obvious. And we think the acceleration factor would be more obvious for larger TSP instances, as the proposed parallel platform both for SOM and 2-opt takes O(N) complexity either for the memory size or for GPU threads.

Comparing solution quality, we should mention that the proposed model deals with a procedure similar to original standard SOM and we did not try to find the best parameters to get best performance of SOM. However, maximum number of iterations, running time and the result quality of the two 2-opt methods are influenced by initial TSP results of SOM.

Visual results of one test in Table 3 are shown in Fig. 9. Our experiments work on the laptop with CPU Intel(R) Core(TM) i7-4710HQ, 2.5 GHz and GPU card GeForce GTX 850M.

5 Conclusion

In this paper, we propose an alternative parallel computing platform both for SOM and 2-opt. We test the proposed methods with GPU parallel computation and present the obtained acceleration factor. We believe the acceleration factor will increase for very large scale instances as capacity of parallel devices is growing. Further jobs would concentrate on tests for very large size TSP instances and comparison with more different combinatorial optimization methods.