Solving the inverse heat conduction problem using NVLink capable Power architecture

The accurate knowledge of Heat Transfer Coefficients is essential for the design of precise heat transfer operations. The determination of these values requires Inverse Heat Transfer Calculations, which are usually based on heuristic optimisation techniques, like Genetic Algorithms or Particle Swarm Optimisation. The main bottleneck of these heuristics is the high computational demand of the cost function calculation, which is usually based on heat transfer simulations producing the thermal history of the workpiece at given locations. This Direct Heat Transfer Calculation is a well parallelisable process, making it feasible to implement an efficient GPU kernel for this purpose. This paper presents a novel step forward: based on the special requirements of the heuristics solving the inverse problem (executing hundreds of simulations in a parallel fashion at the end of each iteration), it is possible to gain a higher level of parallelism using multiple graphics accelerators. The results show that this implementation (running on 4 GPUs) is about 120 times faster than a traditional CPU implementation using 20 cores. The latest developments of the GPU-based High Power Computations area were also analysed, like the new NVLink connection between the host and the devices, which tries to solve the long time existing data transfer handicap of GPU programming. Abstract The accurate knowledge of Heat Transfer Coeﬃcients is essential for the design 2 of precise heat transfer operations. The determination of these values requires 3 Inverse Heat Transfer Calculations, which are usually based on heuristic opti- 4 misation techniques, like Genetic Algorithms or Particle Swarm Optimisation. 5 The main bottleneck of these heuristics is the high computational demand of 6 the cost function calculation, which is usually based on heat transfer simulations 7 producing the thermal history of the workpiece at given locations. This Direct 8 Heat Transfer Calculation is a well parallelisable process, making it feasible to 9 implement an eﬃcient GPU kernel for this purpose. This paper presents a novel 10 step forward: based on the special requirements of the heuristics solving the in- 11 verse problem (executing hundreds of simulations in a parallel fashion at the end 12 of each iteration), it is possible to gain a higher level of parallelism using multi- 13 ple graphics accelerators. The results show that this implementation (running 14 on 4 GPUs) is about 120 times faster than a traditional CPU implementation 15 using 20 cores. The latest developments of the GPU-based High Power Compu- 16 tations area were also analysed, like the new NVLink connection between the 17 host and the devices, which tries to solve the long time existing data transfer 18 handicap of GPU programming. 19


20
As a fundamental experience of modern materials science, material properties 21 are influenced by the microstructure; therefore, these can be altered to improve 22 the mechanical attributes [1]. One of the most widely used methods for this 23 purpose is heat treatment which usually consists of two consecutive steps: heat-24 ing up the work object to a given high temperature and cooling it down in a 25 precisely controlled environment. It is necessary to know the attributes of the 26 given material and the environment, to achieve the best results, especially the 27 Heat Transfer Coefficient (HTC) which shows the amount of heat exchanged 28 between the object and the surrounding cooling medium. 29 The Inverse Heat Conduction Problem (IHCP -the determination of the 30 HTC) is a typical ill-posed problem [2,3,4]. Without any known analytical 31 solution, most methods are based on the comparison of temperature signals 32 recorded during real heat treatment processes and estimated by simulations. 33 The aim of these methods is to find the HTC function giving the minimal 34 deviation of the measured and predicted temperature data. 35 It is usual to use heuristic algorithms, like Genetic Algorithms (GAs) [5], (based on finite-elements or finite-difference techniques), it is feasible to simulate 48 the cooling process and to record the thermal history for each chromosome. 49 The difference between this generated thermal history and the measured one 50 produces the cost value for the individual. The purpose of the IHCP process is 51 to find the best gene values resulting in minimal cost. 52 The bottleneck of this process is the high computational demand. The run-53 time of one cooling process simulation is about 200 milliseconds using one tra-54 ditional CPU core, and it is necessary to run these simulations for each chro-   The main difference between these studies and this research is that this paper 81 is focusing on the two-dimensional IHCP. Heat transfer simulation is a major 82 part of the IHCP solving process; moreover, it is necessary to run thousands 83 of simulations. Accordingly, it is feasible to use a higher level of parallelism 84 by using multi-GPU architectures (the presented papers are usually deal with 85 only one device). It is possible to install multiple graphics cards into a standard 86 PC motherboard, and the CUDA programming environment can handle all of 87 them. Using multiple GPUs can double/triple/quadruple the processing power, 88 but it is necessary to adapt the algorithms to this higher level of parallelism.  Figure 1: Two-dimensional axis-symmetrical heat conduction model. assumption was that it is worth implementing an IHCP solver system based on 97 this architecture.

98
Based on these advancements, a novel numerical approach and a massively 99 parallel implementation to estimate the theoretical thermal history, are outlined.

100
The rest of the paper is structured as follows: the next section presents the novel 101 parallel DHCP and IHCP solver methods; Section 3 presents the raw results of     Based on this simplified model, the mathematical formulation of the non-124 linear transient heat conduction can be described as Eq. 1, with the following 125 initial and boundary conditions (Eq. 2-5): where 131 • r, z -local coordinates;

133
• R -radius of the workpiece;

135
• T 0 -initial temperature of the workpiece;

136
• T q -temperature of the cooling medium;

137
• T (r, z, t) -temperature of the workpiece at given location/time;

138
• k(T ) -thermal conductivity (varying with temperature); is:  Manuscript to be reviewed Computer Science  a a a a a a a c   b a a a a a a a a   ...  The C++ standard std::chrono:high resolution clock object was used to mea-308 sure the execution time. To decrease uncertainty, 20 independent tests were run 309 for all parameter sets, removing the lowest and highest runtimes (5-5%) and    Table 1 and Fig. 4 show the GPU results for TE1, and Table 2 Table 3 and Fig. 6 show the CPU results for TE2. Manuscript to be reviewed

Computer Science
CPU performance analysis is not in the focus of this paper, but these bench-361 marks have been run only for the CPU-GPU comparison. As visible, the increase 362 in the number of cores effectively increases the performance (each core is respon-363 sible for one heat transfer simulation). On the other side, as the population size 364 was increased, the runtime increased almost linearly.

365
In the case of CPU implementations, the implementation is much simpler.

366
There is no need to transfer input data from the host to the device and the 367 output data from the device to the host, and kernel launch overhead is also 368 missing. Therefore, the expectation was that the CPU would be faster in the 369 case of small population sizes (where the GPUs cannot take advantage of the 370 high number of multiprocessors), but as is visible, this was not true.

371
Comparing these results to the GPU runtimes, it is evident that it is not    interesting, the usage of three or four devices does not have any benificial effect.

409
The reason for this is that the used P100 node has a special architecture in that