Multiprocessing implementation for MCNP using Python

Monte Carlo N-Particle (MCNP) is a widely-used code in nuclear engineering, but it needs high computation times due to the tracking of every single particle and interaction event. This causes shielding optimization using trial-and-error to take long times to complete. It can be solved by using multiprocessing, but this requires the MCNP source code which can be difficult to obtain. Therefore, this paper aims to suggest a solution on how Python can be used to run multiple instances of the code in shielding optimization. Two hardware setups were tested: one with a dual-core CPU, and one with a six-core CPU. Each of them would run several repeated simulations with and without multiprocessing enabled. Comparisons were made between each case of the same setup to observe the improvements in completion time. A tutorial of the Python algorithm is also provided in the methodology.


Introduction
Monte Carlo N-Particle (MCNP) is a nuclear simulation code developed and maintained by Los Alamos National Laboratory [1]. It is one of the nuclear modeling software that utilizes the Monte Carlo method. Other examples include Geant [2], Fluka [3], OpenMC [4], and SuperMC [5]. These codes provide very accurate results for complex situations and geometries. However, they are known to be computationally expensive, needing more calculation and time as the complexity increases. This is because the software needs to track every particle from its birth to its death or escape from the simulation space. Several works had been done to improve the runtimes of Monte Carlo simulation for radiation transport. Rydén et al. (2018) [6] utilized four graphics processing units (GPUs) to run an MC code, simulating 3.2 billion photons/s for single-photon emission computed tomography (SPECT) reconstruction. Liang et al. (2020) [7] argued that there are some limitations to using GPUs in Monte Carlo simulations, thus they set out to test the implementation of inverse transform sampling technique on CPU, hoping that it could be used to solve the GPU limitations. Nevertheless, multiprocessing is also available natively for some versions of MCNP. However, they may not be easily obtained compared to the serial version. There is a trend of using MCNP in metaheuristic algorithm optimization which demands the improvement in MCNP runtimes. This is because in these algorithms, the MCNP will be executed multiple times until the design variables were optimized.
One of the earliest studies was by Hu et al. (2008) [8] where they coupled the MCNP code to a genetic algorithm (GA) metaheuristic to optimize the design of a composite shield in a mixed radiation situation . The result was five compositions that showed comparable attenuation with lead oxide. Cai et al. (2018) [9] used the same method to optimize a multilayer shield for uranium-235 radiation. The result was six arrangements of material that proved the reliability and consistency of the metaheuristic in shield optimization. Chen et al. (2019) [10] had also implemented the same strategy in designing the shielding of a nuclear reactor in a passenger-cargo ship from a set of 15 materials. The same application was  [11] but it was for bismuth-borate glasses. They were found to be comparable to limonite concrete and barite concrete for mixed radiation conditions. In 2020, two studies were conducted: one had developed six shielding designs from polyethylene, lead, and boron [12], while the other produced a 16-layer shield for a mobile nuclear reactor system from epoxy resin, lead, boron carbide, and graphene [13].
However, most studies do not reveal the method of how the repetition or multiple instances of MCNP was done. Therefore, this article aims to demonstrate a way of coupling the MCNP code with a Pythonbased algorithm and to investigate the improvement that can be expected when utilizing Python multiprocessing. Two different computer setups are used to observe the increase in performance if the physical CPU cores are increased. A tutorial for the Python algorithm can be found in Section 2 and Appendices.

Hardware setups
Two Windows 10 hardware machines are used to run the Python algorithm. The first one is a laptop from 2017 which has an Intel Core i5-6200U. It is a dual-core, four-thread processor running at 2.4 GHz and it can boost up to 2.7 GHz. The second is a desktop PC that was built in 2018, equipped with the six-core, twelve-thread AMD Ryzen 5 2600X processor. Running at a base clock speed of 3.6 GHz, it has the capability of running at 4.2 GHz on all cores, provided that adequate cooling is maintained. This desktop is run using two cases: (1) all cores capped at 2.7 GHz, and (2) all cores capped at 3.9 GHz. The former is implemented to estimate the effect in the laptop setup if its logical cores were to increase beyond 4. Figure 1 shows the model of MCNP simulation that is run by the algorithm. Its code can be found in Appendix A. The model is a simple radiation attenuation experiment setup with a single monoenergetic source, a slab of shielding, and a detector volume. The source emits neutrons of 1-MeV in the direction of the detector (source direction biasing). The shield is made up of polyethylene and its dimension is set to 3 cm (W) × 3 cm (H) × 3 cm (L). The detector volume is a boron trifluoride-filled cylinder of 2 cm in diameter and 5 cm in length. F4 tally for neutrons is specified in the volume, while secondary photons are neglected. All 20 runs are performed at 2×10 6 nps and they are made to be independent of one another by having distinct seed numbers. A good summary of all the functions in MCNP can be found from the work of Shultis and Faw (2011) [14].

Python multiprocessing for MCNP
The algorithm is developed using Anaconda, a distribution package of Python for scientific computing. It is user-friendly and the results of the code can be seen automatically in the same interface as the code. This means that the code can be written, run, and debugged conveniently. The algorithm is written in a Jupyter notebook (.ipynb) which is opened through the Anaconda Prompt. If done otherwise, it cannot access the MCNP cross-section file "xsdir". Figure 2 shows the flowchart of the algorithm, while the code can be found in Appendix B. Firstly, the multiprocessing function is called. For clarification, this function tells each logical core (thread) to run a single MCNP input at a time ( Figure 3) rather than all of them run a single input. Therefore, the maximum number for the "Pool" depends on the number of logical cores. For example, the highest number for the desktop setup is 12 since it has 12 threads. After that, the "os" function is imported to enable the algorithm to run MCNP code from the command prompt. The mcnp.exe and the .ipynb file also need to be in the same folder.
Then, the MCNP input files are created using the algorithm so that each of them has a distinct seed number. This is done by defining a new function named "mcnp_in". The code for this function can be found in Appendix C. After the MCNP input lines have been added, the input is set to be run by using the command "os.system()". After the run is completed, the output file is read to extract the desired output such as tallies and relative errors. At the end of the function, all dump files ("runtpe", "runtpf", etc.) are deleted so that the next MCNP runs will not be exhausted of letters for their dump file name.
The multiple instances of MCNP runs are executed using the command "pool.starmap(mcnp_in, zip(input))". For example, if the Pool is set to 12 and there are 20 MCNP runs, the algorithm will write IOP Publishing doi:10.1088/1757-899X/1231/1/012003 4 12 MCNP input files and run them simultaneously. It does not wait until all 12 runs are completed to start on the remaining 8 runs. Instead, it creates and runs new ones as the current 12 runs finish. The time taken for all runs to be completed is recorded using the "time" function. Table 1 shows the results of the experiment for logical cores, while Figure 4 plots the completion time against the number of logical cores. Note that when there is only one logical core, it means that the MNCP runs are executed in series using one logical core. Based on Table 1, there are no significant changes to the average tally and the average relative error when the number of logical cores increases. The average relative errors for all cases are also constant at 0.0038 which is less than 0.05. Although this does not guarantee the acceptability of the tally, they are reliable based on the tally interpretation [12].  ). When the logical cores are increased to 4, the desktop shows a 70% decrease in time, while the laptop shows a 54% decrease. At 6 and 8 logical cores, the time taken for the desktop decreases by 80% and 83%, respectively. The remaining 10 and 12 logical cores only decrease the MCNP runtime at very similar rates of 86.6% and 86.7%, respectively. These results suggest that the improvement per increase in logical cores is not linear and there seems to be a diminishing return after a certain number of logical cores. The reason might be because at 1 to 6 logical cores, the MCNP runs are each assigned to one thread of the 6 physical cores. Further increase in the logical cores means that the physical cores are now being shared by two threads, reducing the performance gain. Other processes that are running in the background might also cause a decrease in improvement. Nevertheless, it is clear that by having more logical cores, the MCNP runs can be finished in significantly less runtimes. Table 2 shows the results when the CPU clock speed increases. Similar to Table 1, the average tally and relative error show no significant changes with the clock speed. Figure 5 shows the details of improvement per GHz increase on the desktop setup. It is observed that increasing the clock speed reduces the MCNP completion times. The trends suggest that their relationship is linear. However, the gradient decreases with the number of logical cores. For 1 logical core, the time taken decreases by 121.35 seconds per GHz. However, increasing the logical core by 1, the gradient decreases to 66.95 seconds per GHz (45% less than the previous case). At the maximum number of logical cores, the completion time only decreases by 17.73 seconds per GHz. This trend shows the same diminishing returns as seen in Figure 4. There might be other factors that caused these results such as the way that instructions per core (IPC) work, a limited amount of cache memory, memory latency, and other hardware or software limitations/bottlenecks.

Conclusion
A Python algorithm is developed to investigate the improvement in MCNP runtimes that can be expected when multiprocessing is enabled. 20 MCNP simulations are run using two hardware setups: a laptop with a dual-core, 4-thread processor and a desktop with a 6-core, 12-thread processor. The tutorial for writing the algorithm is also provided and all relevant codes can be found in the appendices.
By enabling multiprocessing or running MCNP in parallel, the runtimes can be decreased by at least 37.5%. Both setups show a similar trend when the number of logical cores is increased. However, the improvement per increase in logical cores is not linear and diminishing returns are observed when the logical cores exceed the physical core count. Increasing the clock speed also decreases the time taken, but the gradient reduces as the logical cores increase. Therefore, it is recommended to prioritize the number of logical cores over the CPU clock speed when running multiple instances of MCNP for metaheuristic optimizations.
Future works can be directed towards expanding the algorithm to check for the acceptability of the tally results automatically. Photons can be added into the MCNP input so that mixed radiation can be simulated which is the usual case for metaheuristic optimizations. An investigation can be done on running the Python-MCNP algorithm using discrete graphics processing units (GPU) especially Nvidia GPUs. They typically have several hundred to thousands of cores. It is interesting to observe the improvements that these GPUs can provide to parallel MCNP simulations.  os.system("del %sn" %(x)) os.system("del %so" %(x)) os.system("del runtpe,runtpf,runtpg,runtph,runtpi,runtpj,runtpk,runtpl") return x,d