Skip to main content
Log in

Simulation of a turbulent flow subjected to favorable and adverse pressure gradients

  • Original Article
  • Published:
Theoretical and Computational Fluid Dynamics Aims and scope Submit manuscript

Abstract

This paper reports the results from a direct numerical simulation of an initially turbulent boundary layer passing over a wall-mounted “speed bump” geometry. The speed bump, represented in the form of a Gaussian distribution profile, generates a favorable pressure gradient region over the upstream half of the geometry, followed by an adverse pressure gradient over the downstream half. The boundary layer approaching the bump undergoes strong acceleration in the favorable pressure gradient region before experiencing incipient or very weak separation within the adverse pressure gradient region. These types of flows have proved to be particularly challenging to predict using lower-fidelity simulation tools based on various turbulence modeling approaches and warrant the use of the highest fidelity simulation techniques. The present direct numerical simulation is performed using a flow solver developed exclusively for graphics processing units. Simulation results are utilized to examine the key phenomena present in the flowfield, such as relaminarization/stabilization in the strong acceleration region succeeded by retransition to turbulence near the onset of adverse pressure gradient, incipient/weak separation and development of internal layers, where the sense of streamwise pressure gradient changes at the foot, apex and tail of the bump.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

Notes

  1. The GPU cluster contains 72 V100s total. Each V100 has a peak performance of 7.8 Tera-FLOPs per second in double precision.

  2. These RANS calculations were performed in separate independent studies by our colleagues.

  3. The data from the RANS calculation with the low-Re S–A model was unavailable prior to the start of the present DNS.

  4. Credit is due Dr. Philippe Spalart for suggesting this method.

  5. The same scale is used for \(\delta \), \(\delta ^{*}\) and \(\theta \) because of the limit of 5 vertical axes imposed by the plotting software.

  6. Note that nondimensionalization of \(\varDelta _p\) would bring \(\mathrm{Re}_L\) into the denominator. An increase in the Reynolds number would lower \(u_\tau \), but the product of the larger \(\mathrm{Re}_L\) and the smaller \(u^3_\tau \) in the denominator of \(\varDelta _p\) would generally be greater than the corresponding product at the lower Reynolds number, and can be estimated more precisely using the available skin friction correlations. Hence, an increase in the Reynolds number would require a stronger favorable pressure gradient, relative to the lower Reynolds number case, for \(\varDelta _p\) to reach a certain threshold.

  7. See https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/ for a comparison between NVLink versus PCI-Express performance.

References

  1. Ashcroft, G., Zhang, X.: Optimized prefactored compact schemes. J. Comput. Phys. 190(2), 459–477 (2003). https://doi.org/10.1016/S0021-9991(03)00293-6

    Article  MATH  Google Scholar 

  2. Aubard, G., Stefanin Volpiani, P., Gloerfelt, X., Robinet, J.C.: Comparison of subgrid-scale viscosity models and selective filtering strategy for large-eddy simulations. Flow Turbul. Combust. 91(3), 497–518 (2013). https://doi.org/10.1007/s10494-013-9485-5

    Article  Google Scholar 

  3. Badri Narayanan, M.A., Ramjee, V.: On the criteria for reverse transition in a two-dimensional boundary layer flow. J. Fluid Mech. 35(2), 225–241 (1969). https://doi.org/10.1017/S002211206900108X

    Article  Google Scholar 

  4. Berland, J., Bogey, C., Marsden, O., Bailly, C.: High-order, low dispersive and low dissipative explicit schemes for multiple-scale and boundary problems. J. Comput. Phys. 224(2), 637–662 (2007). https://doi.org/10.1016/j.jcp.2006.10.017

    Article  MathSciNet  MATH  Google Scholar 

  5. Bogey, C., Bailly, C.: A family of low dispersive and low dissipative explicit schemes for flow and noise computations. J. Comput. Phys. 194(1), 194–214 (2004). https://doi.org/10.1016/j.jcp.2003.09.003

    Article  MATH  Google Scholar 

  6. Bogey, C., Bailly, C.: A shock-capturing methodology based on adaptative spatial filtering for high-order non-linear computations. J. Comput. Phys. 228(5), 1447–1465 (2009). https://doi.org/10.1016/j.jcp.2008.10.042

    Article  MathSciNet  MATH  Google Scholar 

  7. Gaitonde, D.V., Visbal, M.R.: Padé-type higher-order boundary filters for the Navier–Stokes equations. AIAA J. 38(11), 2103–2112 (2000). https://doi.org/10.2514/2.872

    Article  Google Scholar 

  8. Gottlieb, S., Shu, C.W., Tadmor, E.: Strong stability-preserving high-order time discretization methods. SIAM Rev. 43(1), 89–112 (2001). https://doi.org/10.1137/S003614450036757X

    Article  MathSciNet  MATH  Google Scholar 

  9. Lund, T.S., Wu, X., Squires, K.D.: Generation of turbulent inflow data for spatially-developing boundary layer simulations. J. Comput. Phys. 140(2), 233–258 (1998). https://doi.org/10.1006/jcph.1998.5882

    Article  MathSciNet  MATH  Google Scholar 

  10. Morgan, B., Larsson, J., Kawai, S., Lele, S.K.: Improving low-frequency characteristics of recycling/rescaling inflow turbulence generation. AIAA J. 49(3), 582–597 (2011). https://doi.org/10.2514/1.J050705

    Article  Google Scholar 

  11. Moser, R.D., Kim, J., Mansour, N.N.: Direct numerical simulation of turbulent channel flow up to \({\text{ Re }}_\tau = 590\). Phys. Fluids 11(4), 943–945 (1999). https://doi.org/10.1063/1.869966

    Article  MATH  Google Scholar 

  12. Muck, K.C., Hoffmann, P.H., Bradshaw, P.: The effect of convex surface curvature on turbulent boundary layers. J. Fluid Mech. 161, 347–369 (1985). https://doi.org/10.1017/S002211208500297X

    Article  Google Scholar 

  13. Narasimha, R., Sreenivasan, K.R.: Relaminarization in highly accelerated turbulent boundary layers. J. Fluid Mech. 61(3), 417–447 (1973). https://doi.org/10.1017/S0022112073000790

    Article  Google Scholar 

  14. Patel, V.C., Head, M.R.: Reversion of turbulent to laminar flow. J. Fluid Mech. 34(2), 371–392 (1968). https://doi.org/10.1017/S0022112068001953

    Article  Google Scholar 

  15. Schlatter, P., Örlü, R.: Assessment of direct numerical simulation data of turbulent boundary layers. J. Fluid Mech. 659, 116–126 (2010). https://doi.org/10.1017/S0022112010003113

    Article  MATH  Google Scholar 

  16. Sciacovelli, L., Cinnella, P., Gloerfelt, X.: Direct numerical simulations of supersonic turbulent channel flows of dense gases. J. Fluid Mech. 821, 153–199 (2017). https://doi.org/10.1017/jfm.2017.237

    Article  MathSciNet  MATH  Google Scholar 

  17. Slotnick, J.P.: Integrated CFD validation experiments for prediction of turbulent separated flows for subsonic transport aircraft. In: NATO Science and Technology Organization, Meeting Proceedings RDP, STO-MP-AVT-307 (2019). https://doi.org/10.14339/STO-MP-AVT-307

  18. So, R.M.C., Mellor, G.L.: Experiment on turbulent boundary layers on a concave wall. Aeronaut. Q. 26(1), 25–40 (1975). https://doi.org/10.1017/S0001925900007174

    Article  Google Scholar 

  19. Spalart, P.R.: Numerical study of sink-flow boundary layers. J. Fluid Mech. 172, 307–328 (1986). https://doi.org/10.1017/S0022112086001751

    Article  MATH  Google Scholar 

  20. Spalart, P.R., Belyaev, K.V., Garbaruk, A.V., Shur, M.L., Strelets, M.K., Travin, A.K.: Large-eddy and direct numerical simulations of the Bachalo–Johnson flow with shock-induced separation. Flow Turbul. Combust. 99(3–4), 865–885 (2017). https://doi.org/10.1007/s10494-017-9832-z

    Article  Google Scholar 

  21. Spalart, P.R., Watmuff, J.H.: Experimental and numerical study of a turbulent boundary layer with pressure gradients. J. Fluid Mech. 249, 337–371 (1993). https://doi.org/10.1017/S002211209300120X

    Article  Google Scholar 

  22. Uzun, A., Hussaini, M.Y.: Some issues in large-eddy simulations for chevron nozzle jet flows. J. Propuls. Power 28(2), 246–258 (2012). https://doi.org/10.2514/1.B34274

    Article  Google Scholar 

  23. Uzun, A., Malik, M.R.: Large-eddy simulation of flow over a wall-mounted hump with separation and reattachment. AIAA J. 56(2), 715–730 (2018). https://doi.org/10.2514/1.J056397

    Article  Google Scholar 

  24. Uzun, A., Malik, M.R.: Wall-resolved large-eddy simulations of transonic shock-induced flow separation. AIAA J. 57(5), 1955–1972 (2019). https://doi.org/10.2514/1.J057850

    Article  Google Scholar 

  25. Verstappen, R.W.C.P., Veldman, A.E.P.: Direct numerical simulation of turbulence at lower costs. J. Eng. Math. 32(2–3), 143–159 (1997). https://doi.org/10.1023/A:1004255329158

    Article  MathSciNet  MATH  Google Scholar 

  26. Visbal, M.R., Gaitonde, D.V.: Very high-order spatially implicit schemes for computational acoustics on curvilinear meshes. J. Comput. Acoust. 9(4), 1259–1286 (2001). https://doi.org/10.1016/S0218-396X(01)00054-1

    Article  MathSciNet  MATH  Google Scholar 

  27. Vreman, A.W., Kuerten, J.G.M.: Statistics of spatial derivatives of velocity and pressure in turbulent channel flow. Phys. Fluids 26(8), 085103-1/29 (2014). https://doi.org/10.1063/1.4891624

    Article  Google Scholar 

  28. Warnack, D., Fernholz, H.H.: The effects of a favourable pressure gradient and of the Reynolds number on an incompressible axisymmetric turbulent Part 2. The boundary layer with relaminarization. boundary layer. J. Fluid Mech. 359, 357–381 (1998). https://doi.org/10.1017/S0022112097008501

    Article  MATH  Google Scholar 

  29. Williams, O., Samuell, M., Sarwas, S., Robbins, M., Ferrante, A.: Experimental study of a CFD validation test case for turbulent separated flows. In: AIAA Paper 2020-0092, AIAA Scitech 2020 Forum, Orlando (2020). https://doi.org/10.2514/6.2020-0092

Download references

Acknowledgements

This work was sponsored by the NASA Transformational Tools and Technologies Project of the Transformative Aeronautics Concepts Program under the Aeronautics Research Mission Directorate. The calculations were made possible by the computing resources provided by the NASA High-End Computing Program through the NASA Advanced Supercomputing Division at Ames Research Center. This work also used resources of the Oak Ridge Leadership Computing Facility (OLCF) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. The access to computational resources was provided under the OLCF Director’s Discretion Projects Program. The authors acknowledge valuable discussions with Dr. Philippe Spalart. The RANS solutions were provided by Drs. Michael Strelets and Prahladh Iyer.

Funding

Funding was provided by Langley Research Center.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Uzun.

Additional information

Communicated by Sergio Pirozzoli.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Presented as AIAA Paper 2020-3061 at the AIAA Aviation 2020 Virtual Forum, 15–19 June 2020.

Appendix

Appendix

1.1 GPU code scaling performance

The flow solver implements strategies that overlap communication with computation, which are crucial for achieving good performance on GPUs. This overlap is achieved as follows. The GPUs first compute the near-boundary information needed by their neighbors and copy the data to their host central processing units (CPUs) for MPI communication. The GPUs then compute the interior points while the host CPUs handle the MPI communication in parallel. The host CPUs run several OpenMP threads (one dedicated thread per face of a grid block) to perform the MPI data exchange. Assuming a large enough GPU workload, by the time the GPUs have finished computing interior points, the MPI communication among the host CPUs has been completed and the host CPUs have copied the exchanged MPI data back to their corresponding GPUs. The GPUs can then update the points near their block interface using the data received from their neighbors and can proceed further. The data copies between the GPU and the host CPU take place asynchronously, meaning that the GPU can do the computing (as long as there is sufficient work) and the data exchange with the host CPU simultaneously. Moreover, the GPU can run several independent tasks in parallel, as long as the resources needed by the computational kernels are available. The GPU operations are synchronized at critical points.

During the course of the developmental work, it was realized that the chosen explicit algorithms run extremely fast on the GPU since they are essentially made up of many independent multiply–add operations, at which the GPU excels. However, the host-assisted MPI communication among the GPUs was found to be a major bottleneck, as already known from others’ similar experience in GPU code development. We observed that, unless we assign a large workload to the GPU and overlap computation with communication, MPI communication over the network cannot keep up with GPU computation. When the workload per GPU is too small, an inefficiency arises because the GPU finishes its work way too quickly and then sits idle while waiting for the communication to catch up. This idle time is nothing but a wasted opportunity to have the GPU do useful work. Hence, in order to have the GPU run at full efficiency and make the best use of limited computational resources, the present strategy is to assign the maximum possible workload to the GPU and overlap computation with communication as much as possible. To achieve the maximum possible workload, we adjust the grid block size per GPU in order to nearly or fully max out the available GPU global memory, which is 16 GB per GPU on the Summit system. This corresponds to about 24–25 million points per GPU on Summit. The GPUs at the NASA Advanced Supercomputing Division (NAS) facility have 32 GB of global memory, which allows nearly 50 million points per GPU on that system.

In this work, the host-assisted MPI communication approach is chosen over the so-called “GPU-aware MPI” option, which accomplishes direct communication among the GPUs without the complication of sending the data through the host CPUs. This choice was made because the performance of the GPU-aware MPI implementation available at the time of code development was found to be rather poor and unacceptable. Although the implementation of host-assisted MPI communication is more involved, assigning a large workload to the GPUs and having the host CPUs handle the MPI communication overlaps computation with communication effectively and hides the communication cost, as noted above. It is therefore well worth the additional complication since it improves the overall code performance.

Because of the slow communication issue relative to computation, we do not expect to see good strong scaling performance on the current systems. Communication simply would not be able to keep up with computation if the GPUs were assigned too little work. This issue can only be resolved by a much faster communication network, which presently does not exist. Hence, the only way to make the best use of limited resources at present is to run the code in its maximum efficiency mode. Although the maximum efficiency mode may not minimize the wall-clock run time, it will certainly minimize the total number of node-hours needed for a given problem size, which is the more relevant performance metric given the fact that the available resources are limited.

A weak scaling study has been performed on the Summit system, in which the total workload per GPU is kept fixed while increasing the total number of GPUs. Each node on Summit contains 6 NVIDIA Tesla V100 GPUs, with 16 GB of global memory per GPU. Our test case for the scaling study involves the flow over a flat plate. For the weak scaling study runs, the total number of GPUs is varied from 150 (25 nodes) up to 6000 (1000 nodes). Note that 1000 nodes make up \(21.7\%\) of the entire system. For each case, we assign a grid block of \(256 \times 320 \times 288\) points to each GPU. Test runs for each node-count case were performed to measure the total time taken to run 1000 time steps and compute the corresponding average time per step. The third-order, three-stage explicit Runge–Kutta scheme is used for time advancement. Figure 31 shows how the average time taken per step, in milliseconds, varies as the number of Summit nodes are increased. Note that the horizontal axis is plotted in logarithmic scale. Depending upon how the nodes assigned to a particular job are distributed within the system and the network route among the nodes, as well as the network traffic due to other jobs running on the system, code execution speeds may vary from one run to another. We therefore see a slight variation in the average time per step as the number of nodes changes. The average time per step is nearly constant and remains between 149 and 150 ms. This is the payoff for utilizing a strategy that assigns a large workload to the GPU and overlaps GPU computation with MPI communication as much as possible. The “mean” value of the time taken per step (i.e., the value averaged over all node-count runs) is about 149.5 ms for the given grid block size per GPU. This corresponds to roughly one nanosecond per grid point per Summit node per time step.

Fig. 31
figure 31

Weak scaling performance on the Summit system

The overall performance of the GPU code was measured to be \(10\%\) of the peak theoretical double-precision performance of the V100 GPU, which is rated at 7.8 Tera-floating-point-operations (TFLOPs) per second. The most compute-intensive individual kernels (or subroutines) were found to achieve as high as \(17\%\) of the peak. To determine the performance limiting factor, the code was profiled using the available performance analysis tools. For a grid block of about 24 million points per GPU, it was found that, at each time step, the code moves about 71 GB worth of total data from the global memory to the registers of the processing units, and about another 39 GB worth of total computed data from the registers back to the global memory. Thus, there is a data movement of about 110 GB at each time step, between the global memory and the registers. Now, in order to illustrate what limits the code performance, suppose the code only moves this much data between the main memory and registers, but does not do any computing at all. How long would this data movement alone take? To answer this question, we first note that the V100 GPU architecture provides 900 GB/s of peak memory bandwidth. For simplicity, let us assume that the data movement takes place at the peak bandwidth. That would mean that at least \(110/900 \approx 0.122\) s would be needed to move that amount of data. Our profiling measurements show that the most compute-intensive kernels generally achieve about 80 to \(90\%\) of the peak memory bandwidth, so the actual data movement takes a bit longer than this estimate. With an overall average rate of \(85\%\) of the peak bandwidth, the data movement would take about 0.144 s. We also know that for the given grid block size per GPU, the code takes about 0.15 s to perform all operations and advance the simulation for one time step. The profiling measurements show that the actual compute time is around \(19\%\) of the total elapsed time. Even though the actual computations overlap with the memory operations, the data movement between the memory and registers still constitutes a significant chunk of the time taken per computational time step according to the above timings. These observations lead us to the conclusion that our code performance is bound by the available memory bandwidth. In other words, the memory bandwidth is not sufficient to transfer data into and out of the registers at the rate demanded by the processing units. Thus, the well-known adage of the computer science world, “The FLOPs are free, you are paying for the memory bandwidth!” is still very much valid in our case.

As noted above, with about 80 to \(90\%\) of the peak memory bandwidth achieved by the most compute-intensive kernels, our memory-bound GPU code is not far off from its maximum possible performance; hence, reaching a much greater percentage of the peak theoretical FLOPs per second performance of the V100 architecture is not feasible. We anticipate that potential memory bandwidth improvements in future-generation GPU architectures should enable our code to achieve higher FLOP counts per second on those systems.

Fig. 32
figure 32

Instantaneous snapshots of turbulent channel flow

Fig. 33
figure 33

Turbulent channel flow mean streamwise velocity and Reynolds stress component profiles in wall units

1.2 Performance comparison to CPU flow solver

We now provide the performance comparison between this new GPU code and our previous CPU code, which was most recently used in the simulation of flow separation problems [23, 24]. The speedup factors depend on how the comparisons are made. For example, the “node-to-node” speedup factor, which is derived from the performance comparison based on one Summit node with 6 GPUs versus one dual-socket Intel Skylake CPU node with 40 cores, comes out to about \(75\times \). This comparison is for the version of the CPU code based on the third-order, three-stage explicit Runge–Kutta scheme, same as that employed in the GPU code. Note that the CPU code uses high-order compact finite-difference [1] and filtering schemes [7, 26], whose implementations are well-optimized for the CPU. We estimate that the implementation of explicit finite-difference and filtering schemes, employed in the GPU code, into the CPU solver would result in a performance improvement in only about \(25\%\). Even in such a case, the node-to-node speedup factor would still be about \(60\times \). Basing the comparison on the number of GPUs versus the number of CPU cores would lead to an equivalent speedup factor of \(60 \times 40/6 = 400\times \), meaning that one GPU is worth 400 CPU cores. We also note that one dual-socket Intel Skylake CPU, which contains 40 cores total, is comparable in price and power consumption to a V100 GPU. If we were to base the comparison on one V100 GPU versus one dual-socket Intel Skylake CPU, the corresponding speedup factor would be \(400/40=10\times \).

The CPU code also has a second-order implicit time integration version. It is about 2.5 times more costly per time step than the explicit integration scheme. The explicit time integration scheme is normally run at a CFL number of around 0.8. Because of time accuracy concerns, the second-order implicit time integration scheme should not normally be run at a CFL number greater than 5 or so. This corresponds to a factor of about 6.25 increase in the time step with the implicit scheme. The increased computational cost per time step of the implicit scheme gives a speedup factor of about \(6.25/2.5 = 2.5\) over the explicit scheme. Thus, comparing the GPU code to the implicit version of CPU code as is, we obtain a node-to-node speedup factor of \(75/2.5 = 30\times \). The corresponding comparison based on one V100 GPU versus one dual-socket Intel Skylake CPU would yield a speedup factor of \(30/6=5\times \).

As noted earlier, it is possible to further accelerate the GPU code by switching to a second-order, single-stage explicit scheme developed by Verstappen and Veldman [25]. The second-order scheme runs at half the time step of the third-order Runge–Kutta scheme but requires only one right-hand-side computation per time step as opposed to three right-hand-side evaluations needed for the Runge–Kutta scheme. It therefore provides an acceleration factor of \(3/2=1.5\times \) over the Runge–Kutta scheme. With the second-order explicit time advancement scheme implemented in the GPU code, the performance comparison to the implicit version of CPU code (which is also second-order accurate in time), based on one V100 GPU versus one dual-socket Intel Skylake CPU, would provide a speedup factor of \(7.5\times \).

We should note here that a significant effort was also put into optimizing the implicit time integration version of the CPU code discussed above. The explicit time integration version of the CPU code is essentially the conversion from the optimized implicit version. Hence, the above performance comparisons are between the fastest running versions of the GPU and CPU codes, and are as fair as currently possible.

We also note that the connection speed between the GPU and the host CPU has some effect on the overall code performance. Because of the slower GPU to CPU connection on the NVIDIA Tesla V100 GPU nodes installed at NAS (which is PCI-Express, while Summit nodes have the faster NVLinkFootnote 7), the GPU flow solver runs about \(5\%\) slower on the NAS GPU nodes relative to Summit. The block interface data to be exchanged among the GPUs during the computation are first copied to their respective host CPUs, which then perform the actual data exchange via MPI communication and copy the exchanged data back to the GPUs. Although the code overlaps communication with computation as much as possible, the GPU to CPU connection speed and specific MPI implementation details still play a role on the overall performance.

1.3 GPU code validation

For code validation, a turbulent channel flow problem was considered. The Reynolds number of the fully developed turbulent channel flow is \(\mathrm{Re}_{\tau } = \rho _\text {bulk} u_{\tau } h / \mu _\text {wall} = 590\), where \(\rho _\text {bulk}\) is the bulk density, \(u_{\tau }\) is the wall friction velocity, h is the channel half-height and \(\mu _\text {wall}\) is the viscosity on the wall. The domain size is \(2\pi h\) in the streamwise direction, x, 2h in the wall-normal direction, y, and \(\pi h\) in the spanwise direction, z. The flow is periodic both in the streamwise and spanwise directions and is bounded by solid walls at \(y=0\) and 2h. Because of the imposed streamwise periodicity, a source term is added to the streamwise momentum and energy equations to drive the flow at a constant mass flow rate. The Mach number based on bulk velocity and sound speed on the wall is set to 0.2 in order to facilitate a comparison with incompressible DNS results available in the literature.

The grid used in the DNS contains \(768 \times 512 \times 768\) points along xy and z directions, respectively. The total number of points is about 302 million. The grid resolution in wall units is \(\varDelta x^+ \approx 4.8\) in the streamwise direction and \(\varDelta z^+ \approx 2.4\) in the spanwise direction. In the vertical direction, \(\varDelta y^+ \approx 0.23\) on the wall and \(\varDelta y^+ \approx 5.4\) at the channel centerline. For the computation, 18 NVIDIA Tesla V100 GPUs are used on the Summit system. Each GPU solves a grid block of \(256^3\) points. The second-order time integration scheme [25] is used with a CFL number of 0.4. Explicit filtering is applied at every 5 time steps with a filtering parameter of \(\sigma = 0.09\). Note that the finer overall grid resolution used here, relative to that in the speed bump case, and the simpler configuration keeps the simulation stable with less frequent filtering. To ensure full convergence of the time-averaged results, the flow statistics are averaged over \(1009 h/u_\text {bulk}\).

Figure 32 depicts the instantaneous snapshots of the turbulent channel flow. As seen here, the flow structures appear very smooth and provide evidence that the minimal filtering applied in the present case is able to provide a solution free of any numerical wiggles. Figure 33 compares our mean streamwise velocity and Reynolds stress component profiles with the data from Moser et al. [11] as well as Vreman and Kuerten [27]. Both of these groups used spectral methods to solve the incompressible flow equations for the turbulent channel flow at the same \(\mathrm{Re}_{\tau }\). As seen in the comparisons, our results show an excellent agreement with their results.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Uzun, A., Malik, M.R. Simulation of a turbulent flow subjected to favorable and adverse pressure gradients. Theor. Comput. Fluid Dyn. 35, 293–329 (2021). https://doi.org/10.1007/s00162-020-00558-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00162-020-00558-4

Keywords

Navigation