Long version of the beginning of the interview with Juan Escobar (LAERO, CNRS)

../../_images/photo_je.jpg
Juan, you’ve ported Meso-NH 5-5-1 to a GPU with Philippe. Could you describe what this is all about, and the applications it opens up?

Because of their architecture, computing power and memory bandwidth, graphics processing units (GPUs) are in fact the worthy successors to the old parallel and vector supercomputers of the CRAY or NEC FUJITSU families, which were used in the 2000s to run Meso-NH on national computing centers such as IDRIS or Météo-France. Inside one of these boards, there are several dozen independent “processors” that can be used in parallel, and these processors are themselves capable of executing several thousand identical “vector” instructions, with memory access speeds an order of magnitude faster than those of CPUs (our PC processors). The difference is that, instead of costing several million euros, they only cost a few thousand, and what’s more, they only consume a few hundred kilowatts instead of thousands or even millions of watts to perform the same calculations as supercomputers.

That’s why, back in the 2010s, with the first Fortran compilers developed at the time by PGI (Portland Group Incorport) enabling these GPUs to be used for computing (a technology called GPGPU for General-Purpose computing on Graphics Processing Units), I started trying to port MesoNH to these GPUs. At the time, PGI had pioneered its own directive programming language, ACC, which involves adding comments to code to tell the compiler which parts of the code to run on GPUs. Over the years, the language has been standardized to become the OpenACC programming standard, supported by several manufacturers including NVIDIA and AMD (graphics card founders) and HPE/CRAY (supercomputer assembler).

At present, the most time-consuming parts of the program - around 90% of computing time - have been ported to the GPU (advection, turbulence, single-moment microphysics and pressure solver), enabling scientific simulations to be carried out with this version, MNH-V55-OpenACC, even partially ported to the GPU. This is what was done during the Grand Défi Adastra GPU, on the CINES Computing Center in Montpellier, where various extreme events (Storm Alex, Derecho in Corsica, Storm in Amazonia), were simulated using up to 128 compute nodes, or 1024 GPUs: around 1/3 of Adastra’s GPU partition.

What’s the advantage of running simulations on GPU-equipped computers rather than CPU-only ones?

First and foremost, the computational power available on GPU-equipped nodes is greater than on CPU-only nodes. On Adastra, the GPU partition comprises just 338 nodes, but is credited with a theoretical power of 61 Petaflops (61 million billion computing operations per second), whereas the 556-node partition equipped with the latest CPUs delivers just 13 Petaflops. The second advantage is that GPUs are more energy-efficient than CPUs when it comes to delivering equivalent computing power. For the Adastra Grand Challenge, performance tests showed that, using 128 nodes (either GPU or CPU), code running on GPUs is around 5 times faster than on CPUs, while consuming around 2 times less energy.

It should also be noted that the gap between the computing power delivered by GPUs and that delivered by CPUs is set to widen. For example, on the first European SuperCaculator to reach 1 Exaflops, currently being installed at the Jülich computing center in Germany, the GPU partition will deliver 1 Exaflops, or 1,000 Petaflops, while the CPU partition will only deliver 50 Petaflops, or 0.5% of total computing power!

Are there any situations that lend themselves particularly well to using Meso-NH 5-5-1 on GPUs?

As I’ve already mentioned, GPUs are essentially mini-supercomputers, parallel and vector-based. To take full advantage of their performance, they need to be fed with lots of data and calculations that can be processed independently in parallel. This is typically the case for so-called Giga-LES simulations, where grid sizes are of the order of a billion points (Giga- for billion). For the Adastra GPU Grand Challenge, the grids were up to 4096x4096x128 points (for 100 meters of resolution), i.e. 2.14 billion points.

Of course, you can also use smaller configurations, but take advantage of the increased computing power of GPUs to do a lot at once. For example, on Adastra GPU, we created a database of simulations for Corsica, with around 350 simulated days. In this case, for this first test, at 1 km resolution, the grid is only 256x256x70 points, which we ran on 1 single GPU node, with 16 MPI tasks, i.e. sub-domains of 64x64x70 points per GPU (minimum sub-domain size to maintain good parallelization efficiency on GPUs). As only 1 node per simulated day is required, we were able to run hundreds of jobs simultaneously on different GPU nodes, and complete all 350 days of simulation in 3 ‘human’ days.