tutorial on parallel simulation

We can directly call

mpirun -np 8 <path-of-fcpu> <arguments>

to invoke MPI simulation on mulitple processes. However, due to the memory limitation, this approach can only simulation failry small problem sizes. This tutorial explains how to run very large simulations.


The Direct Approach

This is the deault approach. The does the following jobs in sequence when running with multiple processes.

  1. The master process (rank=0) will load the mesh from files: 3d.nodes and 3d.lcon are the default file names for the coordinates and connectivity table. In the meantime, the slave processes (rank>0) are doing nothing. They may be busy waiting (depending on the MPI built but this is very likely) so that their CPU usage is still high although nothing is done.
  2. Once the master process loads the whole unstructured grid (USG in the code), it will then divide the big grid into smaller pieces and send each process their share. This is actually a very nontrivial part and is taken care completely by the petsc-dev library. The master process itself will also get its own sub-domain grid. When the distribution is done, the whole grid memory space is released.

This approach works for small problems but there are problems for large problems.

  • For large problems, the memory required for the master process to load the whole grid can be very large and usually the memory of a distributed memory cluster node is not sufficient to handle it. This memory issue is very serious, mainly due to the fact we are using Taylor-Hood tetrahedral elements, where edges have degrees of freedom in addition to the vertexes. If we were only dealing with first order tetrahedral elements (such as the pressure elements used in this code), then the limit on number of elements is quite high. For example, in 1GB RAM, we may hold O(10) million linear elements, where the connectivity and inverse-connectivity tables are fairly easy to construct and store in memory. However when we have Taylor-Hood elements, the number of total elements are limited to about O(1) million, since a lot of storage is used for store the topological information of the grid. The petsc-dev uses an advanced data structured called sieve to deal with the topological information of the grid, where cells, faces, edges and vertexes are all graph points connected by double-directed arrows. The storage of this dynamic data strctured requires a lot of memory thus we have this memory problem.
  • The sieve data structure not only costs a lot of memory, but also is slow to construct, considering the fact we are dealing with very large meshes and with only one process (rank=0).

Typical error message seen when memory becomes an issue is

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range

or something about the PBS schedule complaining that pvmem limit is exceeded.

Mesh Dump-and-Load Approach

Mainly due to the memory limitation, we have worked out this approach, which requires access to a large shared memory machine. It can be described as following

1. On the shared memory machine, run the code like described in this script

#!/bin/bash
CODE=$HOME/research/FEM-CFD-Petsc-USG-I/fcpu
nohup mpiexec  -n 2 -l  $CODE -Re 1 -Fr 0.02 -dt 0.5 -NSteps 1000 \
   -zeroVelBC 1 -updateOrder 3 -minDist 0.1 \
   -outputSieveOnly 1 -loadSieve 0  \
   -identicalVelStructure 1 -dumpStep 1000 -subSteps 1  >out2 &

The important options is -outputSieveOnly 1, which tells the code to do exactly the same direct approach but stop immediately after the meshes are distributed, after dumping all relevant sub-domain grid information to files. There is no Matrix/Vector created, nor does the -NSteps argument matter since these parts are not run.

You can also use -outputSieve 1 instead of -outputSieveOnly 1, which is simply the same as the direct approach but dumps the meshes after they are distributed. The code would continue as normal simulations and CFD work can be done.

2. Copy the dumped mesh files from the shared memory machine to the distributed memory cluster.

Relevant files are

3d_coordinatesNP2Rank0.txt 3d_IsieveNP2Rank0.txt 3d_renumberingNP2Rank0.txt
3d_coordinatesNP2Rank1.txt 3d_IsieveNP2Rank1.txt 3d_renumberingNP2Rank1.txt

Basically there are three files generated for each process (Rank): coordinates, Isieve and renumbering.

3. Run the production jobs on the cluster.

The following script is often used

CODE=$HOME/research/FEM-CFD-Petsc-USG-I/fcpu
mpiexec  -np 2   $CODE -Re 1.0 -Fr 0.0098 -dt 0.025 -NSteps 10000 \
-zeroVelBC 1 -updateOrder 3 -minDist 0.2 \
   -outputSieve 0 -loadSieve 1  \
      -identicalVelStructure 1 -dumpStep 1000 -subSteps 1  &>out2

The important options is -loadSieve 1, which tells the code to load the dumped distributed meshes directly. In this case, the 3d.lcon and 3d.nodes files are not used at all. Note the number of processes (-np option) should be consistent with mesh-dumping simulation.

The major advantage of this approach is that no huge allocation for the whole domain is required at any process. Also since we know as a prior exactly how many elements and nodes are used, efficient continuous memory allocation is used instead of the costly dynamic one, which saves a lot of memory for large problems. In addition, the loading from files are very fast since the algrotithm to compute the mesh distribution (thus renumbering and ghost points between processes) has been done already. This feature becomes extremely useful when the same mesh is used in many runs, which is often the case in a parameter study.

A Special Treatment for Cylinders

The Mesh Dump-and-Load approach requires the use of a large shared memory machine and a lot of RAM (usually 64GB or more). Even if we have this kind of facility (such as the one in AICT cluster of UofA), the physical RAM of these systems limits our problem size.

In our research, we are most frequently dealing with cylinders (circular or rectagular) and when lots of particles flow inside them, the mesh will be simply repetive along the cylinder axis, i.e., no need of local refinment at some specific axis coordinates. Then the mesh becomes a mixture of strctured and unstrcutred meshes, where the unstructured grid within each sub-domain is simply shifted along the axis.
A special algorithm is developed to take advantage of this special but useful case.

1. Dump the mesh for NP=2

This is the same as the Mesh Dump-and-Load first step but -np 2 required (not any other number of processes).
The 3d.nodes and 3d.lcon mesh input files are those only a two subdomain grid, therefore, it can be significantly smaller than the real domain we are planning to run. It means that we don't need to run it on a shared memory machinie. A typical cluster node should be sufficient.

2. Copy the files over

Same as above if it is run on a different machine.

3. Load the dumped meshes and Shift

A sample script is shown below

CODE=$HOME/research/FEM-CFD-Petsc-USG-I/fcpu
mpiexec -comm pmi -np 10   $CODE \
 -Re 1 -Fr 0.0098 -dt 0.1 -NSteps 6000 \
 -zeroVelBC 1 -updateOrder 3 -minDist 0.2 -outputGmsh 0 \
 -outputSieve 0 -loadNP2Sieve 1  -dumpIndividualParticle 0 \
 -identicalVelStructure 1 -dumpStep 500 -subSteps 1  &> out10

Note now we have the -loadNP2Sieve 1 option, which does the following:
  • rank=0: load the Rank0 files
  • rank>0: load the Rank1 files, for all slave processes

A built-in converter will convert these dumped meshes into what is really needed for a given process.

Note that it is necessary to modify the 3d.aux file to make sure it is for the big domain instead of the two-chunk smaller one. You are always required for supply the correct 3d.particles file as input for particle initial conditions.

With this approach, we can run arbitrarily long cylinders given as sufficient cluster nodes, which becomes very attractive considering the Blue Gene system.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License