Benchmark Report Of Petsc On Blue Gene Using Kspsolve

1. Build PETSc on Blue Gene

First get the petsc-dev code. petsc-3.0.0-p0 has been tried but there is some bug which will be fixed in the next patch release. Therefore, petsc-dev is used instead.

Set the environment variables in ~/.bashrc to have

export PETSC_ARCH=linux-gnu-c-opt
export PETSC_DIR=/gpfs/bglscratch/pi/sjin/src/petsc-dev-mpixlc

The building process is

./configure --CC=mpixlc --FC=mpixlf77 --with-batch  --with-blas-lapack-dir=../petsc-externalpackages/fblaslapack-3.1.1/  --with-debugging=no
run1.sh conftest
./reconfigure.py
make

Note that this build is C only. No C++ is used. Also note the use of —with-batch which is required due to the cross compiling nature of Blue Gene. For more information, please refer to cross compile for Blue Gene/L (hdf5 as an example)

2. Run the Test Code

The test we use is $PETSC_DIR/src/ksp/ksp/examples/tutorials/ex2. To build it,

cd $PETSC_DIR/src/ksp/ksp/examples/tutorials
make ex2

A binary named ex2 is now generated.
To check if it is valid,

sjin@fen1 tutorials $ run1.sh ex2

And this is what I got

Norm of error 0.000156044 iterations 6

Here is the help from the code:

/* Program usage:  mpiexec -n <procs> ex2 [-help] [all PETSc options] */
 
static char help[] = "Solves a linear system in parallel with KSP.\n\
Input parameters include:\n\
  -random_exact_sol : use a random exact solution vector\n\
  -view_exact_sol   : write exact solution vector to stdout\n\
  -m <mesh_x>       : number of mesh points in x-direction\n\
  -n <mesh_n>       : number of mesh points in y-direction\n\n";

The default values of m and n are

m = 8,n = 7

The command to run the test is

mpirun -partition R001-N3-32 -n 2 -mode CO -cwd $PWD -exe ex2 -args "-m 1500 -n 1500 -log_summary log2.txt "

Note that we might need to change the -partition, -m, -n and -log_summary arguments.
Note that "-memory_info" does not work here.

The speed up chart for the 1500x1500 case is shown as
pub?key=pZHoqlL60quZIXwtnaU4KEQ&oid=3&output=image
We see a nearly perfect scaling up to 512 processes, which is the current limit on the facility due to hardware issues on the first half (partition R000).

3. Comparison with a Linux Cluster

Here we compare the same test on the linux cluster at University of Calgary named terminus.ucalgary.ca.
The following script is used to run ex2 serially

[seki@h3 bechmarkEx2]$ cat serial.sl
#!/bin/bash
#SBATCH -n 1
#SBATCH -N 1
    echo "Starting run at: `date`"

    CODE=/home/users/seki/software/src/petsc-dev/src/ksp/ksp/examples/tutorials/bechmarkEx2/ex2

    mpirun -srun -n 1 $CODE -m 1500 -n 1500 -log_summary log1.txt >out1

    echo "Job finished at: `date`"
[seki@h3 bechmarkEx2]$

Use

sbatch serial.sl

to submit it.
top shows the memory usage as

  PID USER    PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  822 seki      25   0 1048m 992m 4036 R  111 25.2   0:44.17 ex2

The following script is used to run parallel jobs in one shot

[seki@h2 bechmarkEx2]$ cat submit.sl
#!/bin/bash
#SBATCH -n 128
#SBATCH -N 32
#SBATCH --ntasks-per-node=4
    echo "Starting run at: `date`"

    CODE=/home/users/seki/software/src/petsc-dev/src/ksp/ksp/examples/tutorials/bechmarkEx2/ex2

    mpirun -srun -n 128 $CODE -m 1500 -n 1500 -log_summary log128.txt >out128
    mpirun -srun -n 64 $CODE -m 1500 -n 1500 -log_summary log64.txt >out64
    mpirun -srun -n 32 $CODE -m 1500 -n 1500 -log_summary log32.txt >out32
    mpirun -srun -n 16 $CODE -m 1500 -n 1500 -log_summary log16.txt >out16
    mpirun -srun -n 8 $CODE -m 1500 -n 1500 -log_summary log8.txt >out8
    mpirun -srun -n 4 $CODE -m 1500 -n 1500 -log_summary log4.txt >out4
    mpirun -srun -n 2 $CODE -m 1500 -n 1500 -log_summary log2.txt >out2

    echo "Job finished at: `date`"
[seki@h2 bechmarkEx2]$
pub?key=pZHoqlL60quZIXwtnaU4KEQ&oid=4&output=image
It shows fairly good scaling, but not as good as the Blue Gene/L.

A more interesting figure is to compare directly the time cost for the linear solvers.
pub?key=pZHoqlL60quZIXwtnaU4KEQ&oid=5&output=image
The Blue Gene/L is significantly slower due to the low CPU speed but it scales better. For small to moderate number of processors, the Blue Gene/L is outperformed by the Terminus Linux cluster but at very large number of processors, it will outperform the cluster.

Conclusions

  1. The Blue Gene has an amazingly fast interconnect so that the scaling is almost perfect.
  2. The Blue Gene chips are slow (700MHz) compared with PC chips (2.4GHz) so it is definitely not a place for sequential jobs.
  3. It does not make any sense to use the Blue Gene unless hundreds of compute nodes are used (for example, at least 256).
  4. Considering the fact that typical access to Westgrid cluster is limited to 32 (at most 64) nodes while at least 512 (probably 1024) nodes are almost exclusively available for us, it is definitely worthwhile to port the codes to the Blue Gene system.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License