qbox core dumps in MPI on BG/Q

Questions and discussions regarding the use of Qbox
Forum rules
You must be a registered user to post in this forum. Registered users may also post new topics if they consider that their subject does not correspond to any topic already present on the forum.
naromero
Posts: 32
Joined: Sun Jun 22, 2008 2:56 pm

Re: qbox core dumps in MPI on BG/Q

Post by naromero »

Francois,

I submitted the original test case for efix 9. In that test case, communicators where being created but never destroyed. When the communicators were exhausted, there was a hang but no warning message. The efix is the warning message.

Please let me know if the Intrepid data agrees. I also can find out from the MPICH team if there communicator limit was different on BG/P vs. BG/Q.
fgygi
Site Admin
Posts: 164
Joined: Tue Jun 17, 2008 7:03 pm

Re: qbox core dumps in MPI on BG/Q

Post by fgygi »

I ran a few tests on Vesta using Qbox 1.56.2 on a 512-water problem.
When using a 1024-node partition in c16 mode (16k tasks), the problem of running out of communicators only occurs if nrowmax <= 64. When using nrowmax >= 128, the program runs normally.

With 16k tasks, the low value of nrowmax (default=32) results in a large number of process columns, and a large number of creations and deletions of communicators in the SlaterDet constructor (see my comment in previous post on this topic). This apparently hits the limit of communicators. I will continue to investigate whether communicators are appropriately deleted in Context.C.

This should not be a problem in actual applications, since small values of nrowmax lead to poor performance in most Scalapack functions. I would recommend nrowmax=256 for this problem.

Another solution is to use multiple threads (mode=c8, OMP_NUM_THREADS=2, or mode=c4, OMP_NUM_THREADS=4). It even runs faster :)
naromero
Posts: 32
Joined: Sun Jun 22, 2008 2:56 pm

Re: qbox core dumps in MPI on BG/Q

Post by naromero »

Francois,

The problem also occurs on Intrped, see
/intrepid-fs0/users/naromero/persistent/qbox/H2O-2048/gs_pbe_4096_vn_ZYXT

I wonder if this issue is related to these other bugs:
http://fpmd.ucdavis.edu/bugs/show_bug.cgi?id=9
http://fpmd.ucdavis.edu/bugs/show_bug.cgi?id=12
http://fpmd.ucdavis.edu/bugs/show_bug.cgi?id=14
http://fpmd.ucdavis.edu/bugs/show_bug.cgi?id=19

Let me know if there is any other way that I can assist.
naromero
Posts: 32
Joined: Sun Jun 22, 2008 2:56 pm

Re: qbox core dumps in MPI on BG/Q

Post by naromero »

Hi Francois,

Just checking in to see if you had luck tracking this issue down. It seems that a judicious choice of NROWMAX can circumvent the problem, but this may be problematic on BG/Q since the communicators consume more memory than a plain vanilla MPI implementation.
fgygi
Site Admin
Posts: 164
Joined: Tue Jun 17, 2008 7:03 pm

Re: qbox core dumps in MPI on BG/Q

Post by fgygi »

A few runs using gdb on small problems show that (in all cases tested) Qbox releases every MPI communicator it created by appropriately calling MPI_Comm_free(). It seems though that the MPI implementation I used (mpich 1.2.7p1) does not fully recycle communicator handles, but cycles through a few values (although it apparently releases the underlying resources). This can be seen by printing the value of the handle during a repeated cycle of MPI_Comm_create/MPI_Comm_free calls. I suppose that depending on the implementation, this behavior may vary. If there is a maximum value of the handle, this could lead to the problem we see on BG/Q. I am not sure what else can be done to release the resources of the communicator than calling MPI_Comm_free. I will try to run the same test on BG/P and BG/Q.

Note also that tracking the allocation of MPI communicators with mpiP is not reliable since it only allows to record the number of calls to MPI_Comm_create and MPI_Comm_free. However, MPI_Comm_create is often called in a situation where it returns MPI_COMM_NULL (for example when the task calling it is not part of the MPI_Group defining the new communicator). These calls must obviously not be matched by corresponding calls to MPI_Comm_free since no MPI_Comm was allocated. Therefore, counting the calls to MPI_Comm_create and MPI_Comm_free is not a reliable way to track possible leaks of MPI communicators.
naromero
Posts: 32
Joined: Sun Jun 22, 2008 2:56 pm

Re: qbox core dumps in MPI on BG/Q

Post by naromero »

Francois,

Thanks for the analysis. I will pass this information along to the MPICH developer's. The version of MPICH used on Blue Gene/Q is MPICH2 1.5. If you can reproduce this problem either on top of ScaLAPACK or better yet, just pure MPI, it would make it easier for the MPICH developer to create a bug gix.

I tried creating a pure MPI reduce test case, but did not manage to succeed. Most likely because it did not exercise MPI in the right way.
naromero
Posts: 32
Joined: Sun Jun 22, 2008 2:56 pm

Re: qbox core dumps in MPI on BG/Q

Post by naromero »

Francois,

I briefly mentioned this issue to an MPICH developer and they basically said that the version of MPICH that you mention is too old to draw any conclusion. I have been quite busy lately due to a workshop and other things, but I will get Qbox running on my Ubuntu Linux desktop and try to dig deeper.

Bests,
Nichols A. Romero
naromero
Posts: 32
Joined: Sun Jun 22, 2008 2:56 pm

Re: qbox core dumps in MPI on BG/Q

Post by naromero »

Francois,

I had a meeting with an MPICH developer and they basically said they were two scenarios that "too many communicator" error:
1. Application is not calling MPI_Comm_free when it should, i.e. not on MPI_COMM_NULL
2. Context exhaustion

Looking over mpiP data. I notice that there were 6 calls to MPI_Comm_split and 6 calls to MPI_Comm_free. Does that sound like the right number of communicators? Looking at the blacs_gridmap call, if MPI_Comm_create return MPI_COMM_NULL, then blacs_gridmap immediate returns. Otherwise, it calls the MPI_comm_dup, and two calls to MPI_Comm_split.

So right now, it looks like the issue is not in Qbox, but in MPICH. We will need to run a bunch more tests to debug it, but I will keep you posted.
fgygi
Site Admin
Posts: 164
Joined: Tue Jun 17, 2008 7:03 pm

Re: qbox core dumps in MPI on BG/Q

Post by fgygi »

In a possibly related issue, it was found that there is a bug in mvapich2 and that it causes BLACS to fail in one of its test programs.
See this related topic.
naromero
Posts: 32
Joined: Sun Jun 22, 2008 2:56 pm

Re: qbox core dumps in MPI on BG/Q

Post by naromero »

Thanks for the info. Possibly related, unfortunately they don't post more details. I will attempt to find more details. In the mean time, an MPICH developer is working on a tool that will help us debug this further. It is similar to mpiP except that it will be able to distinguish between calls that return MPI_COMM_NULL vs. those that return real communicators.

Thanks,
Post Reply