qbox core dumps in MPI on BG/Q

naromero · Post by **naromero** » Tue Feb 19, 2013 10:17 pm

Hi,

I tried a rather simple test on 1024 nodes with 16 MPI tasks per node on BG/Q. And I get this error:
MPIR_Get_contextid_sparse_group(1071): Too many communicators

I have seen this error in other applications and almost always incorrect use of MPI in the applications. I examine the stack trace, the problems goes back to this piece of code:

SlaterDet::SlaterDet(const Context& ctxt, D3vector kpoint) : ctxt_(ctxt),
c_(ctxt)
{
//cout << ctxt.mype() << ": SlaterDet::SlaterDet: ctxt.mycol="
// << ctxt.mycol() << " basis_->context(): "
// << basis_->context();
my_col_ctxt_ = 0;
for ( int icol = 0; icol < ctxt_.npcol(); icol++ )
{
Context* col_ctxt = new Context(ctxt_,ctxt_.nprow(),1,0,icol);
ctxt_.barrier();
if ( icol == ctxt_.mycol() )
my_col_ctxt_ = col_ctxt;
else
delete col_ctxt;
}
//cout << ctxt_.mype() << ": SlaterDet::SlaterDet: my_col_ctxt: "
// << *my_col_ctxt_;
basis_ = new Basis(*my_col_ctxt_,kpoint);
}

This code is creating one communicator per MPI task for each column in the 2D ScaLAPACK grid. That means that the number of communicators are scaling as O(Nproc) which is very problematic. Am I not understanding this piece of code correctly?

naromero · Post by **naromero** » Tue Feb 19, 2013 10:23 pm

More relevant information. I am attempt to see if a PBE0 calculation is possible on 2048 water molecules. It seems that the code path for the MPI communicator that gets called is the one for kpoints and may be there is something stupid in my e-mail file. Here it is:

[naromero@vestalac1 gs_pbe0_1024_c16_ETABCD]$ cat gs_pbe0.i
# ground state of the water molecule
H2O-2048.sys
set xc PBE0
set btHF 0.05
set blHF 3 3 3
set wf_dyn PSDA
set ecut 40
set ecutprec 10
randomize_wf
run 0 200
save h2o.xml

naromero · Post by **naromero** » Tue Feb 19, 2013 11:02 pm

I can confirm that this error occurs with plain vanilla PBE.

Post by **fgygi** » Fri Feb 22, 2013 7:51 pm

We have successfully run Qbox 1.56.2 on 16k MPI tasks on BG/P (Intrepid). These runs were including 512 water molecules and were using PBE0. Other runs on large systems (Al2O3) have used up to 32k MPI tasks without problem.
Could there be a parameter defining the maximum number of communicators on BG/Q ?

What happens in the SlaterDet constructor is the following: it creates column contexts from the global context. This is a bit complicated by the fact that Context creation must be a global operation involving all tasks, because of BLACS and MPI constraints. Each task only needs to keep the column context that it belongs to. However all columns contexts must be created. The ones not needed are then deleted. This relies on the ability of MPI to actually free communicators when MPI_Comm_free() is used. If MPI is not able to recycle communicators, then it may run out. It seems to work on BG/P with 16k tasks.

naromero · Post by **naromero** » Fri Feb 22, 2013 9:24 pm

Sorry about posting into the wrong forum.

Yes, it is a very large systems but I was able to run it on fewer than 1024 nodes. The error I get is that I am running out of MPI communicators at 1024 nodes with 16 MPI tasks per node.

The debugger shows that the error occurs inside a loop which creates communicators. MPI has a limit of order 1000 communicators (I will have to double check that number). But I missed that you are deleting communicators that you don't need. So I suspect that this is indeed a bug in the MPI on BG/Q, and that it is not actually freeing the communicators. I will try to make a reduced test case to make sure though.

Let me try to understand what you are trying to do. You are creating a 2D processor grid in BLACS. You need one context for everything (a duplicate of MPI_COMM_WORLD), another one for rows, and another for columns. Each MPI task will belong to at least three different communicators because each MPI task belong to the whole, a row, and a column. If you had k-points and spin, then also a fourth communicator. Is this correct?

I think there is another way accomplish what you are trying to do without this loop. For example, it is not well-documented, but BLACS can use a communicator besides MPI_COMM_WORLD. Here is how I do it in GPAW:
https://trac.fysik.dtu.dk/projects/gpaw ... acs.c#L493
but you would have to create the aforementioned communicators in MPI and then pass them to BLACS. Its a bit of a pain to do when you start including all the other degrees of freedom.

I *think* that the communicators that you need are already internally created by BLACS but are not accessible through the BLACS API.
See,
https://icl.cs.utk.edu/svn/scalapack-de ... acs_map_.c

Search for these lines
/*
* Form MPI communicators for scope = 'row'
*/
MPI_Comm_split(comm, myrow, mycol, &ctxt->rscp.comm);
/*
* Form MPI communicators for scope = 'Column'
*/
MPI_Comm_split(comm, mycol, myrow, &ctxt->cscp.comm);

There might be other ways to accomplish what you are trying to with the loop. But fundamentally, I think there is an MPI issue. But I will confirm this.

Post by **fgygi** » Fri Feb 22, 2013 10:23 pm

Two comments:
1) I see from your input file that you did not set the nrowmax variable (but maybe it is set in your H2O-2048.sys file). If nrowmax is not set, the default value is 32. With 16k tasks, this will create a 32x512 Context. There is nothing intrinsically wrong with this shape (except that it may lead to low performance in some parts of Qbox), but the 512 columns of the Context will trigger a large number of creations and deletions of column contexts in the SlaterDet constructor. This may stress the limits of the maximum number of communicators. If that is the case, you may want to try nrowmax=256.

2) Bear in mind that ecut=40 Ry for water is too low to qualify as a "water simulation". I would recommend using ecut = 70 Ry in order to be able to claim that "a 2048 water simulation can be run". (note that the unit of energy in the "set ecut" command is Ry).

naromero · Post by **naromero** » Sat Feb 23, 2013 12:12 am

Francois,

I was unable reproduce this problem in a reduced MPI test case.

for ( int icol = 0; icol < ctxt_.npcol(); icol++ )
{
Context* col_ctxt = new Context(ctxt_,ctxt_.nprow(),1,0,icol);
ctxt_.barrier();
if ( icol == ctxt_.mycol() )
my_col_ctxt_ = col_ctxt;
else
delete col_ctxt;
}

I am not particularly fluent in c++, but this line
Context* col_ctxt = new Context(ctxt_,ctxt_.nprow(),1,0,icol);
seems to call Cblacs_gridmap

however, this line
delete col_ctxt

only seems to delete the col_ctxt structure. Unless, this calls Cblacs_gridexit or MPI_Comm_free and MPI_Group_free you are not freeing the communicator. Are you sure its getting called?

It would help me if you can create a reduced test case where you call the SlaterDet API, completely independent of a real water simulation.

P.S. BTW, it was just a non-realistic benchmark. I wasn't really trying to simulate water.

Post by **fgygi** » Sat Feb 23, 2013 4:20 am

The logic of Context creation and destruction is a bit more complex than it appears at first sight.
Since col_ctxt is a pointer to a Context object, the line "delete col_ctxt" will invoke the destructor of the Context object (Context::~Context), which in turns causes the ContextImpl class to remove that context from the list of known contexts (reference counting), and if there are no references to the corresponding ContextRep object, the destructor of the ContextRep object is called. This last call (line 511 in Context.C) invokes Cblacs_gridexit() and MPI_Comm_free(). Why these two additional levels of indirection (including the classes ContextImpl and ContextRep) ? First the use of ContextImpl is an example of the "pimpl idiom" or "opaque pointer" approach, which helps hiding the implementation of the object from other classes, and thus allows to change implementation details at any time without affecting other classes that use it. The ContextRep class is necessary to implement "reference counting", which is a way to keep track of all contexts and avoid unnecessary proliferation of Context objects that would occur when objects containing a Context are created and/or copied. For an explanation of reference counting, see e.g. Item 29 in Scott Meyer's book "More Effective C++".

naromero · Post by **naromero** » Sat Feb 23, 2013 5:15 am

Francois,

I think there might be a bug here. I profiled qbox with http://mpip.sourceforge.net/#Using

Here is the data:
/gpfs/vesta_scratch/projects/catalyst/naromero/qbox/H2O-2048/gs_pbe_256_c16_ETABCD_mpiP/qb.mpiP.x.4096.1.1.mpiP

the file is huge and somewhat difficult to parse if you don't know what you are looking for. I did this for 256 nodes at c16 mode (where I am still OK), I seem to run out of communicators for 512 nodes at c16 mode. c16 mode = 16 MPI tasks per node.

Start by doing a grep for Comm_Create, there are four places (call sites) Qbox calls it from:

Cblacs_gridmap --> call site 62
and two C++ functions that need to be un-mangled:
0) ContextRep::ContextRep() --> call site 30
1) ContextRep::ContextRep(int, int) --> call site 67
2) ContextRep::ContextRep(ContextRep const&, int, int, int, int) -> call site 72

You can find the call sites for Comm_free in a similar manner.

Here is a quick way to look at the data. (2nd column is the call site, fourth column (after the asterisk) is the call count sum over all the tasks at the call site). Remember that there are a total 4096 MPI (256 nodes *16 MPI tasks per node) tasks in this particular job

[naromero@vestalac1 gs_pbe_256_c16_ETABCD_mpiP]$ grep Comm_free qb.mpiP.x.4096.1.1.mpiP | grep "*"
Comm_free 1 * 24576 0.934 0.807 0.685 0.00 0.01
Comm_free 10 * 24576 0.775 0.369 0.173 0.00 0.00
Comm_free 18 * 24576 4.46 1.34 0.677 0.00 0.01
Comm_free 33 * 24576 0.901 0.699 0.184 0.00 0.01
Comm_free 84 * 24576 0.985 0.821 0.686 0.00 0.01
[naromero@vestalac1 gs_pbe_256_c16_ETABCD_mpiP]$ grep Comm_create qb.mpiP.x.4096.1.1.mpiP | grep "*"
Comm_create 30 * 4096 6.35 6.31 6.27 0.00 0.01
Comm_create 62 * 2105344 4.3e+03 7.66 0.034 2.10 5.04
Comm_create 67 * 4096 15.6 15.6 15.4 0.01 0.02
Comm_create 72 * 2097152 11.2 9.45 2.47 2.58 6.19

So perhaps there is something not correct with a C++ destructor somewhere and you are not really getting really calling MPI_Comm_free on the communicator. This is why you are running out of communicators.

BTW, you are not the first person to run into this "out of communicator issue." Codes that worked on BG/P, have now seen the issue on BG/Q. I believe what has changed is MPICH which restricts the maximum limit on the number of communicators.

Let me know what you think. I would really like to see Qbox function properly on BG/Q.

Post by **fgygi** » Sat Feb 23, 2013 5:10 pm

Nichols,
Thanks. I will have a look at the mpiP file. I'd like to run the same mpiP test on Intrepid for comparison.

I also noted the following in the latest messge from Adam Scovel about the Feb 18 maintenance.
Subject line: [Vesta-notify] Vesta Maintenance 20130218 - Efix installation notes
Apparently one of the efixes (efix 09) has to do with MPI context exhaustion. Also, the efix requires relinking applications.
Not sure if this is related.

All,

During maintenance on Feb. 18th, 2013, a number of newly-released efixes were installed on Vesta. I've attached the relevant portions of IBM's included documentation. Of note is the documentation on efix 09, which will require a relink of your applications.

-Adam

20130218.readme

# efix 09
###############################################################################

<--FG:---some lines deleted here------>

- Included an MPICH2 fix for context id exhaustion:
https://trac.mpich.org/projects/mpich/ticket/1768

Steps to take after installing the RPMs for this fix
----------------------------------------------------

Relink applications with the messaging libraries (MPICH2 and PAMI).

Qbox List

qbox core dumps in MPI on BG/Q

qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q

Re: qbox core dumps in MPI on BG/Q