qbox core dumps in MPI on BG/Q
Posted: Tue Feb 19, 2013 10:17 pm
Hi,
I tried a rather simple test on 1024 nodes with 16 MPI tasks per node on BG/Q. And I get this error:
MPIR_Get_contextid_sparse_group(1071): Too many communicators
I have seen this error in other applications and almost always incorrect use of MPI in the applications. I examine the stack trace, the problems goes back to this piece of code:
SlaterDet::SlaterDet(const Context& ctxt, D3vector kpoint) : ctxt_(ctxt),
c_(ctxt)
{
//cout << ctxt.mype() << ": SlaterDet::SlaterDet: ctxt.mycol="
// << ctxt.mycol() << " basis_->context(): "
// << basis_->context();
my_col_ctxt_ = 0;
for ( int icol = 0; icol < ctxt_.npcol(); icol++ )
{
Context* col_ctxt = new Context(ctxt_,ctxt_.nprow(),1,0,icol);
ctxt_.barrier();
if ( icol == ctxt_.mycol() )
my_col_ctxt_ = col_ctxt;
else
delete col_ctxt;
}
//cout << ctxt_.mype() << ": SlaterDet::SlaterDet: my_col_ctxt: "
// << *my_col_ctxt_;
basis_ = new Basis(*my_col_ctxt_,kpoint);
}
This code is creating one communicator per MPI task for each column in the 2D ScaLAPACK grid. That means that the number of communicators are scaling as O(Nproc) which is very problematic. Am I not understanding this piece of code correctly?
I tried a rather simple test on 1024 nodes with 16 MPI tasks per node on BG/Q. And I get this error:
MPIR_Get_contextid_sparse_group(1071): Too many communicators
I have seen this error in other applications and almost always incorrect use of MPI in the applications. I examine the stack trace, the problems goes back to this piece of code:
SlaterDet::SlaterDet(const Context& ctxt, D3vector kpoint) : ctxt_(ctxt),
c_(ctxt)
{
//cout << ctxt.mype() << ": SlaterDet::SlaterDet: ctxt.mycol="
// << ctxt.mycol() << " basis_->context(): "
// << basis_->context();
my_col_ctxt_ = 0;
for ( int icol = 0; icol < ctxt_.npcol(); icol++ )
{
Context* col_ctxt = new Context(ctxt_,ctxt_.nprow(),1,0,icol);
ctxt_.barrier();
if ( icol == ctxt_.mycol() )
my_col_ctxt_ = col_ctxt;
else
delete col_ctxt;
}
//cout << ctxt_.mype() << ": SlaterDet::SlaterDet: my_col_ctxt: "
// << *my_col_ctxt_;
basis_ = new Basis(*my_col_ctxt_,kpoint);
}
This code is creating one communicator per MPI task for each column in the 2D ScaLAPACK grid. That means that the number of communicators are scaling as O(Nproc) which is very problematic. Am I not understanding this piece of code correctly?