Francois,
Is the orthogonalization step the least scalable part of Qbox?
My understanding is that for most (all ?) calculations there will be parallelization of bands and PWs simultaneously.
While the H*Psi products do not require communication between bands (and obviously PWs), this is not the case for
the orthogonalization step. Looking at the source code, this appears to be accomplished by a Cholesky decomposition.
Say you have N = 4000 bands, the overlap matrix (S_mn) is then a 4000-by-4000 matrix. The construction of this matrix is fairly parallelizable even though it requires communication between bands. However, my guess is that the subsequent Cholesky decomposition could at best be done with
nprocs = (m/mb)*(n/nb)
where
m = number of rows = 4000
n = number of columns = 4000
mb = Scalapack row block size ~ 32 - 64
nb = Scalapack column block size ~ 32 - 64
Or is there a more efficient way to do this?
Bests,
Nick Romero
Orthogonalization
Forum rules
You must be a registered user to post in this forum. Registered users may also post new topics if they consider that their subject does not correspond to any topic already present on the forum.
You must be a registered user to post in this forum. Registered users may also post new topics if they consider that their subject does not correspond to any topic already present on the forum.