Orthogonalization

naromero · Post by **naromero** » Tue Jan 13, 2009 4:29 am

Francois,

Is the orthogonalization step the least scalable part of Qbox?

My understanding is that for most (all ?) calculations there will be parallelization of bands and PWs simultaneously.
While the H*Psi products do not require communication between bands (and obviously PWs), this is not the case for
the orthogonalization step. Looking at the source code, this appears to be accomplished by a Cholesky decomposition.

Say you have N = 4000 bands, the overlap matrix (S_mn) is then a 4000-by-4000 matrix. The construction of this matrix is fairly parallelizable even though it requires communication between bands. However, my guess is that the subsequent Cholesky decomposition could at best be done with

nprocs = (m/mb)*(n/nb)

where
m = number of rows = 4000
n = number of columns = 4000
mb = Scalapack row block size ~ 32 - 64
nb = Scalapack column block size ~ 32 - 64

Or is there a more efficient way to do this?

Bests,
Nick Romero