correct final typos

4749c132 · Paul Baumeister · 4e4ad460 · 4749c132 · 4749c132 · 4749c132
Commit 4749c132 authored 6 years ago by Paul Baumeister
--- a/source/KKRnano/doc/DomainDecomposition/KKRnanoDomainDecomposition.tex
+++ b/source/KKRnano/doc/DomainDecomposition/KKRnanoDomainDecomposition.tex
@@ -52,7 +52,7 @@

 \maketitle

-\section{Introduction} 
+\section{Introduction} \label{sec:introduction}

 KKRnano is a linear-scaling Density Functional Theory (DFT) application
 optimized for the treatment of millions of atoms.
@@ -65,10 +65,10 @@ With this and a truncation of atomic interactions beyond a certain distance
 the computational complexity can be reduced to scale linearly 
 with the number of atoms in the system.\cite{zeller_towards_2008,kkrnano:thiess_massively_2012}

-\subsection{Short-ranged operators}
+\subsection{Short-ranged operators} \label{sec:short-ranged_operators}

 In the linear-scaling mode, a truncated view of the operator $\hat A$ 
-(also called Hamiltonian due to its physical analogy)
+% (also called Hamiltonian due to its physical analogy)
 is inverted for each source atom in the system.
 $\hat A$ is a sparse operator with square-shaped blocks describing
 the inter-atomic scattering.
@@ -95,17 +95,17 @@ As the truncation zone is sphere-shaped around the source atom $k$,
 the set of block-rows of $\hat A$ which are relevant
 for solving for the column $k$ of $\hat G$ is called its \emph{view} of $\hat A$.

-\subsection{Parallelization}
+\subsection{Parallelization} \label{sec:parallelization}

 The parallelization concept is to distribute the columns to different \MPIrank{}s.
-In early versions, the was constrained to one atom per \MPIrank{} targetting supercomputer
-architectures with a huge amount of small small compute nodes. 
+In early versions, the was constrained to one column per \MPIrank{} targetting supercomputer
+architectures with a huge amount of small compute nodes. 
 In the last decade, HPC machines started to feature less nodes again 
 where the performance of each node in enhanced drastically by accelerators like GPUs.

 After intense code restructuring towards less \MPIrank{}s and thus several atoms per \MPIrank{}, %% Rabel, Baumeister
 recent versions of KKRnano are able to make a shared usage of those block-rows of $\hat A$
-that are included in the views of all columns that are treated by the same \MPIrank{}.
+that are included in the views of all columns treated by the same \MPIrank{}.
 At first glance, the effect of sharing rows of $\hat A$ is an increased complexity
 in bookkeeping. However, it comes with two striking upsides:
 \begin{itemize}
@@ -124,7 +124,7 @@ than to a matrix-vector multiplication. Hence, we can exploit data re-use to eff
 save memory bandwidth. 
 The overall arithmetic intensity (AI) is now higher than the base AI of the block-times-block multiplication.

-\subsection{Load balancing}
+\subsection{Load balancing} \label{sec:load_balancing}

 In order to make the parallelization efficient,
 we want to keep the workload on all \MPIrank{}s balanced.
@@ -134,11 +134,13 @@ Each \MPIrank{} has to treat at most $\lceil N_a/N_r \rceil$ atoms locally.
 It follows that $\lceil N_a/N_r \rceil N_r - N_a$ more atoms could be
 hosted and the highest \MPIrank{}s potentially idle.
 A better concept would be to distribute
-$$ N_a = N_{r>} \lceil N_a/N_r \rceil + N_{r<} \lfloor N_a/N_r \rfloor $$
+\begin{equation}
+ N_a = N_{r>} \lceil N_a/N_r \rceil + N_{r<} \lfloor N_a/N_r \rfloor
+\end{equation}
 where $N_r = N_{r>} + N_{r<}$ and 
 the $\lceil \cdot \rceil$ term is exactly one atom larger than its $\lfloor \cdot \rfloor$ counterpart.

-\subsection{Atom order matters}
+\subsection{Atom order matters} \label{sec:atom_order_matters}

 In order to make use of both arguments from above,
 i.e.~lower memory requirements and higher floating-point efficiency,
@@ -163,19 +165,27 @@ reads the coordinates and
 optimizes their order for a given $N_r$
 and finally outputs them in the optimized order.

-\subsection{Optimization model}
+\subsection{Optimization model} \label{sec:optimization_model}

 The true number of shared matrix block-rows can be found by constructing the binary information
 if an atom $a'$ is inside the truncation zone of atom $a$, $Z_{aa'}$, 
 for all atoms and comparing their overlap, $\hat W = \hat Z^T \hat Z$.
 In each \MPIrank{} $r$ the total overlap needs to be optimized.
 We define $X_{ra}$ to be unity if atom $a$ is treated in \MPIrank{} $r$ and zero else.
-The model function to be maximized then is 
-$$ \sum_{raa'} X_{ra} W_{aa'} X_{ra'} $$
+The model function to be maximized then is
+\begin{equation}
+ M = \sum_{raa'} X_{ra} W_{aa'} X_{ra'} \label{eqn:quadratic_term}
+\end{equation}
 under the constraints
-$$ \sum_{r} X_{ra} = 1 $$
+\begin{equation}
+ \sum_{r} X_{ra} = 1  \label{eqn:constraint_atom}
+\end{equation}
 and
-$$ \sum_{a} X_{ra} = n_{a}(r) $$
+\begin{equation}
+ \sum_{a} X_{ra} = n_{a}(r)  \label{eqn:constraint_rank}
+\end{equation}
+
+\subsection{Cheaper operators} \label{sec:cheaper_operators}

 \begin{figure}[h!]
 \begin{center}
@@ -187,7 +197,7 @@ $$ \sum_{a} X_{ra} = n_{a}(r) $$

 Experiments with the geometry of amorphous structures of 
 $\mathrm{Cu}_{8640}\mathrm{Zr}_{4860}$ (c.f.~fig.~\ref{fig:Cu8640Zr4860_jmol})
-show that it is fairly compute intensive to construct $\hat W$.
+have shown that it is fairly compute intensive to construct $\hat W$ from $\hat Z$.
 However, we can observe that the derivation between matrix elements of $W$ and a simple
 sphere volumen overlap model is small, c.f.~fig.\ref{fig:truncation_sphere_overlap_vs_distance}.

@@ -195,55 +205,61 @@ sphere volumen overlap model is small, c.f.~fig.\ref{fig:truncation_sphere_overl
 \begin{center}
  \includegraphics[width=0.6\textwidth]{truncation_sphere_overlap_vs_distance_new}
  \caption{Model vs data:
-  For atom pair distances $d_{aa'}$ larger than $2R\um{trc} = 31.9\,\AA$ the overlap vanishes.
-  The count of how many atoms are in both truncation zones coincides well with the sphere volume model.
+  For atom pair distances $d_{aa'}$ larger than $2r\um{trc} = 31.9\,$\AA{} the overlap vanishes.
+  The count of how many atoms are in both truncation zones coincides well with the sphere volume model (grey dashed line).
  The three colors indicate different excerpts from the data sets.
+  Counts have been rescaled with the average number of $1072.33$ atoms per truncation zone.
  }
 \end{center}
 \label{fig:truncation_sphere_overlap_vs_distance}
 \end{figure}

-\subsection{Sphere volume overlap model}
+\subsection{Sphere volume overlap model} \label{sec:sphere_volume_overlap_model}

-Assume two spheres of equal radius $R\um{trc}$ whose centers are separated by $d \geq 0$.
-For $d > 2 R\um{trc}$ we encounter a vanishing overlap.
-For $0 \leq d \leq 2 R\um{trc}$ we can find the model for the overlap $M$ from the lense-shaped body of rotation that
+Assume two spheres of equal radius $r\um{trc}$ whose centers are separated by $d \geq 0$.
+For $d > 2 r\um{trc}$ we encounter a vanishing overlap.
+For $0 \leq d \leq 2 r\um{trc}$ we can find the model for the overlap $M\um{svo}$ from the lense-shaped body of rotation that
 originates from the intersection of two spheres.
-The sphere surface is described by $\sqrt{R\um{trc}^2 - x^2}$.
+The sphere surface is described by $\sqrt{r\um{trc}^2 - x^2}$.
 Now,
 \begin{align*}
-  M &= 2 \pi \int\limits_{\frac 12 d}^{R\um{trc}} \mathrm d x \left( \sqrt{ R\um{trc}^2 - x^2 } \right)^2 \\
-    &= 2 \pi \left( \frac 23 R\um{trc}^3 - \frac d2 R\um{trc}^2 + \frac{d^3}{24} \right) \\
-    &= \frac{4\pi}{3} \left( R\um{trc}^3 - \frac{3d}{4} R\um{trc}^2 + \frac{d^3}{16} \right) \\
-    &= \frac{4\pi}{3} \left( \frac d2 - R\um{trc} \right)^2 \left(\frac d4 + R\um{trc} \right) \\
+ \frac{4\pi}{3} r\um{trc}^3 \  M\um{svo}	&= 2 \pi \int\limits_{\frac 12 d}^{r\um{trc}} \mathrm d x \left( \sqrt{ r\um{trc}^2 - x^2 } \right)^2 \\
+			&= 2 \pi \left( \frac 23 r\um{trc}^3 - \frac d2 r\um{trc}^2 + \frac{d^3}{24} \right) \\
+			&= \frac{4\pi}{3} \left( r\um{trc}^3 - \frac{3d}{4} r\um{trc}^2 + \frac{d^3}{16} \right)  \\
+			&= \frac{4\pi}{3} \left( \frac d2 - r\um{trc} \right)^2 \left(\frac d4 + r\um{trc} \right) % \label{eqn:sphere_volume_overlap_model}
 \end{align*}
-Obviously, the function assumes the volumen of one sphere $\frac{4\pi}{3} R\um{trc}^3$ for $d=0$
-and connects to the vanishing overlap at $d=2R\um{trc}$ with zero value and derivative 
-due to the second power at the linear term $(d/2 - R\um{trc})$.
+Obviously, the term above assumes the volume of one sphere $\frac{4\pi}{3} r\um{trc}^3$ for $d=0$
+and connects to the vanishing overlap at $d=2r\um{trc}$ with zero value and derivative 
+due to the second power at the linear term $(d/2 - r\um{trc})$.
+For simplicity, we can normalize the term to the sphere volume, i.e.~
+\begin{equation}
+ M\um{svo} = \left( \frac d2 - r\um{trc} \right)^2 \left(\frac d4 + r\um{trc} \right) \label{eqn:sphere_volume_overlap_model}
+\end{equation}

 \begin{figure}[h!]
 \begin{center}
  \includegraphics[width=0.6\textwidth]{sphere_volume_overlap_model}
  \caption{Model formula for the overlap of two spheres of unit radius. 
-  For a distance $d$ between their centers, the overlap is given by $(d/2 - 1)^2 (d/4 + 1)$.}
+  For a distance $d$ between their centers, the overlap is given by $(d/2 - 1)^2 (d/4 + 1)$.
+  and zero for $d > 2$.}
 \end{center}
 \label{fig:sphere_volume_overlap_model}
 \end{figure}

-\section{Task}
+\section{Tasks} \label{sec:tasks}

-\subsection{Preparations}
+\subsection{Preparations} \label{sec:tasks_preparations}

-In the sources, the generation of the boolean table $Z_{aa'}$
+In the present sources, the generation of the boolean table $Z_{aa'}$
 has been implemented. For this, first the pair distance table $d_{aa'}$ is constructed.
 Here, a $N_a^2$-scaling ansatz has been taken for simplicity.
 This becomes a waste of computing time as we are only interested in atom pair distances
-which are smaller than $R\um{trc}$.
-If we directly determine $W_{aa'}$ via the sphere volume overlap formula, 
+which are smaller than $r\um{trc}$.
+If we directly determine $W_{aa'}$ via the sphere volume overlap formula (see Section \ref{sec:sphere_volume_overlap_model}), 
 we can omit the computation of $\hat Z$.
-However, all atom pair distances must be smaller than $2R\um{trc}$.
+However, all atom pair distances must be smaller than $2r\um{trc}$.

-The linear scaling ansatz is to construct boxes with edge lengths of at least $2R\um{trc}$.
+The linear scaling ansatz is to construct boxes with edge lengths of at least $2r\um{trc}$.
 For each atom, we add the atom to a member list of the box it lies in.
 Now we only need to compute atom pairs for which the atom partners are in the same or 
 in one of the $26$ direct neighbor boxes.
@@ -251,9 +267,9 @@ in one of the $26$ direct neighbor boxes.
 Even better than boxes would be to use dodecahedra, 
 i.e.~an fcc geometry as there would only be $12$ neighbor dodecahedra.

-\subsection{Optimization}
+\subsection{Optimization} \label{sec:tasks_optimization}

-We have to maximize the aforementioned overlap on each \MPIrank{}
+We have to maximize the aforementioned overlap $M$ on each \MPIrank{}
 under the additional constraint of balancing the work among all \MPIrank{}s.
 Fortunately, for systems that are filled relatively homogeneously with atoms,
 this will leads to compact and convex formations of atoms belonging to the same \MPIrank{}.
@@ -261,11 +277,11 @@ Hence, this approach also minimizes the required MPI communication volumes
 (and number of communication partners)
 which are necessary during the setup of the block-rows of $\hat A$.

-\section{Conclusion}
+\section{Summary} \label{sec:summary}

 We introduce the problem of assigning atoms to \MPIrank{}s
 such that in a KKRnano calculation, 
-the best floating-point efficiency of the tfQMR solver is maximized, and
+the floating-point efficiency of the tfQMR solver is maximized, and
 memory consumtion and MPI communication are minimized.

 % -- optional

--- a/source/KKRnano/doc/DomainDecomposition/truncation_sphere_overlap_vs_distance.txt
+++ b/source/KKRnano/doc/DomainDecomposition/truncation_sphere_overlap_vs_distance.txt
+from numpy import pi
+alat = 3.987116496
+cell_volume = (alat * 15)**3
+atom_density = 13500/cell_volume
+Rtrc = 4*alat
+sphere_volume = 4*pi/3*Rtrc**3
+sphere_volume
+atom_density
+normalize = (sphere_volume)**(-1)
+## Formula = 4*pi/3.*((Rtrc - .75*distance)*Rtrc**2 + distance**3/16.) * atom_density
+## exactly zero for distance==2*Rtrc
+## value at distance==0 is 1072.330292425316
+## y=3.14159*((4./3.*15.948465984^2 - x)*15.948465984 + x^3/12.) * .0631078255
--- a/source/KKRnano/doc/DomainDecomposition/truncation_sphere_overlap_vs_distance_new.agr
+++ b/source/KKRnano/doc/DomainDecomposition/truncation_sphere_overlap_vs_distance_new.agr