The first method is described by Morgan et.al. [7]. Here the matrix products are calculated in parallel, while the subsums are sent around the processor cycle. At the end each processor has found its .
The advantage of this parallelization is that the weight matrices are stored and modified only once, distributed on all processors. Therefor this parallelization has to calculate the partial sums of the error between each of the p-1 communication steps.