研究メモ: mpi_alltoallを使った行列の転置

簡単な例

2次元配列を考える。どうすればいいのか考えるために簡単な例から調べる。

3processorでA(6,6)を一次元目で切る。

A＝
11 12 |13 14 |15 16
21 22 |23 34 |25 27
31 32 |33 34 |35 36
41 42 |43 44 |45 46
51 52 |53 54 |55 56
61 62 |63 64 |65 66
#1 #2 #3 processor

mpi_alltoallでこの行列を渡すと4つが一つの単位となりprocessor間で渡される。Aと同じ位置にalltoall後の配列を書くとこうなっている。

11 12 | 31 32 | 51 52
21 22 | 41 42 | 61 62
13 14 | 33 34 | 53 54
23 24 | 42 44 | 63 64
15 16 | 53 54 | 55 56
25 26 | 63 64 | 65 66
#1 #2 #3 processor

processor内部での転置

processor #1は

11 12 21 22 13 14 23 24 15 16 25 26

という配列を持っている。

a= 11 12
b = 21 22
c = 13 14
d = ...

と定義して a-fまでの６つの配列を2x3だと思ってみると

a b
c d
e f
(memory imageが a b c d e f であるという意味）

である。この転置を取り

a c e
b d f
（memory imageが a c e b d f であるという意味）

と並び替える。a,b,cなどを数値に戻すと6x2の行列に対してメモリーimageが

11 12 13 14 15 16
21 22 23 24 25 26

となる。

他のprocessでも同じことを行い全体としては

11 12 13 14 15 16 proccess #1
21 22 23 24 25 26
------------------
31 32 33 34 36 36 process #2
41 42 43 44 45 46
------------------
51 52 53 54 55 56 process #3
61 62 63 64 65 66

とな並べられており、rowでの分割とcolumnでの分割を無事入れ替えることができた。(分割表示も左右から上下に直している。）

一般的な行列サイズと、processor数の場合

global配列がA(n2,n3)

n2
---------------------------
| 　　　　　　　　　　　|
| 　　　　　　　　　　　| n3
| 　　　　　　　　　　　|
| 　　　　　　　　　　　|
| 　　　　　　　　　　　|
| 　　　　　　　　　　　|
---------------------------

に対してlocal memoryを
n2'
-------
| n3'
| |
--------
| |
| |
-------
| |
| |
-------
と切る。

transpose側の配列は

-n2'----n2'----n2'--
| | | | n3'
| | | |
--------------------
と切る。以下、n3'をn3dと置く。

comminucationを同じblockサイズ(=n2'*n3')で行う必要から、配列にmarginがあり、n2'*mpi__size >= n2, n3'>=n3である。このmargin分は以下のように計算できる。

n2d=n2/mpi__size

if (mod(n2,mpi__size)/=0) n2d=n2d+1

i2s=mpi__rank*n2d+1

i2e=(mpi__rank+1)*n2d

if (i2e>n2) i2e=n2

n3d=n3/mpi__size

if (mod(n3,mpi__size)/=0) n3d=n3d+1

i3s=mpi__rank*n3d+1

i3e=(mpi__rank+1)*n3d

if (i3e>n3) i3e=n3

n2f=n2d*mpi__size

n3f=n3d*mpi__size

i2f=i2s+n2d-1

n2d=n2/mpi__size
if (mod(n2,mpi__size)/=0) n2d=n2d+1
i2s=mpi__rank*n2d+1
i2e=(mpi__rank+1)*n2d
if (i2e>n2) i2e=n2

n3d=n3/mpi__size
if (mod(n3,mpi__size)/=0) n3d=n3d+1
i3s=mpi__rank*n3d+1
i3e=(mpi__rank+1)*n3d
if (i3e>n3) i3e=n3

n2f=n2d*mpi__size
n3f=n3d*mpi__size

i2f=i2s+n2d-1

allocate(a(i2s:i2f,n3f)); a=0 ! すべてのprocessに同じサイズが割り当てられる。
allocate(b(n2f,n3d)); b=0

結果を確認しやすいようにAの値はこうしておく。
do i3=1,n3
do i2=i2s,i2e
a(i2,i3)=i2+i3*10
enddo
enddo

bはmpi_alltoallでtransposeした配列になる。bも配列にmarginがある。alltoallでまずはbへ移す。

i1=int(size(a),kind=4)/mpi__size

i2=int(size(b),kind=4)/mpi__size

call mpi_alltoall(&

a,i1,MPI_INTEGER,&

b,i2,MPI_INTEGER,&

MPI_COMM_WORLD, mpi__err)

ここにi1,i2は各processorがcommunicateするblockのサイズであり配列の全体のサイズではないことに注意。例えば上の例の場合4つ=2x2となる。

ここでの配列bはまだtransposeとしては順序が合っていないのでそれをblockサイズでprocessor内でtransposeする。行列を切り直したいのでsubroutineに渡してしまう。

call transpose_submatrix(mpi__size,n2d,n3f,b)

subroutine transpose_submatrix(mpi__size,n2d,n3,b)

implicit none

integer,intent(in):: mpi__size

integer,intent(in)::n2d,n3

integer,intent(inout):: b(n2d,n3/mpi__size,mpi__size)

integer:: c(n2d,mpi__size,n3/mpi__size) ! local array

integer::i,j

do j=1,mpi__size

do i=1,n3/mpi__size

c(:,j,i)=b(:,i,j)

enddo

b=0; call iaXPY(n2d*n3,1,c,1,b,1) ! iaxpyはblas1のsaxpyをintegerに修正したもの。メモリー上でb = c 相当のことを行えれば良い。

end subroutine transpose_submatrix

例１

最初にかなり変な例としてA(8,5)行列を4processで分割してtransposeをかけてみる。

　local A(n2d,n3f)

rank ここからaの中身

n2d

n3f

a 0 2 8 11 12 21 22 31 32 41 42 51 52 0 0 0 0 0 0

a 1 2 8 13 14 23 24 33 34 43 44 53 54 0 0 0 0 0 0

a 2 2 8 15 16 25 26 35 36 45 46 55 56 0 0 0 0 0 0

a 3 2 8 17 18 27 28 37 38 47 48 57 58 0 0 0 0 0 0

(

character(10):: form1='(a,100i4)'

write(6,form1)'size_of_a=',size(a,dim=1),size(a,dim=2),size(a)

として書いている。

）

とAの中身がなっているとする。これは行列の形で書くと

11 12 | 13 14 | 15 16 | 17 18 |

21 22 | 23 24 | 25 26 | 27 28 |

31 32 | 33 34 | 35 36 | 37 38 |

41 42 | 43 44 | 45 46 | 47 48 |

51 52 | 53 54 | 55 56 | 57 58 |

0 0 | 0 0 | 0 0 | 0 0 |

である。programで転置した結果は

local B(n2f,n3d)

rank ここからbの中身

n2f

n3d

b 0 8 2 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27 28

b 1 8 2 31 32 33 34 35 36 37 38 41 42 43 44 45 46 47 48

b 2 8 2 51 52 53 54 55 56 57 58 0 0 0 0 0 0 0 0

b 3 8 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

となる。ちゃんと行列でかいてあげると

11 12 13 14 15 16 17 18

21 22 23 24 25 26 27 28

--------------------------

31 32 33 34 35 36 37 38

41 42 43 44 45 46 47 48

-------------------------

51 52 53 54 55 56 57 58

0 0 0 0 0 0 0 0

-------------------------

0 0 0 0 0 0 0 0

-------------------------

と転置されていることが確認される。

例２

もうすこしまともな例をだそう。A(8,5)行列を3processで分割する。

rank A

n2d

n3f

a 0 3 6 11 12 13 21 22 23 31 32 33 41 42 43 51 52 53 0 0 0

a 1 3 6 14 15 16 24 25 26 34 35 36 44 45 46 54 55 56 0 0 0

a 2 3 6 17 18 0 27 28 0 37 38 0 47 48 0 57 58 0 0 0 0

A＝

11 12 13 |14 15 16 |17 18 0

21 22 23 |24 25 26 |27 28 0

31 32 33 |34 35 36 |37 38 0

41 42 43 |44 45 46 |47 48 0

51 52 53 |54 55 56 |57 58 0

0 0 0 |0 0 0 | 0 0 0

rank B

n2f

n3d

b 0 9 2 11 12 13 14 15 16 17 18 0 21 22 23 24 25 26 27 28 0

b 2 9 2 51 52 53 54 55 56 57 58 0 0 0 0 0 0 0 0 0 0

b 1 9 2 31 32 33 34 35 36 37 38 0 41 42 43 44 45 46 47 48 0

B＝

11 12 13 14 15 16 17 18 0

21 22 23 24 25 26 27 28 0

--------------------------

31 32 33 34 35 36 37 38 0

41 42 43 44 45 46 47 48 0

-------------------------

51 52 53 54 55 56 57 58 0

0 0 0 0 0 0 0 0 0

例３

A(8,3)を3processで切る。

rank A

n2d

n3f

a 0 3 3 11 12 13 21 22 23 31 32 33

a 1 3 3 14 15 16 24 25 26 34 35 36

a 2 3 3 17 18 0 27 28 0 37 38 0

行列の形では

11 12 13 | 14 15 16 | 17 18 0

21 22 23 | 24 25 26 | 27 28 0

31 32 33 | 34 35 36 | 37 38 0

転置行列は

rank B

n2f

n3d

b 0 9 1 11 12 13 14 15 16 17 18 0

b 1 9 1 21 22 23 24 25 26 27 28 0

b 2 9 1 31 32 33 34 35 36 37 38 0

行列の形では

11 12 13 14 15 16 17 18 0

-------------------------

21 22 23 24 25 26 27 28 0

-------------------------

31 32 33 34 35 36 37 38 0

となる。

2013年8月9日金曜日

mpi_alltoallを使った行列の転置

簡単な例

processor内部での転置

一般的な行列サイズと、processor数の場合

例１

例２

例３