我正在尝试在 Xeon Phi KNC(具有 61 个内核和 4T/C)和 Xeon(具有 2 个 Xeon E5-2660 v2 插槽)上运行以下具有不同 n 大小的代码。
我得到的时间如下表所示。但是,我试图理解为什么 MIC 的性能比运行 Xeon 处理器差。我在这里做错了什么,我该如何解决(如果可能)?
谢谢!
代码:
program prog
integer, allocatable :: arr1(:), arr2(:)
integer :: i, n, time_start, time_end
n=481
do while (n .le. 481000000)
allocate(arr1(n),arr2(n))
call system_clock(time_start)
!dir$ offload begin target(mic)
!$omp SIMD
do i=1,n
arr1(i) = arr1(i) + arr2(i)
end do
!dir$ end offload
call system_clock(time_end)
write (,) "n=",n," time=",time_end-time_start
deallocate(arr1,arr2)
n = n*10
end do
end program
Xeon-Phi 结果:
n= 481 time= 8881
n= 4810 time= 75
n= 48100 time= 53
n= 481000 time= 261
n= 4810000 time= 1991
n= 48100000 time= 18912
n= 481000000 time= 188203
设置:
#!/bin/bash #SBATCH -N 1 #SBATCH -o out_122 #SBATCH --exclusive export MIC_KMP_AFFINITY=verbose,granularity=fine,scatter export MIC_OMP_NUM_THREADS=122 ./prog.exe
sbatch -p xphi -N 1 --exclusive run_par.sh
而所有设置都在 run_par.sh 中,而 xphi 是设备的名称。
还值得一提的是,原生运行(在 !$omp SIMD 之前添加 !dir$ offload begin target(mic))会产生更好的结果。
n= 481 time= 0
n= 4810 time= 0
n= 48100 time= 6
n= 481000 time= 55
n= 4810000 time= 455
n= 48100000 time= 4342
n= 481000000 time= 43322
在本机运行中,设置为:
#!/bin/bash #SBATCH -N 1 #SBATCH -o out_244_native #SBATCH --exclusive export SINK_LD_LIBRARY_PATH=...intel/compilers_and_libraries/linux/lib/mic:$SINK_LD_LIBRARY_PATH micnativeloadex ./prog.exe.MIC -e "KMP_AFFINITY=verbose,granularity=fine,scatter"
至强结果:
n= 481 time= 0
n= 4810 time= 0
n= 48100 time= 2
n= 481000 time= 19
n= 4810000 time= 93
n= 48100000 time= 706
n= 481000000 time= 7006
这是我的 Xeon 机器上 lscpu 命令的输出:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping: 4
CPU MHz: 1203.382
BogoMIPS: 4405.99
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
我的 MIC 规格是(/proc/cpuinfo 的尾部):
processor : 239
vendor_id : GenuineIntel
cpu family : 11
model : 1
model name : 0b/01
stepping : 3
cpu MHz : 1052.630
cache size : 512 KB
physical id : 0
siblings : 240
core id : 59
cpu cores : 60
apicid : 239
initial apicid : 239
fpu : yes
fpu_exception : yes
cpuid level : 4
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr htsyscall nx lm nopl lahf_lm
bogomips : 2112.44
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: