note that the NVidia profiler tool is broken... t$ ./vaddd ------------------------------------------------------------- system name = Linux node name = titan release = 6.4.0-genunix version = #1 SMP Mon Jun 26 15:53:25 GMT 2023 machine = x86_64 page size = 4096 avail memory = 540904357888 = 528226912 kB = 515846 MB ------------------------------------------------------------- INFO : number of host CPUs: 40 : number of CUDA devices: 2 : we need 13027889070 bytes of memory on a GPU : 0: Quadro GP100 totalGlobalMem = 17069309952 : 0: Quadro GP100 is to be selected : 1: Quadro K6000 totalGlobalMem = 12791250944 : min memory unit is 1: Quadro K6000 with 12791250944 bytes : max memory unit is 0: Quadro GP100 with 17069309952 bytes : selected Quadro GP100 : Quadro K6000 selected : Vector addition of 516979725 double FP64 elements : random data loaded 8365043093 nsecs 8.365043 secs : Wallclock cudaMalloc(A) 4685645 nsecs 0.004685645 secs : Wallclock cudaMalloc(B) 4712680 nsecs 0.00471268 secs : Wallclock cudaMalloc(C) 4414429 nsecs 0.004414429 secs : Copy of vector A from host to device done. : Wallclock cudaMemcpy() 455459656 nsecs 0.4554596 secs : Copy of vector B from host to device done. : Wallclock cudaMemcpy() 463675894 nsecs 0.4636759 secs : CUDA kernel launch with 504864 blocks of 1024 threads : vectorAdd done. : Wallclock kernel launch 50036587 nsecs 0.05003659 secs : cudaEventElapsedTime claims 0.04999689 secs : Copy result vector C from device to host done. : cudaMemcpy() 1223154597 nsecs 1.223155 secs : A + B correct within error 2^(-48) epsilon : result check done 1513243663 nsecs 1.513244 secs DONE : total time 16178688201 nsecs 16.17869 secs t$