'Programming/openCL &amp; CUDA' 카테고리의 글 목록 (3 Page)

'Programming/openCL & CUDA'에 해당되는 글 80건

2013.02.17 cuda deviceQuery on GTX650
2013.02.16 cuda 5.0
2012.09.20 cuda shared memory의 결합법칙?
2012.07.11 cudaMalloc 시작 위치?
2012.06.07 cudemMemcpy()
2012.06.06 cuda 에서 device memory의 용량을 초과하는 malloc은 위험해!
2012.06.05 nvcc 는 int main()을 좋아해
2012.06.05 cuda 메모리별 접근 소요 클럭 사이클
2012.06.03 ubuntu 에서 vectorAdd 직접 컴파일 하기
2012.06.02 CUDA devicequery - ION 330

CUDA Capability별 기능차이 (0)	2013.02.17
Nvidia GTX 시리즈별 코드네임 (0)	2013.02.17
cuda 5.0 (0)	2013.02.16
cuda shared memory의 결합법칙? (0)	2012.09.20
cudaMalloc 시작 위치? (0)	2012.07.11

Nvidia GTX 시리즈별 코드네임 (0)	2013.02.17
cuda deviceQuery on GTX650 (0)	2013.02.17
cuda shared memory의 결합법칙? (0)	2012.09.20
cudaMalloc 시작 위치? (0)	2012.07.11
cudemMemcpy() (0)	2012.06.07

CUDA 책을 읽다가 결합법칙에서 계속 막혔었는데..
생각을 해보니 "공유 메모리"는 일정 갯수의 쓰레드 끼리만 공용으로 사용한다라는 사실을 잊고 있었던것...
다르게 보면, 공유 메모리로 복사할때는 4byte(int)형으로만 복사하면
상위레벨에서 블럭으로 system memory에서 블럭단위로 전송하여
쓰레드 블럭에서 알아서 분배하는 스타일로 복사하는 것이다.

그런 이유로, 공유 메모리 예제에서는
for문으로 왕창 복사하는게 아니라 __shared__로 정의된 배열중 하나의 값만 복사를 해서 넣는것 -_-

다르게 말하면 성능 저하를 감수하고
공유 메모리를 사용하지 않는다면 굳이 결합법칙에 머리 아플 이유도 없다는게 되려나?

---
2012.9.22
다시보니 공유 메모리로가 아니라
로컬 메모리에서 로딩하는 모든 연산에 대한 문제이다.
cuda의 특성상 로컬 메모리(오프칩/저속) 에서 읽어오때 블럭단위로 전송을 하기에
단순하게 로컬 메모리에서 읽어 로컬메모리에 쓸때에도
로컬 메모리에서 읽는 부분의 성능 저하를 최소화 하기 위해
결합법칙을 지켜주는 것이 좋다.

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda deviceQuery on GTX650 (0)	2013.02.17
cuda 5.0 (0)	2013.02.16
cudaMalloc 시작 위치? (0)	2012.07.11
cudemMemcpy() (0)	2012.06.07
cuda 에서 device memory의 용량을 초과하는 malloc은 위험해! (0)	2012.06.06

Posted by 구차니

cudaMalloc 시작 위치?

그냥 실험적으로 해보니

0x0021 0000 에서 부터 시작한다.
0x0000 0000 을 cudaMemcpy로 복사해오니 전부 0 인거 같은데..
번지가 잘못되서 그러려나?

아무래도 비디오 메모리 안에 프레임 버퍼와 텍스쳐 메모리 등으로 나뉘는거 같긴한데..
텍스텨 메모리로 할당된 녀석만 cudaMalloc / cudaMemcpy 등으로 접근 가능한건 아니겠지?

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda 5.0 (0)	2013.02.16
cuda shared memory의 결합법칙? (0)	2012.09.20
cudemMemcpy() (0)	2012.06.07
cuda 에서 device memory의 용량을 초과하는 malloc은 위험해! (0)	2012.06.06
nvcc 는 int main()을 좋아해 (0)	2012.06.05

Posted by 구차니

cudemMemcpy()

memcpy()와 비슷하게 dst, src 순서로 주소를 넣어주면 된다.
하지만, 그래픽 카드 메모리(device memory)와 메모리(host memory)를 구분지어 줘야하기 때문에
복사할 메모리의 방향과 종류를 정해주어야 한다.

일반적인 cuda 프로그래밍의 순서인
host -> device
cuda 계산
device -> host를 하기 위해서는

아래와 같이 한번씩 번갈아 해주면 될 듯?

cudaMemcpy(dev_mem, host_mem, cudaMemcpyHostToDevice);
kernel_name<<< ... >>>(...);
cudaMemcpy(host_mem, dev_mem, cudaMemcpyDeviceToHost);

5.8.2.18 cudaError_t cudaMemcpy (void *dst, const void *src, size_t count, enum cudaMemcpyKind kind)

Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDevice-ToDevice, and specifies the direction of the copy. The memory areas may not overlap. Calling cudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.

Parameters:

dst - Destination memory address

src - Source memory address

count - Size in bytes to copy

kind - Type of transfer

Returns:

cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection

5.28.3.9 enum cudaMemcpyKind

CUDA memory copy types

Enumerator:

cudaMemcpyHostToHost Host -> Host

cudaMemcpyHostToDevice Host -> Device

cudaMemcpyDeviceToHost Device -> Host

cudaMemcpyDeviceToDevice Device -> Device

cudaMemcpyDefault Default based unified virtual address space

---
2012.07.11 추가
다시보니 cudaMemcpy(dst, src, direction); 의 양식이다.
다르게 보면 cudaMemcpy(To, From, dir_FromTo);

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda shared memory의 결합법칙? (0)	2012.09.20
cudaMalloc 시작 위치? (0)	2012.07.11
cuda 에서 device memory의 용량을 초과하는 malloc은 위험해! (0)	2012.06.06
nvcc 는 int main()을 좋아해 (0)	2012.06.05
cuda 메모리별 접근 소요 클럭 사이클 (0)	2012.06.05

Posted by 구차니

cuda 에서 device memory의 용량을 초과하는 malloc은 위험해!

예제파일 따라한다고 512 * 65535 개의 쓰레드를 계산하게 하는 int형 배열을 무려 3개나 할당!
4byte * 512 * 65535 = 대략 128MB?

아무튼 이걸 3개를 할당하니 378MB ..
근데 ION에다가 256MB만 할당해 놓은 시스템에서 저걸 돌리니
정상처럼 돌아가는데 결과는 전부 쓰레기값(전부 결과가 0이 나옴 -_-)
그리고 미친척(!) 7번 정도 실행하니 X윈도우까지 맛이 가서 ssh로 재시작 시키게 하는 센스 OTL

결론 : 메모리 사용량은 확실히 계산하고 malloc 해주자!

'Programming > openCL & CUDA' 카테고리의 다른 글

cudaMalloc 시작 위치? (0)	2012.07.11
cudemMemcpy() (0)	2012.06.07
nvcc 는 int main()을 좋아해 (0)	2012.06.05
cuda 메모리별 접근 소요 클럭 사이클 (0)	2012.06.05
ubuntu 에서 vectorAdd 직접 컴파일 하기 (0)	2012.06.03

Posted by 구차니

nvcc 는 int main()을 좋아해

심심(?)해서 void main()으로 해봤더니 요따구 에러 발생 -_-

vectorAdd.cu(11): warning: return type of function "main" must be "int"

vectorAdd.cu:11:11: error: ‘::main’ must return ‘int’

컴파일은 문제없다가 링킹에서 배째는 기분인데 -_-

'Programming > openCL & CUDA' 카테고리의 다른 글

cudemMemcpy() (0)	2012.06.07
cuda 에서 device memory의 용량을 초과하는 malloc은 위험해! (0)	2012.06.06
cuda 메모리별 접근 소요 클럭 사이클 (0)	2012.06.05
ubuntu 에서 vectorAdd 직접 컴파일 하기 (0)	2012.06.03
CUDA devicequery - ION 330 (0)	2012.06.02

Posted by 구차니

cuda 메모리별 접근 소요 클럭 사이클

Nvidia cuda 공식 문서를 찾아보니, 메모리별 소요 클럭에 대한 내용이 자세히 나온다.

local memory 나 device memory 와 같이 off-chip memory(그래픽 카드에 내장되지 않은)는 400~800 클럭에 접근하고
shared memory나 register와 같이 on-ship memory(그래픽 카드에 내장된)는 10~20 클럭 정도에 접근이 가능하다.
constant memory는 2.0 에서 부터는 1회에 대해서 400~800 클럭이고, caching이 되면서 10~20 클럭에 읽어올 수 있다.

다르게 말하면 매우 빈번하게(클럭 빈도로는 40회 이상?) 읽어오는 데이터의 경우에는
무조건 shared memory에 읽어와서 빠르게 읽는게 전체 실행 속도에 유리할 것으로 보인다.
(읽어와서 다시 저장을 해야 한다면 80회 이상으로 늘어날지도?)

5.2 Maximize Utilization
5.2.3 Multiprocessor Level

If all input operands are registers, latency is caused by register dependencies, i.e. some of the input operands are written by some previous instruction(s) whose execution has not completed yet. In the case of a back-to-back register dependency (i.e. some input operand is written by the previous instruction), the latency is equal to the execution time of the previous instruction and the warp schedulers must schedule instructions for different warps during that time. Execution time varies depending on the instruction, but it is typically about 22 clock cycles for devices of compute capability 1.x and 2.x and about 11 clock cycles for devices of compute capability 3.0, which translates to 6 warps for devices of compute capability 1.x and 22 warps for devices of compute capability 2.x and higher (still assuming that warps execute instructions with maximum throughput, otherwise fewer warps are needed). For devices of compute capability 2.1 and higher, this is also assuming enough instruction-level parallelism so that schedulers are always able to issue pairs of instructions for each warp.

If some input operand resides in off-chip memory, the latency is much higher: 400 to 800 clock cycles. The number of warps required to keep the warp schedulers busy during such high latency periods depends on the kernel code and its degree of instruction-level parallelism. In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands (i.e. arithmetic instructions most of the time) to the number of instructions with off-chip memory operands is low (this ratio is commonly called the arithmetic intensity of the program). If this ratio is 15, for example, then to hide latencies of about 600 clock cycles, about 10 warps are required for devices of compute capability 1.x and about 40 for devices of compute capability 2.x and higher (with the same assumptions as in the previous paragraph).

5.3

Maximize Memory Throughput

The first step in maximizing overall memory throughput for the application is to minimize data transfers with low bandwidth.

That means minimizing data transfers between the host and the device, as detailed in Section 5.3.1, since these have much lower bandwidth than data transfers between global memory and the device.

That also means minimizing data transfers between global memory and the device by maximizing use of on-chip memory: shared memory and caches (i.e. L1/L2 caches available on devices of compute capability 2.x and higher, texture cache and constant cache available on all devices).

[출처 : CUDA_C_Programming_Guide.pdf]

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda 에서 device memory의 용량을 초과하는 malloc은 위험해! (0)	2012.06.06
nvcc 는 int main()을 좋아해 (0)	2012.06.05
ubuntu 에서 vectorAdd 직접 컴파일 하기 (0)	2012.06.03
CUDA devicequery - ION 330 (0)	2012.06.02
cuda 5 preview (0)	2012.06.02

Posted by 구차니

ubuntu 에서 vectorAdd 직접 컴파일 하기

SDK 에 들어있는 예제를 통채로 make 해도 되지만 하나만 직접 해보려고 하니 은근 귀찮다 -_-

/usr/local/cuda/bin/nvcc << cuda 컴파일러
~/NVIDIA_GPU_Computing_SDK/C/common/inc << include path
~/NVIDIA_GPU_Computing_SDK/shared/inc << include path

nvcc로 컴파일 하면 아래와 같이 두번의 에러가 발생하는데 직접 파일을 검색하니 위의 include path 두개가 걸려 나왔다.

vectorAdd.cu:24:82: fatal error: sdkHelper.h: 그런 파일이나 디렉터리가 없습니다

vectorAdd.cu:25:23: fatal error: shrQATest.h: 그런 파일이나 디렉터리가 없습니다

nvcc 야 PATH로 환경변수 잡아주면 간단해지지만
저넘의 INCLUDE_PATH가 은근 뒷통수 칠 듯..

$ /usr/local/cuda/bin/nvcc vectorAdd.cu \
-I/home/minimonk/NVIDIA_GPU_Computing_SDK/C/common/inc \
-I/home/minimonk/NVIDIA_GPU_Computing_SDK/shared/inc

헤더 파일은 아래의 두 곳에

~/NVIDIA_GPU_Computing_SDK/C/common/inc$ ll

합계 572

drwxrwxr-x 4 minimonk minimonk 4096 6월 2 19:00 ./

drwxrwxr-x 6 minimonk minimonk 4096 6월 2 20:25 ../

drwxrwxr-x 2 minimonk minimonk 4096 6월 2 19:00 GL/

-rw-rw-r-- 1 minimonk minimonk 14391 6월 2 19:00 bank_checker.h

-rw-rw-r-- 1 minimonk minimonk 15841 6월 2 19:00 cmd_arg_reader.h

-rw-rw-r-- 1 minimonk minimonk 3999 6월 2 19:00 cudaGLHelper.h

-rw-rw-r-- 1 minimonk minimonk 6442 6월 2 19:00 cudaHelper.h

-rw-rw-r-- 1 minimonk minimonk 20787 6월 2 19:00 cuda_drvapi_dynlink.c

-rw-rw-r-- 1 minimonk minimonk 46935 6월 2 19:00 cutil.h

-rw-rw-r-- 1 minimonk minimonk 3183 6월 2 19:00 cutil_gl_error.h

-rw-rw-r-- 1 minimonk minimonk 3492 6월 2 19:00 cutil_gl_inline.h

-rw-rw-r-- 1 minimonk minimonk 1116 6월 2 19:00 cutil_inline.h

-rw-rw-r-- 1 minimonk minimonk 953 6월 2 19:00 cutil_inline_bankchecker.h

-rw-rw-r-- 1 minimonk minimonk 13007 6월 2 19:00 cutil_inline_drvapi.h

-rw-rw-r-- 1 minimonk minimonk 17779 6월 2 19:00 cutil_inline_runtime.h

-rw-rw-r-- 1 minimonk minimonk 36730 6월 2 19:00 cutil_math.h

-rw-rw-r-- 1 minimonk minimonk 10864 6월 2 19:00 drvapi_error_string.h

drwxrwxr-x 2 minimonk minimonk 4096 6월 2 19:00 dynlink/

-rw-rw-r-- 1 minimonk minimonk 15321 6월 2 19:00 dynlink_d3d10.h

-rw-rw-r-- 1 minimonk minimonk 6202 6월 2 19:00 dynlink_d3d11.h

-rw-rw-r-- 1 minimonk minimonk 1809 6월 2 19:00 error_checker.h

-rw-rw-r-- 1 minimonk minimonk 5035 6월 2 19:00 exception.h

-rw-rw-r-- 1 minimonk minimonk 26819 6월 2 19:00 helper_cuda.h

-rw-rw-r-- 1 minimonk minimonk 5973 6월 2 19:00 helper_cuda_drvapi.h

-rw-rw-r-- 1 minimonk minimonk 4151 6월 2 19:00 helper_cuda_gl.h

-rw-rw-r-- 1 minimonk minimonk 1132 6월 2 19:00 helper_functions.h

-rw-rw-r-- 1 minimonk minimonk 22291 6월 2 19:00 helper_image.h

-rw-rw-r-- 1 minimonk minimonk 10491 6월 2 19:00 helper_string.h

-rw-rw-r-- 1 minimonk minimonk 15757 6월 2 19:00 helper_timer.h

-rw-rw-r-- 1 minimonk minimonk 1323 6월 2 19:00 multithreading.h

-rw-rw-r-- 1 minimonk minimonk 7228 6월 2 19:00 nvGLWidgets.h

-rw-rw-r-- 1 minimonk minimonk 4646 6월 2 19:00 nvGlutWidgets.h

-rw-rw-r-- 1 minimonk minimonk 2967 6월 2 19:00 nvMath.h

-rw-rw-r-- 1 minimonk minimonk 10850 6월 2 19:00 nvMatrix.h

-rw-rw-r-- 1 minimonk minimonk 12347 6월 2 19:00 nvQuaternion.h

-rw-rw-r-- 1 minimonk minimonk 6415 6월 2 19:00 nvShaderUtils.h

-rw-rw-r-- 1 minimonk minimonk 20642 6월 2 19:00 nvVector.h

-rw-rw-r-- 1 minimonk minimonk 15917 6월 2 19:00 nvWidgets.h

-rw-rw-r-- 1 minimonk minimonk 5309 6월 2 19:00 param.h

-rw-rw-r-- 1 minimonk minimonk 2321 6월 2 19:00 paramgl.h

-rw-rw-r-- 1 minimonk minimonk 1024 6월 2 19:00 rendercheck_d3d10.h

-rw-rw-r-- 1 minimonk minimonk 1027 6월 2 19:00 rendercheck_d3d11.h

-rw-rw-r-- 1 minimonk minimonk 998 6월 2 19:00 rendercheck_d3d9.h

-rw-rw-r-- 1 minimonk minimonk 7662 6월 2 19:00 rendercheck_gl.h

-rw-rw-r-- 1 minimonk minimonk 26682 6월 2 19:00 sdkHelper.h

-rw-rw-r-- 1 minimonk minimonk 1116 6월 2 19:00 stopwatch.h

-rw-rw-r-- 1 minimonk minimonk 1897 6월 2 19:00 stopwatch_base.h

-rw-rw-r-- 1 minimonk minimonk 2804 6월 2 19:00 stopwatch_base.inl

-rw-rw-r-- 1 minimonk minimonk 11608 6월 2 19:00 stopwatch_functions.h

-rw-rw-r-- 1 minimonk minimonk 4647 6월 2 19:00 stopwatch_linux.h

-rw-rw-r-- 1 minimonk minimonk 9342 6월 2 19:00 string_helper.h

~/NVIDIA_GPU_Computing_SDK/shared/inc$ ll

합계 192

drwxrwxr-x 4 minimonk minimonk 4096 6월 2 19:00 ./

drwxrwxr-x 6 minimonk minimonk 4096 6월 2 20:25 ../

drwxrwxr-x 2 minimonk minimonk 4096 6월 2 19:00 GL/

-rw-rw-r-- 1 minimonk minimonk 15439 6월 2 19:00 cmd_arg_reader.h

drwxrwxr-x 2 minimonk minimonk 4096 6월 2 19:00 dynlink/

-rw-rw-r-- 1 minimonk minimonk 5035 6월 2 19:00 exception.h

-rw-rw-r-- 1 minimonk minimonk 1323 6월 2 19:00 multithreading.h

-rw-rw-r-- 1 minimonk minimonk 7228 6월 2 19:00 nvGLWidgets.h

-rw-rw-r-- 1 minimonk minimonk 4646 6월 2 19:00 nvGlutWidgets.h

-rw-rw-r-- 1 minimonk minimonk 2966 6월 2 19:00 nvMath.h

-rw-rw-r-- 1 minimonk minimonk 10850 6월 2 19:00 nvMatrix.h

-rw-rw-r-- 1 minimonk minimonk 12347 6월 2 19:00 nvQuaternion.h

-rw-rw-r-- 1 minimonk minimonk 6108 6월 2 19:00 nvShaderUtils.h

-rw-rw-r-- 1 minimonk minimonk 20642 6월 2 19:00 nvVector.h

-rw-rw-r-- 1 minimonk minimonk 15917 6월 2 19:00 nvWidgets.h

-rw-rw-r-- 1 minimonk minimonk 8092 6월 2 19:00 rendercheckGL.h

-rw-rw-r-- 1 minimonk minimonk 6801 6월 2 19:00 shrQATest.h

-rw-rw-r-- 1 minimonk minimonk 33202 6월 2 19:00 shrUtils.h

so 파일은 아래의 경로에 존재한다.

/usr/local/cuda/lib$ ll

합계 394240

drwxr-xr-x 2 root root 4096 6월 2 18:47 ./

drwxr-xr-x 12 root root 4096 6월 2 18:47 ../

lrwxrwxrwx 1 root root 14 6월 2 18:46 libcublas.so -> libcublas.so.4*

lrwxrwxrwx 1 root root 18 6월 2 18:47 libcublas.so.4 -> libcublas.so.4.2.9*

-rwxr-xr-x 1 root root 105568932 6월 2 18:46 libcublas.so.4.2.9*

lrwxrwxrwx 1 root root 14 6월 2 18:46 libcudart.so -> libcudart.so.4*

lrwxrwxrwx 1 root root 18 6월 2 18:47 libcudart.so.4 -> libcudart.so.4.2.9*

-rwxr-xr-x 1 root root 427344 6월 2 18:46 libcudart.so.4.2.9*

lrwxrwxrwx 1 root root 13 6월 2 18:47 libcufft.so -> libcufft.so.4*

lrwxrwxrwx 1 root root 17 6월 2 18:46 libcufft.so.4 -> libcufft.so.4.2.9*

-rwxr-xr-x 1 root root 29333272 6월 2 18:47 libcufft.so.4.2.9*

lrwxrwxrwx 1 root root 13 6월 2 18:46 libcuinj.so -> libcuinj.so.4*

lrwxrwxrwx 1 root root 17 6월 2 18:46 libcuinj.so.4 -> libcuinj.so.4.2.9*

-rwxr-xr-x 1 root root 157120 6월 2 18:46 libcuinj.so.4.2.9*

lrwxrwxrwx 1 root root 14 6월 2 18:46 libcurand.so -> libcurand.so.4*

lrwxrwxrwx 1 root root 18 6월 2 18:46 libcurand.so.4 -> libcurand.so.4.2.9*

-rwxr-xr-x 1 root root 27434820 6월 2 18:47 libcurand.so.4.2.9*

lrwxrwxrwx 1 root root 16 6월 2 18:46 libcusparse.so -> libcusparse.so.4*

lrwxrwxrwx 1 root root 20 6월 2 18:46 libcusparse.so.4 -> libcusparse.so.4.2.9*

-rwxr-xr-x 1 root root 188374460 6월 2 18:47 libcusparse.so.4.2.9*

lrwxrwxrwx 1 root root 11 6월 2 18:46 libnpp.so -> libnpp.so.4*

lrwxrwxrwx 1 root root 15 6월 2 18:46 libnpp.so.4 -> libnpp.so.4.2.9*

-rwxr-xr-x 1 root root 52381048 6월 2 18:46 libnpp.so.4.2.9*

'Programming > openCL & CUDA' 카테고리의 다른 글

nvcc 는 int main()을 좋아해 (0)	2012.06.05
cuda 메모리별 접근 소요 클럭 사이클 (0)	2012.06.05
CUDA devicequery - ION 330 (0)	2012.06.02
cuda 5 preview (0)	2012.06.02
nvidia ion cuda core와 h.264 library (0)	2012.05.22

Posted by 구차니

CUDA devicequery - ION 330

리플 룩 ion330 모델에 내장된 ion에 대한 devicequery이다.
2개의 MP가 존재해서 총 16개의 CUDA core가 존재한다.

~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./deviceQuery

[deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "ION"

CUDA Driver Version / Runtime Version 4.2 / 4.2

CUDA Capability Major/Minor version number: 1.1

Total amount of global memory: 254 MBytes (266010624 bytes)

( 2) Multiprocessors x ( 8) CUDA Cores/MP: 16 CUDA Cores

GPU Clock rate: 1100 MHz (1.10 GHz)

Memory Clock rate: 800 Mhz

Memory Bus Width: 64-bit

Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)

Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per multiprocessor: 768

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Concurrent copy and execution: No with 0 copy engine(s)

Run time limit on kernels: No

Integrated GPU sharing Host Memory: Yes

Support host page-locked memory mapping: Yes

Concurrent kernel execution: No

Alignment requirement for Surfaces: Yes

Device has ECC support enabled: No

Device is using TCC driver mode: No

Device supports Unified Addressing (UVA): No

Device PCI Bus ID / PCI location ID: 3 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = ION

[deviceQuery] test results...

PASSED

> exiting in 3 seconds:

3...2...1...done!

그나저나.. 대역폭에서 내장형 그래픽이라 메인메모리를 공유하는데 왜 대역폭에서 이렇게 차이가 날까?

~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./bandwidthTest

[bandwidthTest] starting...

./bandwidthTest Starting...

Running on...

Device 0: ION

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 887.0

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 735.9

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5345.2

[bandwidthTest] test results...

PASSED

> exiting in 3 seconds: 3...2...1...done!

8800GT 에 비하면 확실히 nbody 에서의 연산속도와 fps가 많이 떨어지는 느낌
(8800GT에서는 150fps에 50GFLOP/s 정도 나옴)

2010/11/02 - [Programming/openCL / CUDA] - CUDA 예제파일 실행결과 + SLI

+ 리눅스에서 nvidia 드라이버 버전 보는 방법

$ cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86 Kernel Module 295.40 Thu Apr 5 21:28:09 PDT 2012

GCC version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

[링크 : http://www.nvnews.net/vbulletin/showthread.php?t=127289]

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda 메모리별 접근 소요 클럭 사이클 (0)	2012.06.05
ubuntu 에서 vectorAdd 직접 컴파일 하기 (0)	2012.06.03
cuda 5 preview (0)	2012.06.02
nvidia ion cuda core와 h.264 library (0)	2012.05.22
CUDA API 메모리 종류 (0)	2012.05.18

Posted by 구차니

구차니의 잡동사니 모음

'Programming/openCL & CUDA'에 해당되는 글 80건

cuda deviceQuery on GTX650

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda 5.0

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda shared memory의 결합법칙?

'Programming > openCL & CUDA' 카테고리의 다른 글

cudaMalloc 시작 위치?

'Programming > openCL & CUDA' 카테고리의 다른 글

cudemMemcpy()

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda 에서 device memory의 용량을 초과하는 malloc은 위험해!

'Programming > openCL & CUDA' 카테고리의 다른 글

nvcc 는 int main()을 좋아해

'Programming > openCL & CUDA' 카테고리의 다른 글

cuda 메모리별 접근 소요 클럭 사이클

'Programming > openCL & CUDA' 카테고리의 다른 글

ubuntu 에서 vectorAdd 직접 컴파일 하기

'Programming > openCL & CUDA' 카테고리의 다른 글

CUDA devicequery - ION 330

'Programming > openCL & CUDA' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

티스토리툴바