最近のくだ
Geforce GTX295を2枚使ってデバイス4つでCUDAを実行できるようになりました。
Windows XP 32bitだと素直に正常動作してくれましたということでした。
CUDA Toolkitの最新が2.2で海外のサイトから落とす必用があります。
http://www.nvidia.com/object/cuda_get.html
以下はフォーラムにあったCUDA2.2の概要
The CUDA Toolkit and SDK v2.2 is now released and available to all developers.
- Officially adds support for Windows 7, Server 2003, Server 2008, Ubuntu 8.10, RHEL 5.3, and Fedora 10
- Includes cuda-gdb (hardware GPU debugger) for RHEL5 32 and 64-bit (officially supported and tested), but it may work on more platforms than just those
- Exclusive device mode in Linux: set some GPUs as exclusive-compute (can only own a single CUDA context) and some as non-compute (no CUDA contexts allowed) for easier management of clusters/MPI applications. See the manpage for nvidia-smi for how to set this and cudaSetValidDevices in the reference manual on how to best use this from CUDART.
- Zero-copy support: transparently read from certain system memory from a kernel on GT200 or MCP79 systems. See this post for more information on how it works.
- Asynchronous memcpy support on Vista/Server 2008/Win7
- Texture from pitchlinear memory: use this to avoid an additional memcpy at times in some scenarios.
- >4GB of pinned memory in a single allocation on most OSes
- maximum pinned memory per allocation increased in Vista to ~1.5GB
- pinned memory can be shared between contexts
- Multi-device OpenGL interop performance between a Quadro display card and a separate compute card is dramatically improved.
- Visual Profiler works on Vista
- Visual Profiler supports additional counters for GT200 to measure number of memory transactions of a given size, instruction throughput, etc.
- Blocking sync support for all platforms: allows the host thread to sleep and be awoken by driver when the GPU operation the host thread is waiting on is completed.
- Quite a few additional math functions added due to forum requests (feel free to keep posting requests, we do pay attention)
- __threadfence(): ensure that a thread's pending memory writes are visible to all threads before continuing. It is explicitly not a global sync, unlike how it appears to some.
- Lots of bugfixes, of course; most importantly, killing a CUDA app should behave much, much better than it ever has before, especially when you're on a dedicated compute card