最近のくだ

Geforce GTX295を2枚使ってデバイス4つでCUDAを実行できるようになりました。
Windows XP 32bitだと素直に正常動作してくれましたということでした。

CUDA Toolkitの最新が2.2で海外のサイトから落とす必用があります。
http://www.nvidia.com/object/cuda_get.html

以下はフォーラムにあったCUDA2.2の概要

The CUDA Toolkit and SDK v2.2 is now released and available to all developers.

  • Officially adds support for Windows 7, Server 2003, Server 2008, Ubuntu 8.10, RHEL 5.3, and Fedora 10
  • Includes cuda-gdb (hardware GPU debugger) for RHEL5 32 and 64-bit (officially supported and tested), but it may work on more platforms than just those
  • Exclusive device mode in Linux: set some GPUs as exclusive-compute (can only own a single CUDA context) and some as non-compute (no CUDA contexts allowed) for easier management of clusters/MPI applications. See the manpage for nvidia-smi for how to set this and cudaSetValidDevices in the reference manual on how to best use this from CUDART.
  • Zero-copy support: transparently read from certain system memory from a kernel on GT200 or MCP79 systems. See this post for more information on how it works.
  • Asynchronous memcpy support on Vista/Server 2008/Win7
  • Texture from pitchlinear memory: use this to avoid an additional memcpy at times in some scenarios.
  • >4GB of pinned memory in a single allocation on most OSes
  • maximum pinned memory per allocation increased in Vista to ~1.5GB
  • pinned memory can be shared between contexts
  • Multi-device OpenGL interop performance between a Quadro display card and a separate compute card is dramatically improved.
  • Visual Profiler works on Vista
  • Visual Profiler supports additional counters for GT200 to measure number of memory transactions of a given size, instruction throughput, etc.
  • Blocking sync support for all platforms: allows the host thread to sleep and be awoken by driver when the GPU operation the host thread is waiting on is completed.
  • Quite a few additional math functions added due to forum requests (feel free to keep posting requests, we do pay attention)
  • __threadfence(): ensure that a thread's pending memory writes are visible to all threads before continuing. It is explicitly not a global sync, unlike how it appears to some.
  • Lots of bugfixes, of course; most importantly, killing a CUDA app should behave much, much better than it ever has before, especially when you're on a dedicated compute card