最近のくだ - SNSよりBlogが好き。

Geforce GTX295を2枚使ってデバイス4つでCUDAを実行できるようになりました。
Windows XP 32bitだと素直に正常動作してくれましたということでした。

CUDA Toolkitの最新が2.2で海外のサイトから落とす必用があります。
http://www.nvidia.com/object/cuda_get.html

以下はフォーラムにあったCUDA2.2の概要

The CUDA Toolkit and SDK v2.2 is now released and available to all developers.

Officially adds support for Windows 7, Server 2003, Server 2008, Ubuntu 8.10, RHEL 5.3, and Fedora 10

Includes cuda-gdb (hardware GPU debugger) for RHEL5 32 and 64-bit (officially supported and tested), but it may work on more platforms than just those

Exclusive device mode in Linux: set some GPUs as exclusive-compute (can only own a single CUDA context) and some as non-compute (no CUDA contexts allowed) for easier management of clusters/MPI applications. See the manpage for nvidia-smi for how to set this and cudaSetValidDevices in the reference manual on how to best use this from CUDART.

Zero-copy support: transparently read from certain system memory from a kernel on GT200 or MCP79 systems. See this post for more information on how it works.

Asynchronous memcpy support on Vista/Server 2008/Win7

Texture from pitchlinear memory: use this to avoid an additional memcpy at times in some scenarios.

>4GB of pinned memory in a single allocation on most OSes

maximum pinned memory per allocation increased in Vista to ~1.5GB

pinned memory can be shared between contexts

Multi-device OpenGL interop performance between a Quadro display card and a separate compute card is dramatically improved.

Visual Profiler works on Vista

Visual Profiler supports additional counters for GT200 to measure number of memory transactions of a given size, instruction throughput, etc.

Blocking sync support for all platforms: allows the host thread to sleep and be awoken by driver when the GPU operation the host thread is waiting on is completed.

Quite a few additional math functions added due to forum requests (feel free to keep posting requests, we do pay attention)

__threadfence(): ensure that a thread's pending memory writes are visible to all threads before continuing. It is explicitly not a global sync, unlike how it appears to some.

Lots of bugfixes, of course; most importantly, killing a CUDA app should behave much, much better than it ever has before, especially when you're on a dedicated compute card