CUDA Task Launcher for CPU and GPU

Date:

Low-speed interconnects serve as a fundamental issue to CPU-GPU systems. This means that if the GPU kernel offload costs are greater than the computation time, it makes no sense to migrate jobs to the GPU. We studied the potential of offloading unmodified CUDA source code to CPU SIMD units if they were available along with GPUs. This functionality was implemented in the system software layer as a CUDA driver shim that was transparent to the application developer. We demonstrated 1.5X performance improvements over a pure-CPU SIMD execution and 1.3X improvements over a pure-GPU execution of data center applications. This work was completed during an internship at Nvidia Research.