I shipped a dozen products with them (mostly video games), so there's nothing "wrong" that would make them unusable.
But programming them and setting up the graphics pipe (and all the passes, structured buffers, compiling, binding, weird errors, and synchronization) is a huge PITA as compared to the convenience of CUDA.
Compilers are way less mature, especially on some platforms cough.
Some GPU capabilities are not exposed.
No real composability or libraries.
No proper debugging.
These days, some game engines have done pretty well at making compute shaders easy to use (such as Bevy [1] -- disclaimer, I contribute to that engine). But telling the scientific/financial/etc. community that they need to run their code inside a game engine to get a decent experience is a hard sell. It's not a great situation compared to how easy it is on NVIDIA's stack.
I have recently published an AI-related open-source project entirely based on compute shaders https://github.com/Const-me/Cgml and I’m super happy with the workflow. Possible to implement very complicated things without compiling a single line of C++, the software is mostly in C#.
> setting up the graphics pipe
I’ve picked D3D11, as opposed to D3D12 or Vulkan. The 11 is significantly higher level, and much easier to use.
> compiling, binding
The compiler is design-time, I ship them compiled, and integrated into the IDE. I solved the bindings with a simple code generation tool, which parses HLSL and generates C#.
I understand why you've picked D3D11, but people have to understand that comes with serious limitations. There are no subgroups, which also means no cooperative matrix multiplication ("tensor cores"). For throughput in machine learning inference in particular, there's no way D3D11 can compete with either CUDA or a more modern compute shader stack, such as one based on Vulkan 1.3.
When running inference on end-user computers, for many practical applications users don’t care about throughput. They only have a single audio stream / chat / picture being generated, their batch size is a small number often just 1, and they mostly care about latency, not throughput. Under these conditions inference is guaranteed to bottleneck on memory bandwidth, as opposed to compute. For use cases like that, tensor cores are useless.
> there's no way D3D11 can compete with either CUDA
Sure. It's a tradeoff space. Gain portability and ergonomics, lose throughput. For applications that are throttled by TOPS at low precisions (ie most ML inferencing) then the performance drop from not being able to use tensor cores is going to be unacceptable. Glad you found something that works for you, but it certainly doesn't spell the end of CUDA.
Most ML inferencing is throttled with memory, not compute. This certainly applies to both Whisper and Mistral models.
> it certainly doesn't spell the end of CUDA
No, because traditional HPC. Some people in the industry spent many man-years developing very complicated compute kernels, which are very expensive to port.
AI is another story. Not too hard to port from CUDA to compute shaders, because the GPU-running code is rather simple.
Moreover, it can help with performance just by removing abstraction layers. I think the reason why compute shaders-based Whisper outperformed CUDA-based version on the same GPU, these implementations do slightly different things. Unlike Python and Torch, compute shaders actually program GPUs as opposed to calling libraries with tons of abstractions layers inside them. This saves memory bandwidth storing and then loading temporary tensors.
This. It's crazy how primitive the GPU development process still is in the year 2023. Yeah it's gotten better, but there's still a massive gap with traditional development.
It's kinda like building Legos vs building actual Skyscrapers. The gap between compute shaders and CUDA is massive. At least it feels massive because CUDA has some key features that compute shaders lack, and which make it so much easier to build complex, powerful and fast applications.
One of the features that would get compute shaders far ahead compared to now would be pointers and pointer casting - Just let me have a byte buffer and easily cast the bytes to whatever I want. Another would be function pointers. These two are pretty much the main reason I had to stop doing a project in OpenGL/Vulkan, and start using CUDA. There are so many more, however, that make life easier like cooperative groups with device-wide sync, being able to allocate a single buffer with all the GPU memory, recursion, etc.
Khronos should start supporting C++20 for shaders (basically what CUDA is) and stop the glsl or spirv nonsense.
You might argue for forking off from glsl and SPIR-V for complex compute workloads, but lightweight, fast compilers for a simple language like glsl do solve issues for graphics. Some graphics use cases don't get around shipping a shader compiler to the user. The number of possible shader configurations is often either insanely large or just impossible to enumerate, so on the fly compilation is really the only thing you can do.
Ironically, most people use HLSL with Vulkan, because Khronos doesn't have a budget nor the people to improve GLSL.
So yet another thing where Khronos APIs are dependent on DirectX evolution.
It used to be that AMD and NVidia would first implement new stuff on DirectX in collaboration with Microsoft, have them as extensions in OpenGL, and eventually as standard features.
For GPGPU tasks, they lack a lot of useful features that CUDA has like the ability to allocate memory and launch kernels from the GPU. They also generally require you to write your GPU and CPU portions of an algorithm in different languages, while CUDA allows you to intermix your code and share data structures and simple functions between the two.
There are existing efforts to compile SYCL to Vulkan compute shaders. Plenty of "weird quirks" involved since they're based on different underlying varieties of SPIR-V ("kernels" vs. "shaders") and seem to have evolved independently in other ways (Vulkan does not have the amount of support for numerical computation that OpenCL/SYCL has) - but nothing too terrible or anything that couldn't be addressed by future Vulkan extensions.
Vulkan 1.3 has pointers, thanks to buffer device address[1]. It took a while to get there, and earlier pointer support was flawed. I also don't know of any major applications that use this.
Modern Vulkan is looking pretty good now. Cooperative matrix multiplication has also landed (as a widely supported extension), and I think it's fair to say it's gone past OpenCL.
Whether we get significant adoption of all this I think is too early to say, but I think it's a plausible foundation for real stuff. It's no longer just a toy.
Is IREE the main runtime doing Vulkan or are there others? Who should we be listening to (oh wise @raphlinus)?
It's been awesome seeing folks like Keras 3.0 kicking out broad Intercompatibility across JAX, TF, Pytorch, powered by flexible executuon engines. Looking forward to seeing more Vulkan based runs getting socialized benchmarked & compared. https://news.ycombinator.com/item?id=38446353
The two I know of are IREE and Kompute[1]. I'm not sure how much momentum the latter has, I don't see it referenced much. There's also a growing body of work that uses Vulkan indirectly through WebGPU. This is currently lagging in performance due to lack of subgroups and cooperative matrix mult, but I see that gap closing. There I think wonnx[2] has the most momentum, but I am aware of other efforts.
How feasible would it be to target Vulkan 1.3 or such from standard SYCL (as first seen in Sylkan, for earlier Vulkan Compute)? Is it still lacking the numerical properties for some math functions that OpenCL and SYCL seem to expect?
That's a really good question. I don't know enough about SYCL to be able to tell you the answer, but I've heard rumblings that it may be the thing to watch. I think there may be some other limitations, for example SYCL 2020 depends on unified shared memory, and that is definitely not something you can depend on in compute shader land (in some cases you can get some of it, for example with resizable BAR, but it depends).
In researching this answer, I came across a really interesting thread[1] on diagnosing performance problems with USM in SYCL (running on AMD HIP in this case). It's a good tour of why this is hard, and why for the vast majority of users it's far better to just use CUDA and not have to deal with any of this bullshit - things pretty much just work.
When targeting compute shaders, you pretty much have to manage buffers manually, and also do copying between host and device memory explicitly (when needed - on hardware such as Apple Silicon, you prefer to not copy). I personally don't have a problem with this, as I like things being explicit, but it is definitely one of the ergonomic advantages of modern CUDA, and one of the reasons why fully automated conversion to other runtimes is not going to work well.
Unified shared memory is an intel specific extension of OpenCL.
SYCL builds on top of OpenCL so you need to know the history of OpenCL. OpenCL 2.0 introduced shared virtual memory, which is basically the most insane way of doing it. Even with coarse grained shared virtual memory, memory pages can transparently migrate from host to device on access. This is difficult to implement in hardware. The only good implementations were on iGPUs simply because the memory is already shared. No vendor, not even AMD could implement this demanding feature. You would need full cache coherence from the processor to the GPU, something that is only possible with something like CXL and that one isn't ready even to this day.
So OpenCL 2.x was basically dead. It has unimplementable mandatory features so nobody wrote software for OpenCL 2.x.
Khronos then decided to make OpenCL 3.0, which gets rid of all these difficult to implement features so vendors can finally move on.
So, Intel is building their Arc GPUs and they decided to create a variant of shared virtual memory that is actually implementable called unified shared memory.
The idea is the following: All USM buffers are accessible by CPU and GPU, but the location is defined by the developer. Host memory stays on the host and the GPU must access it over PCIe. Device memory stays on the GPU and the host must access it over PCIe. These types of memory already cover the vast majority of use cases and can be implemented by anyone. Then finally, there is "shared" memory, which can migrate between CPU and GPU in a coarse grained matter. This isn't page level. The entire buffer gets moved as far as I am aware. This allows you to do CPU work then GPU work and then CPU work. What doesn't exist is a fully cache coherent form of shared memory.
https://enccs.github.io/sycl-workshop/unified-shared-memory/ seems to suggest that USM is still a hardware-specific feature in SYCL 2020, so compatibility with hardware that requires a buffer copying approach is still maintained. Is this incorrect?
"Using a pointer in a shader - In Vulkan GLSL, there is the GL_EXT_buffer_reference extension "
That extension is utter garbage. I tried it. It was the last thing I tried before giving up on GLSL/Vulkan and switching to CUDA. It was the nail in the coffin that made me go "okay, if that's the best Vulkan can do, then I need to switch to CUDA". It's incredibly cumbersome, confusing and verbose.
What's needed are regular, simple, C-like pointers.
Compute shaders are not capable of using modern GPU features like tensor cores or many of the other features needed to feed tensor cores data fast enough (e.g. TMA/cp.async.shared)