Working Around Driver Bugs: Fixing SPIR-V Assembly With Regex
Recently, I worked around a Vulkan driver bug on my Quest 3 by writing code that throws regular expressions at SPIR-V assembly until it's in the right form to make the driver happy. Some of you might not have enjoyed reading that sentence. Those who just dislike the unfamiliar tech jargon are the lucky ones. The rest are already feeling my pain, and this first paragraph isn't even over yet. It's (horror) story time - but you might learn a little about SPIR-V along the way if you stick around. I sure did.
This post isn't about building content generally, but it is about a problem which I ultimately fixed (well, worked around) with code inside the build pipeline. Things like this do tend to accumulate in and around asset compilers, so my deep dive into this one is included in the series on building content.
The problem
This started off pretty simple. After getting a (very) basic Android build set up and tested using my phone, I decided it was time to push my little VR side project to my HMD (head-mounted display) and start debugging it in its intended environment. The app would start, but then it would promptly crash before rendering a single frame.
The investigation
The crash turned out to be caused by a failure in vkCreateGraphicsPipelines. The result code was less than helpful: VK_ERROR_UNKNOWN. I had one additional hint - right before returning ERROR_UNKNOWN, the driver printed a warning to the Android log: "Failed to link shaders." And, of course, my logs showed me which pipeline was failing to compile. With that to go on, I opened the offending shaders and started changing/deleting bits to see if I could narrow down what triggered the fault. Focusing my efforts on the shaders' interfaces (because that's what a link error suggests), I quickly tracked down the culprit:
out float[1] ClipDistance : SV_ClipDistance;
It didn't matter what I wrote to that output or where and how I declared it, any use of SV_ClipDistance and the pipeline wouldn't link on my HMD. Now, the Quest 3 does support the shaderClipDistance feature, and my one clip plane's request for one clip plane is well within the bounds of the GPU's maxClipDistances limit (which is 8). So why doesn't its pipeline compiler like SV_ClipDistance?
A quick aside: this sort of "just try things to get clues about the issue" approach to troubleshooting is, often enough, all one has to fall back on - particularly when interfacing with closed systems. And this approach can even turn into a frustrating hunt for specific needles in a (sometimes moving) haystack. Understanding what you're up against helps immesurably. In this case, having a general idea of what must be going on in the driver told me immediately which end of the haystack to begin on. In others, it's allowed me to binary-search my haystack - and \(O(\log{_2}{(haystack)})\) is waaaay faster than plain old \(O(haystack)\).
Checking my API usage
As it turned out, I had a bug in my application code outside the shader. Specifically, a review of all things ClipDistance in the Vulkan documentation showed that I had checked for but forgotten to enable the shaderClipDistance device feature. No other driver that I test on had complained about this, but maybe this one was just more strict? Well, it turned out that no, this wasn't what the driver was upset about.
With a careful review of the C++ code complete, I had to go back to the shader and pick at it a lot more thoroughly.
Looking at the SPIR-V
The Slang source was fine, and even very trivial substitute shaders failed to work, so I started examining the actual SPIR-V shader bytecode. It's easy enough to look at shader disassembly in RenderDoc on PC, and my PC build runs exactly the same shader code, that's where I went next. I am using Slang, which is a newer shading language, and I thought that maybe I'd hit a subtle bug in its compiler.
Trimming away everything except the references to ClipDistance (and the things that those references refer to), this is what Slang emits for a SV_ClipDistance binding:
; SPIR-V
; Version: 1.3
; Generator: NVIDIA Slang Compiler; 0
; Bound: 862
; Schema: 0
OpCapability MultiView
OpCapability Shader
%78 = OpExtInstImport "GLSL.std.450"
OpMemoryModel Logical GLSL450
OpEntryPoint Vertex %vert "vert" %gl_Position %entryPointParam_vert_ViewPos %entryPointParam_vert_Normal %in_Position %in_Normal %gl_ClipDistance %18
OpEntryPoint Fragment %frag "frag" %entryPointParam_frag
OpExecutionMode %frag OriginUpperLeft
OpSource Slang 1
; snip - lots of OpName
; snip - lots of Op[Member]Decorate
OpDecorate %gl_ClipDistance BuiltIn ClipDistance
; snip
%float = OpTypeFloat 32
%int = OpTypeInt 32 1
%int_1 = OpConstant %int 1
%_arr_float_int_1 = OpTypeArray %float %int_1
; snip
%int_0 = OpConstant %int 0
; snip
%_ptr_Output__arr_float_int_1 = OpTypePointer Output %_arr_float_int_1
; snip
%gl_ClipDistance = OpVariable %_ptr_Output__arr_float_int_1 Output
; snip
%vert = OpFunction %void None %3
; snip
%851 = OpCompositeConstruct %_arr_float_int_1 %801
OpStore %gl_ClipDistance %851
OpReturn
OpFunctionEnd
%frag = OpFunction %void None %3
; snip
OpReturn
OpFunctionEnd
So, what does that all mean?
- First, there's a preamble that declares what compiler made the code, what version of SPIR-V it produced, and what optional features the code requires (those are the
OpCapabilitylines). - Then there's a list of entry points (shaders), each of which lists its input and output bindings, which strongly implies that only the vertex shader (
"vert") touches anything that looks like aClipDistance. - Then there's a bit more metadata.
- Then there's a bunch of
OpNamedebug info, which I snipped.
The first interesting line is this one:
OpDecorate %gl_ClipDistance BuiltIn ClipDistance
That is an instruction to the driver to link the %gl_ClipDistance symbol to wherever it is the GPU wants shaders to put ClipDistance values. And it's the only reference to BuiltIn ClipDistance, so the %gl_ClipDistance is the main thing to focus on. So, what is a %gl_ClipDistance and where is it used?
%gl_ClipDistance = OpVariable %_ptr_Output__arr_float_int_1 Output
; snip
%851 = OpCompositeConstruct %_arr_float_int_1 %801
OpStore %gl_ClipDistance %851
Okay, as far as SPIR-V is concerned, it's just some variable. Probably the only reason it renders as %gl_ClipDistance is that the disassembler saw the OpDecorate on it and chose a friendlier name than, say %851.
And finally, there were no traces of %gl_ClipDistance anywhere in the fragment shader at all, which is what I expected.
Examining %gl_ClipDistance's type
I'll divert briefly to discuss %gl_ClipDistance's type (%_ptr_Output__arr_float_int_1) and how it's declared in SPIR-V, because it's going to turn out to be relevant. Starting with %_ptr_Output__arr_float_int_1 itself and then tracing backwards through what it references:
%_ptr_Output__arr_float_int_1 = OpTypePointer Output %_arr_float_int_1
%_arr_float_int_1 = OpTypeArray %float %int_1
%float = OpTypeFloat 32
%int_1 = OpConstant %int 1
%int = OpTypeInt 32 1
Reading that, you can see that:
%_ptr_Output__arr_float_int_1is an output pointer to a%_arr_float_int_1.%_arr_float_int_1is, in turn, an array of%int_1%floatvalues.%int_1is a constant of type%intwhose value is 1.%intis a 32-bit signed (that's what the 1 at the end of theOpTypeIntmeans) integer.%floatis a 32-bitfloat.
It's kind of interesting how SPIR-V builds up even very primitive types like int32 as a series of declarations rather than having them built in. I suppose this is to leave room to expose non-standard-precision types on GPUs that support them without having to pile on lots and lots of extensions.
It's also interesting to note that array types do not automatically "decay" to pointers as they do in C and C++. When SPIR-V says something is an array of N values, that refers to the N values, not the starting address of some contiguous range in memory.
So, what's the problem?
Well, looking up BuiltIn ClipDistance I see mention of a ClipDistance capability, and I don't see an OpCapability for that. None of my other devices care that the compiler missed it (just like they didn't care that I'd forgotten to enable shaderClipDistance), but maybe that's the issue.
I hoped so, because everything else seemed pretty reasonable. I'm not saying it's correct, becuase I haven't spent enough time with the spec to really know what is and isn't legal for a BuiltIn ClipDistance binding. But the code didn't look wrong to me, either.
Surely, working around this little problem will be easy...
Now, the question is how to work around this issue. I don't want to commit the time necessary to forking the Slang compiler and making and maintaining my own patches. That would be excessive unless I'm absolutely forced into doing so.
Fortunately, the same SPIR-V disassembler that RenderDoc is using is available as a library. And I already had it wrapped and available as a nice utility API in my tools pipeline (a leftover from earlier experimentation and debugging). So it should be a simple matter of taking Slang's output, disassembling it, editing the text, and then reassembling the resulting shader module. A little annoying, but easy enough to do, and something I could plug into the build system so it just runs automatically.
To get there, I wrote two regular expressions. One to find OpDecorate ... BuiltIn ... instructions and another to find OpCapability.
[GeneratedRegex(@"^\s*OpCapability\s+(?<name>\w+)\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpCapabilityMatcher();
[GeneratedRegex(@"^\s*OpDecorate\s+(?<name>%\w+)\s+BuiltIn\s+(?<decorator>\w+)\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpDecorateBuiltInMatcher();
The logic was simple. If OpDecorateBuiltInMatcher found a hit for ClipDistance but OpCapabilityMatcher didn't, then inject an OpCapability ClipDistance into the text and reassemble the shader. Took almost no time to code it up...
But it didn't help. The missing OpCapability ClipDistance was not, in fact, the problem. The Quest 3's driver happily ignores its absence just like every other driver seems to. There is something else that it objects to, so the search went on.
Trying different things
The first thing to do at this point was confirm that ClipDistance actually works on the Quest 3. It's such a basic feature and important for some very powerful and common rendering techniques, so I couldn't imagine that being broken, but I had to check regardless. To do that, I dusted off my old GLSL-based pre-Slang graphics pipeline compilation code.
Step one: Go back to my RenderDoc capture and ask it to decompile my shader module into (very ugly) GLSL.
Step two: Paste that into a file and put it in my build in place of (well, next to) the Slang source. Point the pipeline definition at the GLSL source.
Step three: Fix the bitrot in the GLSL-based content builder. The main problem here was that Slang supports user-defined attributes in its source, its reflection API makes them visible to users of the Slang libraries, and I had built my material-pipelinine binding logic on top of these attributes. GLSL and SPIRV-Reflect, obviously, have no equivalent feature. The short version here is that I patched up the GLSL reflection code to synthesize the information the new Slang reflection code produces directly, mostly based on some silly ad-hoc naming conventions.
And with that out of the way and my material compilers happy, I compiled the GLSL source I'd copied out of RenderDoc's decompiler, and it worked.
At this point I entertained a few options:
- Switch this one material over to GLSL.
- Get Slang to transpile to GLSL (that is a thing that it can do) and then compile that.
- Put in a compatibility mode that decompiles the shader all the way to GLSL before recompiling it.
I didn't like any of these options, so I decided to dig a bit deeper. But it was good to know that I'd have approaches to fall back on if the further investigation proved fruitless.
Figuring out what was wrong
So now that I know that ClipDistance does work on the device and that the problem is with the specifically Slang version of the bytecode, I had to work out exactly what the relevant difference is between the two compilers' output. I've already gone over Slang's output, so now it was time to do the same with glslang's:
OpCapability Shader
OpCapability ClipDistance
OpCapability MultiView
; snip
OpEntryPoint Vertex %2 "vert" %3 %4 %5 %6 %7 %8
; snip
OpDecorate %_struct_15 Block
OpMemberDecorate %_struct_15 0 BuiltIn Position
OpMemberDecorate %_struct_15 1 BuiltIn PointSize
OpMemberDecorate %_struct_15 2 BuiltIn ClipDistance
OpMemberDecorate %_struct_15 3 BuiltIn CullDistance
; snip
%int_2 = OpConstant %int 2
; snip
%uint_1 = OpConstant %uint 1
%_arr_float_uint_1 = OpTypeArray %float %uint_1
%_struct_15 = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1
%_ptr_Output__struct_15 = OpTypePointer Output %_struct_15
%5 = OpVariable %_ptr_Output__struct_15 Output
; snip
%_ptr_Output__arr_float_uint_1 = OpTypePointer Output %_arr_float_uint_1
; snip
%136 = OpCompositeConstruct %_arr_float_uint_1 %135
%137 = OpAccessChain %_ptr_Output__arr_float_uint_1 %5 %int_2
OpStore %137 %136
; snip
Well this is interesting.
A few things jumped out at me immediately as I scanned the assembly:
- The compiler did include an
OpCapability ClipDistance. - The compiler put in an unused reference to
BuiltIn CullDistance, but noOpCapability CullDistance. - The
BuiltIn ClipDistanceappears in anOpMemberDecorate, not anOpDecorate... - ... because the ClipDistance isn't in a plain variable, it's in a struct,
%_struct_15. %_struct_15contains some other stuff as well, like the output vertex position and another unused field forBuiltIn PointSize.- The final
CullDistancevalue is still written through anOpStorewhich writes the value through a pointer in the same way Slang's code does, but that pointer is just formed a little differently.
Briefly, for those who don't follow where I got all that from, here's how OpTypeStruct works:
%_struct_15 = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1
%5 = OpVariable %_ptr_Output__struct_15 Output
The first line means that %_struct_15 is a struct type which has four fields which are, respectively, of the following types: float4, float, float[1], and float[1]. The second line declares %5 (which is in the vert shader's list of inputs and outputs) as a pointer to a %_struct_15.
This pairs with the OpMemberDecorates:
OpMemberDecorate %_struct_15 2 BuiltIn ClipDistance
That reads "%_struct_15's third field (two because they're zero-indexed) binds to BuiltIn ClipDistance".
And finally there's OpAccessChain:
%137 = OpAccessChain %_ptr_Output__arr_float_uint_1 %5 %int_2
That means "compute a pointer of type %_ptr_Output__arr_float_uint_1 by loading the pointer in %5 and adding the offset to the third (again, zero-based) field of %5's pointed-to struct.
That's a bunch of extra ceremonly and indirection which the driver (I certainly hope) optimizes away at pipeline creation time. But the question is why it works and the much simpler Slang output above doesn't. Well - spoiler alert - I have no idea. My best guess is that whoever built the Quest's Vulkan driver is only testing againt glslang's output, and so its shader compiler looks specificaly for this pattern and can't handle any other. (That's really not atypical as far as GPU drivers go, alas.)
Narrowing it down
So that's quite the list of differences. It'd be good if I could find which ones, exactly, are the relevant ones. To do that, I needed a way to experiment. Fortunately, I already had all of the pieces that I needed:
- I was already set up to assemble SPIR-V from text as part of my quick hack to patch in the
OpCapability ClipDistance. - I had just fixed the reflection logic for GLSL shaders. And since that runs off of the GLSL compiler's SPIR-V output rather than the GLSL source, it could just as easily accept SPIR-V from other sources.
Putting those together, I added another shader format to the build tool's pipeline compiler:
ShaderModuleSpirvBytecode BuildSpirvAsmModule()
{
// pseudo(ish)code
// use the assembler routine from earlier
var bytecode = Assemble(spirvAsmSource);
var ret = new ShaderModuleSpirvBytecode(bytecode);
// run SPIR-V module relection, shared with GLSL
var refls = ShaderInterface.ReflectModule(bytecode);
// snipped: repacking the data in refls into the format downstream code expects
return ret;
}
var module = Args.InputFile.Extension switch
{
".slang" => BuildSlangModule(),
".spvasm" => BuildSpirvAsmModule(), // new!
".vkmod" => BuildGlslModule(),
_ => throw new ArgumentException("Unrecognized file extension. Unable to infer the shader language."),
};
With that in place, I took the disassembly from the broken Slang shaders, pasted it into a .spvasm file (as before, next to the Slang source), and pointed the pipeline definition at it. I then started making little changes to it by hand, moving it in the direction of the GLSL compiler's output, until I got something that works.
What worked
If you're curious what parts of the diffs the Quest's VK driver likes so much, here's the minimal(ish) set of changes that got the pipeline to work on my HMD. (This isn't in source order, it's been rearranged for exposition.)
First, I wrapped the ClipDistance variable in a struct:
%_ClipDistance_wrapper_struct = OpTypeStruct %_arr_float_int_1
OpDecorate %_ClipDistance_wrapper_struct Block
OpMemberDecorate %_ClipDistance_wrapper_struct 0 BuiltIn ClipDistance
%_ptr_ClipDistance_wrapper_struct = OpTypePointer Output %_ClipDistance_wrapper_struct
I didn't need anything else in the struct besides the float[1] array for the output. Whatever the GLSL compiler was doing with PointSize and CullDistance fields wasn't relevant here.
Then I deleted the old %gl_ClipDistance (and its BuiltIn ClipDistance decoration) and replaced it with this:
%gl_ClipDistance = OpVariable %_ptr_ClipDistance_wrapper_struct Output
The last thing that needed updating is how that variable is written (and I've already covered how to read this, only the field index has changed):
%_p_clipDist = OpAccessChain %_ptr_Output__arr_float_int_1 %gl_ClipDistance %int_0
OpStore %_p_clipDist %851
And, finally, I threw in an OpCapability ClipDistance, just to keep things tidy.
Automating the solution
So, that's nice. I can disassembly Slang's output and patch it by hand and... yeah there's no way I'm making that my workflow. Absolutely not. Ephatically: hell no.
The actual list of patches that the shader required is pretty short, so I decided to see how far I could push the simple regex-based approach from the earlier OpCapability experiment. But I'd need more regexes for this:
[GeneratedRegex(@"^\s*OpCapability\s+(?<name>\w+)\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpCapabilityMatcher();
// OpDecorateBuiltInMatcher got expanded to also find OpMemberDecorate
[GeneratedRegex(@"^\s*Op(?<member>Member)?Decorate\s+(?<name>%\w+)\s+(?:(?<membernum>\d+)\s+)?BuiltIn\s+(?<decorator>\w+)\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpMaybeMemberDecorateBuiltInMatcher();
[GeneratedRegex(@"^\s*(?<name>%\w+)\s+=\s+OpVariable\s+(?<type>%\w+)(?:\s+(?<decorator>\w+))\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpVariableMatcher();
[GeneratedRegex(@"^\s*(?<name>%\w+)\s+=\s+OpTypePointer\s+(?<modifier>\w+)\s+(?<base>%\w+)\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpTypePointerMatcher();
[GeneratedRegex(@"^\s*OpStore\s+(?<dst>%\w+)\s+(?<src>%\w+)\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpStoreMatcher();
[GeneratedRegex(@"^\s*(?<name>%\w+)\s+=\s+OpTypeInt\s+(?<width>\d+)\s+(?<signed>[01])\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpTypeIntMatcher();
[GeneratedRegex(@"^\s*(?<name>%\w+)\s+=\s+OpConstant\s+(?<type>%\w+)\s+(?<value>\w+)\s*$", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
private static partial Regex OpConstantMatcher();
And also some helper functions like the following to hlep drive them:
private static MatcherFunc IsOpMaybeMemberDecorateBuiltIn(bool? isMember = null, string? name = null, int? memberNum = null, string? decorator = null)
=> (string line, out Match match) =>
{
match = OpMaybeMemberDecorateBuiltInMatcher().Match(line);
return match.Success &&
(isMember == null || match.Groups["member"].Success == isMember) &&
(name == null || match.Groups["name"].ValueSpan.SequenceEqual(name)) &&
(memberNum == null || int.Parse(match.Groups["membernum"].ValueSpan, CultureInfo.InvariantCulture) == memberNum) &&
(decorator == null || match.Groups["decorator"].ValueSpan.SequenceEqual(decorator));
};
The overall logic is fairly simple:
- Look for an
Op[Member]Decoration <something> BuiltIn ClipDistance. - Look for an
OpCapability ClipDistance. If it's missing, inject it. - If the
BuiltIn ClipDistancedecoration was found on a struct member, then return. Otherwise,<something>is the%varNamegoing forward. - Find the
%varName = OpVariable <ptrType> Outputdeclaration. - Find the
%ptrType = OpTypePointer Output <baseType>declaration. - Find the
OpStore %varName <src>instruction.
If any of the declarations can't be found, then this isn't the bad pattern from Slang and nothing further is done. Otherwise, the regex captures contain everything I need to replace the "bad" lines with "good" ones:
var i0Name = lines.FindIntConstant(value: 0);
var storeSrc = matOpStore.Groups["src"].Value;
var wrapperTypeName = $"{varName}_patchWrapperStruct";
var wrapperTypePtrName = $"{varName}_patchWrapperStruct_ptr";
lines.Replace(
(clipDistanceDec, [
$"OpDecorate {wrapperTypeName} Block",
$"OpMemberDecorate {wrapperTypeName} 0 BuiltIn ClipDistance",
]),
(varDecl, [
$"{wrapperTypeName} = OpTypeStruct {baseType}",
$"{wrapperTypePtrName} = OpTypePointer Output {wrapperTypeName}",
$"{varName} = OpVariable {wrapperTypePtrName} Output",
]),
(opStore, [
$"{varName}_addr = OpAccessChain {ptrType} {varName} {i0Name}",
$"OpStore {varName}_addr {storeSrc}",
])
);
Is it horrifying? Yes.
But does it work? Yes.
It's integrated into my regular content builds, and a warning prints every time the code triggers. So I don't have to fall back to writing some subset of my shaders in GLSL or (ugh) raw SPIR-V assembly. The problem-matching criteria in my code should be tight enough to prevent it messing up as the Slang compiler evolves, but if something breaks it's usually pretty clear/easy to catch (at least at the scale of a one man hobby project). It's not great, but it's good enough. (This isn't even the worst hack I've ever had to make to deal with GPU drivers being GPU drivers.)
If I'm lucky, someone will do something about my bug report before I ever have to touch this code again, and it can one day simply be deleted. For now, do I have other things I'd rather work on before I make this any prettier/more robust than it has to be? Yes. Otherwise, if the need arises for even more of these patches, I may have to learn to properly parse SPIR-V without first disassembling it to human-readable text and start manipulating it in that way. (May that day never come. Or, at least not for this reason.)