I strongly encourage anyone that’s interested to check out the code. There’s not much of it. What there is, is simple. The following is a brief overview of what I did.
The Problem With the C Version
The main problem with the hqNx family of scaling algorithms is the ~5000 lines worth of switch-case and ifs at their core (which would be even more with all the macros expanded out and inline functions inlined). That’s obviously not something that’s going to fit nicely in a pixel shader. Even if it did (somehow) fit below the instruction limit, the branching would promptly bring the GPU to its knees (adjacent pixels are pretty unlikely to take the same path).
An Interesting Observation
The thing about the giant switch-case, is that it does little more than select from a palette of blending functions to be used to interpolate the final color values. More interesting, is that each of the blending functions has fixed weights and operates solely on the $3\times3$ neighborhood of pixels surrounding the source. It is, essentially, selecting a filtering kernel and applying it.
The first part of the selection is comparing the source pixel to its eight neighbors and building a bitmask where a bit is set if the corresponding neighbor is different enough from the center. This gives 256 possible values (which is what the switch-case switches on). The second part of the selection is unique to a few of the main cases, where comparisons are made between the pixel above, below, to the left, and to the right. These comparisons always take place between the upper and left, upper and right, lower and left, and lower and right pixels, giving us four more bits of comparison info.
A True Lookup Table
A GPU may not be good with a 5000 line switch-case and hordes of nested ifs, but it will blaze if you give it a simple lookup table, usually in the form of a texture. $8+4$ bits of selection yields 4092 values in one dimension, which is cutting it close to the maximum texture size on many GPUs (and probably well beyond it on many). hqNx uses 3D textures instead, yielding a $256\times16\times2N^2$ R8G8B8A8 texture (the extra $2$ in the last term will be explained shortly).
The lookup table data is built by running a slightly modified version of the original switch-case which, rather than blending between values in the $3\times3$ square around the pixel, simply writes the weights it would have blended with into a $3\times3$ buffer. The buffer is then normalized and the eight outer weights are packed into a pair of R8G8B8A8 color values (thus the extra $2$ in the table’s last dimension). The weights always add to one, so the pixel shader trivially computes the missing weight.
You can find the table-building code in:
Splitting up the Loop
The original code runs one giant loop over the source image, with each iteration analyzing the $3\times3$ neighborhood of the source pixel and then interpolating and writing \(N^2\) pixels to the destination image. GPUs however, work the other way, with the pixel shader executing independently for each destination pixel. While it is fairly simple to compute the source pixel for any given destination pixel, we soon find we’re wasting lots of time reanalyzing the same $3\times3$ block once for each of the \(N\times N\) pixels we’re outputting (note that this may actually be faster on some systems - I haven’t had a chance to test yet).
We get around this by breaking the algorithm into two passes. The first pass creates a _pattern_ image of the source by analyzing each $3\times3$ block and writing two channels containing the 8-bit top-level switch value and the 4-bit secondary switch value. A simple R8G8B8A8 color texture is sufficient for this (R8G8 would be preferable, but it tends to not be useful as a render target). This pixel shader is a bit on the long side, and it does not fit into the
ps_2_0 target, so we use
ps_2_a. One could easily break it into two smaller passes and blend the results together, at the expense of doubling the reads from the source texture and some blending overhead. The comparison function could also be simplified (the existing one is based on the original algorithm, which works in the YUV color space - that conversion isn’t free).
The second pass computes which source pixel it is blowing up, fetches its pattern values, finds its position is in the \(N\times N\) quad that pixel enlarges to, and uses those values to look up the 9 blending weights it needs from the precomputed lookup table. It then fetches the $3\times3$ source pixel block and blends them together.
See hqNx.fx for the implementation.