New swscale design to change everything (tm) ============================================ SwsGraph -------- The entry point to the new architecture, SwsGraph is what coordinates multiple "passes". These can include cascaded scaling passes, error diffusion dithering, and so on. Or we could have separate passes for the vertical and horizontal scaling. In between each SwsPass lies a fully allocated image buffer. Graph passes may have different levels of threading, e.g. we can have a single threaded error diffusion pass following a multi-threaded scaling pass. SwsGraph is internally recreated whenever the image format, dimensions or settings change in any way. sws_scale_frame() is itself just a light-weight wrapper that runs ff_sws_graph_create() whenever the format changes, splits interlaced images into separate fields, and calls ff_sws_graph_run() on each. From the point of view of SwsGraph itself, all inputs are progressive. SwsOp / SwsOpList ----------------- This is the newly introduced abstraction layer between the high-level format handling logic and the low-level backing implementation. Each SwsOp is designed to be as small and atomic as possible, with the possible exception of the read / write operations due to their numerous variants. The basic idea is to split logic between three major components: 1. The high-level format "business logic", which generates in a very naive way a sequence of operations guaranteed to get you from point A to point B. This logic is written with correctness in mind only, and ignoring any performance concerns or low-level implementation decisions. Semantically, everything is always decoded from the input format to normalized (real valued) RGB, and then encoded back to output format. This code lives in libswscale/format.c 2. The optimizer. This is where the "magic" happens, so to speak. The optimizer's job is to take the abstract sequence of operations produced by the high-level format analysis code and incrementally optimize it. Each optimization step is designed to be minute and provably lossless, or otherwise guarded behind the BITEXACT flag. This ensures that the resulting output is always identical, no matter how many layers of optimization we add. This code lives in libswscale/ops.c 3. The compiler. Once we have a sequence of operations as output by the optimizer, we "compile" this down to a callable function. This is then applied by the dispatch wrapper by striping it over the input image. See libswscale/ops_backend.c for the reference backend, or libswscale/x86/ops.c for a more complex SIMD example. This overall approach has a considerable number of benefits: 1. It allows us to verify correctness of logic and spot semantic errors at a very high level, by simply looking at the sequence of operations (available by default at debug / verbose log level), without having to dig through the multiple levels of complicated, interwoven format handling code that is legacy swscale. 2. Because most of the brains lives inside the the powerful optimizer, we get fast paths "for free" for any suitable format conversion, rather than having to enumerate them one by one. SIMD code itself can be written in a very general way and does need to be tied to specific pixel formats - subsequent low-level implementations can be strung together without much overhead. 3. We can in the future, with relative ease, compile these operations down to SPIR-V (or even LLVM IR) and generate efficient GPU or target-machine specific implementations. This also opens the window for adding hardware frame support to libswscale, and even transparently using GPU acceleration for CPU frames. 4. Platform-specific SIMD can be reduced down to a comparatively small set of optimized routines, while still providing 100% coverage for all possible pixel formats and operations. (As of writing, the x86 example backend has about 60 unique implementations, of which 20 are trivial swizzles, 10 are read/write ops, 10 are pixel type conversions and the remaining 20 are the various logic/arithmetic ops). 5. Backends hide behind a layer of abstraction offering them a considerable deal of flexibility in how they want to implement their operations. For example, the x86 backend has a dedicated function for compiling compatible operations down to a single in-place pshufb instruction. Platform specific low level data is self-contained within its own setup() function and private data structure, eliminating all reads into SwsContext or the possibility of conflicts between platforms. 6. We can compute an exact reference result for each operation with fixed precision (ff_sws_op_apply_q), and use that to e.g. measure the amount of error introduced by dithering, or even catch bugs in the reference C implementation. (In theory - currently checkasm just compares against C) Examples of SwsOp in action --------------------------- For illustration, here is the sequence of operations currently generated by my prototype, for a conversion from RGB24 to YUV444P: Unoptimized operation list: [ u8 .... -> ....] SWS_OP_READ : 3 elem(s) packed >> 0 [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123 [ u8 .... -> ....] SWS_OP_RSHIFT : >> 0 [ u8 .... -> ....] SWS_OP_CLEAR : {_ _ _ 0} [ u8 .... -> ....] SWS_OP_CONVERT : u8 -> f32 [f32 .... -> ....] SWS_OP_LINEAR : diag3+alpha [[1/255 0 0 0 0] [0 1/255 0 0 0] [0 0 1/255 0 0] [0 0 0 1 1]] [f32 .... -> ....] SWS_OP_LINEAR : matrix3 [[0.299000 0.587000 0.114000 0 0] [-0.168736 -0.331264 1/2 0 0] [1/2 -0.418688 -57/701 0 0] [0 0 0 1 0]] [f32 .... -> ....] SWS_OP_LINEAR : diag3+off3 [[219 0 0 0 16] [0 224 0 0 128] [0 0 224 0 128] [0 0 0 1 0]] [f32 .... -> ....] SWS_OP_DITHER : 16x16 matrix [f32 .... -> ....] SWS_OP_MAX : {0 0 0 0} <= x [f32 .... -> ....] SWS_OP_MIN : x <= {255 255 255 _} [f32 .... -> ....] SWS_OP_CONVERT : f32 -> u8 [ u8 .... -> ....] SWS_OP_LSHIFT : << 0 [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123 [ u8 .... -> ....] SWS_OP_WRITE : 3 elem(s) planar >> 0 This is optimized into the following sequence: Optimized operation list: [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0 [ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> f32 [f32 ...X -> ...X] SWS_OP_LINEAR : matrix3+off3 [[0.256788 0.504129 0.097906 0 16] [-0.148223 -0.290993 112/255 0 128] [112/255 -0.367788 -0.071427 0 128] [0 0 0 1 0]] [f32 ...X -> ...X] SWS_OP_DITHER : 16x16 matrix [f32 ...X -> +++X] SWS_OP_CONVERT : f32 -> u8 [ u8 ...X -> +++X] SWS_OP_WRITE : 3 elem(s) planar >> 0 (X = unused, + = exact, 0 = zero) The extra metadata on the left of the operation list is just a dump of the internal state used by the optimizer during optimization. It keeps track of knowledge about the pixel values, such as their value range, whether or not they're exact integers, and so on. In this example, you can see that the input values are exact (except for the alpha channel, which is undefined), until the first SWS_OP_LINEAR multiplies them by a noninteger constant. They regain their exact integer status only after the (truncating) conversion to U8 in the output step. Example of more aggressive optimization --------------------------------------- Conversion pass for gray -> rgb48: Unoptimized operation list: [ u8 .... -> ....] SWS_OP_READ : 1 elem(s) planar >> 0 [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123 [ u8 .... -> ....] SWS_OP_RSHIFT : >> 0 [ u8 .... -> ....] SWS_OP_CLEAR : {_ 0 0 0} [ u8 .... -> ....] SWS_OP_CONVERT : u8 -> f32 [f32 .... -> ....] SWS_OP_LINEAR : luma+alpha [[1/255 0 0 0 0] [0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 1]] [f32 .... -> ....] SWS_OP_LINEAR : matrix3 [[1 0 701/500 0 0] [1 -0.344136 -0.714136 0 0] [1 443/250 0 0 0] [0 0 0 1 0]] [f32 .... -> ....] SWS_OP_LINEAR : diag3 [[65535 0 0 0 0] [0 65535 0 0 0] [0 0 65535 0 0] [0 0 0 1 0]] [f32 .... -> ....] SWS_OP_MAX : {0 0 0 0} <= x [f32 .... -> ....] SWS_OP_MIN : x <= {65535 65535 65535 _} [f32 .... -> ....] SWS_OP_CONVERT : f32 -> u16 [u16 .... -> ....] SWS_OP_LSHIFT : << 0 [u16 .... -> ....] SWS_OP_SWIZZLE : 0123 [u16 .... -> ....] SWS_OP_WRITE : 3 elem(s) packed >> 0 Optimized operation list: [ u8 XXXX -> +XXX] SWS_OP_READ : 1 elem(s) planar >> 0 [ u8 .XXX -> +XXX] SWS_OP_CONVERT : u8 -> u16 (expand) [u16 .XXX -> +++X] SWS_OP_SWIZZLE : 0003 [u16 ...X -> +++X] SWS_OP_WRITE : 3 elem(s) packed >> 0 (X = unused, + = exact, 0 = zero) Here, the optimizer has managed to eliminate all of the unnecessary linear operations on previously zero'd values, turn the resulting column matrix into a swizzle operation, avoid the unnecessary dither (and round trip via float) because the pixel values are guaranteed to be bit exact, and finally, turns the multiplication by 65535 / 255 = 257 into a simple integer expand operation. As a final bonus, the x86 backend further optimizes this into a 12-byte shuffle: pshufb = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1} time=208 us, ref=4212 us, speedup=20.236x faster (single thread) time=57 us, ref=472 us, speedup=8.160x faster (multi thread) Compiler and underlying implementation layer (SwsOpChain) --------------------------------------------------------- While the backend API is flexible enough to permit more exotic implementations (e.g. using JIT code generation), we establish a common set of helpers for use in "traditional" SIMD implementations. The basic idea is to have one "kernel" (or implementation) per operation, and then just chain a list of these kernels together as separate function calls. For best performance, we want to keep data in vector registers in between function calls using a custom calling convention, thus avoiding any unnecessary memory accesses. Additionally, we want the per-kernel overhead to be as low as possible, with each kernel ideally just jumping directly into the next kernel. As a result, we arrive at a design where we first divide the image into small chunks, or "blocks", and then dispatch the "chain" of kernels on each chunk in sequence. Each kernel processes a fixed number of pixels, with the overall entry point taking care of looping. Remaining pixels (the "tail") are handled generically by the backend-invariant dispatch code (located in ops.c), using a partial memcpy into a suitably sized temporary buffer. To minimize the per-kernel function call overhead, we use a "continuation passing style" for chaining kernels. Each operation computes its result and then directly calls the next operation in the sequence, with the appropriate internal function signature. The C reference backend reads data into the stack and then passes the array pointers to the next continuation as regular function arguments: void process(GlobalContext *ctx, OpContext *op, block_t x, block_t y, block_t z, block_t w) { for (int i = 0; i < SWS_BLOCK_SIZE; i++) // do something with x[i], y[i], z[i], w[i] op->next(ctx, &op[1], x, y, z, w); } With type conversions pushing the new data onto the stack as well: void convert8to16(GlobalContext *ctx, OpContext *op, block_t x, block_t y, block_t z, block_t w) { /* Pseudo-code */ u16block_t x16 = (u16block_t) x; u16block_t y16 = (u16block_t) y; u16block_t z16 = (u16block_t) z; u16block_t w16 = (u16block_t) w; op->next(ctx, &op[1], x16, y16, z16, w16); } By contrast, the x86 backend always keeps the X/Y/Z/W values pinned in specific vector registers (ymm0-ymm3 for the lower half, and ymm4-ymm7 for the second half). Each kernel additionally has access to a 32 byte per-op context storing the pointer to the next kernel plus 16 bytes of arbitrary private data. This is used during construction of the function chain to place things like small constants. In assembly, the per-kernel overhead looks like this: load $tmp, $arg1 ... add $arg1, 32 jump $tmp This design gives vastly better performance than the alternative of returning out to a central loop or "trampoline". This is partly because the order of kernels within a chain is always the same, so the branch predictor can easily remember the target address of each "jump" instruction. The only way to realistically improve on this design would be to directly stitch the kernel body together using runtime code generation. Future considerations and limitations ------------------------------------- My current prototype has a number of severe limitations and opportunities for improvements: 1. It does not handle scaling at all. I am not yet entirely sure on how I want to handle scaling; this includes handling of subsampled content. I have a number of vague ideas in my head, but nothing where I can say with certainty that it will work out well. It's possible that we won't come up with a perfect solution here, and will need to decide on which set of compromises we are comfortable accepting: 1. Do we need the ability to scale YUV -> YUV by handling luma and chroma independently? When downscaling 100x100 4:2:0 to 50x50 4:4:4, should we support the option of reusing the chroma plane directly (even though this would introduce a subpixel shift for typical chroma siting)? Looking towards zimg, I am also thinking that we probably also want to do scaling on floating point values, since this is best for both performance and accuracy, especially given that we need to go up to 32-bit intermediates during scaling anyway. So far, the most promising approach seems to be to handle subsampled input/output as a dedicated read/write operation type; perhaps even with a fixed/static subsampling kernel. To avoid compromising on performance when chroma resampling is not necessary, the optimizer could then relax the pipeline to use non-interpolating read/writes when all intermediate operations are component-independent. 2. Since each operation is conceptually defined on 4-component pixels, we end up defining a lot of variants of each implementation for each possible *subset*. For example, we have four different implementations for SWS_OP_SCALE in my current templates: - op_scale_1000 - op_scale_1001 - op_scale_1110 - op_scale_1111 This reflects the four different arrangements of pixel components that are typically present (or absent). While best for performance, it does turn into a bit of a chore when implementing these kernels. The only real alternative would be to either branch inside the kernel (bad), or to use separate kernels for each individual component and chain them all together. I have not yet tested whether the latter approach would be faster after the latest round of refactors to the kernel glue code. 3. I do not yet have any support for LUTs. But when I add them, something we could do is have the optimized pass automatically "promote" a sequence of operations to LUTs. For example, any sequence that looks like: 1. [u8] SWS_OP_CONVERT -> X 2. [X] ... // only per-component operations 4. [X] SWS_OP_CONVERT -> Y 3. [Y] SWS_OP_WRITE could be replaced by a LUT with 256 entries. This is especially important for anything involving packed 8-bit input (e.g. rgb8, rgb4_byte). We also definitely want to hook this up to the existing CMS code for transformations between different primaries. 4. Because we rely on AVRational math to generate the coefficients for operations, we need to be able to represent all pixel values as an AVRational. However, this presents a challenge for 32-bit formats (e.g. GRAY32, RGBA128), because their size exceeds INT_MAX, which is the maximum value representable by an AVRational. It's possible we may want to introduce an AVRational64 for this, or perhaps more flexibly, extend AVRational to an AVFloating type which is represented as { AVRational n; int exp; }, representing n/d * 2^exp. This would preserve our ability to represent all pixel values exactly, while opening up the range arbitrarily. 5. Is there ever a situation where the use of floats introduces the risk of non bit-exact output? For this reason, and possible performance advantages, we may want to explore the use of a fixed-point 16 bit path as an alternative to the floating point math. So far, I have managed to avoid any bit exactness issues inside the x86 backend by ensuring that the order of linear operations is identical between the C backend and the x86 backend, but this may not be practical to guarantee on all backends. The x86 float code is also dramatically faster than the old fixed point code, so I'm tentatively optimistic about the lack of a need for a fixed point path.