Otherwise, we always pass frames that already have buffers allocated, which
breaks the no-op refcopy optimizations.
Testing with -p 0.1 -threads 16 -bench 10, on an AMD Ryzen 9 9950X3D:
Before:
Overall speedup=2.776x faster, min=0.133x max=629.496x
yuv444p 1920x1080 -> yuv444p 1920x1080, flags=0x100000 dither=1
time=9 us, ref=9 us, speedup=1.043x faster
After:
Overall speedup=2.721x faster, min=0.140x max=574.034x
yuv444p 1920x1080 -> yuv444p 1920x1080, flags=0x100000 dither=1
time=0 us, ref=28 us, speedup=516.504x faster
(The slowdown in the legacy swscale case is from swscale's lack of a no-op
refcopy optimizaton, plus the fact that it's now actually doing memory
work instead of a no-op / redundant memset)
Signed-off-by: Niklas Haas <git@haasn.dev>