593 Commits

Author SHA1 Message Date
a039726c2a avutil/x86/aes: remove a few branches
The rounds value is constant and can be one of three hardcoded values, so
instead of checking it on every loop, just split the function into three
different implementations for each value.

Before:
aes_decrypt_128_aesni:                                  93.8 (47.58x)
aes_decrypt_192_aesni:                                 106.9 (49.30x)
aes_decrypt_256_aesni:                                 109.8 (56.50x)
aes_encrypt_128_aesni:                                  93.2 (47.70x)
aes_encrypt_192_aesni:                                 111.1 (48.36x)
aes_encrypt_256_aesni:                                 113.6 (56.27x)

After:
aes_decrypt_128_aesni:                                  71.5 (63.31x)
aes_decrypt_192_aesni:                                  96.8 (55.64x)
aes_decrypt_256_aesni:                                 106.1 (58.51x)
aes_encrypt_128_aesni:                                  81.3 (55.92x)
aes_encrypt_192_aesni:                                  91.2 (59.78x)
aes_encrypt_256_aesni:                                 109.0 (58.26x)

Signed-off-by: James Almer <jamrial@gmail.com>
2025-04-10 12:02:34 -03:00
a35b4e8d29 avutil/x86/aes: ignore the upper bits in count
The argument is an int.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-04-06 11:02:09 -03:00
2ea3c51795 lavu/aes: add x86 AESNI optimizations
crypto_bench comparison for AES-128-ECB:

lavu_aesni AES-128-ECB  size: 1048576  runs:   1024  time:    0.596 +- 0.081
lavu_c     AES-128-ECB  size: 1048576  runs:   1024  time:   17.007 +- 2.131
crypto     AES-128-ECB  size: 1048576  runs:   1024  time:    0.612 +- 1.857
gcrypt     AES-128-ECB  size: 1048576  runs:   1024  time:    1.123 +- 0.224
tomcrypt   AES-128-ECB  size: 1048576  runs:   1024  time:    9.038 +- 0.790

Improved-By: Henrik Gramner <henrik@gramner.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2025-04-05 20:46:40 -03:00
892f64ad9b x86/tx_float: remove HAVE_AVX2_EXTERNAL checks
It'll always be enabled.
Thanks, nasm.
2024-10-06 01:32:49 +02:00
b17a240c8d Revert "x86/tx_float: set all operands for shufps"
This reverts commit 74f5fb6db899dbc4fde9ccf77f37256ddcaaaab9.
2024-10-06 01:32:49 +02:00
24c5a58e55 Revert "x86/tx_float: add missing check for AVX2"
This reverts commit f4097e4c1f1bb244cae78c363a69d5e84495b616.
2024-10-06 01:32:48 +02:00
bf643f989b Revert "x86/tx_float: add missing preprocessor wrapper for AVX2 functions"
This reverts commit 750f378becf15c0552c45a66a66aca7cc506d490.
2024-10-06 01:32:48 +02:00
b890482d05 Revert "x86/tx_float: change a condition in a preprocessor check"
This reverts commit 0d8f43c74d0b1039ba70aacb4c9c7768e8bebf9f.
2024-10-06 01:32:47 +02:00
9e7a93c6fd x86/intreadwrite: add SSE2 optimized AV_COPY128U
Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-29 23:17:52 -03:00
70c6b904be x86/intreadwrite: add missing casts to pointer arguments
Should make strict compilers happy.

Also, make AV_COPY128 use integer operations while at it. Removing the
inclusion of immintrin.h ensures a lot less intrinsic related headers are
included as well, which fixes a clash of defines with some Clang versions.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-11 18:24:26 -03:00
1a86a7a48d x86/intreadwrite: fix include of config.h
Should fix make checkheaders.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:52:52 -03:00
15056dd650 x86/intreadwrite.h: add missing preprocessor checks
Removed by accident in the previous commits. This makes the code only run when
compiled with GCC and Clang like before. Support for other compilers like msvc
can be added later.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:49:21 -03:00
bd1bcb07e0 x86/intreadwrite: use intrinsics instead of inline asm for AV_COPY128
This has the benefit of removing any SSE -> AVX penalty that may happen when
the compiler emits VEX encoded instructions.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:25:44 -03:00
4a04cca69a x86/intreadwrite: use intrinsics instead of inline asm for AV_ZERO128
When called inside a loop, the inline asm version results in one pxor
unnecessarely emitted per iteration, as the contents of the __asm__() block are
opaque to the compiler's instruction scheduler.
This is not the case with intrinsics, where pxor will be emitted once with any
half decent compiler.

This also has the benefit of removing any SSE -> AVX penalty that may happen
when the compiler emits VEX encoded instructions.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:25:44 -03:00
4b57ea8fc7 avutil/common: assert that bit position in av_zero_extend is valid
Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-13 20:36:09 -03:00
39c90d6466 avutil: rename av_mod_uintp2 to av_zero_extend
It's more descriptive of what it does.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-13 20:35:57 -03:00
0231097d1b lavu/x86: remove GCC 4.4- stuff
Since the C11 support is required, those GCC versions can no longer be
supported anyhow. (Clang pretends to be GCC 4.4, but it looks like the
code was intended for old GCC specifically.)
2024-06-13 21:16:16 +03:00
a14440867c x86/float_dsp: add SSE2 and AVX versions of scalarproduct_double
Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-03 22:14:55 -03:00
790f793844 avutil/common: Don't auto-include mem.h
There are lots of files that don't need it: The number of object
files that actually need it went down from 2011 to 884 here.

Keep it for external users in order to not cause breakages.

Also improve the other headers a bit while just at it.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-31 00:08:43 +01:00
afa471d0ef x86: Update x86inc.asm
Make things up-to-date with upstream.

https://code.videolan.org/videolan/x86inc.asm
2024-03-24 14:53:57 +01:00
c3d3f0e697 avutil/x86util: Fix broken pre-SSE4.1 PMINSD emulation
Fixes yadif-16 which allows FATE to pass.

Broken since 2904db90458a1253e4aea6844ba9a59ac11923b6 (2017).
2024-03-17 13:52:27 +01:00
c00cd007e8 configure: Remove av_restrict
All versions of MSVC that support C11 (namely >= v19.27)
also support the restrict keyword, therefore av_restrict
is no longer necessary since 75697836b1db3e0f0a3b7061be6be28d00c675a0.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-15 12:51:15 +01:00
7ec2354c38 x86: Remove inline MMX assembly that clobbers the FPU state
These inline implementations of AV_COPY64, AV_SWAP64 and AV_ZERO64
are known to clobber the FPU state - which has to be restored
with the 'emms' instruction afterwards.

This was known and signaled with the FF_COPY_SWAP_ZERO_USES_MMX
define, which calling code seems to have been supposed to check,
in order to call emms_c() after using them. See
0b1972d4096df5879038f0af776f87f41e90ebd4,
29c4c0886d143790fcbeddbe40a23dfc6f56345c and
df215e575850e41b19aeb1fd99e53372a6b3d537 for history on earlier
fixes in the same area.

However, new code can use these AV_*64() macros without knowing
about the need to call emms_c().

Just get rid of these dangerous inline assembly snippets; this
doesn't make any difference for 64 bit architectures anyway.

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-02-09 23:55:52 +02:00
9af87828bd x86/tx_init: propely indicate the extended available transform sizes
Forgot to do this with the previous commit.

Actually makes the assembly being used.

Still the fastest FFT in the world, 15% faster than FFTW on the
largest available size.
2024-02-09 18:08:42 +01:00
bd3e71b21e x86/tx_float: enable SIMD for sizes over 131072
The tables for the new sizes were added last year due
to being required for SDR.
However, the assembly was never updated to use them.
2024-02-07 15:20:48 +01:00
ed8ddf0bd3 x86inc: Add REPX macro to repeat instructions/operations
When operating on large blocks of data it's common to repeatedly use
an instruction on multiple registers. Using the REPX macro makes it
easy to quickly write dense code to achieve this without having to
explicitly duplicate the same instruction over and over.

For example,

    REPX {paddw x, m4}, m0, m1, m2, m3
    REPX {mova [r0+16*x], m5}, 0, 1, 2, 3

will expand to

    paddw       m0, m4
    paddw       m1, m4
    paddw       m2, m4
    paddw       m3, m4
    mova [r0+16*0], m5
    mova [r0+16*1], m5
    mova [r0+16*2], m5
    mova [r0+16*3], m5

Commit taken from x264:
6d10612ab0

Signed-off-by: Frank Plowman <post@frankplowman.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2023-11-08 13:49:08 +01:00
5b85ca5317 avutil/x86/pixelutils: Empty MMX state in ff_pixelutils_sad_8x8_mmxext
We currently mostly do not empty the MMX state in our MMX
DSP functions; instead we only do so before code that might
be using x87 code. This is a violation of the System V i386 ABI
(and maybe of other ABIs, too):
"The CPU shall be in x87 mode upon entry to a function. Therefore,
every function that uses the MMX registers is required to issue an
emms or femms instruction after using MMX registers, before returning
or calling another function." (See 2.2.1 in [1])
This patch does not intend to change all these functions to abide
by the ABI; it only does so for ff_pixelutils_sad_8x8_mmxext, as this
function can by called by external users, because it is exported
via the pixelutils API. Without this, the following fragment will
assert (on x86/x64):
    uint8_t src1[8 * 8], src2[8 * 8];
    av_pixelutils_sad_fn fn = av_pixelutils_get_sad_fn(3, 3, 0, NULL);
    fn(src1, 8, src2, 8);
    av_assert0_fpu();

[1]: https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/intel386-psABI-1.1.pdf

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-11-04 01:26:03 +01:00
f8503b4c33 avutil/internal: Don't auto-include emms.h
Instead include emms.h wherever it is needed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-09-04 11:04:45 +02:00
bbe95f7353 x86: replace explicit REP_RETs with RETs
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)

x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.

The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.

In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
2023-02-01 04:23:55 +01:00
90c17a05aa x86/tx_float: fix stray change in 15xM FFT and replace imul->lea
Thanks to rorgoroth for bisecting and kurosu for the lea suggestion.
2022-11-28 16:58:12 +01:00
87bae6b018 lavu/tx: refactor to explicitly track and convert lookup table order
Necessary for generalizing PFAs.
2022-11-24 15:58:34 +01:00
fab97faf02 x86/tx_float: implement striding in fft_15xM 2022-11-24 15:58:32 +01:00
92100eee5b x86/tx_float_init: properly specify the supported factors of 15xM FFTs
Only powers of two are currently supported.
2022-11-24 15:58:32 +01:00
cc1df4045e x86/tx_float: add a standalone 15-point AVX2 transform
Enables its use everywhere else in the framework.
2022-11-24 15:58:31 +01:00
877e575b5d x86/tx_float: optimize and macro out FFT15 2022-11-24 15:58:31 +01:00
a11e745b97 lavu/fixed_dsp: add missing av_restrict qualifiers
The butterflies_fixed function pointer declaration specifies av_restrict
for the first two pointer arguments. So the corresponding function
definitions should honor this declaration.

MSVC emits warning C4113 for this.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2022-10-04 10:56:12 +02:00
f21899db7d x86/tx_float: enable AVX-only split-radix FFT codelets
Sandy Bridge, Ivy Bridge and Bulldozer cores don't support FMA3.
2022-09-24 04:16:55 +02:00
d2f482965f x86/tx_float: fix some symbol names
Should fix compilation on MacOS

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-23 18:53:05 -03:00
0d8f43c74d x86/tx_float: change a condition in a preprocessor check
Fixes compilation with yasm.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-23 16:05:07 -03:00
750f378bec x86/tx_float: add missing preprocessor wrapper for AVX2 functions
Fixes compilation with old assemblers.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-23 15:15:20 -03:00
74e8541bab x86/tx_float: generalize iMDCT
To support non-aligned buffers during the post-transform step, just iterate
backwards over the array.

This allows using the 15xN-point FFT, with which the speed is 2.1 times
faster than our old libavcodec implementation.
2022-09-23 12:35:28 +02:00
ace42cf581 x86/tx_float: add 15xN PFA FFT AVX SIMD
~4x faster than the C version.
The shuffles in the 15pt dim1 are seriously expensive. Not happy with it,
but I'm contempt.

Can be easily converted to pure AVX by removing all vpermpd/vpermps
instructions.
2022-09-23 12:35:27 +02:00
3241e9225c x86/tx_float: adjust internal ASM call ABI again
There are many ways to go about it, and this one seems optimal for both
MDCTs and PFA FFTs without requiring excessive instructions or stack usage.
2022-09-23 12:33:35 +02:00
4ba68639ca x86/tx_float: add asm call versions of the 2pt and 4pt transforms
Verified to be working.
2022-09-19 06:01:06 +02:00
892548e6a1 x86/tx_float: fully support 128bit regs in LOAD64_LUT
The gather path didn't support 128bit registers.
It's not faster on Zen 3, but it's here for completeness.
2022-09-19 06:01:04 +02:00
af42bb3d61 x86/tx_float: simplify and describe the intra-asm call convention 2022-09-19 06:01:02 +02:00
bda3a9faf4 x86/float_dsp: use three operand form for some instructions
Fixes compilation with old yasm

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-13 13:50:09 -03:00
72acff9f59 avutil/x86/float_dsp: add fma3 for scalarproduct 2022-09-13 17:43:15 +02:00
29c4c0886d avutil/x86/intreadwrite: Add ability to detect whether MMX code is used
It can be used to call emms_c() only when needed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-11 21:08:04 +02:00
f4097e4c1f x86/tx_float: add missing check for AVX2
Fixes compilation with old yasm.

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-06 14:06:33 -03:00