Store the intermediate values as words, clipped to the 0..255 range
instead.
Old benchmarks:
filter_diag4_c: 353.4 ( 1.00x)
filter_diag4_sse2: 57.5 ( 6.15x)
New benchmarks:
filter_diag4_c: 350.6 ( 1.00x)
filter_diag4_sse2: 55.1 ( 6.36x)
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Only the two middle coefficients are so huge that overflow can happen.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
For most systems (particularly all x64), the stack is already
guaranteed to be sufficiently aligned. So just use x86inc's
stack feature which does the right thing.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
VP3's frame managment is actually simple: It has three frame slots:
current, last and golden. After having decoded the current frame,
the old last frame will be freed and replaced by the current frame.
If the current frame is a keyframe, it also takes over the golden slot.
The VP3 decoder handled this like this: In single-threaded mode,
the above procedure was carried out (on success). Doing so with
frame-threading is impossible, as it would lead to data races.
Instead vp3_update_thread_context() created new references
to these frames and then carried out said procedure.
This means that vp3_update_thread_context() is not just a "dumb"
function that only copies certain fields from src to dst; instead
it actually processes them. E.g. trying to copy the decoding state
from A to B and then from B to C (with no decode_frame call in between)
will not be equivalent to copying from A to C, as both current and last
frames will be blank in the first case.
This commit changes this: Because last_frame won't be needed after
decoding, no reference to it will be created to it in
vp3_update_thread_context(); instead it is now always unreferenced
after decoding it (even on error). Replacing last_frame with the new
frame is now always performed when the new frame is allocated.
Replacing the golden frame is now done earlier, namely in decode_frame()
before ff_thread_finish_setup(), so that update_thread_context only
has to reference current frame and golden frame. Being dumb means
that update_thread_context also no longer checks whether the current
frame is valid, so that it can no longer error out.
This unifies the single- and multi-threaded codepaths; it can lead
to changes in output in single threaded mode: When erroring out,
the current frame would be discarded and not be put into one
of the reference slots at all in single-threaded mode. The new
code meanwhile does everything as the frame-threaded code already did
in order to reduce discrepancies between the two. It would be possible
to keep the old single-threaded behavior (one would need to postpone
replacing the golden frame to the end of vp3_decode_frame and would
need to swap the current frame and the last frame on error,
unreferencing the former).
Reviewed-by: Peter Ross <pross@xvid.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The dimensions are only set at two places: theora_decode_header()
and vp3_decode_init(). These functions are called during init
and during dimension changes, but the latter is only supported
(and attempted) when frame threading is not active. This implies that
the dimensions of the various worker threads in
vp3_update_thread_context() always coincide, so that these checks
are dead and can be removed.
(These checks would of course need to be removed when support
for dimension changes during frame threading is implemented;
and in any case, a dimension change is not an error.)
Reviewed-by: Peter Ross <pross@xvid.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
6c7a344b65 made the VLCs shared between
threads and did so in a way that was designed to support stream
reconfigurations, so that the structure containing the VLCs was
synced in update_thread_context. The idea was that the currently
active VLCs would just be passed along between threads.
Yet this was broken by 5acbdd2264:
Before this commit, submit_packet() was a no-op during flushing
for VP3, as it is a no-delay decoder, so it won't produce any output
during flushing. This meant that prev_thread in pthread_frame.c
contained the last dst thread that update_thread_context()
was called for (so that these VLCs could be passed along between
threads). Yet after said commit, submit_packet was no longer
a no-op during flushing and changed prev_thread in such a way
that it did not need to contain any VLCs at all*. When flushing,
prev_thread is used to pass the current state to the first worker
thread which is the one that is used to restart decoding.
It could therefore happen that the decoding thread did not contain
the VLCs at all any more after decoding restarts after flushing
leading to a crash (this scenario was never anticipated and
must not happen at all).
There is a simple, easily backportable fix given that we do not
support stream reconfigurations (yet) when using frame threading:
Don't sync the VLCs in update_thread_context(), instead do it once
during init.
This fixes forgejo issue #20346 and trac issue #11592.
(I don't know why 5acbdd2264
changed submit_packet() to no longer be a no-op when draining
no-delay decoders.)
*: The exact condition for the crash is nb_threads > 2*nb_frames.
Reviewed-by: Peter Ross <pross@xvid.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It allows us to easily synchronize the software and hardware
decoders, by removing the abstraction the Vulkan layer added by changing
the values written.
The Vulkan spec requires that all accesses to push data are uniform for
all invocations (e.g. can't be based on gl_WorkGroupID or gl_LocalInvocationID).
This commit optimizes the Vulkan decoder by splitting up decoding
from iDCT, and merging the few tables needed directly into the shader.
The speedup on Intel is 10x.
The decoder will reinit the hwaccel upon pixfmt/dimension changes,
so we can remove the f->use32bit and is_rgb variants of all shaders.
This speeds up init time.