[ovs-discuss] [PATCH net-next] fast_hash: clobber registers correctly for inline function use
Hannes Frederic Sowa
hannes at stressinduktion.org
Fri Nov 14 15:46:18 UTC 2014
On Fr, 2014-11-14 at 07:33 -0800, Eric Dumazet wrote:
> On Fri, 2014-11-14 at 16:13 +0100, Hannes Frederic Sowa wrote:
> > >
> > >
> > > Thats a lot of clobbers.
> > Yes, those are basically all callee-clobbered registers for the
> > particular architecture. I didn't look at the generated code for jhash
> > and crc_hash because I want this code to always be safe, independent of
> > the version and optimization levels of gcc.
> > > Alternative would be to use an assembly trampoline to save/restore them
> > > before calling __jhash2
> > This version provides the best hints on how to allocate registers to the
> > optimizers. E.g. it could avoid using callee-clobbered registers but use
> > callee-saved ones. If we build a trampoline, we need to save and reload
> > all registers all the time. This version just lets gcc decide how to do
> > that.
> > > __intel_crc4_2_hash2 can probably be written in assembly, it is quite
> > > simple.
> > Sure, but all the pre and postconditions must hold for both, jhash and
> > intel_crc4_2_hash and I don't want to rewrite jhash in assembler.
> We write optimized code for current cpus.
> With current generation of cpus, we have crc32 support.
__intel_crc4_2_hash(2) does already make use of crc32 instruction. I'll
have a closer look at what gcc generates.
> The fallback having to save/restore few registers, we don't care, as the
> fallback has huge cost anyway.
> You don't have to write jhash() in assembler, you misunderstood me.
Ok, understood, so we only clobber the registers needed in the
crc32_hash implementation and only if we branch to jhash we save all the
other ones in a trampoline directly before jhash.
> We only have to provide a trampoline in assembler, with maybe 10
> Then gcc will know that we do not clobber registers for the optimized
Yes, makes sense.
I would still like to see the current proposed fix getting applied and
we can do this on-top. The inline call after this patch reassembles a
direct function call, so besides the long list of clobbers, it should
still be pretty fast.
More information about the discuss