[ovs-dev] [PATCH 2/6] compat: Simplify inet_fragment backports.

Mon Aug 1 18:46:14 UTC 2016

On 1 August 2016 at 10:21, pravin shelar <pshelar at ovn.org> wrote:
> On Tue, Jul 12, 2016 at 3:26 PM, Joe Stringer <joe at ovn.org> wrote:
>> The core fragmentation handling logic is exported on all supported
>> kernels, so it's not necessary to backport the latest version of this.
>> This greatly simplifies the code due to inconsistencies between the old
>> per-lookup garbage collection and the newer workqueue based garbage
>> collection.
>>
>> As a result of simplifying and removing unnecessary backport code, a few
>> bugs are fixed for corner cases such as when some fragments remain in
>> the fragment cache when openvswitch is unloaded.
>>
>> Some backported ip functions need a little extra logic than what is seen
>> on the latest code due to this, for instance on kernels <3.17:
>> * Call inet_frag_evictor() before defrag
>> * Limit hashsize in ip{,6}_fragment logic
>>
>> The pernet init/exit logic also differs a little from upstream. Upstream
>> ipv[46]_defrag logic initializes the various pernet fragment parameters
>> and its own global fragments cache. In the OVS backport, the pernet
>> parameters are shared while the fragments cache is separate. The
>> backport relies upon upstream pernet initialization to perform the
>> shared setup, and performs no pernet initialization of its own. When it
>> comes to pernet exit however, the backport must ensure that all
>> OVS-specific fragment state is cleared, while the shared state remains
>> untouched so that the regular ipv[46] logic may do its own cleanup. In
>> practice this means that OVS must have its own divergent implementation
>> of inet_frags_exit_net().
>>
>> Fixes the following crash:
>>
>> Call Trace:
>>  <IRQ>
>>  [<ffffffff810744f6>] ? call_timer_fn+0x36/0x100
>>  [<ffffffff8107548f>] run_timer_softirq+0x1ef/0x2f0
>>  [<ffffffff8106cccc>] __do_softirq+0xec/0x2c0
>>  [<ffffffff8106d215>] irq_exit+0x105/0x110
>>  [<ffffffff81737095>] smp_apic_timer_interrupt+0x45/0x60
>>  [<ffffffff81735a1d>] apic_timer_interrupt+0x6d/0x80
>>  <EOI>
>>  [<ffffffff8104f596>] ? native_safe_halt+0x6/0x10
>>  [<ffffffff8101cb2f>] default_idle+0x1f/0xc0
>>  [<ffffffff8101d406>] arch_cpu_idle+0x26/0x30
>>  [<ffffffff810bf3a5>] cpu_startup_entry+0xc5/0x290
>>  [<ffffffff810415ed>] start_secondary+0x21d/0x2d0
>> Code:  Bad RIP value.
>> RIP  [<ffffffffa0177480>] 0xffffffffa0177480
>>  RSP <ffff88003f703e78>
>> CR2: ffffffffa0177480
>> ---[ end trace eb98ca80ba07bd9c ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>>
>> Signed-off-by: Joe Stringer <joe at ovn.org>
>> ---
>> I've tested this on CentOS kernel 3.10.0-327 and Ubuntu kernels
>> 3.13.0-68, 3.16.0-70, 3.19.0-58, and 4.2.0-35. Some earlier kernel
>> versions may still trigger fragmentation-related crashes due to upstream
>> bugs, for instance on Ubuntu's 3.13.0-24.
>> ---
>>  acinclude.m4                                  |   1 +
>>  datapath/linux/compat/include/net/inet_frag.h |  58 ++-
>>  datapath/linux/compat/inet_fragment.c         | 486 ++------------------------
>>  datapath/linux/compat/ip_fragment.c           |  36 +-
>>  datapath/linux/compat/nf_conntrack_reasm.c    |  17 +
>>  5 files changed, 91 insertions(+), 507 deletions(-)
>>
> ...
>
>> -void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
>> -                                  const char *prefix)
>> -{
>> -       static const char msg[] = "inet_frag_find: Fragment hash bucket"
>> -               " list length grew over limit " __stringify(INETFRAGS_MAXDEPTH)
>> -               ". Dropping fragment.\n";
>> +       local_bh_disable();
>> +       inet_frag_evictor(nf, f, true);
>> +       local_bh_enable();
>>
>> -       if (PTR_ERR(q) == -ENOBUFS)
>> -               net_dbg_ratelimited("%s%s", prefix, msg);
>> +       nf->low_thresh = thresh;
>>  }
>
> Upstream inet_frags_exit_net() and back ported looks the same, we can
> just use the exported symbol.

We share nf->mem with the upstream frag code, so we shouldn't
percpu_counter_destroy(&nf->mem) it when we exit the namespace; the
upstream frag code will handle that.

My other concern was that we use the shared nf->low_thresh to clear
out our fragments cache, but if we are exiting our module and the
upstream frag code remains, we should restore the nf->low_thresh
afterwards so that we don't cause flushing of the upstream frag cache.

>> +#endif /* HAVE_INET_FRAGS_WITH_FRAGS_WORK */
>>
>>  #endif /* !HAVE_CORRECT_MRU_HANDLING */
>> diff --git a/datapath/linux/compat/ip_fragment.c b/datapath/linux/compat/ip_fragment.c
>> index 8d01088abc0a..64e2cf23c327 100644
>> --- a/datapath/linux/compat/ip_fragment.c
>> +++ b/datapath/linux/compat/ip_fragment.c
>> @@ -76,12 +76,7 @@ struct ipfrag_skb_cb
>>
>>  /* Describe an entry in the "incomplete datagrams" queue. */
>>  struct ipq {
>> -       union {
>> -               struct inet_frag_queue q;
>> -#ifndef HAVE_INET_FRAG_QUEUE_WITH_LIST_EVICTOR
>> -               struct ovs_inet_frag_queue oq;
>> -#endif
>> -       };
>> +       struct inet_frag_queue q;
>>
>>         u32             user;
>>         __be32          saddr;
>> @@ -119,6 +114,12 @@ static unsigned int ipqhashfn(__be16 id, __be32 saddr, __be32 daddr, u8 prot)
>>                             (__force u32)saddr, (__force u32)daddr,
>>                             ip4_frags.rnd);
>>  }
>> +/* fb3cfe6e75b9 ("inet: frag: remove hash size assumptions from callers")
>> + * shifted this logic into inet_fragment, but prior kernels still need this.
>> + */
>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(3,17,0)
>> +#define ipqhashfn(a, b, c, d) (ipqhashfn(a, b, c, d) & (INETFRAGS_HASHSZ - 1))
>> +#endif
>>
>>  #ifdef HAVE_INET_FRAGS_CONST
>>  static unsigned int ip4_hashfn(const struct inet_frag_queue *q)
>> @@ -267,6 +268,23 @@ out:
>>         ipq_put(qp);
>>  }
>>
>> +/* Memory limiting on fragments.  Evictor trashes the oldest
>> + * fragment queue until we are back under the threshold.
>> + *
>> + * Necessary for kernels earlier than v3.17. Replaced in commit
>> + * b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue").
>> + */
>> +static void ip_evictor(struct net *net)
>> +{
>> +#ifdef HAVE_INET_FRAG_EVICTOR
>> +       int evicted;
>> +
>> +       evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false);
>> +       if (evicted)
>> +               IP_ADD_STATS_BH(net, IPSTATS_MIB_REASMFAILS, evicted);
>> +#endif
>> +}
>> +
>>  /* Find the correct entry in the "incomplete datagrams" queue for
>>   * this IP datagram, and create new one, if nothing is found.
>>   */
>> @@ -281,6 +299,11 @@ static struct ipq *ip_find(struct net *net, struct iphdr *iph,
>>         arg.user = user;
>>         arg.vif = vif;
>>
>> +       ip_evictor(net);
>> +
> IP fragment eviction can be done after this lookup. So that there is
> better chance of finding the fragment.

I think I was trying to follow upstream more closely, although it's
actually in ip_defrag() in v3.16. I could shift this there instead, I
don't remember specifically why I put it here in the first place.

This seems like a reasonable suggestion at face value, on affected
kernels they maintain an LRU for frag lists so if the current frag
adds to the oldest queue then it will be bumped to the front of the
LRU and be saved, so if it /would/ have been deleted by the
ip_evictor() then it's no longer affected; it could also end up
reassembling the message, therefore releasing the memory and reducing
the amount of work for ip_evictor() to do.

The tradeoff is that if it's actually a fragment from a brand new
message, then if the fragment cache is full and we don't clean out old
fragments first, we'll fail to allocate a fragqueue for the new
message. We'll clean out the frags list afterwards, then subsequently
receive later fragments for the same message. These fragments won't be
able to reassemble the message because we dropped the first one.

I'm inclined to think that the latter is more problematic than the
former, most likely fragments arrive in quick succession so if we're
deleting the oldest frag queue then it is more likely to be stale &
not receiving the remaining fragments. So we should prioritize
ensuring that the new fragment can create a new fragqueue.

Thoughts?