[ovs-dev] request for review: protection against kernel flow trashing

Wed Apr 3 13:29:02 UTC 2013

Good day.

Few weeks ago I have reported a serious flaw in OVS: specially crafted 
flood or even a legal traffic with a large number of tcp connections 
from different sources can cause a denial of service.

Initially I thought it's somehow caused by 'slow' normal mode rules, but 
further research have shown that even a single rule 'priority=10, 
actions=drop' can cause absolutely unacceptable CPU load by 
ovs-vswitchd. Flood (or just large number of legal connections) with 
rate about 5k new flows per second cause 100% CPU load, huge latency 
spikes and overall system performance degradation.

This issue is especially harmful for XenServer and XCP, which use OVS 
for management/SAN traffic. Under flood-load about 30-40Mb/s, it cause 
horrible network delays on all bridges, virtual machines starts to get 
IO timeouts and propagate them as IO errors to VM userspace. Even 
neglecting SAN traffic, 30Mb/s from Internet can cause outage of server 
with 1G, 10G interfaces. This is ridiculous.

I've done some deep research and found the reason:

Every packet is inspected in kernel space by "extract_flow" function. 
Extracted key is hashed and used to search for corresponding rule in the 
flows table. If rule is found, it's used. If not - packet sent to 
userspace ovs-switchd daemon for inspection. This process is very slow 
compare to kernel-only mode. Normally OVS kernel module can process up 
to 100kpps per single VIF (and I think this is a limitation of xen's 
netback/front driver, not OVS), but passing those packets to userspace 
cause performance drops to 3-5-7kpps. And not 'per single VIF', but 'per 
host'.

The problem is that packets with different source IP, source/destination 
TCP/UDP ports, different TTL, etc... produce different hashes. Any 
change of any field of any header on any level results in new key and 
additional "slow" queries to ovs-vswitchd. Large number of TCP/UDP 
sessions => many different keys => slow processing rate.

That over-deep packet inspection is the reason why performance of OVS is 
very inconsistent on different load types. Under one kind of load it 
shows excellent performance, many times faster than brtools, but 
dropping to over 10 times slower on the other load type.

To fix that issue I've decided to 'cut off' some of the OVS key 
extraction functionality. I understand that this will make OVS a 
"cripple"(not a fully compatible open flow switch), but few weeks of 
testing in labs and two weeks in product showed that patch [1] 
significantly reduces DoS vulnerability and fixes some strange latency 
fluctuation under heavy load.

Patch is built on few assumptions:

1) OVS runs under XenServer/XCP, therefore port 0 is always connected to 
xenbrX, port 1 connected to physical interface (eth) and all ports above 
1 is VMs ports.
2) Amount of source mac/ip from guest VMs is limited
3) Support for antispoofing rules (restriction for VMs to use only 
allowed source MAC/IP addresses)
4) No specific interest in xenbr/eth traffic.

Changes:

1) ARP and ipv6 traffic processing is left unchanged (no ipv6 DoS 
protection for now, I will bother about that only after first serious 
activity)
2) For xenbr/eth interfaces: no source/destination mac/IP addresses are 
added to flow.
3) For interfaces >1 (VMs): only source IP and source MAC is added to key.
4) No TTL, QoS, etc fields are taken into account.

If it's possible I'd like to hear comments on these changes. Thanks.

[1] Patch to openvswitch.ko and ovs-vswitchd (1.4.6):

diff --git a/datapath/flow.c b/datapath/flow.c
index 06df0f6..bea3af6 100644
--- a/datapath/flow.c
+++ b/datapath/flow.c
@@ -699,12 +699,11 @@ int ovs_flow_extract(struct sk_buff *skb, u16 
in_port, struct sw_flow_key *key,
                 }

                 nh = ip_hdr(skb);
-               key->ipv4.addr.src = nh->saddr;
-               key->ipv4.addr.dst = nh->daddr;
+        if (in_port > 1){ /*ports above eth0 and xenbr0 will record 
source ip (antispoofing)*/
+               key->ipv4.addr.src = nh->saddr;
+        }

                 key->ip.proto = nh->protocol;
-               key->ip.tos = nh->tos;
-               key->ip.ttl = nh->ttl;

                 offset = nh->frag_off & htons(IP_OFFSET);
                 if (offset) {
@@ -720,25 +719,16 @@ int ovs_flow_extract(struct sk_buff *skb, u16 
in_port, struct sw_flow_key *key,
                         key_len = SW_FLOW_KEY_OFFSET(ipv4.tp);
                         if (tcphdr_ok(skb)) {
                                 struct tcphdr *tcp = tcp_hdr(skb);
-                               key->ipv4.tp.src = tcp->source;
-                               key->ipv4.tp.dst = tcp->dest;
                         }
                 } else if (key->ip.proto == IPPROTO_UDP) {
                         key_len = SW_FLOW_KEY_OFFSET(ipv4.tp);
                         if (udphdr_ok(skb)) {
                                 struct udphdr *udp = udp_hdr(skb);
-                               key->ipv4.tp.src = udp->source;
-                               key->ipv4.tp.dst = udp->dest;
                         }
                 } else if (key->ip.proto == IPPROTO_ICMP) {
                         key_len = SW_FLOW_KEY_OFFSET(ipv4.tp);
                         if (icmphdr_ok(skb)) {
                                 struct icmphdr *icmp = icmp_hdr(skb);
-                               /* The ICMP type and code fields use the 
16-bit
-                                * transport port fields, so we need to 
store
-                                * them in 16-bit network byte order. */
-                               key->ipv4.tp.src = htons(icmp->type);
-                               key->ipv4.tp.dst = htons(icmp->code);
                         }
                 }

diff --git a/lib/flow.c b/lib/flow.c
index 5aaa353..a23332e 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -221,8 +221,6 @@ parse_tcp(struct ofpbuf *packet, struct ofpbuf *b, 
struct flow *flow)
  {
      const struct tcp_header *tcp = pull_tcp(b);
      if (tcp) {
-        flow->tp_src = tcp->tcp_src;
-        flow->tp_dst = tcp->tcp_dst;
          packet->l7 = b->data;
      }
  }
@@ -232,8 +230,6 @@ parse_udp(struct ofpbuf *packet, struct ofpbuf *b, 
struct flow *flow)
  {
      const struct udp_header *udp = pull_udp(b);
      if (udp) {
-        flow->tp_src = udp->udp_src;
-        flow->tp_dst = udp->udp_dst;
          packet->l7 = b->data;
      }
  }
@@ -365,19 +361,17 @@ flow_extract(struct ofpbuf *packet, uint32_t 
priority, ovs_be64 tun_id,
          const struct ip_header *nh = pull_ip(&b);
          if (nh) {
              packet->l4 = b.data;
-
-            flow->nw_src = get_unaligned_be32(&nh->ip_src);
-            flow->nw_dst = get_unaligned_be32(&nh->ip_dst);
+            if (ofp_in_port>1){
+             flow->nw_src = get_unaligned_be32(&nh->ip_src);
+            };
              flow->nw_proto = nh->ip_proto;

-            flow->nw_tos = nh->ip_tos;
              if (IP_IS_FRAGMENT(nh->ip_frag_off)) {
                  flow->nw_frag = FLOW_NW_FRAG_ANY;
                  if (nh->ip_frag_off & htons(IP_FRAG_OFF_MASK)) {
                      flow->nw_frag |= FLOW_NW_FRAG_LATER;
                  }
              }
-            flow->nw_ttl = nh->ip_ttl;

              if (!(nh->ip_frag_off & htons(IP_FRAG_OFF_MASK))) {
                  if (flow->nw_proto == IPPROTO_TCP) {
@@ -387,8 +381,6 @@ flow_extract(struct ofpbuf *packet, uint32_t 
priority, ovs_be64 tun_id,
                  } else if (flow->nw_proto == IPPROTO_ICMP) {
                      const struct icmp_header *icmp = pull_icmp(&b);
                      if (icmp) {
-                        flow->tp_src = htons(icmp->icmp_type);
-                        flow->tp_dst = htons(icmp->icmp_code);
                          packet->l7 = b.data;
                      }
                  }