[ovs-discuss] Flow miss/Packet order question

Dmitry Fleytman dfleytma at redhat.com
Wed Oct 2 11:49:27 UTC 2013


On Apr 30, 2012, at 20:15 PM, Ben Pfaff <blp at nicira.com> wrote:

> I think that your explanation stems from a misunderstanding.  Yes, if
> an OpenFlow controller uses a reactive model, then it cannot avoid the
> problem.  However, I think that Joji is raising a different issue, one
> that is an implementation detail within Open vSwitch and that
> controllers have no power to avoid.
> 
> Let me explain in detail.  When a packet arrives for which there is no
> kernel flow, the kernel sends it to userspace.  Userspace sends the
> packet and sets up a kernel flow.  In the meantime, more packets might
> have arrived and been queued to userspace.  Userspace will send these
> packets, but any packets that arrive after the kernel flow is set up
> will be forwarded directly by the kernel before those queued to
> userspace go out.
> 


This is exactly the problem we face while going for KVM paravirtualized network driver for Windows (NetKVM) certification.
There are a few automated tests that send bursts of packets and wait for the same packets and the same order on the other side.

We have a POC patches (pretty dirty) that solve the problem (below). The idea is simple - when datapath makes upcall it queues packets in kernel until user mode completes processing and downloads a new flow. It looks like overkill to queue packets per datapath, queueing per vport will be enough, but it was easier to implement this way and it proves the concept as well. Still, it is obvious there is performance and scaling impact so another ideas are highly welcome.

What do you think? Should we go for this solution and prepare clean patches for submission?

Regards,
Dmitry Fleytman

=========================================================================
Patch for kernel part:

---
 datapath.c |  149 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 datapath.h |    8 +++
 2 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/datapath.c b/datapath.c
index 22e30ef..19bb7c4 100644
--- a/datapath.c
+++ b/datapath.c
@@ -210,11 +210,27 @@ void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb)
 	int error;
 	int key_len;
 
+#ifdef REORDER_HACK
+	unsigned long flags;
+	spin_lock_irqsave(&dp->pending_list_lock, flags);
+
+	if(dp->has_pending_upcalls) {
+		*((struct vport **)&skb->cb) = p;
+		skb_queue_head(&dp->pending_skb_list, skb);
+		spin_unlock_irqrestore(&dp->pending_list_lock, flags);
+		return;
+	}
+
+#endif //REORDER_HACK
+
 	stats = this_cpu_ptr(dp->stats_percpu);
 
 	/* Extract flow from 'skb' into 'key'. */
 	error = ovs_flow_extract(skb, p->port_no, &key, &key_len);
 	if (unlikely(error)) {
+#ifdef REORDER_HACK
+		spin_unlock_irqrestore(&dp->pending_list_lock, flags);
+#endif //REORDER_HACK
 		kfree_skb(skb);
 		return;
 	}
@@ -224,6 +240,11 @@ void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb)
 	if (unlikely(!flow)) {
 		struct dp_upcall_info upcall;
 
+#ifdef REORDER_HACK
+		dp->has_pending_upcalls = 1;
+		spin_unlock_irqrestore(&dp->pending_list_lock, flags);
+#endif //REORDER_HACK
+
 		upcall.cmd = OVS_PACKET_CMD_MISS;
 		upcall.key = &key;
 		upcall.userdata = NULL;
@@ -238,15 +259,77 @@ void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb)
 
 	stats_counter = &stats->n_hit;
 	ovs_flow_used(OVS_CB(skb)->flow, skb);
+
+#ifdef REORDER_HACK
+	spin_unlock_irqrestore(&dp->pending_list_lock, flags);
+#endif //REORDER_HACK
+
 	ovs_execute_actions(dp, skb);
 
 out:
+
 	/* Update datapath statistics. */
 	u64_stats_update_begin(&stats->sync);
 	(*stats_counter)++;
 	u64_stats_update_end(&stats->sync);
 }
 
+#ifdef REORDER_HACK
+
+static
+int __ovs_dp_process_pending_packet(struct vport *p, struct sk_buff *skb)
+{
+	struct datapath *dp = p->dp;
+	struct sw_flow *flow;
+	struct dp_stats_percpu *stats;
+	struct sw_flow_key key;
+	u64 *stats_counter;
+	int error;
+	int key_len;
+	int res = 1;
+
+	stats = this_cpu_ptr(dp->stats_percpu);
+
+	/* Extract flow from 'skb' into 'key'. */
+	error = ovs_flow_extract(skb, p->port_no, &key, &key_len);
+	if (unlikely(error)) {
+		kfree_skb(skb);
+		return res;
+	}
+
+	/* Look up flow. */
+	flow = ovs_flow_tbl_lookup(rcu_dereference(dp->table), &key, key_len);
+	if (unlikely(!flow)) {
+		struct dp_upcall_info upcall;
+
+		upcall.cmd = OVS_PACKET_CMD_MISS;
+		upcall.key = &key;
+		upcall.userdata = NULL;
+		upcall.pid = p->upcall_pid;
+		ovs_dp_upcall(dp, skb, &upcall);
+		consume_skb(skb);
+		stats_counter = &stats->n_missed;
+		res = 0;
+		goto out;
+	}
+
+	OVS_CB(skb)->flow = flow;
+
+	stats_counter = &stats->n_hit;
+	ovs_flow_used(OVS_CB(skb)->flow, skb);
+	ovs_execute_actions(dp, skb);
+
+out:
+	/* Update datapath statistics. */
+	u64_stats_update_begin(&stats->sync);
+	(*stats_counter)++;
+	u64_stats_update_end(&stats->sync);
+
+	return res;
+}
+
+#endif //REORDER_HACK
+
 static struct genl_family dp_packet_genl_family = {
 	.id = GENL_ID_GENERATE,
 	.hdrsize = sizeof(struct ovs_header),
@@ -746,6 +830,47 @@ err:
 	return err;
 }
 
+static int ovs_packet_cmd_push(struct sk_buff *skb, struct genl_info *info)
+{
+	struct ovs_header *ovs_header = info->userhdr;
+	struct datapath *dp;
+	int err;
+
+	rcu_read_lock();
+	dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
+	err = -ENODEV;
+	if (!dp)
+		goto err_unlock;
+
+	rcu_read_unlock();
+
+#ifdef REORDER_HACK
+	{
+		unsigned long flags;
+		spin_lock_irqsave(&dp->pending_list_lock, flags);
+		while(!skb_queue_empty(&dp->pending_skb_list)) {
+			struct sk_buff *skb = skb_dequeue_tail(&dp->pending_skb_list);
+
+			spin_unlock_irqrestore(&dp->pending_list_lock, flags);
+
+			if(!__ovs_dp_process_pending_packet(*((struct vport **)&skb->cb), skb)) {
+				return err;
+			}
+
+			spin_lock_irqsave(&dp->pending_list_lock, flags);
+		}
+		dp->has_pending_upcalls = 0;
+		spin_unlock_irqrestore(&dp->pending_list_lock, flags);
+	}
+#endif //REORDER_HACK
+
+	return err;
+
+err_unlock:
+	rcu_read_unlock();
+	return err;
+}
+
 static const struct nla_policy packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
 	[OVS_PACKET_ATTR_PACKET] = { .type = NLA_UNSPEC },
 	[OVS_PACKET_ATTR_KEY] = { .type = NLA_NESTED },
@@ -757,6 +882,11 @@ static struct genl_ops dp_packet_genl_ops[] = {
 	  .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
 	  .policy = packet_policy,
 	  .doit = ovs_packet_cmd_execute
+	},
+	{ .cmd = OVS_PACKET_CMD_PUSH,
+	  .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */
+	  .policy = packet_policy,
+	  .doit = ovs_packet_cmd_push
 	}
 };
 
@@ -1354,6 +1484,13 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
 
 	ovs_net = net_generic(ovs_dp_get_net(dp), ovs_net_id);
 	list_add_tail(&dp->list_node, &ovs_net->dps);
+
+#ifdef REORDER_HACK
+	skb_queue_head_init(&dp->pending_skb_list);
+	dp->has_pending_upcalls = 0;
+	spin_lock_init(&dp->pending_list_lock);
+#endif //REORDER_HACK
+
 	rtnl_unlock();
 
 	genl_notify(reply, genl_info_net(info), info->snd_pid,
@@ -1394,6 +1531,14 @@ static void __dp_destroy(struct datapath *dp)
 				ovs_dp_detach_port(vport);
 	}
 
+#ifdef REORDER_HACK
+	while(!skb_queue_empty(&dp->pending_skb_list)) {
+		struct sk_buff *skb = skb_dequeue_tail(&dp->pending_skb_list);
+		kfree_skb(skb);
+	}
+	dp->has_pending_upcalls = 0;
+#endif //REORDER_HACK
+
 	list_del(&dp->list_node);
 	ovs_dp_detach_port(ovs_vport_rtnl(dp, OVSP_LOCAL));
 
diff --git a/datapath.h b/datapath.h
index a3c7b67..d8efdef 100644
--- a/datapath.h
+++ b/datapath.h
@@ -19,6 +19,8 @@
 #ifndef DATAPATH_H
 #define DATAPATH_H 1
 
+#define REORDER_HACK 1
+
 #include <asm/page.h>
 #include <linux/kernel.h>
 #include <linux/mutex.h>
@@ -81,6 +83,12 @@ struct datapath {
 	/* Stats. */
 	struct dp_stats_percpu __percpu *stats_percpu;
 
+#ifdef REORDER_HACK
+	int has_pending_upcalls;
+	struct sk_buff_head pending_skb_list;
+	spinlock_t pending_list_lock;
+#endif //REORDER_HACK
+
 #ifdef CONFIG_NET_NS
 	/* Network namespace ref. */
 	struct net *net;
-- 
1.7.1

Patch for user mode part:

---
 include/linux/openvswitch.h |    3 ++-
 lib/dpif-linux.c            |   39 +++++++++++++++++++++++++++++++++++++++
 lib/dpif-netdev.c           |    1 +
 lib/dpif-provider.h         |    2 ++
 lib/dpif.c                  |   17 +++++++++++++++++
 lib/dpif.h                  |    2 ++
 ofproto/ofproto-dpif.c      |    3 ++-
 7 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
index adc7fa3..4caaadb 100644
--- a/include/linux/openvswitch.h
+++ b/include/linux/openvswitch.h
@@ -130,7 +130,8 @@ enum ovs_packet_cmd {
 	OVS_PACKET_CMD_ACTION,  /* OVS_ACTION_ATTR_USERSPACE action. */
 
 	/* Userspace commands. */
-	OVS_PACKET_CMD_EXECUTE  /* Apply actions to a packet. */
+	OVS_PACKET_CMD_EXECUTE, /* Apply actions to a packet. */
+    OVS_PACKET_CMD_PUSH
 };
 
 /**
diff --git a/lib/dpif-linux.c b/lib/dpif-linux.c
index 1519eb2..6466334 100644
--- a/lib/dpif-linux.c
+++ b/lib/dpif-linux.c
@@ -865,6 +865,21 @@ dpif_linux_encode_execute(int dp_ifindex, const struct dpif_execute *d_exec,
                       d_exec->actions, d_exec->actions_len);
 }
 
+static void
+dpif_linux_encode_push(int dp_ifindex,
+                       struct ofpbuf *buf)
+{
+    struct ovs_header *k_exec;
+
+    ofpbuf_prealloc_tailroom(buf, (64));
+
+    nl_msg_put_genlmsghdr(buf, 0, ovs_packet_family, NLM_F_REQUEST,
+                          OVS_PACKET_CMD_PUSH, OVS_PACKET_VERSION);
+
+    k_exec = ofpbuf_put_uninit(buf, sizeof *k_exec);
+    k_exec->dp_ifindex = dp_ifindex;
+}
+
 static int
 dpif_linux_execute__(int dp_ifindex, const struct dpif_execute *execute)
 {
@@ -881,6 +896,21 @@ dpif_linux_execute__(int dp_ifindex, const struct dpif_execute *execute)
 }
 
 static int
+dpif_linux_push__(int dp_ifindex)
+{
+    uint64_t request_stub[1024 / 8];
+    struct ofpbuf request;
+    int error;
+
+    ofpbuf_use_stub(&request, request_stub, sizeof request_stub);
+    dpif_linux_encode_push(dp_ifindex, &request);
+    error = nl_sock_transact(genl_sock, &request, NULL);
+    ofpbuf_uninit(&request);
+
+    return error;
+}
+
+static int
 dpif_linux_execute(struct dpif *dpif_, const struct dpif_execute *execute)
 {
     struct dpif_linux *dpif = dpif_linux_cast(dpif_);
@@ -888,6 +918,14 @@ dpif_linux_execute(struct dpif *dpif_, const struct dpif_execute *execute)
     return dpif_linux_execute__(dpif->dp_ifindex, execute);
 }
 
+static int
+dpif_linux_push(struct dpif *dpif_)
+{
+    struct dpif_linux *dpif = dpif_linux_cast(dpif_);
+
+    return dpif_linux_push__(dpif->dp_ifindex);
+}
+
 #define MAX_OPS 50
 
 static void
@@ -1325,6 +1363,7 @@ const struct dpif_class dpif_linux_class = {
     dpif_linux_recv,
     dpif_linux_recv_wait,
     dpif_linux_recv_purge,
+    dpif_linux_push,
 };
 
 static int
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 5b93580..8dcb7fc 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -1307,6 +1307,7 @@ const struct dpif_class dpif_netdev_class = {
     dpif_netdev_recv,
     dpif_netdev_recv_wait,
     dpif_netdev_recv_purge,
+    NULL /* push */
 };
 
 static void
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index ffe084a..92e5a40 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -336,6 +336,8 @@ struct dpif_class {
     /* Throws away any queued upcalls that 'dpif' currently has ready to
      * return. */
     void (*recv_purge)(struct dpif *dpif);
+
+    int (*push)(struct dpif *dpif);
 };
 
 extern const struct dpif_class dpif_linux_class;
diff --git a/lib/dpif.c b/lib/dpif.c
index fa9b30a..9050e76 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -973,6 +973,17 @@ dpif_execute__(struct dpif *dpif, const struct dpif_execute *execute)
     return error;
 }
 
+static int
+dpif_push__(struct dpif *dpif)
+{
+    int error;
+
+    COVERAGE_INC(dpif_execute);
+    error = dpif->dpif_class->push(dpif);
+
+    return error;
+}
+
 /* Causes 'dpif' to perform the 'actions_len' bytes of actions in 'actions' on
  * the Ethernet frame specified in 'packet' taken from the flow specified in
  * the 'key_len' bytes of 'key'.  ('key' is mostly redundant with 'packet', but
@@ -996,6 +1007,12 @@ dpif_execute(struct dpif *dpif,
     return dpif_execute__(dpif, &execute);
 }
 
+int
+dpif_push(struct dpif *dpif)
+{
+    return dpif_push__(dpif);
+}
+
 /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in the order in
  * which they are specified, placing each operation's results in the "output"
  * members documented in comments.
diff --git a/lib/dpif.h b/lib/dpif.h
index 45c78a5..1f6ab4f 100644
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -168,6 +168,8 @@ int dpif_execute(struct dpif *,
                  const struct nlattr *key, size_t key_len,
                  const struct nlattr *actions, size_t actions_len,
                  const struct ofpbuf *);
+
+int dpif_push(struct dpif *);
 
 /* Operation batching interface.
  *
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 5f04b69..d6466dc 100755
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -2864,7 +2864,7 @@ flow_miss_should_make_facet(struct ofproto_dpif *ofproto,
 static inline void
 buf_set_batch_index(struct ofpbuf *buf, size_t idx)
 {
-    buf->private_p = (void*) idx;
+    buf->private_p = (void *) idx;
 }
 
 static inline size_t
@@ -3177,6 +3177,7 @@ handle_miss_upcalls(struct ofproto_dpif *ofproto, struct dpif_upcall *upcalls,
     }
 
     dpif_operate(ofproto->dpif, dpif_ops, n_exec_ops + n_ops);
+    dpif_push(ofproto->dpif);
 
     /* Free memory and update facets. */
     for (i = 0; i < n_ops; i++) {
-- 
1.7.1




> On Fri, Apr 27, 2012 at 01:26:55PM -0700, Bob Lantz wrote:
>> It does indeed happen with UDP flows (or if a rule is removed while a flow is active) and it's easy to observe.
>> It's not a bug in Open vSwitch, which is in fact implementing the correct OpenFlow behavior.
>> In practice you can easily solve the problem either by pushing down routes proactively or by buffering somewhere (e.g. at the receiver.)
>> 
>> On Apr 27, 2012, at 7:39 AM, Joji Mtt wrote:
>> 
>>> I currently do not have a real scenario where I have run into this problem.
>>> 
>>> However, it is easy to see real scenarios where I could run into this. As you said, most flows start off with a single packet and wait for a response. But, there is also the flow eviction mechanism which would bring this into play mid-flow. There is also the UDP traffic which does not have a ramp up mechanism. Yes, the window is small. But, I think it is real and could problems for some applications.
>>> 
>>> -Joji
>>> 
>>> 
>>> On Thu, Apr 26, 2012 at 10:22 PM, Ben Pfaff <blp at nicira.com> wrote:
>>> It isn't commonly a problem in practice because flows most often start
>>> off with a single packet and wait for a return packet before ramping up
>>> packet volume.  I've been aware of the issue for years, but you are the
>>> only other person I ever recall bringing it up on the mailing lists.
>>> 
>>> Do you have a real situation (not a hypothetical scenario) where you see
>>> this causing trouble?
>>> 
>>> On Fri, Apr 27, 2012 at 11:06:29AM +0600, junaid khalid wrote:
>>>> Are you planning to solve this problem in near future or do you have any
>>>> suggestions to mitigate this problem?
>>>> 
>>>> On Thu, Apr 26, 2012 at 2:37 AM, Ben Pfaff <blp at nicira.com> wrote:
>>>> 
>>>>> On Wed, Apr 25, 2012 at 01:33:56PM -0700, Joji Mtt wrote:
>>>>>> I am trying to figure out if there would be a packet order issue with the
>>>>>> current version of OVS. Consider a case where a controller has added a
>>>>>> forwarding rule for a specific flow (Flow A) and this rule is not yet
>>>>>> installed in the DP. In this scenario, it is conceivable that certain
>>>>>> (bursty) traffic patterns on Flow A can result in the packets being sent
>>>>>> out of order. E.g. consider an initial burst of 5 packets that miss the
>>>>>> kernel flow table, followed by several subsequent packets arriving at
>>>>>> random intervals. As soon as the userspace processes the flow miss, it
>>>>> will
>>>>>> install a rule in the kernel. Depending on the relative timing of the
>>>>> rule
>>>>>> installation, any of these subsequent packets could get switched directly
>>>>>> by the kernel before the previous packets that took the slow path could
>>>>> get
>>>>>> forwarded. I couldn't find any special handling to cover this case. Most
>>>>>> likely it is already handled and I am just missing the part where it is
>>>>>> done. Could anyone clarify this for me?
>>>>> 
>>>>> Yes, it's possible to get out-of-order packets for this reason.
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> discuss at openvswitch.org
>>>>> http://openvswitch.org/mailman/listinfo/discuss
>>>>> 
>>> 
>>> _______________________________________________
>>> discuss mailing list
>>> discuss at openvswitch.org
>>> http://openvswitch.org/mailman/listinfo/discuss
>> 
> 




More information about the discuss mailing list