[ovs-dev] [PATCH v10 0/5] dpcls func ptrs & optimizations

Ilya Maximets i.maximets at samsung.com
Fri Jul 12 10:49:45 UTC 2019


On 11.07.2019 17:40, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Ilya Maximets [mailto:i.maximets at samsung.com]
>> Sent: Thursday, July 11, 2019 3:14 PM
>> To: Van Haaren, Harry <harry.van.haaren at intel.com>; dev at openvswitch.org
>> Cc: malvika.gupta at arm.com; Stokes, Ian <ian.stokes at intel.com>; Michal Orsák
>> <michal.orsak at gmail.com>
>> Subject: Re: [PATCH v10 0/5] dpcls func ptrs & optimizations
>>
>> On 09.07.2019 15:34, Harry van Haaren wrote:
>>> Hey All,
>>>
>>>
>>> Here a v10 of the DPCLS Function Pointer patchset, as has been
>>> presented at OVS Conf in Nov '18, and discussed on the ML since then.
>>> I'm aware of the soft-freeze for 2.12, I feel this patchset has had
>>> enough reviews/versions/testing to be merged in 2.12.
>>>
>>> Thanks Ilya and Ian for review comments on v9, they should all be addressed
>>> in this v10.
>>>
>>> Thanks Malvika Gupta for testing (Tested-by tag added to patches) and also
>>> for reporting ARM performance gains, see here for details:
>>> https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/360088.html
>>>
>>>
>>> Regards, -Harry
>>
>> Hi, Harry.
>> Thanks for working on this.
> 
> My pleasure - it’s a nice part of OVS. And there's lots more to do :)
> 
> 
>> I performed some tests with this version in my usual PVP with bonded PHY
>> setup and here are some observations:
>>
>> * Bug that redirected packets to wrong rules is gone. At least I can't
>>   catch it in my testing anymore. Assuming it's fixed now.
>>
>> * dpcls performance boost for 512B packets is around 12% in compare with
>>   current master.
> 
> Ah great! Glad to hear its giving you performance.
> 
> 
>> Few remarks about the test scenario:
>> All packets mostly goes through the NORMAL action with vlan push/pop.
>> Packets that goes from VM to balanced-tcp bonded PHY goes through
>> recirculation. Datapath flows for them looks like this:
>>
>> Before recirculation:
>> recirc_id=0,eth,ip,vlan_tci=0x0000/0x1fff,dl_src=aa:16:3e:24:30:dd,dl_dst=aa:b
>> b:cc:dd:ee:11,nw_frag=no
>>
>> After recirculation:
>> recirc_id=0x1,dp_hash=0xf5/0xff,eth,ip,dl_vlan=42,dl_vlan_pcp=0,nw_frag=no
>>
>> I have 256 flows in datapath for different 'dp_hash'es.
>>
>> So, even if the number of ipv4 flows is as high as 256K, I have about ~270
>> datapath
>> flows in dpcls. (This gives a huge advantage to dpcls over EMC and SMC).
> 
> Right - I'm a big fan of the consistent performance characteristic of DPCLS,
> which is due to its wildcarding capabilities and lack of caching concepts.
> 
> 
>> All the flows fits into 5+1 case, i.e. optimized function
>> dpcls_subtable_lookup_mf_u0w5_u1w1 used.
>>
>> Most interesting observation:
>>
>> * New version of dpcls lookup outperforms SMC in this setup even on
>>   relatively small number of flows. With 8K flows dpcls faster than SMC
>>   by 1.5% and by 5.7% with 256K flows.
>>   Of course, SMC is 10% faster than dpcls with 8 flows, but it's not very
>>   interesting because no-one can beat EMC in this area.
>>
>> I'd like to read the code more carefully tomorrow and probably give some
>> more feedback.
>>
>> Best regards, Ilya Maximets.
> 
> Thanks for your comments - please do prioritize feedback ASAP, because as
> you know the 2.12 soft-freeze is already in effect.
> 
> I'll work on Ian's comments on v10, but hold off sending v11 until there
> is some feedback from you too :)

Few comments:

1. I don't really like that new dpcls depends on the internal structure of
   'struct flowmap'. At least, we need to add a build assert to notify
   anyone who will try to change it.

2. No need to expose so many internal stuff to every module that includes
   'dpif-netdev.h'. Can we create 'dpif-netdev-private.h' instead and move
   all the subtable related stuff there?

3. IMHO, it's better to apply some preprocessor optimizations to make it
   easier to add new optimized functions and decrease code duplication.

4. Please, add a space between 'ULLONG_FOR_EACH_1' and '('.

5. A lot of comments needs to be re-formatted according to coding style,
   i.e. first letter capitalized + period at the end.

6. "generic optimized" sounds weird.

Here is some incremental I'm suggesting which covers comments 1, 3 and 4:

diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index 12406f4c1..2d210bf93 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -31,6 +31,9 @@
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
 
+/* Lookup functions below depends on the internal structure of a flowmap. */
+BUILD_ASSERT_DECL(FLOWMAP_UNITS == 2);
+
 /* netdev_flow_key_flatten_unit:
  * Given a packet, table and mf_masks, this function iterates over each bit
  * set in the subtable, and calculates the appropriate metadata to store in the
@@ -129,8 +132,8 @@ netdev_rule_matches_key(const struct dpcls_rule *rule,
 {
     const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
     const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
-
     uint64_t not_match = 0;
+
     for (int i = 0; i < mf_bits_total; i++) {
         not_match |= (blocks_scratch[i] & maskp[i]) != keyp[i];
     }
@@ -163,7 +166,7 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
     int i;
 
     /* Flatten the packet metadata into the blocks_scratch[] using subtable */
-    ULLONG_FOR_EACH_1(i, keys_map) {
+    ULLONG_FOR_EACH_1 (i, keys_map) {
             netdev_flow_key_flatten(keys[i],
                                     &subtable->mask,
                                     mf_masks,
@@ -173,9 +176,10 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
     }
 
     /* Hash the now linearized blocks of packet metadata */
-    ULLONG_FOR_EACH_1(i, keys_map) {
+    ULLONG_FOR_EACH_1 (i, keys_map) {
          uint32_t hash = 0;
          uint32_t i_off = i * bit_count_total;
+
          for (int h = 0; h < bit_count_total; h++) {
              hash = hash_add64(hash, blocks_scratch[i_off + h]);
          }
@@ -188,12 +192,13 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
      */
     uint32_t found_map;
     const struct cmap_node *nodes[NETDEV_MAX_BURST];
+
     found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
 
     /* Verify that packet actually matched rule. If not found, a hash
      * collision has taken place, so continue searching with the next node.
      */
-    ULLONG_FOR_EACH_1(i, found_map) {
+    ULLONG_FOR_EACH_1 (i, found_map) {
         struct dpcls_rule *rule;
 
         CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
@@ -236,41 +241,25 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
                                subtable->mf_bits_set_unit1);
 }
 
-static uint32_t
-dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable *subtable,
-                                   uint64_t *blocks_scratch,
-                                   uint32_t keys_map,
-                                   const struct netdev_flow_key *keys[],
-                                   struct dpcls_rule **rules)
-{
-    /* hard coded bit counts - enables compile time loop unrolling, and
-     * generating of optimized code-sequences due to loop unrolled code.
-     */
-    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
-                               5, 1);
-}
+#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
+    static uint32_t                                                           \
+    dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(                               \
+                                         struct dpcls_subtable *subtable,     \
+                                         uint64_t *blocks_scratch,            \
+                                         uint32_t keys_map,                   \
+                                         const struct netdev_flow_key *keys[],\
+                                         struct dpcls_rule **rules)           \
+    {                                                                         \
+        /* Hard coded bit counts - enables compile time loop unrolling, and   \
+         * generating of optimized code-sequences due to loop unrolled code.  \
+         */                                                                   \
+        return lookup_generic_impl(subtable, blocks_scratch, keys_map,        \
+                                   keys, rules, U0, U1);                      \
+    }                                                                         \
 
-static uint32_t
-dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable *subtable,
-                                   uint64_t *blocks_scratch,
-                                   uint32_t keys_map,
-                                   const struct netdev_flow_key *keys[],
-                                   struct dpcls_rule **rules)
-{
-    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
-                               4, 1);
-}
-
-static uint32_t
-dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable *subtable,
-                                   uint64_t *blocks_scratch,
-                                   uint32_t keys_map,
-                                   const struct netdev_flow_key *keys[],
-                                   struct dpcls_rule **rules)
-{
-    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
-                               4, 0);
-}
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1);
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1);
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0);
 
 /* Probe function to lookup an available specialized function.
  * If capable to run the requested miniflow fingerprint, this function returns
@@ -283,14 +272,15 @@ dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
 {
     dpcls_subtable_lookup_func f = NULL;
 
-    if (u0_bits == 5 && u1_bits == 1) {
-        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
-    } else if (u0_bits == 4 && u1_bits == 1) {
-        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
-    } else if (u0_bits == 4 && u1_bits == 0) {
-        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
+#define CHECK_LOOKUP_FUNCTION(U0, U1)                      \
+    if (!f && u0_bits == U0 && u1_bits == U1) {            \
+        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;    \
     }
 
+    CHECK_LOOKUP_FUNCTION(5, 1);
+    CHECK_LOOKUP_FUNCTION(4, 1);
+    CHECK_LOOKUP_FUNCTION(4, 0);
+
     if (f) {
         VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
                   u0_bits, u1_bits);
---

Best regards, Ilya Maximets.


More information about the dev mailing list