[ovs-dev] [PATCH v10 0/5] dpcls func ptrs & optimizations

Fri Jul 12 11:22:17 UTC 2019

On 12.07.2019 13:49, Ilya Maximets wrote:
> On 11.07.2019 17:40, Van Haaren, Harry wrote:
>>> -----Original Message-----
>>> From: Ilya Maximets [mailto:i.maximets at samsung.com]
>>> Sent: Thursday, July 11, 2019 3:14 PM
>>> To: Van Haaren, Harry <harry.van.haaren at intel.com>; dev at openvswitch.org
>>> Cc: malvika.gupta at arm.com; Stokes, Ian <ian.stokes at intel.com>; Michal Orsák
>>> <michal.orsak at gmail.com>
>>> Subject: Re: [PATCH v10 0/5] dpcls func ptrs & optimizations
>>>
>>> On 09.07.2019 15:34, Harry van Haaren wrote:
>>>> Hey All,
>>>>
>>>>
>>>> Here a v10 of the DPCLS Function Pointer patchset, as has been
>>>> presented at OVS Conf in Nov '18, and discussed on the ML since then.
>>>> I'm aware of the soft-freeze for 2.12, I feel this patchset has had
>>>> enough reviews/versions/testing to be merged in 2.12.
>>>>
>>>> Thanks Ilya and Ian for review comments on v9, they should all be addressed
>>>> in this v10.
>>>>
>>>> Thanks Malvika Gupta for testing (Tested-by tag added to patches) and also
>>>> for reporting ARM performance gains, see here for details:
>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/360088.html
>>>>
>>>>
>>>> Regards, -Harry
>>>
>>> Hi, Harry.
>>> Thanks for working on this.
>>
>> My pleasure - it’s a nice part of OVS. And there's lots more to do :)
>>
>>
>>> I performed some tests with this version in my usual PVP with bonded PHY
>>> setup and here are some observations:
>>>
>>> * Bug that redirected packets to wrong rules is gone. At least I can't
>>>   catch it in my testing anymore. Assuming it's fixed now.
>>>
>>> * dpcls performance boost for 512B packets is around 12% in compare with
>>>   current master.
>>
>> Ah great! Glad to hear its giving you performance.
>>
>>
>>> Few remarks about the test scenario:
>>> All packets mostly goes through the NORMAL action with vlan push/pop.
>>> Packets that goes from VM to balanced-tcp bonded PHY goes through
>>> recirculation. Datapath flows for them looks like this:
>>>
>>> Before recirculation:
>>> recirc_id=0,eth,ip,vlan_tci=0x0000/0x1fff,dl_src=aa:16:3e:24:30:dd,dl_dst=aa:b
>>> b:cc:dd:ee:11,nw_frag=no
>>>
>>> After recirculation:
>>> recirc_id=0x1,dp_hash=0xf5/0xff,eth,ip,dl_vlan=42,dl_vlan_pcp=0,nw_frag=no
>>>
>>> I have 256 flows in datapath for different 'dp_hash'es.
>>>
>>> So, even if the number of ipv4 flows is as high as 256K, I have about ~270
>>> datapath
>>> flows in dpcls. (This gives a huge advantage to dpcls over EMC and SMC).
>>
>> Right - I'm a big fan of the consistent performance characteristic of DPCLS,
>> which is due to its wildcarding capabilities and lack of caching concepts.
>>
>>
>>> All the flows fits into 5+1 case, i.e. optimized function
>>> dpcls_subtable_lookup_mf_u0w5_u1w1 used.
>>>
>>> Most interesting observation:
>>>
>>> * New version of dpcls lookup outperforms SMC in this setup even on
>>>   relatively small number of flows. With 8K flows dpcls faster than SMC
>>>   by 1.5% and by 5.7% with 256K flows.
>>>   Of course, SMC is 10% faster than dpcls with 8 flows, but it's not very
>>>   interesting because no-one can beat EMC in this area.
>>>
>>> I'd like to read the code more carefully tomorrow and probably give some
>>> more feedback.
>>>
>>> Best regards, Ilya Maximets.
>>
>> Thanks for your comments - please do prioritize feedback ASAP, because as
>> you know the 2.12 soft-freeze is already in effect.
>>
>> I'll work on Ian's comments on v10, but hold off sending v11 until there
>> is some feedback from you too :)
> 
> Few comments:
> 
> 1. I don't really like that new dpcls depends on the internal structure of
>    'struct flowmap'. At least, we need to add a build assert to notify
>    anyone who will try to change it.
> 
> 2. No need to expose so many internal stuff to every module that includes
>    'dpif-netdev.h'. Can we create 'dpif-netdev-private.h' instead and move
>    all the subtable related stuff there?
> 
> 3. IMHO, it's better to apply some preprocessor optimizations to make it
>    easier to add new optimized functions and decrease code duplication.
> 
> 4. Please, add a space between 'ULLONG_FOR_EACH_1' and '('.
> 
> 5. A lot of comments needs to be re-formatted according to coding style,
>    i.e. first letter capitalized + period at the end.
> 
> 6. "generic optimized" sounds weird.

One more thing. Could you, please, rename patches like:

dpif-netdev: implement function pointers/subtable

to

dpif-netdev: Implement function pointers/subtable.

This is not a strict requirement but a common formatting for OVS (capital letter + period).

> 
> Here is some incremental I'm suggesting which covers comments 1, 3 and 4:
> 
> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
> index 12406f4c1..2d210bf93 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -31,6 +31,9 @@
>  
>  VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
>  
> +/* Lookup functions below depends on the internal structure of a flowmap. */
> +BUILD_ASSERT_DECL(FLOWMAP_UNITS == 2);
> +
>  /* netdev_flow_key_flatten_unit:
>   * Given a packet, table and mf_masks, this function iterates over each bit
>   * set in the subtable, and calculates the appropriate metadata to store in the
> @@ -129,8 +132,8 @@ netdev_rule_matches_key(const struct dpcls_rule *rule,
>  {
>      const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
>      const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
> -
>      uint64_t not_match = 0;
> +
>      for (int i = 0; i < mf_bits_total; i++) {
>          not_match |= (blocks_scratch[i] & maskp[i]) != keyp[i];
>      }
> @@ -163,7 +166,7 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
>      int i;
>  
>      /* Flatten the packet metadata into the blocks_scratch[] using subtable */
> -    ULLONG_FOR_EACH_1(i, keys_map) {
> +    ULLONG_FOR_EACH_1 (i, keys_map) {
>              netdev_flow_key_flatten(keys[i],
>                                      &subtable->mask,
>                                      mf_masks,
> @@ -173,9 +176,10 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
>      }
>  
>      /* Hash the now linearized blocks of packet metadata */
> -    ULLONG_FOR_EACH_1(i, keys_map) {
> +    ULLONG_FOR_EACH_1 (i, keys_map) {
>           uint32_t hash = 0;
>           uint32_t i_off = i * bit_count_total;
> +
>           for (int h = 0; h < bit_count_total; h++) {
>               hash = hash_add64(hash, blocks_scratch[i_off + h]);
>           }
> @@ -188,12 +192,13 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
>       */
>      uint32_t found_map;
>      const struct cmap_node *nodes[NETDEV_MAX_BURST];
> +
>      found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
>  
>      /* Verify that packet actually matched rule. If not found, a hash
>       * collision has taken place, so continue searching with the next node.
>       */
> -    ULLONG_FOR_EACH_1(i, found_map) {
> +    ULLONG_FOR_EACH_1 (i, found_map) {
>          struct dpcls_rule *rule;
>  
>          CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> @@ -236,41 +241,25 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
>                                 subtable->mf_bits_set_unit1);
>  }
>  
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable *subtable,
> -                                   uint64_t *blocks_scratch,
> -                                   uint32_t keys_map,
> -                                   const struct netdev_flow_key *keys[],
> -                                   struct dpcls_rule **rules)
> -{
> -    /* hard coded bit counts - enables compile time loop unrolling, and
> -     * generating of optimized code-sequences due to loop unrolled code.
> -     */
> -    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
> -                               5, 1);
> -}
> +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
> +    static uint32_t                                                           \
> +    dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(                               \
> +                                         struct dpcls_subtable *subtable,     \
> +                                         uint64_t *blocks_scratch,            \
> +                                         uint32_t keys_map,                   \
> +                                         const struct netdev_flow_key *keys[],\
> +                                         struct dpcls_rule **rules)           \
> +    {                                                                         \
> +        /* Hard coded bit counts - enables compile time loop unrolling, and   \
> +         * generating of optimized code-sequences due to loop unrolled code.  \
> +         */                                                                   \
> +        return lookup_generic_impl(subtable, blocks_scratch, keys_map,        \
> +                                   keys, rules, U0, U1);                      \
> +    }                                                                         \
>  
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable *subtable,
> -                                   uint64_t *blocks_scratch,
> -                                   uint32_t keys_map,
> -                                   const struct netdev_flow_key *keys[],
> -                                   struct dpcls_rule **rules)
> -{
> -    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
> -                               4, 1);
> -}
> -
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable *subtable,
> -                                   uint64_t *blocks_scratch,
> -                                   uint32_t keys_map,
> -                                   const struct netdev_flow_key *keys[],
> -                                   struct dpcls_rule **rules)
> -{
> -    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
> -                               4, 0);
> -}
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1);
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1);
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0);
>  
>  /* Probe function to lookup an available specialized function.
>   * If capable to run the requested miniflow fingerprint, this function returns
> @@ -283,14 +272,15 @@ dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
>  {
>      dpcls_subtable_lookup_func f = NULL;
>  
> -    if (u0_bits == 5 && u1_bits == 1) {
> -        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
> -    } else if (u0_bits == 4 && u1_bits == 1) {
> -        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
> -    } else if (u0_bits == 4 && u1_bits == 0) {
> -        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
> +#define CHECK_LOOKUP_FUNCTION(U0, U1)                      \
> +    if (!f && u0_bits == U0 && u1_bits == U1) {            \
> +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;    \
>      }
>  
> +    CHECK_LOOKUP_FUNCTION(5, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 0);
> +
>      if (f) {
>          VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
>                    u0_bits, u1_bits);
> ---
> 
> Best regards, Ilya Maximets.
>