[ovs-dev] [PATCH 1/1] sFlow: adds sFlow export

Neil McKee neil.mckee at inmon.com
Mon Nov 9 20:30:09 UTC 2009


On Nov 9, 2009, at 11:02 AM, Ben Pfaff wrote:

> Neil McKee <neil.mckee at inmon.com> writes:
>
>> sFlow (http://www.sflow.org) packet sampling/counter-polling added
>> (only to the kernel-module datapath).
>>
>> Some issues to be addressed are highlighted with '???' in the  
>> comment.
>
> First, thank you very much!  This is very helpful.  I think that
> sFlow will be a valuable addition to the OVS feature set.
>
> I couldn't apply this, because many lines were word-wrapped.
> Could you re-send it after reconfiguring your email client not to
> wrap lines?  Or you could use "git send-email" or "git
> imap-send", although sometimes they can be troublesome to set up.
>
> There is some information about configuring some email clients
> here:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/ 
> linux-2.6.git;a=blob_plain;f=Documentation/email-clients.txt
>
> Or you could push your patch to a public Git server (many are
> available, e.g. repo.or.cz) and tell us where to pull from.
>
> Once you do that, I think I'll be able to provide more
> constructive feedback, since then I can more easily provide
> suggestions in the form of patches.
>

Trying git-send-email now.


>> This is released under the sFlow license
>> (http://www.inmon.com/technology/sflowlicense.txt
>> ).
>
> That makes perfect sense for the files from InMon_Agent.  Could
> we have permission to put the rest of the changes under the
> Apache 2 license used elsewhere for the Open vSwitch code?  And
> to put the kernel code under GPLv2?  Thanks.
>

Yes.  We have no problem with that.

> I promised you a review today, so I'll take an initial look at
> the code without being able to compile it.
>
> First, the kernel code should use one hard tab per indentation
> level, but I'm seeing a few spaces per level instead.  Please use
> u32 (e.g.) instead of u_int32_t consistently in kernel code.

OK.

>
>> --- a/datapath/actions.c
>> +++ b/datapath/actions.c
>> @@ -350,18 +350,105 @@ output_control(struct datapath *dp, struct
>> sk_buff *skb, u32 arg, gfp_t gfp)
>>  	return dp_output_control(dp, skb, _ODPL_ACTION_NR, arg);
>>  }
>> +
>> +/* Send a copy of this packet up to the sFlow agent, along with
>> +   extra information about what happened to it. */
>> +void
>> +sflow_take_sample(struct datapath *dp, struct sk_buff *skb,
>> +		    struct odp_flow_key *key,
>> +		    const union odp_action *a, int n_actions,
>> +		    gfp_t gfp)
>> +{
>> +        /* ??? is skb->len always the packet length? */
>> +        u32 frame_len = skb->len;
>
> Yes, skb->len is always the packet length.  sk_buffs have lots of
> members and I still get confused myself, but that one's easy :-)

That's a relief.

>
>> +	struct sflow_packet_sample_header *hdr;
>> +	/* ??? use skb_cow() instead? */
>> +	struct sk_buff *skbcopy = skb_copy_expand(skb, sizeof(*hdr), 0,  
>> gfp);
>
> In general, I think that we can do most of what
> sflow_take_sample() does in userspace.  Userspace should have the
> set of flow actions (since it added them in the first place).  I
> guess the only trick, if there is one, is that sFlow packets
> might show up in userspace for a flow after it's been removed.
> However, that's not very likely (since flows usually get removed
> only after they've gone idle) so perhaps we could ignore it on a
> first pass.

I'll defer to you on that choice,  though it does seem like it might  
cause an awkward race someday.   sFlow samples should be processed at  
a low priority (e.g. compared to spanning-tree updates) so they could  
languish in a queue for some time.  I think the standard allows you  
take as long as a second.  If that ever happened then the forwarding  
details might have changed or disappeared.

>
> I'm mostly thinking out loud here--I don't really expect you to
> change this particular piece of the design yourself.  It's the
> kind of thing we might do here though.
>
>>  int execute_actions(struct datapath *dp, struct sk_buff *skb,
>>  		    struct odp_flow_key *key,
>>  		    const union odp_action *a, int n_actions,
>>  		    gfp_t gfp)
>>  {
>> +
>
> Please don't add extra blank lines.
>

OK.

>>  	/* Every output action needs a separate clone of 'skb', but
>> the common
>>  	 * case is just a single output action, so that doing a clone and
>>  	 * then freeing the original skbuff is wasteful.  So the
>> following code
>>  	 * is slightly obscure just to avoid that. */
>>  	int prev_port = -1;
>>  	int err;
>> +
>> +        if(dp->sflow_sampling_rate) {
>> +	    /* The sample-pool is the total number of packets that
>> *could* have been sampled. */
>> +	    dp->sflow_sample_pool++;
>> +	    if(--dp->sflow_countdown == 0) {
>> +		/* Sample this packet, then compute the next random skip. */
>> +		dp->sflow_countdown = (net_random() &
>> dp->sflow_sampling_mask) + dp-
>>> sflow_sampling_add;
>> +		sflow_take_sample(dp, skb, key, a, n_actions, gfp);
>> +	    }
>> +	}
>> +
>
> This code can run any number of CPUs at once, so at a minimum
> we'd need use to use a spinlock to protect the new "sflow_"
> members of 'dp'.  But that will likely introduce unnecessary
> cache line bouncing, so it would be better to use per-CPU data.
>
> But could we just use net_random() and avoid global data
> entirely?  I see that this is allowed by the sFlow specification.
> net_random() is already implemented in terms of per-CPU data, and
> the net_random() PRNG's randomness should be suitable for the
> requirements.
>
> I'm really tempted, in fact, to make sFlow sampling a flow
> action.  The ODPAT_CONTROLLER action (struct
> odp_action_controller), which sends a packet to userspace, has 16
> bits of "reserved" space.  We could use those bits for specifying
> a probability: 0xffff means always send, 0x8000 means send half
> the time, and so on.  Then userspace could turn sFlow on and off
> on a per-flow basis (that wouldn't necessarily be useful in the
> general case, but I could see wanting sFlow enabled for certain
> switch ports and not others, at the very least).
>
> Then we could also drop the configuration ioctls that you added.
>

Yes.  The code we added was assuming that a call to net_random()
is much more expensive than a decrement-and-test countdown,  but
if it works better to just test every packet against a sampling
probability then that is fine.   I think this may restrict you to  
choosing
powers-of-2 for your effective-sampling-rate settings,  but that is
OK.   If someone asks for 1-in-100 it is fine to choose 1-in-128,
and report that in the sFlow samples that you send.


>> +static int
>> +dpif_netdev_set_sflow_sampling_rate(struct dpif *dpif, int
>> sampling_rate)
>> +{
>> +    struct dp_netdev *dp = get_dp_netdev(dpif);
>> +    dp->sflow_sampling_rate = sampling_rate;
>> +    /* This is probably where we should compute the sampling  
>> countdown
>> +     *  vars just as we do for the linux module datapath, right???
>> +     */
>> +    return 0;
>> +}
>
> Yes, that's right.
>
>> +/* sFlow library callback to allocate memory. */
>> +static void *
>> +sflow_agent_alloc_cb(void *magic, SFLAgent *agent, size_t bytes)
>> +{
>> +    return calloc(1, bytes);
>> +}
>> +
>
> I guess that the sFlow library gracefully handles allocation
> failure?  If not, there is also an xcalloc() function that will
> abort on allocation failure.
>

The sFlow library only ever allocates memory using a callback,
so we can change it to use xcalloc in:
ofproto.c:sflow_agent_alloc_cb()


>> +	/* it seems like we may need to add functions to netdev.c and
>> netdev-
>> linux.c
>> +	   to preserve the abstraction and portability here??? Comment
>> it out for now.
>> +
>> +	if(port->netdev && port->netdev->class == &netdev_linux_class) {
>> +	    struct ethtool_cmd ecmd;
>> +	    int error;
>> +	    memset(&ecmd, 0, sizeof ecmd);
>> +	    error = netdev_linux_do_ethtool(port->netdev, &ecmd,
>> ETHTOOL_GSET, "ETHTOOL_GSET");
>> +	    if (!error) {
>> +		u_int64_t speed64 = 0L;
>> +		switch(ecmd.speed) {
>> +		case SPEED_10: speed64 = 10000000L; break;
>> +		case SPEED_100: speed64 = 100000000L; break;
>> +		case SPEED_1000: speed64 = 1000000000L; break;
>> +		case SPEED_10000: speed64 = 10000000000; break;
>> +		default: break;
>> +		}
>> +		genElem.counterBlock.generic.ifSpeed = speed64;
>> +		genElem.counterBlock.generic.ifDirection = ecmd.duplex ? 1 : 2;
>> +		genElem.counterBlock.generic.ifStatus = 3;
>> +	    }
>> +	}
>> +	*/
>
> I think all you need here is a call to netdev_get_features()
> followed by a look at the bits in the "current" return value.
> It's not in the form that you want, though, I guess, since it's a
> bitmap, but it should provide the information you need.
>

I should have spotted that.   OK.

>> +	/* Questions still to be answered:
>> +	   1. Is the multicast counter filled in?
>> +	   2. Does the multicast counter include broadcasts?
>> +	   3. Does the rx_packets counter include multicasts/broadcasts?
>> +	*/
>
> Good questions, I don't have answers at the moment.  Probably
> we'll have to either add back some of the kernel counters you had
> there before or count broadcasts based on flow data.
>

This decision can be postponed.  It's not so important.

>> +static int
>> +sflow_choose_agent_address(struct ofproto *p,
>> +			   const char *agent_device, const char *controlIP,
>> +			   SFLAddress *agent_addr)
>> +{
> ...
>> +    if(agent_device) {
>> +	struct netdev *netdev = NULL;
>> +	int error;
>> +	error = netdev_open(agent_device, NETDEV_ETH_TYPE_NONE, &netdev);
>> +	if(error == 0 && netdev) {
>
> You can just check !error here; if error is 0 then netdev is
> guaranteed to be nonnull.

OK.

>
>> +	    error = netdev_get_in4(netdev, paddr4, NULL);
>> +	    netdev_close(netdev);
>> +	    if(error == 0) {
>> +		return 0;
>> +	    }
>> +	}
>
>> +int
>> +ofproto_set_sflow(struct ofproto *p, const struct svec *collectors,
>> +		  uint32_t sampling_rate, uint32_t polling_interval,
>> +		  uint32_t header_len, int bridge_no,
>> +		  const char *agent_device, const char *controlIP)
>> +{
>
> With that many parameters I would ordinarily introduce a new
> struct.
>

Yeah,  the list kinda grew.

>> +      time_t now = time(NULL);
>
> Please use the OVS-specific function time_now() in place of
> time(NULL) here.
>

Well spotted.

>> +
>> +      p->sflow_agent = (SFLAgent *)xcalloc(1, sizeof(SFLAgent));
>
> Please write this as:
>         p->sflow_agent = xcalloc(1, sizeof *p->sflow_agent);
>

OK.

>> +      sfl_agent_init(p->sflow_agent,
>> +		     &agentIP,
>
>
>> +	      /* Try to get the global ifIndex here, as understood by
>> the OS. That way it will
>> +		 tie in properly with, say, an SNMP agent running on the host. */
>
> I guess it's time for the ifIndex discussion.  I'll follow up on
> your other email separately.
>

Yes,  worth a separate thread.

>> --- a/vswitchd/bridge.c
>> +++ b/vswitchd/bridge.c
>> @@ -460,6 +462,7 @@ bridge_reconfigure(void)
>>      struct svec old_br, new_br;
>>      struct bridge *br, *next;
>>      size_t i;
>> +    int sflow_bridge_number;
>
>>      COVERAGE_INC(bridge_reconfigure);
>
>> @@ -564,6 +567,7 @@ bridge_reconfigure(void)
>>          svec_destroy(&want_ifaces);
>>          svec_destroy(&add_ifaces);
>>      }
>> +    sflow_bridge_number = 0;
>>      LIST_FOR_EACH (br, struct bridge, node, &all_bridges) {
>>          uint8_t ea[8];
>>          uint64_t dpid;
>
> Is it important for the sflow_bridge_number to be stable?  This
> function gets called whenever OVS is reconfigured (whenever it
> receives SIGHUP, etc.).  The loop always runs in alphabetical
> order, so adding or removing bridges will renumber all the ones
> after that bridge in alphabetical order.
>

I was using the bridge-number to help in two ways:
1.  as the sFlow agent "SubId".  This goes out with every sample. It
indicates to the collector that there is a separate "sub-agent" here.
Two sub-agents can have the same IP address,  but their
sequence numbers will increment independently.
2.  To help make up ifIndex numbers that are unique across
all bridges.

So the answer is probably that it is desirable for the number to
be stable.  I think there was a suggestion made about using
some bits from the MAC address?   That would probably work.
If the bridge-number for an agent changes,  then it should
reset all it's sequence numbers to indicate a discontinuity to
the collector -- in case this new SubId is one that was being
used before for a different bridge.

>> @@ -626,6 +630,43 @@ bridge_reconfigure(void)
>>          }
>>          svec_destroy(&nf_hosts);
>
>> +	{
>> +	  uint32_t sampling_rate = SFL_DEFAULT_SAMPLING_RATE;
>> +	  uint32_t polling_interval = SFL_DEFAULT_POLLING_INTERVAL;
>> +	  uint32_t header_bytes = SFL_DEFAULT_HEADER_SIZE;
> ...
>
> Probably this should be in a new function.
>

OK.

>> @@ -640,7 +681,7 @@ bridge_reconfigure(void)
>>      LIST_FOR_EACH (br, struct bridge, node, &all_bridges) {
>>          for (i = 0; i < br->n_ports; i++) {
>>              struct port *port = br->ports[i];
>> -
>> +	
>>              port_update_vlan_compat(port);
>>              port_update_bonding(port);
>>          }
>
> I think you can drop that null change.


Yup.  This one is clearly not mission-critical :)


Regards,
Neil





More information about the dev mailing list