[ovs-dev] Cloning packets for "action: set field"

Saurabh Shah ssaurabh at vmware.com
Fri Aug 15 05:03:26 UTC 2014


Hi Samuel,

Thanks for your patience. I believe all of your concerns are arising due
to a fundamental misunderstanding of the NDIS apis & the buffer management
code. I thought it would useful if I walk you through the
OvsPartialCopyNBL logic by giving an example.
	   
Let's say there is a call to OvsPartialCopyNBL(), with copySize = 50. This
is presumably the size of the header that you want to be able to modify it.

Say, the original NBL looks like following:

Original NBL 
------------
	|-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
	     |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
MappedSystemVa = 0x1000, Next = NULL]


So, here is how the partial copy logic works -
 

01. Inside partial copy API, first we advance the original NBL by copySize
(50 bytes). 

After this operation the original NBL looks like following:

Original NBL
------------
	|-> NB1 [DataLength = 1450, CurrentMdlOffset = 50, Next = NULL]
	     |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
MappedSystemVa = 0x1000, Next = NULL]

Note: CurrentMdlOffset & DataLength get modified.

02. Allocate a clone from the Original NBL.

The original NBL is unchanged & the cloned NBL looks like following:



Clone NBL
---------
   |-> NB1 [DataLength = 1450, CurrentMdlOffset = 0, Next = NULL]
	|-> MdlChain --> MDL2 [ByteCount = 1450, ByteOffset = 50, MappedSystemVa
= 0x1000, Next = NULL]

Note: The cloned NBL has a new MDL shell (i.e MDL2), but is pointing to
the same data buffer VA:0x1000. The ByteOffset is appropriately adjusted
in MDL2 (by 50 bytes).

03. The original NBL gets retreated by copySize.

With this operation, the original NBL becomes *exactly* like how the
caller had given it.

Original NBL 
------------
	|-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
	     |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
MappedSystemVa = 0x1000, Next = NULL]


04. The cloned NBL is retreated by copySize bytes (assume headRoom = 0).

The clone NBL will look like following:

Clone NBL
---------
   |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
	|-> MdlChain --> MDL3 [ByteCount = 50, ByteOffset = 0, MappedSystemVa =
0x2000, Next = MDL2]
		     	   |--> MDL2 [ByteCount = 1450, ByteOffset = 50, MappedSystemVa =
0x1000, Next = NULL]



Note: A new memory descriptor (MDL3) is added in front of the chain and
the new descriptor points to a completely new data buffer, VA:0x2000.

05. We copy ŒcopySize¹ bytes from original NBL to the head of the cloned
NBL. For this we use the ndis call, NdisCopyFromNetBufferToNetBuffer. NDIS
api takes care that 'copySize¹ bytes get copied correctly. This means that
even if the bytes where split in multiple MDLs they will correctly be
copied to a contiguous new data buffer of the cloned NBL.

Note: At the end of this step, the cloned NBL is guaranteed to have
ŒcopySize¹ amount of data in a contiguous data buffer. And the data buffer
is solely owned by the cloned NBL.

06. Allocate NBL context for the cloned NBL, set appropriate flags in the
NBL context & take a ref on the parent NBL context.

To summarize, when partial copy returns, the caller gets a new NBL with
copySize amount of data in a contiguous *writeable* buffer.


Completion Path:
----------------

For most operations that require you to modify the packet, we would have
done a partial copy at some point. After manipulating the partially copied
NBL, the packet gets sent to the appropriate port. Completion is triggered
when NDIS calls our 'FilterSendNetBufferListsComplete' callback. The
complete callback will go ahead and complete the NBL it receives from
NDIS. This will trigger a chain of NBL free's, till it reaches the root
NBL. The root NBL is the original NBL that the extension received in the
ŒFilterSendNetBufferLists¹ callback. When the refcount of the root NBL
reaches 0, we propagate the completion by calling
NdisFSendNetBufferListsComplete for all those root NBLs whose refcount
have become 0.


In theory this is how things are supposed to work. If you still think that
this approach has holes, then let me know and we can do a separate IRC
session just to discuss this. :)

Thanks,
Saurabh


From:  Samuel Ghinet <sghinet at cloudbasesolutions.com>
Date:  Thursday, August 7, 2014 at 10:45 AM
To:  Saurabh Shah <ssaurabh at vmware.com>
Cc:  "dev at openvswitch.org" <dev at openvswitch.org>
Subject:  RE: Cloning packets for "action: set field"


>"it sometimes happens for packets to come in FilterSentNetBufferLists
>callback: NBL1, NBL2, NBL3 and NBL4, and only after NBL4 was finished
>being sent
>via NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to
>be called."
>
>I mean, to enter the FilterSentNetBufferLists callback 4 times, each time
>for a NBL. And only after the 4th was done, the
>FilterSentNetBufferListsComplete callback to start being called.
>
>This may mean that it is possible that:
>1) cloning NBL1 into NBL2
>2) call NdisFSendNetBufferLists on NBL1
>3) modify NBL2
>may still corrupt the buffer of NBL1.
>
>Reading from MSDN:
>"As soon as a filter driver calls the NdisFSendNetBufferLists function,
>it relinquishes ownership of the NET_BUFFER_LIST structures and all
>associated resources. "
>
>"Until NDIS calls FilterSendNetBufferListsComplete, the current status of
>the send request is not available to the filter driver. A filter driver
>temporarily releases ownership of all resources that are associated with
>a send request when it calls NdisFSendNetBufferLists. A filter driver
>should never try to examine the NET_BUFFER_LIST structures or any
>associated data after calling NdisFSendNetBufferLists."
>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en-u
>s/library/windows/hardware/ff562616%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOoM
>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O0
>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=891f0324d2d101746989
>9e64476a53d80d5825ba5053c86f396bfdee07663515)
>________________________________________
>From: Samuel Ghinet
>Sent: Thursday, August 07, 2014 8:39 PM
>To: Saurabh Shah
>Cc: dev at openvswitch.org
>Subject: RE: Cloning packets for "action: set field"
>
>Hello Saurabh,
>
>[QUOTE]For the set case, we actually send out the packet to all the
>existing output ports before going ahead with the set. This is what the
>function OvsOutputBeforeSetAction does. So, at any given point we only
>have one NBL to deal with.[/QUOTE]
>This may still not solve the problem: when the NdisFSendNetBufferLists
>function is called, the packet is being forwarded to the underlying
>drivers, and thus may spend some time until it is actually sent out. I
>believe this is an asynchronous operation
>(meaning that you just request a "send packet to other drivers"  and may
>move forward before all is finished). I say this because I remember in my
>tests, it sometimes happens for packets to come in
>FilterSentNetBufferLists callback: NBL1, NBL2, NBL3 and NBL4, and only
>after NBL4 was finished being sent
>via NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to
>be called.
>
>[QUOTE]We actually can copy parts of the header (even if they are in a
>single MDL), in fact our code has the infrastructure to do this. Please
>take a look at OvsPartialCopyNBL.[/QUOTE]
>I believe you missunderstand me: I was speaking about how
>NdisAllocateCloneNetBufferList and NdisRetreatNetBufferDataStart work.
>Both of them are being used by OvsPartialCopyNBL and the problem is not
>solved.
>The point with NdisAllocateCloneNetBufferList:
>"The clone NET_BUFFER_LIST structure describes the same data that is
>described by the NET_BUFFER_LIST structure at OriginalNetBufferList. NDIS
>does not copy the data that is described by the original MDLs to new data
>buffers. Instead, the cloned structures reference the original data
>buffers. "
>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en-u
>s/library/windows/hardware/ff560705%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOoM
>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O0
>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=f85e9ea4310586763bef
>feb77684b3e99a81a8f4a0cb5eadfb4298ee124d7d4f)
>This means that by writing to the clone buffer you actually write to the
>original buffer. The buffer is REFERENCED, not a new one allocated.
>
>The point with NdisRetreatNetBufferDataStart:
>"If there isn't enough unused data space, this function allocates a new
>buffer and an MDL to describe the new buffer and chains the new MDL to
>the beginning of the MDL chain. "
>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en-u
>s/library/windows/hardware/ff564527%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOoM
>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O0
>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=6cd96cceddd38b61d064
>fdc3c602d2b0e4e4d8c0f4629ff45f16361d29cf8418)
>This means that if you have an NB with data offset = 40, you retreat by
>40 and write 40 bytes from the "new beginning" of the buffer, you
>actually write to the same MDL.
>
>[QUOTE]The Buffer Management subsystem is one of the major difference
>between both the drivers. Since doing a deep copy of all packets is
>pretty sub-optimal, we (Guolin, in particular) designed the sophisticated
>buffer management system. Yes, it is intricate at first, but we already
>have it! (And for a while now. :))[/QUOTE]
>I particularly find it very intricate. Perhaps it should be refactored a
>bit and documented a bit more.
>For instance, functions such as OvsPartialCopyNBL and
>OvsPartialCopyToMultipleNBLs and OvsCopySinglePacketNBL and
>OvsFullCopyNBL and OvsFullCopyToMultipleNBLs and OvsCompleteNBL I find
>very daunting.
>Perhaps some comments for several parameters used and some comments in
>the function body would help as well.
>What does "Partial copy" actually mean? I understand full copy should
>mean something like "deep copy" / duplicate, but partial copy means
>create new NBL and new NB, and copy src buffer -> dst buffer ( = src
>buffer, because it's the same buffer)??
>
>IMPORTANT:
>I have just made a test with your functionality from OvsPartialCopyNBL,
>and checked the cloned NBL against original NBL:
>original NBL: 0xffffe000`02e74b20
>cloned NBL: 0xffffe000`02a53ac0 (ok)
>
>NBL's FirstNetBuffer
>original FirstNetBuffer: 0xffffe000`02e74c80; DataOffset = 0x26;
>DataLength = 0x5c
>cloned FirstNetBuffer: 0xffffe000`02fe4890; DataOffset = 0; DataLength =
>0x5c (ok)
>
>CurrentMdl of the FirstNetBuffer
>original CurrentMdl: 0xffffe000`02e74da0; ByteCount = 0x50; (used space
>starting after 0x26 bytes, see above); Next MDL count bytes = 0x32; (0x50
>- 0x26) + 0x32 = 0x5c
>cloned CurrentMdl: 0xffffe000`02e74dc6; ByteCount = 0x2a; Next MDL count
>bytes = 0x32; total size from MDLs = 0x5c (NOT OK -- see below)
>
>If you study CurrentMdl of both, you will see:
>original: is spanning from ...74da0 -> ...74DF0 (i.e. + 0x50), but
>excluding the unused area: (offset = 0x26): 74DC6 -> 74DF0
>cloned: is spanning from ...74DC6 -> ...74DF0  (i.e. + 0x32).
>BOTH MDLS MAP THE SAME MEMORY AREA! So writing to cloneNb also writes to
>originalNb!
>
>Did you study the layout of NBLs, NBs and MDLs and buffers in your memory
>management functions?
>
>Sam
>
>________________________________________
>From: Saurabh Shah [ssaurabh at vmware.com]
>Sent: Wednesday, August 06, 2014 12:50 AM
>To: Samuel Ghinet
>Cc: dev at openvswitch.org
>Subject: RE: Cloning packets for "action: set field"
>
>Hi Samuel,
>
>> -----Original Message-----
>> From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
>> Sent: Tuesday, August 05, 2014 5:00 AM
>> To: Saurabh Shah
>> Cc: dev at openvswitch.org
>> Subject: RE: Cloning packets for "action: set field"
>>
>> Hello Saurabh,
>>
>> I believe there are more complexities associated to this packet cloning.
>> Consider for instance the following flow:
>> if (conditions on packet) =>
>> out_to_port: vm1;
>> set ip_dest: X1; out_to_port: vm2
>> set ip_dest: X2; out_to_port: vm3
>>
>> Given that multiple destinations are set, and only then the packet is
>> outputted, then, given the issue with cloning, all packets (that will be
>> outputted to vm1, vm2, vm3) will have the same, last destination ip.
>>
>
>For the set case, we actually send out the packet to all the existing
>output ports before going ahead with the set. This is what the function
>OvsOutputBeforeSetAction does. So, at any given point we only have one
>NBL to deal with.
>
>> I believe we cannot actually copy only the headers: if the packet
>>contains a
>> single MDL, there is nothing else we can do but to duplicate the packet.
>> For the case where the NB has multiple MDLs chained, we must copy the
>>first;
>> MDL(s), i.e. the MDL or MDLs that contain the headers that must be
>>copied.
>>
>
>We actually can copy parts of the header (even if they are in a single
>MDL), in fact our code has the infrastructure to do this. Please take a
>look at OvsPartialCopyNBL.
>
>> There is also a problem with NBL completion (say we do the MDL copy and
>> replacement):
>> for the case above, consider:
>> out_to_port: vm1;  (NB1)
>> set ip_dest: X1; out_to_port: vm2 (clone to NB2, create new MDL, store
>>ptr
>> to original MDL) set ip_dest: X2; out_to_port: vm3(clone to NB3, create
>>new
>> MDL, store ptr to original MDL)
>>
>> At completion, we may need to do this:
>> NB3: Put back old MDL into original NB, free clone NB3
>> NB2: Put back old MDL (but already done before), and free NB2
>> NB1: it's the original, so nothing special.
>>
>> In order to implement this "but already done before" case, we need to
>>use
>> ref counting for NBL context data (which I have seen you do have some
>>use
>> of ref counting) However, consider the complexity of restoring an
>>original NB
>> when multiple MDLs have been replaced, by multiple clones. The big
>> problem here is that the whole MDL link would be broken.
>>
>
>The Buffer Management subsystem is one of the major difference between
>both the drivers. Since doing a deep copy of all packets is pretty
>sub-optimal, we (Guolin, in particular) designed the sophisticated buffer
>management system. Yes, it is intricate at first, but we already have it!
>(And for a while now. :)) So, all that you have mentioned here has
>already been implemented. Please take a look at our Buffer Management
>code & let me know if something is still unclear.
>
>> I strongly believe that building this component with the mind set
>>"optimize
>> first, fix bugs later" is not at all a good way to do it. Plus, it
>>would make the
>> code more and more intrincate, as we continue to both find new problems
>> and we get to have large pieces of code (dealing with NBLs) spread all
>>over.
>> I personally believe a better approach would be "start with something
>> simpler, which is fully functional even though suboptimal, and then
>>continue
>> by building on top of it, improving incrementally for each of various
>>cases".
>>
>
>Quite the contrary. The whole point of implementing the buffer management
>code was to layout the right infrastructure to help simplify NBL
>management for all the consumers & avoid doing a major enhancement at a
>later point. I can go off on a rant explaining our rationales in detail,
>but to allay your concerns, let me assure you that we don't have a
>mindset of optimize-over-bugfix. Once the Netlink integration is
>complete, I expect all of us to work towards closing all critical bugs at
>a priority.
>
>> Thanks!
>> Sam
>
>HTH,
>Saurabh




More information about the dev mailing list