[ovs-dev] [PATCH v3 1/3] datapath-windows: Support for custom VXLAN tunnel port

Nithin Raju nithin at vmware.com
Mon Apr 20 23:24:14 UTC 2015


hi Sorin,
I looked at your replies, and responded to the ones that required a response. I’ll await the next version of the patch. In general, we want to make sure the add/deletion workflow follows the pattern:

a. During OVS_VPORT_CMD_NEW, tunnel vport does not “exist” until OvsTunnelVportPendingInit() is called back successfully, and the vport gets insert into the vport hash tables. We can also handle duplicate requests by checking for duplicates before insertion into the hash tables.

b. During OVS_VPORT_CMD_DEL, we should delete the tunnel vport from the hash tables before calling into OvsCleanupVxlanTunnel(), and OvsTunnelVportPendingUninit() should only be responsible for deallocating the vport structure. Everything else should be done by that time.

It is ok for OvsInitVxlanTunnel() and OvsCleanupVxlanTunnel() to submit the requests to different WFP threads.

This was the important response. Otherwise were mostly Nits.

thanks for all the contribution and patience.

-- Nithin


> On Apr 15, 2015, at 10:58 AM, Sorin Vinturis <svinturis at cloudbasesolutions.com> wrote:
> 
> Nithin,
> 
> Thanks for reviewing my patch. In the future I will try to break big code changes into smaller patches to ease the code review process.
> 
> I will add more comments to the added data structures and functions. Also a diagram to explain the workflow.

Thanks Sorin. It helps readability.


>>   if (status == STATUS_PENDING) {
>> -        return status;
>> +        /* STATUS_PENDING is returned by the NL handler when the request is
>> +         * to be processed later, so we mark the IRP as pending and complete
>> +         * it in another thread when the request is processed. */
>> +        IoMarkIrpPending(irp);
> 
> Do we need to mark the IRP as pending explicitly? We already have working code where we returned STATUS_PENDING and completed the IRP later. My understanding is that as long as the IRP is not completed, it is in pending state.
> 
> I am fine if this is needed/nice-to-have/makes-things-clear.
> SV: If we return from the Dispatch routine without completing the IRP, we MUST (a) mark the IRP pending, before the IRP is completed to the caller, by calling IoMarkIrpPending(), and (b) return STATUS_PENDING from the Dispatch routine. That is a rule, not a nice-to-have thing.

[Nithin]: sounds good. thanks for explaining.


>>      return status;
>>   }
>>   return OvsCompleteIrpRequest(irp, (ULONG_PTR)replyLen, status);
>> }
>> diff --git a/datapath-windows/ovsext/Netlink/NetlinkError.h b/datapath-windows/ovsext/Netlink/NetlinkError.h
>> index 53c935f..fde46b0 100644
>> --- a/datapath-windows/ovsext/Netlink/NetlinkError.h
>> +++ b/datapath-windows/ovsext/Netlink/NetlinkError.h
>> @@ -195,8 +195,10 @@ typedef enum _NL_ERROR_
>>   NL_ERROR_TIMEDOUT = ((ULONG)-138),
>>   /* The given text file is busy */
>>   NL_ERROR_TXTBSY = ((ULONG)-139),
>> -    /*the operation would block */
>> +    /* The operation would block */
>>   NL_ERROR_WOULDBLOCK = ((ULONG)-140),
>> +    /* The operation is not finished */
>> +    NL_ERROR_PENDING = ((ULONG)-141),
>> } NL_ERROR;
>> 
>> static __inline
>> @@ -215,7 +217,11 @@ NlMapStatusToNlErr(NTSTATUS status)
>>   case STATUS_SUCCESS:
>>     ret = NL_ERROR_SUCCESS;
>>     break;
>> +    case STATUS_PENDING:
>> +      ret = NL_ERROR_PENDING;
>> +      break;
> 
> Do we need NL_ERROR_PENDING? Userspace does not really expect an output message if the return value is STATUS_PENDING, right?
> SV: I have added the new NL error for pending in order to make the behavior of the OvsNewVportCmdHandler function consistent. In the latter function the OvsInitTunnelVport returns STATUS_PENDING and NlMapStatusToNlErr is called to map the status to a NL error. Then NL error is used in all across the function for result interpretation. Also, in the cleanup phase the NL error is checked, not the status.

[Nithin]: sounds good. In that case, can we add an ASSERT() in NlBuildErrorMsg() to make sure that we don’t pack a NL_ERROR_PENDING into the netlink message?
> 

>>   default:
>> +        ret = NL_ERROR_OTHER;
>>     break;
>>   }
>> 
>> diff --git a/datapath-windows/ovsext/Switch.c b/datapath-windows/ovsext/Switch.c
>> index a228d8e..cf5e3c4 100644
>> --- a/datapath-windows/ovsext/Switch.c
>> +++ b/datapath-windows/ovsext/Switch.c
>> @@ -261,8 +261,8 @@ OvsDeleteSwitch(POVS_SWITCH_CONTEXT switchContext)
>>   if (switchContext)
>>   {
>>       dpNo = switchContext->dpNo;
>> -        OvsTunnelFilterUninitialize(gOvsExtDriverObject);
>>       OvsClearAllSwitchVports(switchContext);
>> +        OvsTunnelFilterUninitialize(gOvsExtDriverObject);
> 
> I see a little tricky situation here. If there are lots of tunnel ports, then the driver gets unloaded, OvsClearAllSwitchVports() would end up submitting a whole bunch of requests to the WFP threads. Soon after that, OvsTunnelFilterUninitialize() would try to stop the threads. Is this situation handled?
> SV: The situation that you have mentioned is not handled correctly. In that specific case all requests will be completed with STATUS_CANCELLED. That is OK for the remaining VXLAN tunnel creation requests, but not OK for the remaining tunnel deletion requests. I will fix this in the next version of this patch.

[Nithin]: sound good. If you are not fixing it in this patch itself, adding an XXX comment is also ok with me.


>> 
>> +typedef struct _OVS_TUNFLT_REQUEST {
>> +    LIST_ENTRY              entry;
>> +    /* Tunnel filter destination port. */
>> +    UINT16                  port;
>> +    /* Tunnel filter identification used for filter deletion. */
>> +    UINT64                  ID;
>> +    /* Pointer used to return filter ID to the caller on filter creation. */
>> +    UINT64*                 retID;
> 
> Nit: use
> UINT64 *retID or
> PUINT64 retID
> 
> I see why you have a retID and ID separated out. This seems to be because the filter create uses ‘retID', and filter delete uses ‘ID’. In OvsTunelFilterDelete(), you are passing the address of a stack variable, and storing it in ‘retID’. This can be confusing. If you want to use the same POVS_TUNFLT_REQUEST, you can create a union of these two values and use them accordingly. It can be confusing to debug later on.
> SV: Here, the '*retID' points to the '((POVS_VXLAN_VPORT) vxlan)->filterID' memory address from heap and this is how I can return the filter ID to the caller, when creating the tunnel filter at a later time by one of the tunnel filter threads. The problem of using this pointer comes when the tunnel filter needs to be deleted. In the latter case the OvsCleanupVxlanTunnel is called, the tunnel filter delete request is recorded but not yet processed, and the POVS_VXLAN_VPORT structure is released. Thus, by the time the filter delete request is processed, the 'retID' pointer will point to a freed memory space.

[Nithin]: I don’t disagree with the reasoning. All I am saying is, you could use a union since we know that only one of them will be active at any given time. XXX: explained later too.

>> +    /* Callback function called before completing the IRP. */
>> +    PFNTunnelVportPendingOp callback;
>> +    /* Context passed to the callback function. */
>> +    PVOID                   context;
>> +} OVS_TUNFLT_REQUEST, *POVS_TUNFLT_REQUEST;
> 
> Pls. add a L4 protocol also along with the port. So, VXLAN would say UDP. You can do it later, no worries.
> SV: I can add the L4 port also.

[Nithin]: thanks.

>> 
>> -/* Callout driver implementation */
>> +/*
>> + * Callout driver implementation.
>> + */
>> 
>> NTSTATUS
>> 	(HANDLE *handle)
>> @@ -122,14 +184,11 @@ OvsTunnelEngineOpen(HANDLE *handle)
>>   NTSTATUS status = STATUS_SUCCESS;
>>   FWPM_SESSION session = { 0 };
>> 
>> -    /* The session name isn't required but may be useful for diagnostics. */
>> -    session.displayData.name = OVS_TUNNEL_SESSION_NAME;
>>   /*
>>   * Set an infinite wait timeout, so we don't have to handle FWP_E_TIMEOUT
>>   * errors while waiting to acquire the transaction lock.
>>   */
>>   session.txnWaitTimeoutInMSec = INFINITE;
>> -    session.flags = FWPM_SESSION_FLAG_DYNAMIC;
> 
> Is there any specific reason to remove the ‘session.displayData.name’ and ‘session.flags’?
> SV: I have removed the session display name, because I am using the same function to create new engine sessions for each new thread and the session name should not be the same.
> Also, I have removed the session dynamic flag because I needed a static session, which is the default session type. And that was required in order for another session to reference the callout by ID. If the session was dynamic, one cannot add a filter with a callout created in another session.

[Nithin]: sounds good.

>> VOID
>> -OvsTunnelUnregisterCallouts(VOID)
>> +OvsTunnelUnregisterCallouts()
>> {
>> -    OvsTunnelRemoveFilter(&OVS_TUNNEL_FILTER_KEY,
>> -                          &OVS_TUNNEL_SUBLAYER);
>>   FwpsCalloutUnregisterById(gCalloutIdV4);
>> +    FwpmSubLayerDeleteByKey(gEngineHandle, &OVS_TUNNEL_SUBLAYER);
>>   FwpmCalloutDeleteById(gEngineHandle, gCalloutIdV4);
>> -    OvsTunnelEngineClose(&gEngineHandle);
> 
> There’s an assumption here that all the VXLAN ports and the corresponding filters have been removed. Can you pls. add some ASSERT to make sure this is true. i.e. all the filters have been removed before we unregister the callout and remove the sublayer. You can perhaps count the number of filters we added for this callout and ASSERT it is 0.
> SV: No WFP filters are added using the gEngineHandle session. All filters are added in the tunnel filter threads, which have their own requests queue and engine session.

[Nithin]: sounds good.

>> +NTSTATUS
>> +OvsTunnelRemoveFilterEx(HANDLE engineSession,
>> +                        UINT64 filterID)
>> +{
>> +    NTSTATUS status = STATUS_SUCCESS;
>> +    BOOLEAN  error = TRUE;
>> +
>> +    do {
>> +        if (0 == filterID) {
> 
> Nit: if you are not particular, we can use ‘filterID == 0’.
> SV: The reason I am accustomed to use this ordering is to avoid issues when adding 'if (filterID = 0) {' by mistake. The compiler won't complain, because the code is syntactically correct, but it is logically incorrect. I can change it if you don't like it.  

[Nithin]: if you don’t mind, please do change :)

>> +            OVS_LOG_INFO("No tunnel filter to remove.");
>> +            break;
>> +        }
>> +
>> +        status = FwpmFilterDeleteById(engineSession, filterID);
>> +        if (!NT_SUCCESS(status)) {
>> +            OVS_LOG_ERROR("Failed to remove tunnel with filter ID: %d,\
>> +                           status: %x.", filterID, status);
>> +            break;
>> +        }
>> +        OVS_LOG_INFO("Filter removed, filter ID: %d.",
>> +                     filterID);
>> +
>> +        error = FALSE;
>> +    } while (error);
> 
> ‘error’ is not really being used to signal anything. You can use a do {} while(0); if you wish to.
> SV: I did that, but the compiler won't let me. That is the reason for using the variable.

[Nithin]: interesting. that’s fine then.

>> +    case OVS_TUN_FILTER_DELETE:
>> +        status = OvsTunnelRemoveFilterEx(engineSession,
>> +                                         request->ID);
>> +        break;
>> +    default:
> 
> Add ASSERT() here, reason being the caller will end up waiting forever if a wrong code was specified.
> SV: I will set an error to the 'status' variable that will cause the request to be completed with the specified error.

[Nithin]: even better.

>> +POVS_TUNFLT_REQUEST
>> +OvsTunnelFilterRequestPop(POVS_TUNFLT_REQUEST_LIST listRequests)
>> +{
>> +    POVS_TUNFLT_REQUEST request = NULL;
>> +
>> +    NdisAcquireSpinLock(&listRequests->spinlock);
>> +
>> +    if (!IsListEmpty(&listRequests->head)) {
>> +        request = (POVS_TUNFLT_REQUEST) RemoveHeadList(&listRequests->head);
>> +        listRequests->numEntries--;
>> +    }
>> +
>> +    NdisReleaseSpinLock(&listRequests->spinlock);
>> +
>> +    return request;
>> +}
> 
> OvsTunnelFilterRequestPopCrtList() is an optimized version of OvsTunnelFilterRequestPop(). I think we can make do with just the former. Less code the better and easier to read.
> SV: The new function I have added is faster. It gets all current requests in the queue, while acquiring the spin lock only once. The thread's request queue is emptied and other requests are added as they come. There is no reason to pop the requests one by one, while waiting each time for the spin lock to be acquired, because all the requests are destined for the current thread and will be handled by the same thread.

[Nithin]: I meant to say that just do with OvsTunnelFilterRequestPopCrtList(). We can pop the list at once, and then loop over the list. We don’t need a version that pops one at a time. I think we are saying the same thing.

> Also, renaming OvsTunnelFilterRequestPopCrtList => OvsTunnelFilterRequestPopList() would make it easier to read.
> SV: Sure, I can rename the function.

[Nithin]: sounds good.


> 
>> +
>> +VOID
>> +OvsTunnelFilterRequestPush(POVS_TUNFLT_REQUEST_LIST listRequests,
>> +                           POVS_TUNFLT_REQUEST request)
>> +{
>> +    NdisAcquireSpinLock(&listRequests->spinlock);
>> +
>> +    InsertTailList(&listRequests->head, &(request->entry));
>> +    listRequests->numEntries++;
>> +
>> +    NdisReleaseSpinLock(&listRequests->spinlock);
>> +}
>> +
>> +VOID
>> +OvsTunnelFilterRoundRobinPush(POVS_TUNFLT_REQUEST request)
>> +{
>> +    /* If the new request contains the same filter port like the last one,
>> +     * then send the request to the same/previous thread. */
>> +    if ((0 != request->port) && (gLastQueuedFilterPort != request->port))
> 
> Is it possible for request->port to be 0? Shouldn’t it have been validated way before itself in VPORT add? You can add an ASSERT if you wish. Also, what happens if the request->port is 0? We go ahead and add the port anyway? This seems wrong.
> SV: It is not wrong. All tunnel delete requests have the 'request->port' equal to zero and that is how I am identifying it. Sure, I could use the 'request->operation' for that. I can change it.
> 
> I saw later that request->port == 0 is true during OVS_TUN_FILTER_DELETE. You might as well check for the request type than request->port == 0.
> SV: I will change to check for the operation instead of the port. Sure.

[Nithin]: ok.

> Also, I don’t see the necessity for using the ‘gLastQueuedFilterPort’ port. This can happen in the following cases:
> a. There are two different tunneling protocols that are using the same destination L4 port, and both of them are being created in succession.
> b. A tunneling port is getting deleted, and re-created.
> 
> The possibility of this is remote. I think you can nuke ‘gLastQueuedFilterPort’. The global variable is not protected anyway.
> 
> The hashing logic is good. I like it esp. if there’s a flurry of port-creation requests.
> 
> But, I am wondering if we really need a ‘gRoundRobinThreadIndex’. I see the following problems:
> 1. It is not protected.
> 2. You can just hash on the port number directly, and assuming that the port numbers are in host order and in a sequence, you should get good distribution.
> 
> In other words:
> queueIndex = request->port % OVS_TUNFLT_MAX_THREADS.
> 
> Are you worried about a DOS attack involving create-port & delete-port for the same L4 port#? That should be handled at a higher layer in Vport.c. Fail a delete-port request until a previous crete-port is successful.
> 
> Maybe I am missing something about the need for ‘gRoundRobinThreadIndex’ and ‘gLastQueuedFilterPort’. Pls. feel free to explain.
> SV: In case an existing VXLAN interface is set with a different destination port, userspace will issue two vport command operations, OVS_VPORT_CMD_NEW and OVS_VPORT_CMD_DEL - in this order. If a tunnel filter create request, corresponding to an OVS_VPORT_CMD_NEW operation, is assigned to thread 1 and the tunnel delete request is assigned to thread 2, this could be a potential problem if the tunnel thread delete request is processed before the tunnel create request.

[Nithin]: you mean, “problem if the tunnel thread *create* request is processed before the tunnel *delete* request”?

Sure, I didn’t think about this usecase, but it seems to be a valid problem. There seem to be 2 different problems when the port number gets updated in userspace:
1. Userspace calls into kernel using OVS_VPORT_CMD_NEW followed by OVS_VPORT_CMD_DEL
2. Userspace calls into kernel with OVS_VPORT_CMD_SET with updated port number.

#1 should be easy to handle. If we make clear as to when a Tunnel vport “exists" and “does not exist”, and synchronize the two operations, we can handle it neatly:
a. During OVS_VPORT_CMD_NEW, tunnel vport does not “exist” until OvsTunnelVportPendingInit() is called back successfully, and the vport gets insert into the vport hash tables. We can also handle duplicate requests by checking for duplicates before insertion into the hash tables.

b. During OVS_VPORT_CMD_DEL, we should delete the tunnel vport from the hash tables before calling into OvsCleanupVxlanTunnel(), and OvsTunnelVportPendingUninit() should only be responsible for deallocating the vport structure. Everything else should be done by that time.

It is ok for OvsInitVxlanTunnel() and OvsCleanupVxlanTunnel() to submit the requests to different WFP threads.

For #2, it can get a little tricky. I have to think through it.

My comments above about ‘gRoundRobinThreadIndex’ and ‘gLastQueuedFilterPort’ are still valid.

>> 
>> +VOID
>> +OvsTunnelFilterThreadStop(POVS_TUNFLT_THREAD_CONTEXT threadCtx)
>> +{
>> +    if (threadCtx->threadObject) {
>> +        /* Signal stop thread event. */
>> +        KeSetEvent(&threadCtx->stopEvent, IO_NO_INCREMENT, FALSE);
>> +
>> +        /* Wait for the tunnel thread to finish. */
>> +        KeWaitForSingleObject(threadCtx->threadObject,
>> +                              Executive,
>> +                              KernelMode,
>> +                              FALSE,
>> +                              NULL);
> 
> Just for my information, when does 'threadCtx->threadObject' get signaled?
> SV: The 'threadCtx->threadObject' is signaled when the thread exits.

[Nithin]: thanks for the explanation.

>> 
>>   vport->ovsState = OVS_STATE_PORT_CREATED;
>> @@ -849,11 +864,14 @@ OvsInitPhysNicVport(POVS_VPORT_ENTRY physExtVport,
>> * --------------------------------------------------------------------------
>> */
>> NTSTATUS
>> -OvsInitTunnelVport(POVS_VPORT_ENTRY vport,
>> +OvsInitTunnelVport(PVOID context,
>> +                   POVS_VPORT_ENTRY vport,
> 
> PVOID context can be
> POVS_USER_PARAMS_CONTEXT usrParamsCtx.
> 
> No reason for making it PVOID.
> SV: Yes, there is. The POVS_USER_PARAMS_CONTEXT is not available from the Vport.h header.

[Nithin]: sounds good. You can update the parameter name.

>> 
>> +    do {
>> +        if (NULL == tunnelContext) {
>> +            nlError = NL_ERROR_INVAL;
>> +            break;
>> +        }
> 
> Can this happen? Shouldn’t this be an ASSERT?
> 
>> +
>> +        if ((NULL == tunnelContext->vport) ||
>> +            (NULL == tunnelContext->inputBuffer) ||
>> +            (NULL == tunnelContext->outputBuffer)) {
>> +            nlError = NL_ERROR_INVAL;
>> +            break;
>> +        }
> 
> Can this happen? Shouldn’t this be an ASSERT?
> 
>> +
>> +        if (!NT_SUCCESS(status)) {
>> +            nlError = NlMapStatusToNlErr(status);
> 
> ASSERT(nlError != NL_ERROR_INVAL) would be useful IMO.

[Nithin]: do you think these comments are valid?

>> 
>> +        if ((NULL == tunnelContext->vport) ||
>> +            (NULL == tunnelContext->inputBuffer) ||
>> +            (NULL == tunnelContext->outputBuffer)) {
>> +            nlError = NL_ERROR_INVAL;
>> +            break;
>> +        }
> 
> Can these two conditions happen? Shouldn’t they be ASSERT?
> 
>> +
>> +        msgIn = (POVS_MESSAGE)tunnelContext->inputBuffer;
>> +        msgOut = (POVS_MESSAGE)tunnelContext->outputBuffer;
>> +
>> +        if (!NT_SUCCESS(status)) {
>> +            nlError = NlMapStatusToNlErr(status);
>> +            break;
>> +        }

> ASSERT(nlError != NL_ERROR_INVAL) would be useful IMO.

[Nithin]: do you think these comments are valid?

>>                              POVS_VPORT_ENTRY vport);
>> -NTSTATUS OvsInitTunnelVport(POVS_VPORT_ENTRY vport, OVS_VPORT_TYPE ovsType,
>> -                            UINT16 dstport);
>> +NTSTATUS OvsInitTunnelVport(PVOID usrParamsCtx, POVS_VPORT_ENTRY vport,
>> +                            OVS_VPORT_TYPE ovsType, UINT16 dstport);
>> NTSTATUS OvsInitBridgeInternalVport(POVS_VPORT_ENTRY vport);
>> 
>> POVS_VPORT_ENTRY OvsAllocateVport(VOID);
>> diff --git a/datapath-windows/ovsext/Vxlan.c b/datapath-windows/ovsext/Vxlan.c
>> index 1ce5af2..cd3dc45 100644
>> --- a/datapath-windows/ovsext/Vxlan.c
>> +++ b/datapath-windows/ovsext/Vxlan.c
>> @@ -49,15 +49,54 @@
>> /* Move to a header file */
>> extern POVS_SWITCH_CONTEXT gOvsSwitchContext;
>> 
>> +BOOLEAN
>> +OvsIsTunnelFilterCreated(POVS_SWITCH_CONTEXT switchContext,
>> +                         UINT16 udpPortDest)
> 
> Pls. add a description. Also, this is has static scope.
> 
> I am not clear as to the motivation for this function. It seems to be used in OvsInitVxlanTunnel() to decide on whether or not to insert the VXLAN WFP filter or not. If a VXLAN port exists in the hash tables, wouldn’t it have vxlanPort->filterID != NULL? The VXLAN vport would have got inserted into the hash tables in OvsTunnelVportPendingInit(), but what also means vxlanPort->filterID != NULL.
> 
> Are you trying to handle the case of duplicate request being made while the previous one is not complete yet? This won’t help for that.
> SV: OvsIsTunnelFilterCreated function was added to verify if a VXLAN WFP port already exists. When this function is called, the current vxlan vport already exist in the hash tables, but the port is not added yet. The port addition is delayed, due to high IRQL. If the port was not added yet, 'vxlanPort->filterID == 0’.

[Nithin]: vports added from userspace are added to the hash tables ONLY in InitOvsVportCommon(), and InitOvsVportCommon() is being called for delayed VXLAN ports only in OvsIsTunnelFilterCreated(). So, there’s no race as such. If a VXLAN port exists in the hash tables, it implies that vxlanPort->filterID != NULL.


More information about the dev mailing list