[ovs-dev] [PATCH V2] netdev-dpdk: Add Jumbo Frame Support.
Kavanagh, Mark B
mark.b.kavanagh at intel.com
Fri Jan 29 11:56:08 UTC 2016
Further clarifications below Aaron - please let me know if you have any additional comments.
Thanks,
Mark
>
>One addendum below, Aaron.
>
>>
>>>Hi Mark,
>>>
>>>Thanks for the patch! I've not done a very thorough review of this, but
>>>my first-blush comments are inline.
>>
>>Thanks for the review Aaron!
>>
>>Any additional comments are welcome - my initial responses are inline, below.
>>
>>Thanks,
>>Mark
>>
>>>
>>>Mark Kavanagh <mark.b.kavanagh at intel.com> writes:
>>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>>> using single-segment-mbufs.
>>>>
>>>> Using this approach, the amount of memory allocated for each mbuf
>>>> to store frame data is increased to a value greater than 1518B
>>>> (typical Ethernet maximum frame length). The increased space
>>>> available in the mbuf means that an entire Jumbo Frame can be carried
>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>> segments.
>>>>
>>>> The amount of space allocated to each mbuf to hold frame data is
>>>> defined dynamically by the user when adding a DPDK port to a bridge.
>>>> If an MTU value is not supplied, or the user-supplied value is invalid,
>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>
>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
>>>> ---
>>>
>>>This is a new feature which has user-visible impact, so should be
>>>announced in NEWS (plus, it's great news).
>>
>>No problem - I'll add an entry.
>>
>>>
>>>> INSTALL.DPDK.md | 59 ++++++++-
>>>> lib/netdev-dpdk.c | 356 +++++++++++++++++++++++++++++++++++++++++-------------
>>>> 2 files changed, 328 insertions(+), 87 deletions(-)
>>>>
>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>> index 96b686c..2f23e27 100644
>>>> --- a/INSTALL.DPDK.md
>>>> +++ b/INSTALL.DPDK.md
>>>> @@ -859,10 +859,61 @@ by adding the following string:
>>>> to <interface> sections of all network devices used by DPDK. Parameter 'N'
>>>> determines how many queues can be used by the guest.
>>>>
>>>> +
>>>> +Jumbo Frames
>>>> +------------
>>>> +
>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>>>> +
>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to the ovs-vsctl
>>>> +'add-port' command-line, along with the required MTU for the port.
>>>> +e.g.
>>>
>>>This text and the example do not match. I think '--dpdk-mtu' is not valid.
>>
>>Good catch, thanks.
>>
>>>
>>>> +
>>>> + ```
>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000
>>>> + ```
>>>> +
>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
>>>> +increased, such that a full Jumbo Frame may be accommodated inside a single
>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>
>>>Why no support? DPDK supports changing the mtu.
>>
>>I guess my rationale here is that an MTU change can't be triggered via OVS command-line, nor
>>can it be triggered programmatically via DPDK (apart from an explicit call to
>>rte_eth_dev_set_mtu).
>>So, while technically it's possibly, from a user's point of view, there's no way to
>configure
>>it, outside of modifying the code directly. If I've missed something here, please let me
>>know.
>>
>>>
>>>> +Jumbo frame support has been validated against 13312B frames, using the
>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>> +theoretically be supported. Supported port types excludes vHost-Cuse ports, as
>>>> +this feature is pending deprecation.
>>>> +
>>>> +
>>>> +vHost Ports and Jumbo Frames
>>>> +----------------------------
>>>> +Jumbo frame support is available for DPDK vHost-User ports only. Some additional
>>>> +configuration is needed to take advantage of this feature:
>>>> +
>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
>>>> + the QEMU command line snippet below:
>>>> +
>>>> + ```
>>>> + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>> + '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>> + ```
>>>> +
>>>> + 2. Where virtio devices are bound to the Linux kernel driver in a guest
>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK driver), the
>>>> + MTU of those logical network interfaces must also be increased. This
>>>> + avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers
>>>> + to the length of the IP packet only, and not that of the entire frame.
>>>> +
>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2
>>>> + header and CRC lengths (i.e. 18B) from the max supported frame size.
>>>> + So, to set the MTU for a 13312B Jumbo Frame:
>>>> +
>>>> + ```
>>>> + ifconfig eth1 mtu 13294
>>>> + ```
>>>> +
>>>> +
>>>> Restrictions:
>>>> -------------
>>>>
>>>> - - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>>>> - Currently DPDK port does not make use any offload functionality.
>>>> - DPDK-vHost support works with 1G huge pages.
>>>>
>>>> @@ -903,6 +954,12 @@ Restrictions:
>>>> the next release of DPDK (which includes the above patch) is available and
>>>> integrated into OVS.
>>>>
>>>> + Jumbo Frames:
>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of
>>>> + this issue is currently being investigated.
>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.
>>>> +
>>>> +
>>>> Bug Reporting:
>>>> --------------
>>>>
>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>> index de7e488..5f23e28 100644
>>>> --- a/lib/netdev-dpdk.c
>>>> +++ b/lib/netdev-dpdk.c
>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>> #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>
>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED 1
>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE 1024
>>>> +
>>>> /*
>>>> * need to reserve tons of extra space in the mbufs so we can align the
>>>> * DMA addresses to 4KB.
>>>> * The minimum mbuf size is limited to avoid scatter behaviour and drop in
>>>> * performance for standard Ethernet MTU.
>>>> */
>>>> -#define MTU_TO_MAX_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>> -#define MBUF_SIZE_MTU(mtu) (MTU_TO_MAX_LEN(mtu) \
>>>> - + sizeof(struct dp_packet) \
>>>> - + RTE_PKTMBUF_HEADROOM)
>>>> -#define MBUF_SIZE_DRIVER (2048 \
>>>> - + sizeof (struct rte_mbuf) \
>>>> - + RTE_PKTMBUF_HEADROOM)
>>>> -#define MBUF_SIZE(mtu) MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)
>>>> +#define MBUF_SEGMENT_SIZE(mtu) ( MTU_TO_FRAME_LEN(mtu) \
>>>> + + sizeof(struct dp_packet) \
>>>> + + RTE_PKTMBUF_HEADROOM)
>>>> +
>>>> +/* This value should be specified as a multiple of the DPDK NIC driver's
>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>> + */
>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN 13312
>>>>
>>>> /* Max and min number of packets in the mempool. OVS tries to allocate a
>>>> * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>>> #define MIN_NB_MBUF (4096 * 4)
>>>> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>
>>>> +#define DPDK_VLAN_TAG_LEN 4
>>>> +
>>>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>>>
>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>> .header_split = 0, /* Header Split disabled */
>>>> .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>> .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>> - .jumbo_frame = 0, /* Jumbo Frame Support disabled */
>>>> .hw_strip_crc = 0,
>>>> },
>>>> .rx_adv_conf = {
>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>>> return class->construct == netdev_dpdk_construct;
>>>> }
>>>>
>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).
>>>> + * If 'frame_len' is not a multiple of this value, insufficient buffers are
>>>> + * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio
>>>> + * driver needs to ensure that there is also sufficient space in the Rx buffer
>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too
>>>> + * small, then the driver enables scatter RX behaviour, which reduces
>>>> + * performance. To prevent this, use a buffer size that is closest to
>>>> + * 'frame_len', but which satisfies the aforementioned criteria.
>>>> + */
>>>> +static uint32_t
>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>> +{
>>>> + struct rte_eth_dev_info info;
>>>> + uint32_t buf_size;
>>>> + uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>> +
>>>> + if(netdev->type == DPDK_DEV_ETH) {
>>>> + rte_eth_dev_info_get(netdev->port_id, &info);
>>>> + buf_size = (info.min_rx_bufsize == 0) ?
>>>> + NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>> + info.min_rx_bufsize;
>>>
>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?
>>
>>I'm not sure if I follow - the MTU isn't the issue here, but rather the buffer size used
>>when creating the rte_mempool, which must be a multiple of the driver's min_rx_bufsize.
>>
>>>
>>>> + } else {
>>>> + buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>> + }
>>>> +
>>>> + if(len % buf_size) {
>>>> + len = buf_size * ((len/buf_size) + 1);
>>>> + }
>>>> +
>>>> + return len;
>>>> +}
>>>> +
>>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>>>> * for all other segments data, bss and text. */
>>>>
>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>>> }
>>>>
>>>> static void
>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>>>> - void *opaque_arg OVS_UNUSED,
>>>> - void *_m,
>>>> - unsigned i OVS_UNUSED)
>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>> {
>>>> - struct rte_mbuf *m = _m;
>>>> - uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>> + struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>> + struct rte_pktmbuf_pool_private default_mbp_priv;
>>>> + uint16_t roomsz;
>>>>
>>>> RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>
>>>> - memset(m, 0, mp->elt_size);
>>>> + /* if no structure is provided, assume no mbuf private area */
>>>> +
>>>> + user_mbp_priv = opaque_arg;
>>>> + if (user_mbp_priv == NULL) {
>>>> + default_mbp_priv.mbuf_priv_size = 0;
>>>> + if (mp->elt_size > sizeof(struct dp_packet)) {
>>>> + roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>> + } else {
>>>> + roomsz = 0;
>>>> + }
>>>> + default_mbp_priv.mbuf_data_room_size = roomsz;
>>>> + user_mbp_priv = &default_mbp_priv;
>>>> + }
>>>> +
>>>> + RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>> + user_mbp_priv->mbuf_data_room_size +
>>>> + user_mbp_priv->mbuf_priv_size);
>>>> +
>>>> + mbp_priv = rte_mempool_get_priv(mp);
>>>> + memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>> +}
>>>> +
>>>> +/* Initialise some fields in the mbuf structure that are not modified by the
>>>> + * user once created (origin pool, buffer start address, etc.*/
>>>> +static void
>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>> + void *opaque_arg OVS_UNUSED,
>>>> + void *_m,
>>>> + unsigned i OVS_UNUSED)
>>>> +{
>>>> + struct rte_mbuf *m = _m;
>>>> + uint32_t buf_size, buf_len, priv_size;
>>>> +
>>>> + priv_size = rte_pktmbuf_priv_size(mp);
>>>> + buf_size = sizeof(struct dp_packet) + priv_size;
>>>> + buf_len = rte_pktmbuf_data_room_size(mp);
>>>>
>>>> - /* start of buffer is just after mbuf structure */
>>>> - m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>> - m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>> - sizeof(struct dp_packet);
>>>> - m->buf_len = (uint16_t)buf_len;
>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
>>>> + RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>> + RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>
>>>> - /* keep some headroom between start of buffer and data */
>>>> - m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>> + memset(m, 0, mp->elt_size);
>>>>
>>>> - /* init some constant fields */
>>>> - m->pool = mp;
>>>> - m->nb_segs = 1;
>>>> - m->port = 0xff;
>>>> + /* start of buffer is after dp_packet structure and priv data */
>>>> + m->priv_size = priv_size;
>>>> + m->buf_addr = (char *)m + buf_size;
>>>> + m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>> + m->buf_len = (uint16_t)buf_len;
>>>> +
>>>> + /* keep some headroom between start of buffer and data */
>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>>>> +
>>>> + /* init some constant fields */
>>>> + m->pool = mp;
>>>> + m->nb_segs = 1;
>>>> + m->port = 0xff;
>>>
>>>Please don't mix tabs and spaces in the file.
>>
>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.
>>
>>>
>>>> }
>>>>
>>>> static void
>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>> {
>>>> struct rte_mbuf *m = _m;
>>>>
>>>> - __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>> + __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>
>>>> dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>> }
>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>>>> struct dpdk_mp *dmp = NULL;
>>>> char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>> unsigned mp_size;
>>>> + struct rte_pktmbuf_pool_private mbp_priv;
>>>>
>>>> LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>> if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>>>> dmp->socket_id = socket_id;
>>>> dmp->mtu = mtu;
>>>> dmp->refcount = 1;
>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;
>>>> + mbp_priv.mbuf_priv_size = 0;
>>>>
>>>> mp_size = MAX_NB_MBUF;
>>>> do {
>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>>>> return NULL;
>>>> }
>>>>
>>>> - dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),
>>>> MP_CACHE_SZ,
>>>> sizeof(struct rte_pktmbuf_pool_private),
>>>> - rte_pktmbuf_pool_init, NULL,
>>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>>> ovs_rte_pktmbuf_init, NULL,
>>>> socket_id, 0);
>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);
>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int
>>n_txq)
>>>> {
>>>> int diag = 0;
>>>> int i;
>>>> + struct rte_eth_conf conf = port_conf;
>>>>
>>>> /* A device may report more queues than it makes available (this has
>>>> * been observed for Intel xl710, which reserves some of them for
>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int
>>>n_txq)
>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>>> }
>>>>
>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
>>>> + if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>> + conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>> + conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>> + }
>>>> +
>>>> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>>> if (diag) {
>>>> break;
>>>> }
>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>>>> struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>> int sid;
>>>> int err = 0;
>>>> + uint32_t buf_size;
>>>>
>>>> ovs_mutex_init(&netdev->mutex);
>>>> ovs_mutex_lock(&netdev->mutex);
>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>>>> netdev->port_id = port_no;
>>>> netdev->type = type;
>>>> netdev->flags = 0;
>>>> +
>>>> + /* Initialize port's MTU and frame len to the default Ethernet values.
>>>> + * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>> + * netdev_dpdk_set_config.
>>>> + */
>>>> + netdev->max_packet_len = ETHER_MAX_LEN;
>>>> netdev->mtu = ETHER_MTU;
>>>> - netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>
>>>> - netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>> + buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));
>>>> if (!netdev->dpdk_mp) {
>>>> err = ENOMEM;
>>>> goto unlock;
>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], const char prefix[],
>>>> return 0;
>>>> }
>>>>
>>>> +static void
>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>> +{
>>>> + const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>> + int local_mtu;
>>>> +
>>>> + if(mtu_str) {
>>>> + local_mtu = atoi(mtu_str);
>>>
>>>Please use strtol or strtoul here, and detect errors in the string by
>>>checking endptr. That way, we can be sure that random garbage on the
>>>stack won't accidentally become the mtu.
>>
>>Will do, thanks.
>>
>>>
>>>> + }
>>>> + if(!mtu_str || local_mtu < ETHER_MTU ||
>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>>>> + local_mtu = ETHER_MTU;
>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",
>>>> + local_mtu);
>>>> + }
>>>> + *mtu = local_mtu;
>>>> +}
>>>> +
>>>> static int
>>>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>>> {
>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, struct smap
>>>*args)
>>>> smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>>>> smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>> smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
>>>> + smap_add_format(args, "mtu", "%d", dev->mtu);
>>>> ovs_mutex_unlock(&dev->mutex);
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> +/* Set the mtu of DPDK_DEV_ETH ports */
>>>> +static int
>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>> +{
>>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>> + int old_mtu, err;
>>>> + uint32_t buf_size;
>>>> + int dpdk_mtu;
>>>> + struct dpdk_mp *old_mp;
>>>> + struct dpdk_mp *mp;
>>>> +
>>>> + ovs_mutex_lock(&dpdk_mutex);
>>>> + ovs_mutex_lock(&dev->mutex);
>>>> + if (dev->mtu == mtu) {
>>>> + err = 0;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>> + dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>> +
>>>> + mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>> + if (!mp) {
>>>> + err = ENOMEM;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + rte_eth_dev_stop(dev->port_id);
>>>> +
>>>> + old_mtu = dev->mtu;
>>>> + old_mp = dev->dpdk_mp;
>>>> + dev->dpdk_mp = mp;
>>>> + dev->mtu = mtu;
>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>> +
>>>> + err = dpdk_eth_dev_init(dev);
>>>
>>>Why call the dpdk_eth_dev_init here? Since you are directly calling
>>>rte_eth_dev_X functions here - wouldn't it make sense to use the
>>>rte_eth_dev_start() instead?
The rx queues for the device need to be assigned the mempool that was allocated for the jumbo frames; since dpdk_eth_dev_init handles the queue setup, rte_eth_dev configuration, etc. it makes sense to call it here.
>>
>>This is legacy code, but I can refactor if needs be.
>>
>>>
>>>> + if (err) {
>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "
>>>> + "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>> + dpdk_mp_put(mp);
>>>> + dev->mtu = old_mtu;
>>>> + dev->dpdk_mp = old_mp;
>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>> + err = dpdk_eth_dev_init(dev);
>>>
>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>>>will cause a queue reconfiguration action, which is not needed, and can
>>>error.
This call is needed, in case the eth_dev's queue configuration was changed by the previous call.
The error handling for this call is arguably outside the scope of this patch, and could be submitted as a separate patch - what do you think?
>>>
>>
>>As above.
>>
>>>If you are intending to set MTU, I would have expected calls to
>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>>>recognize that much of this is legacy, but since you are touching it :)
>>
>>I actually had a number of issues with rte_eth_dev_set_mtu early on in the patch; I can take
>>a second look now that the code is more stable though.
>
>I remembered why I didn't use this.
>
>If you look at the IXGBE driver's mtu_set function, it refuses frame lengths which are larger
>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
>This might be avoided by configuring the device for scatter RX behavior, but this incurs a
>performance impact.
>
><snippet>
>ixgbe_dev_mtu_set(...) {
>
>...
>
> /* refuse mtu that requires the support of scattered packets when this
> * feature has not been enabled before. */
> if (!dev->data->scattered_rx &&
> (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
> dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
> return -EINVAL;
>
>...
>}
></snippet>
>
>
>>
>>>
>>>> + goto out;
>>>> + }
>>>> +
>>>> + dpdk_mp_put(old_mp);
>>>> + netdev_change_seq_changed(netdev);
>>>> +out:
>>>> + ovs_mutex_unlock(&dev->mutex);
>>>> + ovs_mutex_unlock(&dpdk_mutex);
>>>> + return err;
>>>> +}
>>>> +
>>>> +static int
>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>> +{
>>>> + int mtu;
>>>> +
>>>> + dpdk_dev_parse_mtu(args, &mtu);
>>>> +
>>>> + return netdev_dpdk_set_mtu(netdev_, mtu);
>>>> +}
>>>> +
>>>> static int
>>>> netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>> {
>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
>>>>
>>>> return 0;
>>>> }
>>>> -
>>>> -static int
>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>> -{
>>>> - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>> - int old_mtu, err;
>>>> - struct dpdk_mp *old_mp;
>>>> - struct dpdk_mp *mp;
>>>> -
>>>> - ovs_mutex_lock(&dpdk_mutex);
>>>> - ovs_mutex_lock(&dev->mutex);
>>>> - if (dev->mtu == mtu) {
>>>> - err = 0;
>>>> - goto out;
>>>> - }
>>>> -
>>>> - mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>> - if (!mp) {
>>>> - err = ENOMEM;
>>>> - goto out;
>>>> - }
>>>> -
>>>> - rte_eth_dev_stop(dev->port_id);
>>>> -
>>>> - old_mtu = dev->mtu;
>>>> - old_mp = dev->dpdk_mp;
>>>> - dev->dpdk_mp = mp;
>>>> - dev->mtu = mtu;
>>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>> -
>>>> - err = dpdk_eth_dev_init(dev);
>>>> - if (err) {
>>>> - dpdk_mp_put(mp);
>>>> - dev->mtu = old_mtu;
>>>> - dev->dpdk_mp = old_mp;
>>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>> - dpdk_eth_dev_init(dev);
>>>> - goto out;
>>>> - }
>>>> -
>>>> - dpdk_mp_put(old_mp);
>>>> - netdev_change_seq_changed(netdev);
>>>> -out:
>>>> - ovs_mutex_unlock(&dev->mutex);
>>>> - ovs_mutex_unlock(&dpdk_mutex);
>>>> - return err;
>>>> -}
>>>> -
>>>> static int
>>>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>
>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, struct smap
>>>*args)
>>>> smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));
>>>> smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>> smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
>>>> - smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
>>>> + smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>> smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
>>>> smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
>>>> smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>>> return 0;
>>>> }
>>>>
>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>>>> +static int
>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>> +{
>>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>> + int err = 0;
>>>> + struct dpdk_mp *old_mp;
>>>> + struct dpdk_mp *mp;
>>>> +
>>>> + ovs_mutex_lock(&dpdk_mutex);
>>>> + ovs_mutex_lock(&dev->mutex);
>>>> + if (dev->mtu == mtu) {
>>>> + err = 0;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + mp = dpdk_mp_get(dev->socket_id, mtu);
>>>> + if (!mp) {
>>>> + err = ENOMEM;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + old_mp = dev->dpdk_mp;
>>>> + dev->dpdk_mp = mp;
>>>> + dev->mtu = mtu;
>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>> +
>>>> + dpdk_mp_put(old_mp);
>>>> + netdev_change_seq_changed(netdev);
>>>> +out:
>>>> + ovs_mutex_unlock(&dev->mutex);
>>>> + ovs_mutex_unlock(&dpdk_mutex);
>>>> + return err;
>>>> +}
>>>> +
>>>> +static int
>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)
>>>> +{
>>>> + int mtu;
>>>> +
>>>> + dpdk_dev_parse_mtu(args, &mtu);
>>>> +
>>>> + return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>> +}
>>>> +
>>>> static void
>>>> dpdk_common_init(void)
>>>> {
>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>>> return err;
>>>> }
>>>>
>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \
>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>>>> + MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES, \
>>>> + GET_STATUS, RXQ_RECV) \
>>>> { \
>>>> NAME, \
>>>> INIT, /* init */ \
>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>>> DESTRUCT, \
>>>> netdev_dpdk_dealloc, \
>>>> netdev_dpdk_get_config, \
>>>> - NULL, /* netdev_dpdk_set_config */ \
>>>> + SET_CONFIG, \
>>>> NULL, /* get_tunnel_config */ \
>>>> NULL, /* build header */ \
>>>> NULL, /* push header */ \
>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>>> netdev_dpdk_set_etheraddr, \
>>>> netdev_dpdk_get_etheraddr, \
>>>> netdev_dpdk_get_mtu, \
>>>> - netdev_dpdk_set_mtu, \
>>>> + SET_MTU, \
>>>> netdev_dpdk_get_ifindex, \
>>>> GET_CARRIER, \
>>>> netdev_dpdk_get_carrier_resets, \
>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>>> NULL,
>>>> netdev_dpdk_construct,
>>>> netdev_dpdk_destruct,
>>>> + netdev_dpdk_set_config,
>>>> netdev_dpdk_set_multiq,
>>>> netdev_dpdk_eth_send,
>>>> + netdev_dpdk_set_mtu,
>>>> netdev_dpdk_get_carrier,
>>>> netdev_dpdk_get_stats,
>>>> netdev_dpdk_get_features,
>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =
>>>> NULL,
>>>> netdev_dpdk_ring_construct,
>>>> netdev_dpdk_destruct,
>>>> + netdev_dpdk_set_config,
>>>> netdev_dpdk_set_multiq,
>>>> netdev_dpdk_ring_send,
>>>> + netdev_dpdk_set_mtu,
>>>> netdev_dpdk_get_carrier,
>>>> netdev_dpdk_get_stats,
>>>> netdev_dpdk_get_features,
>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class
>=
>>>> dpdk_vhost_cuse_class_init,
>>>> netdev_dpdk_vhost_cuse_construct,
>>>> netdev_dpdk_vhost_destruct,
>>>> + NULL,
>>>> netdev_dpdk_vhost_set_multiq,
>>>> netdev_dpdk_vhost_send,
>>>> + NULL,
>>>> netdev_dpdk_vhost_get_carrier,
>>>> netdev_dpdk_vhost_get_stats,
>>>> NULL,
>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class
>=
>>>> dpdk_vhost_user_class_init,
>>>> netdev_dpdk_vhost_user_construct,
>>>> netdev_dpdk_vhost_destruct,
>>>> + netdev_dpdk_vhost_set_config,
>>>> netdev_dpdk_vhost_set_multiq,
>>>> netdev_dpdk_vhost_send,
>>>> + netdev_dpdk_vhost_set_mtu,
>>>> netdev_dpdk_vhost_get_carrier,
>>>> netdev_dpdk_vhost_get_stats,
>>>> NULL,
>>_______________________________________________
>>dev mailing list
>>dev at openvswitch.org
>>http://openvswitch.org/mailman/listinfo/dev
>_______________________________________________
>dev mailing list
>dev at openvswitch.org
>http://openvswitch.org/mailman/listinfo/dev
More information about the dev
mailing list