* [PATCH 1/3] net/iavf: support Rx timestamp offload on AVX512 @ 2023-04-10 7:35 Zhichao Zeng 2023-04-12 6:49 ` [PATCH v2 " Zhichao Zeng 2023-04-12 8:46 ` Zhichao Zeng 0 siblings, 2 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-04-10 7:35 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Jingjing Wu, Beilei Xing, Bruce Richardson, Konstantin Ananyev This patch enables Rx timestamp offload on AVX512 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 207 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - 3 files changed, 206 insertions(+), 7 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h index 09e2127db0..97b5e86f6e 100644 --- a/drivers/net/iavf/iavf_rxtx.h +++ b/drivers/net/iavf/iavf_rxtx.h @@ -44,7 +44,8 @@ RTE_ETH_RX_OFFLOAD_CHECKSUM | \ RTE_ETH_RX_OFFLOAD_SCTP_CKSUM | \ RTE_ETH_RX_OFFLOAD_VLAN | \ - RTE_ETH_RX_OFFLOAD_RSS_HASH) + RTE_ETH_RX_OFFLOAD_RSS_HASH | \ + RTE_ETH_RX_OFFLOAD_TIMESTAMP) /** * According to the vlan capabilities returned by the driver and FW, the vlan tci diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx512.c b/drivers/net/iavf/iavf_rxtx_vec_avx512.c index bd2788121b..334504f54b 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx512.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx512.c @@ -16,18 +16,20 @@ /****************************************************************************** * If user knows a specific offload is not enabled by APP, * the macro can be commented to save the effort of fast path. - * Currently below 2 features are supported in RX path, + * Currently below 6 features are supported in RX path, * 1, checksum offload * 2, VLAN/QINQ stripping * 3, RSS hash * 4, packet type analysis * 5, flow director ID report + * 6, timestamp offload ******************************************************************************/ #define IAVF_RX_CSUM_OFFLOAD #define IAVF_RX_VLAN_OFFLOAD #define IAVF_RX_RSS_OFFLOAD #define IAVF_RX_PTYPE_OFFLOAD #define IAVF_RX_FDIR_OFFLOAD +#define IAVF_RX_TS_OFFLOAD static __rte_always_inline void iavf_rxq_rearm(struct iavf_rx_queue *rxq) @@ -618,6 +620,25 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m256i hw_low_last; + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + } + } +#endif +#endif + /* constants used in processing loop */ const __m512i crc_adjust = _mm512_set_epi32 @@ -1081,12 +1102,13 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC if (offload) { -#ifdef IAVF_RX_RSS_OFFLOAD +#if defined(IAVF_RX_RSS_OFFLOAD) || defined(IAVF_RX_TS_OFFLOAD) /** * needs to load 2nd 16B of each desc for RSS hash parsing, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1138,6 +1160,7 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm256_castsi128_si256(raw_desc_bh0), raw_desc_bh1, 1); +#ifdef IAVF_RX_RSS_OFFLOAD if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH) { /** * to shift the 32b RSS hash value to the @@ -1278,7 +1301,125 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, mb0_1 = _mm256_or_si256 (mb0_1, vlan_tci0_1); } - } /* if() on RSS hash parsing */ +#endif /* IAVF_RX_RSS_OFFLOAD */ + +#ifdef IAVF_RX_TS_OFFLOAD + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, + _mm256_set1_epi32(iavf_timestamp_dynflag)); + } +#endif /* IAVF_RX_TS_OFFLOAD */ + } /* if() on RSS hash or RX timestamp parsing */ #endif } #endif @@ -1411,10 +1552,70 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + break; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + break; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + break; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + break; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + break; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + break; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + break; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + break; + case 0: + break; + default: + printf("invalid inflection point for rx timestamp\n"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], + iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); diff --git a/drivers/net/iavf/iavf_rxtx_vec_common.h b/drivers/net/iavf/iavf_rxtx_vec_common.h index cc38f70ce2..ddb13ce8c3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_common.h +++ b/drivers/net/iavf/iavf_rxtx_vec_common.h @@ -231,9 +231,6 @@ iavf_rx_vec_queue_default(struct iavf_rx_queue *rxq) if (rxq->proto_xtr != IAVF_PROTO_XTR_NONE) return -1; - if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) - return -1; - if (rxq->offloads & IAVF_RX_VECTOR_OFFLOAD) return IAVF_VECTOR_OFFLOAD_PATH; -- 2.25.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v2 1/3] net/iavf: support Rx timestamp offload on AVX512 2023-04-10 7:35 [PATCH 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng @ 2023-04-12 6:49 ` Zhichao Zeng 2023-04-12 8:46 ` Zhichao Zeng 1 sibling, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-04-12 6:49 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, yingyax.han, Zhichao Zeng, Wenjun Wu, Jingjing Wu, Beilei Xing, Bruce Richardson, Konstantin Ananyev This patch enables Rx timestamp offload on AVX512 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com> Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 199 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - 3 files changed, 198 insertions(+), 7 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h index 09e2127db0..97b5e86f6e 100644 --- a/drivers/net/iavf/iavf_rxtx.h +++ b/drivers/net/iavf/iavf_rxtx.h @@ -44,7 +44,8 @@ RTE_ETH_RX_OFFLOAD_CHECKSUM | \ RTE_ETH_RX_OFFLOAD_SCTP_CKSUM | \ RTE_ETH_RX_OFFLOAD_VLAN | \ - RTE_ETH_RX_OFFLOAD_RSS_HASH) + RTE_ETH_RX_OFFLOAD_RSS_HASH | \ + RTE_ETH_RX_OFFLOAD_TIMESTAMP) /** * According to the vlan capabilities returned by the driver and FW, the vlan tci diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx512.c b/drivers/net/iavf/iavf_rxtx_vec_avx512.c index bd2788121b..a1b7ddd09b 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx512.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx512.c @@ -16,18 +16,20 @@ /****************************************************************************** * If user knows a specific offload is not enabled by APP, * the macro can be commented to save the effort of fast path. - * Currently below 2 features are supported in RX path, + * Currently below 6 features are supported in RX path, * 1, checksum offload * 2, VLAN/QINQ stripping * 3, RSS hash * 4, packet type analysis * 5, flow director ID report + * 6, timestamp offload ******************************************************************************/ #define IAVF_RX_CSUM_OFFLOAD #define IAVF_RX_VLAN_OFFLOAD #define IAVF_RX_RSS_OFFLOAD #define IAVF_RX_PTYPE_OFFLOAD #define IAVF_RX_FDIR_OFFLOAD +#define IAVF_RX_TS_OFFLOAD static __rte_always_inline void iavf_rxq_rearm(struct iavf_rx_queue *rxq) @@ -618,6 +620,25 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m256i hw_low_last; + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + } + } +#endif +#endif + /* constants used in processing loop */ const __m512i crc_adjust = _mm512_set_epi32 @@ -1081,12 +1102,13 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC if (offload) { -#ifdef IAVF_RX_RSS_OFFLOAD +#if defined(IAVF_RX_RSS_OFFLOAD) || defined(IAVF_RX_TS_OFFLOAD) /** * needs to load 2nd 16B of each desc for RSS hash parsing, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1138,6 +1160,7 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm256_castsi128_si256(raw_desc_bh0), raw_desc_bh1, 1); +#ifdef IAVF_RX_RSS_OFFLOAD if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH) { /** * to shift the 32b RSS hash value to the @@ -1278,7 +1301,125 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, mb0_1 = _mm256_or_si256 (mb0_1, vlan_tci0_1); } - } /* if() on RSS hash parsing */ +#endif /* IAVF_RX_RSS_OFFLOAD */ + +#ifdef IAVF_RX_TS_OFFLOAD + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, + _mm256_set1_epi32(iavf_timestamp_dynflag)); + } +#endif /* IAVF_RX_TS_OFFLOAD */ + } /* if() on RSS hash or RX timestamp parsing */ #endif } #endif @@ -1411,10 +1552,62 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + printf("invalid inflection point for rx timestamp\n"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_ICE_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], + iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); diff --git a/drivers/net/iavf/iavf_rxtx_vec_common.h b/drivers/net/iavf/iavf_rxtx_vec_common.h index cc38f70ce2..ddb13ce8c3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_common.h +++ b/drivers/net/iavf/iavf_rxtx_vec_common.h @@ -231,9 +231,6 @@ iavf_rx_vec_queue_default(struct iavf_rx_queue *rxq) if (rxq->proto_xtr != IAVF_PROTO_XTR_NONE) return -1; - if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) - return -1; - if (rxq->offloads & IAVF_RX_VECTOR_OFFLOAD) return IAVF_VECTOR_OFFLOAD_PATH; -- 2.25.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v2 1/3] net/iavf: support Rx timestamp offload on AVX512 2023-04-10 7:35 [PATCH 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng 2023-04-12 6:49 ` [PATCH v2 " Zhichao Zeng @ 2023-04-12 8:46 ` Zhichao Zeng 2023-04-27 3:12 ` [PATCH v3 " Zhichao Zeng 1 sibling, 1 reply; 22+ messages in thread From: Zhichao Zeng @ 2023-04-12 8:46 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, yingyax.han, Zhichao Zeng, Wenjun Wu, Jingjing Wu, Beilei Xing, Bruce Richardson, Konstantin Ananyev This patch enables Rx timestamp offload on AVX512 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com> Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - 3 files changed, 200 insertions(+), 9 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h index 09e2127db0..97b5e86f6e 100644 --- a/drivers/net/iavf/iavf_rxtx.h +++ b/drivers/net/iavf/iavf_rxtx.h @@ -44,7 +44,8 @@ RTE_ETH_RX_OFFLOAD_CHECKSUM | \ RTE_ETH_RX_OFFLOAD_SCTP_CKSUM | \ RTE_ETH_RX_OFFLOAD_VLAN | \ - RTE_ETH_RX_OFFLOAD_RSS_HASH) + RTE_ETH_RX_OFFLOAD_RSS_HASH | \ + RTE_ETH_RX_OFFLOAD_TIMESTAMP) /** * According to the vlan capabilities returned by the driver and FW, the vlan tci diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx512.c b/drivers/net/iavf/iavf_rxtx_vec_avx512.c index bd2788121b..c0a4fce120 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx512.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx512.c @@ -16,18 +16,20 @@ /****************************************************************************** * If user knows a specific offload is not enabled by APP, * the macro can be commented to save the effort of fast path. - * Currently below 2 features are supported in RX path, + * Currently below 6 features are supported in RX path, * 1, checksum offload * 2, VLAN/QINQ stripping * 3, RSS hash * 4, packet type analysis * 5, flow director ID report + * 6, timestamp offload ******************************************************************************/ #define IAVF_RX_CSUM_OFFLOAD #define IAVF_RX_VLAN_OFFLOAD #define IAVF_RX_RSS_OFFLOAD #define IAVF_RX_PTYPE_OFFLOAD #define IAVF_RX_FDIR_OFFLOAD +#define IAVF_RX_TS_OFFLOAD static __rte_always_inline void iavf_rxq_rearm(struct iavf_rx_queue *rxq) @@ -587,9 +589,9 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, bool offload) { struct iavf_adapter *adapter = rxq->vsi->adapter; - +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; - +#endif #ifdef IAVF_RX_PTYPE_OFFLOAD const uint32_t *type_table = adapter->ptype_tbl; #endif @@ -618,6 +620,25 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m256i hw_low_last; + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + } + } +#endif +#endif + /* constants used in processing loop */ const __m512i crc_adjust = _mm512_set_epi32 @@ -1081,12 +1102,13 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC if (offload) { -#ifdef IAVF_RX_RSS_OFFLOAD +#if defined(IAVF_RX_RSS_OFFLOAD) || defined(IAVF_RX_TS_OFFLOAD) /** * needs to load 2nd 16B of each desc for RSS hash parsing, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1138,6 +1160,7 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm256_castsi128_si256(raw_desc_bh0), raw_desc_bh1, 1); +#ifdef IAVF_RX_RSS_OFFLOAD if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH) { /** * to shift the 32b RSS hash value to the @@ -1278,7 +1301,125 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, mb0_1 = _mm256_or_si256 (mb0_1, vlan_tci0_1); } - } /* if() on RSS hash parsing */ +#endif /* IAVF_RX_RSS_OFFLOAD */ + +#ifdef IAVF_RX_TS_OFFLOAD + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, + _mm256_set1_epi32(iavf_timestamp_dynflag)); + } +#endif /* IAVF_RX_TS_OFFLOAD */ + } /* if() on RSS hash or RX timestamp parsing */ #endif } #endif @@ -1411,10 +1552,62 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + printf("invalid inflection point for rx timestamp\n"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], + iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); diff --git a/drivers/net/iavf/iavf_rxtx_vec_common.h b/drivers/net/iavf/iavf_rxtx_vec_common.h index cc38f70ce2..ddb13ce8c3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_common.h +++ b/drivers/net/iavf/iavf_rxtx_vec_common.h @@ -231,9 +231,6 @@ iavf_rx_vec_queue_default(struct iavf_rx_queue *rxq) if (rxq->proto_xtr != IAVF_PROTO_XTR_NONE) return -1; - if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) - return -1; - if (rxq->offloads & IAVF_RX_VECTOR_OFFLOAD) return IAVF_VECTOR_OFFLOAD_PATH; -- 2.25.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v3 1/3] net/iavf: support Rx timestamp offload on AVX512 2023-04-12 8:46 ` Zhichao Zeng @ 2023-04-27 3:12 ` Zhichao Zeng 2023-05-26 2:42 ` [PATCH v4 " Zhichao Zeng ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-04-27 3:12 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Wenjun Wu, Jingjing Wu, Beilei Xing, Bruce Richardson, Konstantin Ananyev This patch enables Rx timestamp offload on AVX512 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com> Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v3: logging with driver dedicated macro --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - 3 files changed, 200 insertions(+), 9 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h index f205a2aaf1..9b7363724d 100644 --- a/drivers/net/iavf/iavf_rxtx.h +++ b/drivers/net/iavf/iavf_rxtx.h @@ -47,7 +47,8 @@ RTE_ETH_RX_OFFLOAD_CHECKSUM | \ RTE_ETH_RX_OFFLOAD_SCTP_CKSUM | \ RTE_ETH_RX_OFFLOAD_VLAN | \ - RTE_ETH_RX_OFFLOAD_RSS_HASH) + RTE_ETH_RX_OFFLOAD_RSS_HASH | \ + RTE_ETH_RX_OFFLOAD_TIMESTAMP) /** * According to the vlan capabilities returned by the driver and FW, the vlan tci diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx512.c b/drivers/net/iavf/iavf_rxtx_vec_avx512.c index 4fe9b97278..2e09179fae 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx512.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx512.c @@ -16,18 +16,20 @@ /****************************************************************************** * If user knows a specific offload is not enabled by APP, * the macro can be commented to save the effort of fast path. - * Currently below 2 features are supported in RX path, + * Currently below 6 features are supported in RX path, * 1, checksum offload * 2, VLAN/QINQ stripping * 3, RSS hash * 4, packet type analysis * 5, flow director ID report + * 6, timestamp offload ******************************************************************************/ #define IAVF_RX_CSUM_OFFLOAD #define IAVF_RX_VLAN_OFFLOAD #define IAVF_RX_RSS_OFFLOAD #define IAVF_RX_PTYPE_OFFLOAD #define IAVF_RX_FDIR_OFFLOAD +#define IAVF_RX_TS_OFFLOAD static __rte_always_inline void iavf_rxq_rearm(struct iavf_rx_queue *rxq) @@ -587,9 +589,9 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, bool offload) { struct iavf_adapter *adapter = rxq->vsi->adapter; - +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; - +#endif #ifdef IAVF_RX_PTYPE_OFFLOAD const uint32_t *type_table = adapter->ptype_tbl; #endif @@ -618,6 +620,25 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m256i hw_low_last; + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + } + } +#endif +#endif + /* constants used in processing loop */ const __m512i crc_adjust = _mm512_set_epi32 @@ -1081,12 +1102,13 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC if (offload) { -#ifdef IAVF_RX_RSS_OFFLOAD +#if defined(IAVF_RX_RSS_OFFLOAD) || defined(IAVF_RX_TS_OFFLOAD) /** * needs to load 2nd 16B of each desc for RSS hash parsing, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1138,6 +1160,7 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm256_castsi128_si256(raw_desc_bh0), raw_desc_bh1, 1); +#ifdef IAVF_RX_RSS_OFFLOAD if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH) { /** * to shift the 32b RSS hash value to the @@ -1275,7 +1298,125 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, mb0_1 = _mm256_or_si256 (mb0_1, vlan_tci0_1); } - } /* if() on RSS hash parsing */ +#endif /* IAVF_RX_RSS_OFFLOAD */ + +#ifdef IAVF_RX_TS_OFFLOAD + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, + _mm256_set1_epi32(iavf_timestamp_dynflag)); + } +#endif /* IAVF_RX_TS_OFFLOAD */ + } /* if() on RSS hash or RX timestamp parsing */ #endif } #endif @@ -1408,10 +1549,62 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], + iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); diff --git a/drivers/net/iavf/iavf_rxtx_vec_common.h b/drivers/net/iavf/iavf_rxtx_vec_common.h index cc38f70ce2..ddb13ce8c3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_common.h +++ b/drivers/net/iavf/iavf_rxtx_vec_common.h @@ -231,9 +231,6 @@ iavf_rx_vec_queue_default(struct iavf_rx_queue *rxq) if (rxq->proto_xtr != IAVF_PROTO_XTR_NONE) return -1; - if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) - return -1; - if (rxq->offloads & IAVF_RX_VECTOR_OFFLOAD) return IAVF_VECTOR_OFFLOAD_PATH; -- 2.25.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 2023-04-27 3:12 ` [PATCH v3 " Zhichao Zeng @ 2023-05-26 2:42 ` Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 " Zhichao Zeng 2 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-26 2:42 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Wenjun Wu, Jingjing Wu, Beilei Xing, Bruce Richardson, Konstantin Ananyev This patch enables Rx timestamp offload on AVX512 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com> Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v4: rework avx2 patch base on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - 3 files changed, 200 insertions(+), 9 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h index 547b68f441..0345a6a51d 100644 --- a/drivers/net/iavf/iavf_rxtx.h +++ b/drivers/net/iavf/iavf_rxtx.h @@ -47,7 +47,8 @@ RTE_ETH_RX_OFFLOAD_CHECKSUM | \ RTE_ETH_RX_OFFLOAD_SCTP_CKSUM | \ RTE_ETH_RX_OFFLOAD_VLAN | \ - RTE_ETH_RX_OFFLOAD_RSS_HASH) + RTE_ETH_RX_OFFLOAD_RSS_HASH | \ + RTE_ETH_RX_OFFLOAD_TIMESTAMP) /** * According to the vlan capabilities returned by the driver and FW, the vlan tci diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx512.c b/drivers/net/iavf/iavf_rxtx_vec_avx512.c index 4fe9b97278..2e09179fae 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx512.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx512.c @@ -16,18 +16,20 @@ /****************************************************************************** * If user knows a specific offload is not enabled by APP, * the macro can be commented to save the effort of fast path. - * Currently below 2 features are supported in RX path, + * Currently below 6 features are supported in RX path, * 1, checksum offload * 2, VLAN/QINQ stripping * 3, RSS hash * 4, packet type analysis * 5, flow director ID report + * 6, timestamp offload ******************************************************************************/ #define IAVF_RX_CSUM_OFFLOAD #define IAVF_RX_VLAN_OFFLOAD #define IAVF_RX_RSS_OFFLOAD #define IAVF_RX_PTYPE_OFFLOAD #define IAVF_RX_FDIR_OFFLOAD +#define IAVF_RX_TS_OFFLOAD static __rte_always_inline void iavf_rxq_rearm(struct iavf_rx_queue *rxq) @@ -587,9 +589,9 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, bool offload) { struct iavf_adapter *adapter = rxq->vsi->adapter; - +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; - +#endif #ifdef IAVF_RX_PTYPE_OFFLOAD const uint32_t *type_table = adapter->ptype_tbl; #endif @@ -618,6 +620,25 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m256i hw_low_last; + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + } + } +#endif +#endif + /* constants used in processing loop */ const __m512i crc_adjust = _mm512_set_epi32 @@ -1081,12 +1102,13 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC if (offload) { -#ifdef IAVF_RX_RSS_OFFLOAD +#if defined(IAVF_RX_RSS_OFFLOAD) || defined(IAVF_RX_TS_OFFLOAD) /** * needs to load 2nd 16B of each desc for RSS hash parsing, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1138,6 +1160,7 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm256_castsi128_si256(raw_desc_bh0), raw_desc_bh1, 1); +#ifdef IAVF_RX_RSS_OFFLOAD if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH) { /** * to shift the 32b RSS hash value to the @@ -1275,7 +1298,125 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, mb0_1 = _mm256_or_si256 (mb0_1, vlan_tci0_1); } - } /* if() on RSS hash parsing */ +#endif /* IAVF_RX_RSS_OFFLOAD */ + +#ifdef IAVF_RX_TS_OFFLOAD + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, + _mm256_set1_epi32(iavf_timestamp_dynflag)); + } +#endif /* IAVF_RX_TS_OFFLOAD */ + } /* if() on RSS hash or RX timestamp parsing */ #endif } #endif @@ -1408,10 +1549,62 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], + iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); diff --git a/drivers/net/iavf/iavf_rxtx_vec_common.h b/drivers/net/iavf/iavf_rxtx_vec_common.h index cc38f70ce2..ddb13ce8c3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_common.h +++ b/drivers/net/iavf/iavf_rxtx_vec_common.h @@ -231,9 +231,6 @@ iavf_rx_vec_queue_default(struct iavf_rx_queue *rxq) if (rxq->proto_xtr != IAVF_PROTO_XTR_NONE) return -1; - if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) - return -1; - if (rxq->offloads & IAVF_RX_VECTOR_OFFLOAD) return IAVF_VECTOR_OFFLOAD_PATH; -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path 2023-04-27 3:12 ` [PATCH v3 " Zhichao Zeng 2023-05-26 2:42 ` [PATCH v4 " Zhichao Zeng @ 2023-05-26 9:50 ` Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng ` (3 more replies) 2023-05-29 2:23 ` [PATCH v4 " Zhichao Zeng 2 siblings, 4 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-26 9:50 UTC (permalink / raw) To: dev; +Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning and SSE path Zhichao Zeng (3): net/iavf: support Rx timestamp offload on AVX512 net/iavf: support Rx timestamp offload on AVX2 net/iavf: support Rx timestamp offload on SSE drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx2.c | 186 +++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - drivers/net/iavf/iavf_rxtx_vec_sse.c | 159 ++++++++++++++++++- 5 files changed, 538 insertions(+), 16 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 2023-05-26 9:50 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng @ 2023-05-26 9:50 ` Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng ` (2 subsequent siblings) 3 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-26 9:50 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Wenjun Wu, Jingjing Wu, Beilei Xing, Bruce Richardson, Konstantin Ananyev This patch enables Rx timestamp offload on AVX512 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com> Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - 3 files changed, 200 insertions(+), 9 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h index 547b68f441..0345a6a51d 100644 --- a/drivers/net/iavf/iavf_rxtx.h +++ b/drivers/net/iavf/iavf_rxtx.h @@ -47,7 +47,8 @@ RTE_ETH_RX_OFFLOAD_CHECKSUM | \ RTE_ETH_RX_OFFLOAD_SCTP_CKSUM | \ RTE_ETH_RX_OFFLOAD_VLAN | \ - RTE_ETH_RX_OFFLOAD_RSS_HASH) + RTE_ETH_RX_OFFLOAD_RSS_HASH | \ + RTE_ETH_RX_OFFLOAD_TIMESTAMP) /** * According to the vlan capabilities returned by the driver and FW, the vlan tci diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx512.c b/drivers/net/iavf/iavf_rxtx_vec_avx512.c index 4fe9b97278..f9961e53b8 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx512.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx512.c @@ -16,18 +16,20 @@ /****************************************************************************** * If user knows a specific offload is not enabled by APP, * the macro can be commented to save the effort of fast path. - * Currently below 2 features are supported in RX path, + * Currently below 6 features are supported in RX path, * 1, checksum offload * 2, VLAN/QINQ stripping * 3, RSS hash * 4, packet type analysis * 5, flow director ID report + * 6, timestamp offload ******************************************************************************/ #define IAVF_RX_CSUM_OFFLOAD #define IAVF_RX_VLAN_OFFLOAD #define IAVF_RX_RSS_OFFLOAD #define IAVF_RX_PTYPE_OFFLOAD #define IAVF_RX_FDIR_OFFLOAD +#define IAVF_RX_TS_OFFLOAD static __rte_always_inline void iavf_rxq_rearm(struct iavf_rx_queue *rxq) @@ -587,9 +589,9 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, bool offload) { struct iavf_adapter *adapter = rxq->vsi->adapter; - +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; - +#endif #ifdef IAVF_RX_PTYPE_OFFLOAD const uint32_t *type_table = adapter->ptype_tbl; #endif @@ -618,6 +620,25 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m256i hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + } + } +#endif +#endif + /* constants used in processing loop */ const __m512i crc_adjust = _mm512_set_epi32 @@ -1081,12 +1102,13 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC if (offload) { -#ifdef IAVF_RX_RSS_OFFLOAD +#if defined(IAVF_RX_RSS_OFFLOAD) || defined(IAVF_RX_TS_OFFLOAD) /** * needs to load 2nd 16B of each desc for RSS hash parsing, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1138,6 +1160,7 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm256_castsi128_si256(raw_desc_bh0), raw_desc_bh1, 1); +#ifdef IAVF_RX_RSS_OFFLOAD if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH) { /** * to shift the 32b RSS hash value to the @@ -1275,7 +1298,125 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, mb0_1 = _mm256_or_si256 (mb0_1, vlan_tci0_1); } - } /* if() on RSS hash parsing */ +#endif /* IAVF_RX_RSS_OFFLOAD */ + +#ifdef IAVF_RX_TS_OFFLOAD + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, + _mm256_set1_epi32(iavf_timestamp_dynflag)); + } +#endif /* IAVF_RX_TS_OFFLOAD */ + } /* if() on RSS hash or RX timestamp parsing */ #endif } #endif @@ -1408,10 +1549,62 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], + iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); diff --git a/drivers/net/iavf/iavf_rxtx_vec_common.h b/drivers/net/iavf/iavf_rxtx_vec_common.h index cc38f70ce2..ddb13ce8c3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_common.h +++ b/drivers/net/iavf/iavf_rxtx_vec_common.h @@ -231,9 +231,6 @@ iavf_rx_vec_queue_default(struct iavf_rx_queue *rxq) if (rxq->proto_xtr != IAVF_PROTO_XTR_NONE) return -1; - if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) - return -1; - if (rxq->offloads & IAVF_RX_VECTOR_OFFLOAD) return IAVF_VECTOR_OFFLOAD_PATH; -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 2/3] net/iavf: support Rx timestamp offload on AVX2 2023-05-26 9:50 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng @ 2023-05-26 9:50 ` Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 3 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-26 9:50 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Bruce Richardson, Konstantin Ananyev, Jingjing Wu, Beilei Xing This patch enables Rx timestamp offload on AVX2 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx_vec_avx2.c | 186 +++++++++++++++++++++++++- 1 file changed, 182 insertions(+), 4 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx2.c b/drivers/net/iavf/iavf_rxtx_vec_avx2.c index 22d4d3a90f..86290c4bbb 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx2.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx2.c @@ -532,7 +532,9 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, struct iavf_adapter *adapter = rxq->vsi->adapter; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; +#endif const uint32_t *type_table = adapter->ptype_tbl; const __m256i mbuf_init = _mm256_set_epi64x(0, 0, @@ -558,6 +560,21 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, if (!(rxdp->wb.status_error0 & rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + bool is_tsinit = false; + uint8_t inflection_point = 0; + __m256i hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, rxq->phc_time); + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, rxq->phc_time); + } + } +#endif /* constants used in processing loop */ const __m256i crc_adjust = @@ -967,10 +984,11 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, if (offload) { #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC /** - * needs to load 2nd 16B of each desc for RSS hash parsing, + * needs to load 2nd 16B of each desc, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1053,7 +1071,7 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, mb4_5 = _mm256_or_si256(mb4_5, rss_hash4_5); mb2_3 = _mm256_or_si256(mb2_3, rss_hash2_3); mb0_1 = _mm256_or_si256(mb0_1, rss_hash0_1); - } + } /* if() on RSS hash parsing */ if (rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* merge the status/error-1 bits into one register */ @@ -1132,8 +1150,121 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, mb4_5 = _mm256_or_si256(mb4_5, vlan_tci4_5); mb2_3 = _mm256_or_si256(mb2_3, vlan_tci2_3); mb0_1 = _mm256_or_si256(mb0_1, vlan_tci0_1); - } - } /* if() on RSS hash parsing */ + } /* if() on Vlan parsing */ + + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, _mm256_set1_epi32(iavf_timestamp_dynflag)); + } /* if() on Timestamp parsing */ + } #endif } @@ -1265,10 +1396,57 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE 2023-05-26 9:50 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng @ 2023-05-26 9:50 ` Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 3 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-26 9:50 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Bruce Richardson, Konstantin Ananyev, Jingjing Wu, Beilei Xing This patch enables Rx timestamp offload on SSE data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning and timestamp error --- drivers/net/iavf/iavf_rxtx_vec_sse.c | 159 ++++++++++++++++++++++++++- 1 file changed, 156 insertions(+), 3 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx_vec_sse.c b/drivers/net/iavf/iavf_rxtx_vec_sse.c index 3f30be01aa..b754122c51 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_sse.c +++ b/drivers/net/iavf/iavf_rxtx_vec_sse.c @@ -392,6 +392,11 @@ flex_desc_to_olflags_v(struct iavf_rx_queue *rxq, __m128i descs[4], _mm_extract_epi32(fdir_id0_3, 3); } /* if() on fdir_enabled */ +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) + flags = _mm_or_si128(flags, _mm_set1_epi32(iavf_timestamp_dynflag)); +#endif + /** * At this point, we have the 4 sets of flags in the low 16-bits * of each 32-bit value in flags. @@ -723,7 +728,9 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, int pos; uint64_t var; struct iavf_adapter *adapter = rxq->vsi->adapter; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; +#endif const uint32_t *ptype_tbl = adapter->ptype_tbl; __m128i crc_adjust = _mm_set_epi16 (0, 0, 0, /* ignore non-length fields */ @@ -793,6 +800,24 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m128i hw_low_last = _mm_set_epi32(0, 0, 0, (uint32_t)rxq->phc_time); + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm_setzero_si128(); + is_tsinit = 1; + } else { + hw_low_last = _mm_set_epi32(0, 0, 0, (uint32_t)rxq->phc_time); + } + } + +#endif + /** * Compile-time verify the shuffle mask * NOTE: some field positions already verified above, but duplicated @@ -825,7 +850,7 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, rxdp += IAVF_VPMD_DESCS_PER_LOOP) { __m128i descs[IAVF_VPMD_DESCS_PER_LOOP]; #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC - __m128i descs_bh[IAVF_VPMD_DESCS_PER_LOOP]; + __m128i descs_bh[IAVF_VPMD_DESCS_PER_LOOP] = {_mm_setzero_si128()}; #endif __m128i pkt_mb0, pkt_mb1, pkt_mb2, pkt_mb3; __m128i staterr, sterr_tmp1, sterr_tmp2; @@ -895,10 +920,11 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC /** - * needs to load 2nd 16B of each desc for RSS hash parsing, + * needs to load 2nd 16B of each desc, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ descs_bh[3] = _mm_load_si128 @@ -964,7 +990,94 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, pkt_mb2 = _mm_or_si128(pkt_mb2, vlan_tci2); pkt_mb1 = _mm_or_si128(pkt_mb1, vlan_tci1); pkt_mb0 = _mm_or_si128(pkt_mb0, vlan_tci0); - } + } /* if() on Vlan parsing */ + + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m128i ts; + __m128i ts_low = _mm_setzero_si128(); + __m128i ts_low1; + __m128i max_ret; + __m128i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 4; + __m128i ts_desp_mask = _mm_set_epi32(mask, 0, 0, 0); + __m128i cmp_mask = _mm_set1_epi32(mask); + + ts = _mm_and_si128(descs_bh[0], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 3 * 4)); + ts = _mm_and_si128(descs_bh[1], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 2 * 4)); + ts = _mm_and_si128(descs_bh[2], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 1 * 4)); + ts = _mm_and_si128(descs_bh[3], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, ts); + + ts_low1 = _mm_slli_si128(ts_low, 4); + ts_low1 = _mm_and_si128(ts_low, _mm_set_epi32(mask, mask, mask, 0)); + ts_low1 = _mm_or_si128(ts_low1, hw_low_last); + hw_low_last = _mm_and_si128(ts_low, _mm_set_epi32(0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 3); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm_max_epu32(ts_low, ts_low1); + cmp_ret = _mm_andnot_si128(_mm_cmpeq_epi32(max_ret, ts_low), cmp_mask); + + if (_mm_testz_si128(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m128i mask_low; + __m128i mask_high; + switch (shift) { + case 2: + mask_low = _mm_set_epi32(0, 0, mask, mask); + mask_high = _mm_set_epi32(mask, mask, 0, 0); + break; + case 1: + mask_low = _mm_srli_si128(cmp_mask, 4); + mask_high = _mm_slli_si128(cmp_mask, 4); + break; + } + ret = _mm_testz_si128(cmp_ret, mask_low); + if (ret) { + ret = _mm_testz_si128(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + } /* if() on Timestamp parsing */ flex_desc_to_olflags_v(rxq, descs, descs_bh, &rx_pkts[pos]); #else @@ -1011,10 +1124,50 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, /* C.4 calc available number of desc */ var = __builtin_popcountll(_mm_cvtsi128_si64(staterr)); nb_pkts_recd += var; + +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= var) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif + if (likely(var != IAVF_VPMD_DESCS_PER_LOOP)) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (nb_pkts_recd > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[nb_pkts_recd - 1], + iavf_timestamp_dynfield_offset, uint32_t *); +#endif +#endif + /* Update our internal tail pointer */ rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd); rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1)); -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path 2023-05-26 9:50 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng ` (2 preceding siblings ...) 2023-05-26 9:50 ` [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng @ 2023-06-14 1:49 ` Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng ` (3 more replies) 3 siblings, 4 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-06-14 1:49 UTC (permalink / raw) To: dev; +Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng This patch enables Rx timestamp offload on the vector data path. It significantly reduces the performance drop when RTE_ETH_RX_OFFLOAD_TIMESTAMP is enabled. --- v5: fix CI errors --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning and SSE path Zhichao Zeng (3): net/iavf: support Rx timestamp offload on AVX512 net/iavf: support Rx timestamp offload on AVX2 net/iavf: support Rx timestamp offload on SSE drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx2.c | 191 +++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 208 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - drivers/net/iavf/iavf_rxtx_vec_sse.c | 160 +++++++++++++++++- 5 files changed, 549 insertions(+), 16 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 1/3] net/iavf: support Rx timestamp offload on AVX512 2023-06-14 1:49 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng @ 2023-06-14 1:49 ` Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng ` (2 subsequent siblings) 3 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-06-14 1:49 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Jingjing Wu, Beilei Xing, Bruce Richardson, Konstantin Ananyev This patch enables Rx timestamp offload on the AVX512 data path. It significantly reduces the performance drop when RTE_ETH_RX_OFFLOAD_TIMESTAMP is enabled. Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v5: fix CI errors --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 208 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - 3 files changed, 205 insertions(+), 9 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h index 547b68f441..0345a6a51d 100644 --- a/drivers/net/iavf/iavf_rxtx.h +++ b/drivers/net/iavf/iavf_rxtx.h @@ -47,7 +47,8 @@ RTE_ETH_RX_OFFLOAD_CHECKSUM | \ RTE_ETH_RX_OFFLOAD_SCTP_CKSUM | \ RTE_ETH_RX_OFFLOAD_VLAN | \ - RTE_ETH_RX_OFFLOAD_RSS_HASH) + RTE_ETH_RX_OFFLOAD_RSS_HASH | \ + RTE_ETH_RX_OFFLOAD_TIMESTAMP) /** * According to the vlan capabilities returned by the driver and FW, the vlan tci diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx512.c b/drivers/net/iavf/iavf_rxtx_vec_avx512.c index bd2788121b..3e66df5341 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx512.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx512.c @@ -16,18 +16,20 @@ /****************************************************************************** * If user knows a specific offload is not enabled by APP, * the macro can be commented to save the effort of fast path. - * Currently below 2 features are supported in RX path, + * Currently below 6 features are supported in RX path, * 1, checksum offload * 2, VLAN/QINQ stripping * 3, RSS hash * 4, packet type analysis * 5, flow director ID report + * 6, timestamp offload ******************************************************************************/ #define IAVF_RX_CSUM_OFFLOAD #define IAVF_RX_VLAN_OFFLOAD #define IAVF_RX_RSS_OFFLOAD #define IAVF_RX_PTYPE_OFFLOAD #define IAVF_RX_FDIR_OFFLOAD +#define IAVF_RX_TS_OFFLOAD static __rte_always_inline void iavf_rxq_rearm(struct iavf_rx_queue *rxq) @@ -587,9 +589,9 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, bool offload) { struct iavf_adapter *adapter = rxq->vsi->adapter; - +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; - +#endif #ifdef IAVF_RX_PTYPE_OFFLOAD const uint32_t *type_table = adapter->ptype_tbl; #endif @@ -618,6 +620,25 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m256i hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + } + } +#endif +#endif + /* constants used in processing loop */ const __m512i crc_adjust = _mm512_set_epi32 @@ -1081,12 +1102,13 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC if (offload) { -#ifdef IAVF_RX_RSS_OFFLOAD +#if defined(IAVF_RX_RSS_OFFLOAD) || defined(IAVF_RX_TS_OFFLOAD) /** * needs to load 2nd 16B of each desc for RSS hash parsing, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1138,6 +1160,7 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm256_castsi128_si256(raw_desc_bh0), raw_desc_bh1, 1); +#ifdef IAVF_RX_RSS_OFFLOAD if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH) { /** * to shift the 32b RSS hash value to the @@ -1278,7 +1301,125 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, mb0_1 = _mm256_or_si256 (mb0_1, vlan_tci0_1); } - } /* if() on RSS hash parsing */ +#endif /* IAVF_RX_RSS_OFFLOAD */ + +#ifdef IAVF_RX_TS_OFFLOAD + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low = _mm256_setzero_si256(); + __m256i mask_high = _mm256_setzero_si256(); + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, + _mm256_set1_epi32(iavf_timestamp_dynflag)); + } +#endif /* IAVF_RX_TS_OFFLOAD */ + } /* if() on RSS hash or RX timestamp parsing */ #endif } #endif @@ -1411,10 +1552,67 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + /* fallthrough */ + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#endif +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], + iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); diff --git a/drivers/net/iavf/iavf_rxtx_vec_common.h b/drivers/net/iavf/iavf_rxtx_vec_common.h index cc38f70ce2..ddb13ce8c3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_common.h +++ b/drivers/net/iavf/iavf_rxtx_vec_common.h @@ -231,9 +231,6 @@ iavf_rx_vec_queue_default(struct iavf_rx_queue *rxq) if (rxq->proto_xtr != IAVF_PROTO_XTR_NONE) return -1; - if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) - return -1; - if (rxq->offloads & IAVF_RX_VECTOR_OFFLOAD) return IAVF_VECTOR_OFFLOAD_PATH; -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 2/3] net/iavf: support Rx timestamp offload on AVX2 2023-06-14 1:49 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng @ 2023-06-14 1:49 ` Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng 2023-06-19 1:03 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhang, Qi Z 3 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-06-14 1:49 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Bruce Richardson, Konstantin Ananyev, Jingjing Wu, Beilei Xing This patch enables Rx timestamp offload on the AVX2 data path. It significantly reduces the performance drop when RTE_ETH_RX_OFFLOAD_TIMESTAMP is enabled. --- v5: fix CI errors --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- drivers/net/iavf/iavf_rxtx_vec_avx2.c | 191 +++++++++++++++++++++++++- 1 file changed, 187 insertions(+), 4 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx2.c b/drivers/net/iavf/iavf_rxtx_vec_avx2.c index c7f8b6ef71..c10f24036e 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx2.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx2.c @@ -532,7 +532,9 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, struct iavf_adapter *adapter = rxq->vsi->adapter; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; +#endif const uint32_t *type_table = adapter->ptype_tbl; const __m256i mbuf_init = _mm256_set_epi64x(0, 0, @@ -558,6 +560,21 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, if (!(rxdp->wb.status_error0 & rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + bool is_tsinit = false; + uint8_t inflection_point = 0; + __m256i hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, rxq->phc_time); + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, rxq->phc_time); + } + } +#endif /* constants used in processing loop */ const __m256i crc_adjust = @@ -967,10 +984,11 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, if (offload) { #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC /** - * needs to load 2nd 16B of each desc for RSS hash parsing, + * needs to load 2nd 16B of each desc, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1053,7 +1071,7 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, mb4_5 = _mm256_or_si256(mb4_5, rss_hash4_5); mb2_3 = _mm256_or_si256(mb2_3, rss_hash2_3); mb0_1 = _mm256_or_si256(mb0_1, rss_hash0_1); - } + } /* if() on RSS hash parsing */ if (rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* merge the status/error-1 bits into one register */ @@ -1132,8 +1150,121 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, mb4_5 = _mm256_or_si256(mb4_5, vlan_tci4_5); mb2_3 = _mm256_or_si256(mb2_3, vlan_tci2_3); mb0_1 = _mm256_or_si256(mb0_1, vlan_tci0_1); - } - } /* if() on RSS hash parsing */ + } /* if() on Vlan parsing */ + + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low = _mm256_setzero_si256(); + __m256i mask_high = _mm256_setzero_si256(); + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, _mm256_set1_epi32(iavf_timestamp_dynflag)); + } /* if() on Timestamp parsing */ + } #endif } @@ -1265,10 +1396,62 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + /* fallthrough */ + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 3/3] net/iavf: support Rx timestamp offload on SSE 2023-06-14 1:49 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng @ 2023-06-14 1:49 ` Zhichao Zeng 2023-06-15 9:28 ` Tang, Yaqi 2023-06-19 1:03 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhang, Qi Z 3 siblings, 1 reply; 22+ messages in thread From: Zhichao Zeng @ 2023-06-14 1:49 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Bruce Richardson, Konstantin Ananyev, Jingjing Wu, Beilei Xing This patch enables Rx timestamp offload on the SSE data path. It significantly reduces the performance drop when RTE_ETH_RX_OFFLOAD_TIMESTAMP is enabled. --- v5: fix CI errors --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning and timestamp error Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- drivers/net/iavf/iavf_rxtx_vec_sse.c | 160 ++++++++++++++++++++++++++- 1 file changed, 157 insertions(+), 3 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx_vec_sse.c b/drivers/net/iavf/iavf_rxtx_vec_sse.c index 3f30be01aa..892bfa4cf3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_sse.c +++ b/drivers/net/iavf/iavf_rxtx_vec_sse.c @@ -392,6 +392,11 @@ flex_desc_to_olflags_v(struct iavf_rx_queue *rxq, __m128i descs[4], _mm_extract_epi32(fdir_id0_3, 3); } /* if() on fdir_enabled */ +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) + flags = _mm_or_si128(flags, _mm_set1_epi32(iavf_timestamp_dynflag)); +#endif + /** * At this point, we have the 4 sets of flags in the low 16-bits * of each 32-bit value in flags. @@ -723,7 +728,9 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, int pos; uint64_t var; struct iavf_adapter *adapter = rxq->vsi->adapter; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; +#endif const uint32_t *ptype_tbl = adapter->ptype_tbl; __m128i crc_adjust = _mm_set_epi16 (0, 0, 0, /* ignore non-length fields */ @@ -793,6 +800,24 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m128i hw_low_last = _mm_set_epi32(0, 0, 0, (uint32_t)rxq->phc_time); + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm_setzero_si128(); + is_tsinit = 1; + } else { + hw_low_last = _mm_set_epi32(0, 0, 0, (uint32_t)rxq->phc_time); + } + } + +#endif + /** * Compile-time verify the shuffle mask * NOTE: some field positions already verified above, but duplicated @@ -825,7 +850,7 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, rxdp += IAVF_VPMD_DESCS_PER_LOOP) { __m128i descs[IAVF_VPMD_DESCS_PER_LOOP]; #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC - __m128i descs_bh[IAVF_VPMD_DESCS_PER_LOOP]; + __m128i descs_bh[IAVF_VPMD_DESCS_PER_LOOP] = {_mm_setzero_si128()}; #endif __m128i pkt_mb0, pkt_mb1, pkt_mb2, pkt_mb3; __m128i staterr, sterr_tmp1, sterr_tmp2; @@ -895,10 +920,11 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC /** - * needs to load 2nd 16B of each desc for RSS hash parsing, + * needs to load 2nd 16B of each desc, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ descs_bh[3] = _mm_load_si128 @@ -964,7 +990,94 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, pkt_mb2 = _mm_or_si128(pkt_mb2, vlan_tci2); pkt_mb1 = _mm_or_si128(pkt_mb1, vlan_tci1); pkt_mb0 = _mm_or_si128(pkt_mb0, vlan_tci0); - } + } /* if() on Vlan parsing */ + + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m128i ts; + __m128i ts_low = _mm_setzero_si128(); + __m128i ts_low1; + __m128i max_ret; + __m128i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 4; + __m128i ts_desp_mask = _mm_set_epi32(mask, 0, 0, 0); + __m128i cmp_mask = _mm_set1_epi32(mask); + + ts = _mm_and_si128(descs_bh[0], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 3 * 4)); + ts = _mm_and_si128(descs_bh[1], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 2 * 4)); + ts = _mm_and_si128(descs_bh[2], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 1 * 4)); + ts = _mm_and_si128(descs_bh[3], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, ts); + + ts_low1 = _mm_slli_si128(ts_low, 4); + ts_low1 = _mm_and_si128(ts_low, _mm_set_epi32(mask, mask, mask, 0)); + ts_low1 = _mm_or_si128(ts_low1, hw_low_last); + hw_low_last = _mm_and_si128(ts_low, _mm_set_epi32(0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 3); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm_max_epu32(ts_low, ts_low1); + cmp_ret = _mm_andnot_si128(_mm_cmpeq_epi32(max_ret, ts_low), cmp_mask); + + if (_mm_testz_si128(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m128i mask_low = _mm_setzero_si128(); + __m128i mask_high = _mm_setzero_si128(); + switch (shift) { + case 2: + mask_low = _mm_set_epi32(0, 0, mask, mask); + mask_high = _mm_set_epi32(mask, mask, 0, 0); + break; + case 1: + mask_low = _mm_srli_si128(cmp_mask, 4); + mask_high = _mm_slli_si128(cmp_mask, 4); + break; + } + ret = _mm_testz_si128(cmp_ret, mask_low); + if (ret) { + ret = _mm_testz_si128(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + } /* if() on Timestamp parsing */ flex_desc_to_olflags_v(rxq, descs, descs_bh, &rx_pkts[pos]); #else @@ -1011,10 +1124,51 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, /* C.4 calc available number of desc */ var = __builtin_popcountll(_mm_cvtsi128_si64(staterr)); nb_pkts_recd += var; + +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= var) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + /* fallthrough */ + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + /* fallthrough */ + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#endif + if (likely(var != IAVF_VPMD_DESCS_PER_LOOP)) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (nb_pkts_recd > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[nb_pkts_recd - 1], + iavf_timestamp_dynfield_offset, uint32_t *); +#endif +#endif + /* Update our internal tail pointer */ rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd); rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1)); -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: [PATCH v5 3/3] net/iavf: support Rx timestamp offload on SSE 2023-06-14 1:49 ` [PATCH v5 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng @ 2023-06-15 9:28 ` Tang, Yaqi 0 siblings, 0 replies; 22+ messages in thread From: Tang, Yaqi @ 2023-06-15 9:28 UTC (permalink / raw) To: Zeng, ZhichaoX, dev Cc: Zhang, Qi Z, Richardson, Bruce, Konstantin Ananyev, Wu, Jingjing, Xing, Beilei > -----Original Message----- > From: Zeng, ZhichaoX <zhichaox.zeng@intel.com> > Sent: Wednesday, June 14, 2023 9:50 AM > To: dev@dpdk.org > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; Tang, Yaqi <yaqi.tang@intel.com>; > Zeng, ZhichaoX <zhichaox.zeng@intel.com>; Richardson, Bruce > <bruce.richardson@intel.com>; Konstantin Ananyev > <konstantin.v.ananyev@yandex.ru>; Wu, Jingjing <jingjing.wu@intel.com>; Xing, > Beilei <beilei.xing@intel.com> > Subject: [PATCH v5 3/3] net/iavf: support Rx timestamp offload on SSE > > This patch enables Rx timestamp offload on the SSE data path. > > It significantly reduces the performance drop when > RTE_ETH_RX_OFFLOAD_TIMESTAMP is enabled. > > --- > v5: fix CI errors > --- > v4: rework avx2 patch based on offload path > --- > v3: logging with driver dedicated macro > --- > v2: fix compile warning and timestamp error > > Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> > --- Functional test passed. Cover SSE, AVX2 and AVX512 paths. Tested-by: Yaqi Tang <yaqi.tang@intel.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path 2023-06-14 1:49 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng ` (2 preceding siblings ...) 2023-06-14 1:49 ` [PATCH v5 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng @ 2023-06-19 1:03 ` Zhang, Qi Z 3 siblings, 0 replies; 22+ messages in thread From: Zhang, Qi Z @ 2023-06-19 1:03 UTC (permalink / raw) To: Zeng, ZhichaoX, dev; +Cc: Tang, Yaqi > -----Original Message----- > From: Zeng, ZhichaoX <zhichaox.zeng@intel.com> > Sent: Wednesday, June 14, 2023 9:50 AM > To: dev@dpdk.org > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; Tang, Yaqi <yaqi.tang@intel.com>; > Zeng, ZhichaoX <zhichaox.zeng@intel.com> > Subject: [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path > > This patch enables Rx timestamp offload on the vector data path. > > It significantly reduces the performance drop when > RTE_ETH_RX_OFFLOAD_TIMESTAMP is enabled. > > --- > v5: fix CI errors > --- > v4: rework avx2 patch based on offload path > --- > v3: logging with driver dedicated macro > --- > v2: fix compile warning and SSE path > > Zhichao Zeng (3): > net/iavf: support Rx timestamp offload on AVX512 > net/iavf: support Rx timestamp offload on AVX2 > net/iavf: support Rx timestamp offload on SSE > > drivers/net/iavf/iavf_rxtx.h | 3 +- > drivers/net/iavf/iavf_rxtx_vec_avx2.c | 191 +++++++++++++++++++++- > drivers/net/iavf/iavf_rxtx_vec_avx512.c | 208 +++++++++++++++++++++++- > drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - > drivers/net/iavf/iavf_rxtx_vec_sse.c | 160 +++++++++++++++++- > 5 files changed, 549 insertions(+), 16 deletions(-) > > -- > 2.34.1 Acked-by: Qi Zhang <qi.z.zhang@intel.com> Applied to dpdk-next-net-intel. Thanks Qi ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path 2023-04-27 3:12 ` [PATCH v3 " Zhichao Zeng 2023-05-26 2:42 ` [PATCH v4 " Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng @ 2023-05-29 2:23 ` Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng ` (3 more replies) 2 siblings, 4 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-29 2:23 UTC (permalink / raw) To: dev; +Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning and SSE path Zhichao Zeng (3): net/iavf: support Rx timestamp offload on AVX512 net/iavf: support Rx timestamp offload on AVX2 net/iavf: support Rx timestamp offload on SSE drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx2.c | 186 +++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - drivers/net/iavf/iavf_rxtx_vec_sse.c | 159 ++++++++++++++++++- 5 files changed, 538 insertions(+), 16 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 2023-05-29 2:23 ` [PATCH v4 " Zhichao Zeng @ 2023-05-29 2:23 ` Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng ` (2 subsequent siblings) 3 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-29 2:23 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Wenjun Wu, Jingjing Wu, Beilei Xing, Bruce Richardson, Konstantin Ananyev This patch enables Rx timestamp offload on AVX512 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Wenjun Wu <wenjun1.wu@intel.com> Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - 3 files changed, 200 insertions(+), 9 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h index 547b68f441..0345a6a51d 100644 --- a/drivers/net/iavf/iavf_rxtx.h +++ b/drivers/net/iavf/iavf_rxtx.h @@ -47,7 +47,8 @@ RTE_ETH_RX_OFFLOAD_CHECKSUM | \ RTE_ETH_RX_OFFLOAD_SCTP_CKSUM | \ RTE_ETH_RX_OFFLOAD_VLAN | \ - RTE_ETH_RX_OFFLOAD_RSS_HASH) + RTE_ETH_RX_OFFLOAD_RSS_HASH | \ + RTE_ETH_RX_OFFLOAD_TIMESTAMP) /** * According to the vlan capabilities returned by the driver and FW, the vlan tci diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx512.c b/drivers/net/iavf/iavf_rxtx_vec_avx512.c index 4fe9b97278..f9961e53b8 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx512.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx512.c @@ -16,18 +16,20 @@ /****************************************************************************** * If user knows a specific offload is not enabled by APP, * the macro can be commented to save the effort of fast path. - * Currently below 2 features are supported in RX path, + * Currently below 6 features are supported in RX path, * 1, checksum offload * 2, VLAN/QINQ stripping * 3, RSS hash * 4, packet type analysis * 5, flow director ID report + * 6, timestamp offload ******************************************************************************/ #define IAVF_RX_CSUM_OFFLOAD #define IAVF_RX_VLAN_OFFLOAD #define IAVF_RX_RSS_OFFLOAD #define IAVF_RX_PTYPE_OFFLOAD #define IAVF_RX_FDIR_OFFLOAD +#define IAVF_RX_TS_OFFLOAD static __rte_always_inline void iavf_rxq_rearm(struct iavf_rx_queue *rxq) @@ -587,9 +589,9 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, bool offload) { struct iavf_adapter *adapter = rxq->vsi->adapter; - +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; - +#endif #ifdef IAVF_RX_PTYPE_OFFLOAD const uint32_t *type_table = adapter->ptype_tbl; #endif @@ -618,6 +620,25 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m256i hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, (uint32_t)rxq->phc_time); + } + } +#endif +#endif + /* constants used in processing loop */ const __m512i crc_adjust = _mm512_set_epi32 @@ -1081,12 +1102,13 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC if (offload) { -#ifdef IAVF_RX_RSS_OFFLOAD +#if defined(IAVF_RX_RSS_OFFLOAD) || defined(IAVF_RX_TS_OFFLOAD) /** * needs to load 2nd 16B of each desc for RSS hash parsing, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1138,6 +1160,7 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm256_castsi128_si256(raw_desc_bh0), raw_desc_bh1, 1); +#ifdef IAVF_RX_RSS_OFFLOAD if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH) { /** * to shift the 32b RSS hash value to the @@ -1275,7 +1298,125 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, mb0_1 = _mm256_or_si256 (mb0_1, vlan_tci0_1); } - } /* if() on RSS hash parsing */ +#endif /* IAVF_RX_RSS_OFFLOAD */ + +#ifdef IAVF_RX_TS_OFFLOAD + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, + _mm256_set1_epi32(iavf_timestamp_dynflag)); + } +#endif /* IAVF_RX_TS_OFFLOAD */ + } /* if() on RSS hash or RX timestamp parsing */ #endif } #endif @@ -1408,10 +1549,62 @@ _iavf_recv_raw_pkts_vec_avx512_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], + iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); diff --git a/drivers/net/iavf/iavf_rxtx_vec_common.h b/drivers/net/iavf/iavf_rxtx_vec_common.h index cc38f70ce2..ddb13ce8c3 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_common.h +++ b/drivers/net/iavf/iavf_rxtx_vec_common.h @@ -231,9 +231,6 @@ iavf_rx_vec_queue_default(struct iavf_rx_queue *rxq) if (rxq->proto_xtr != IAVF_PROTO_XTR_NONE) return -1; - if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) - return -1; - if (rxq->offloads & IAVF_RX_VECTOR_OFFLOAD) return IAVF_VECTOR_OFFLOAD_PATH; -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 2/3] net/iavf: support Rx timestamp offload on AVX2 2023-05-29 2:23 ` [PATCH v4 " Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng @ 2023-05-29 2:23 ` Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng 2023-06-06 5:41 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhang, Qi Z 3 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-29 2:23 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Bruce Richardson, Konstantin Ananyev, Jingjing Wu, Beilei Xing This patch enables Rx timestamp offload on AVX2 data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning --- drivers/net/iavf/iavf_rxtx_vec_avx2.c | 186 +++++++++++++++++++++++++- 1 file changed, 182 insertions(+), 4 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx_vec_avx2.c b/drivers/net/iavf/iavf_rxtx_vec_avx2.c index 22d4d3a90f..86290c4bbb 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_avx2.c +++ b/drivers/net/iavf/iavf_rxtx_vec_avx2.c @@ -532,7 +532,9 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, struct iavf_adapter *adapter = rxq->vsi->adapter; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; +#endif const uint32_t *type_table = adapter->ptype_tbl; const __m256i mbuf_init = _mm256_set_epi64x(0, 0, @@ -558,6 +560,21 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, if (!(rxdp->wb.status_error0 & rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + bool is_tsinit = false; + uint8_t inflection_point = 0; + __m256i hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, rxq->phc_time); + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm256_setzero_si256(); + is_tsinit = 1; + } else { + hw_low_last = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, rxq->phc_time); + } + } +#endif /* constants used in processing loop */ const __m256i crc_adjust = @@ -967,10 +984,11 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, if (offload) { #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC /** - * needs to load 2nd 16B of each desc for RSS hash parsing, + * needs to load 2nd 16B of each desc, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ const __m128i raw_desc_bh7 = @@ -1053,7 +1071,7 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, mb4_5 = _mm256_or_si256(mb4_5, rss_hash4_5); mb2_3 = _mm256_or_si256(mb2_3, rss_hash2_3); mb0_1 = _mm256_or_si256(mb0_1, rss_hash0_1); - } + } /* if() on RSS hash parsing */ if (rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* merge the status/error-1 bits into one register */ @@ -1132,8 +1150,121 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, mb4_5 = _mm256_or_si256(mb4_5, vlan_tci4_5); mb2_3 = _mm256_or_si256(mb2_3, vlan_tci2_3); mb0_1 = _mm256_or_si256(mb0_1, vlan_tci0_1); - } - } /* if() on RSS hash parsing */ + } /* if() on Vlan parsing */ + + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m256i ts; + __m256i ts_low = _mm256_setzero_si256(); + __m256i ts_low1; + __m256i ts_low2; + __m256i max_ret; + __m256i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 8; + __m256i ts_desp_mask = _mm256_set_epi32(mask, 0, 0, 0, mask, 0, 0, 0); + __m256i cmp_mask = _mm256_set1_epi32(mask); + __m256i ts_permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0); + + ts = _mm256_and_si256(raw_desc_bh0_1, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 3 * 4)); + ts = _mm256_and_si256(raw_desc_bh2_3, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 2 * 4)); + ts = _mm256_and_si256(raw_desc_bh4_5, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, _mm256_srli_si256(ts, 4)); + ts = _mm256_and_si256(raw_desc_bh6_7, ts_desp_mask); + ts_low = _mm256_or_si256(ts_low, ts); + + ts_low1 = _mm256_permutevar8x32_epi32(ts_low, ts_permute_mask); + ts_low2 = _mm256_permutevar8x32_epi32(ts_low1, + _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7)); + ts_low2 = _mm256_and_si256(ts_low2, + _mm256_set_epi32(mask, mask, mask, mask, mask, mask, mask, 0)); + ts_low2 = _mm256_or_si256(ts_low2, hw_low_last); + hw_low_last = _mm256_and_si256(ts_low1, + _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 3); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 4); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 5); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 6); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm256_extract_epi32(ts_low1, 7); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm256_max_epu32(ts_low2, ts_low1); + cmp_ret = _mm256_andnot_si256(_mm256_cmpeq_epi32(max_ret, ts_low1), cmp_mask); + + if (_mm256_testz_si256(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m256i mask_low; + __m256i mask_high; + switch (shift) { + case 4: + mask_low = _mm256_set_epi32(0, 0, 0, 0, mask, mask, mask, mask); + mask_high = _mm256_set_epi32(mask, mask, mask, mask, 0, 0, 0, 0); + break; + case 2: + mask_low = _mm256_srli_si256(cmp_mask, 2 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 2 * 4); + break; + case 1: + mask_low = _mm256_srli_si256(cmp_mask, 1 * 4); + mask_high = _mm256_slli_si256(cmp_mask, 1 * 4); + break; + } + ret = _mm256_testz_si256(cmp_ret, mask_low); + if (ret) { + ret = _mm256_testz_si256(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + mbuf_flags = _mm256_or_si256(mbuf_flags, _mm256_set1_epi32(iavf_timestamp_dynflag)); + } /* if() on Timestamp parsing */ + } #endif } @@ -1265,10 +1396,57 @@ _iavf_recv_raw_pkts_vec_avx2_flex_rxd(struct iavf_rx_queue *rxq, (_mm_cvtsi128_si64 (_mm256_castsi256_si128(status0_7))); received += burst; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= burst) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 5: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 4], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 6: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 5], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 7: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 6], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 8: + *RTE_MBUF_DYNFIELD(rx_pkts[i + 7], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif if (burst != IAVF_DESCS_PER_LOOP_AVX) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + if (received > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[received - 1], iavf_timestamp_dynfield_offset, rte_mbuf_timestamp_t *); +#endif + /* update tail pointers */ rxq->rx_tail += received; rxq->rx_tail &= (rxq->nb_rx_desc - 1); -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE 2023-05-29 2:23 ` [PATCH v4 " Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng @ 2023-05-29 2:23 ` Zhichao Zeng 2023-06-01 2:49 ` Tang, Yaqi 2023-06-06 5:41 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhang, Qi Z 3 siblings, 1 reply; 22+ messages in thread From: Zhichao Zeng @ 2023-05-29 2:23 UTC (permalink / raw) To: dev Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng, Bruce Richardson, Konstantin Ananyev, Jingjing Wu, Beilei Xing This patch enables Rx timestamp offload on SSE data path. Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> --- v4: rework avx2 patch based on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning and timestamp error --- drivers/net/iavf/iavf_rxtx_vec_sse.c | 159 ++++++++++++++++++++++++++- 1 file changed, 156 insertions(+), 3 deletions(-) diff --git a/drivers/net/iavf/iavf_rxtx_vec_sse.c b/drivers/net/iavf/iavf_rxtx_vec_sse.c index 3f30be01aa..b754122c51 100644 --- a/drivers/net/iavf/iavf_rxtx_vec_sse.c +++ b/drivers/net/iavf/iavf_rxtx_vec_sse.c @@ -392,6 +392,11 @@ flex_desc_to_olflags_v(struct iavf_rx_queue *rxq, __m128i descs[4], _mm_extract_epi32(fdir_id0_3, 3); } /* if() on fdir_enabled */ +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) + flags = _mm_or_si128(flags, _mm_set1_epi32(iavf_timestamp_dynflag)); +#endif + /** * At this point, we have the 4 sets of flags in the low 16-bits * of each 32-bit value in flags. @@ -723,7 +728,9 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, int pos; uint64_t var; struct iavf_adapter *adapter = rxq->vsi->adapter; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC uint64_t offloads = adapter->dev_data->dev_conf.rxmode.offloads; +#endif const uint32_t *ptype_tbl = adapter->ptype_tbl; __m128i crc_adjust = _mm_set_epi16 (0, 0, 0, /* ignore non-length fields */ @@ -793,6 +800,24 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, rte_cpu_to_le_32(1 << IAVF_RX_FLEX_DESC_STATUS0_DD_S))) return 0; +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC + uint8_t inflection_point = 0; + bool is_tsinit = false; + __m128i hw_low_last = _mm_set_epi32(0, 0, 0, (uint32_t)rxq->phc_time); + + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint64_t sw_cur_time = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + + if (unlikely(sw_cur_time - rxq->hw_time_update > 4)) { + hw_low_last = _mm_setzero_si128(); + is_tsinit = 1; + } else { + hw_low_last = _mm_set_epi32(0, 0, 0, (uint32_t)rxq->phc_time); + } + } + +#endif + /** * Compile-time verify the shuffle mask * NOTE: some field positions already verified above, but duplicated @@ -825,7 +850,7 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, rxdp += IAVF_VPMD_DESCS_PER_LOOP) { __m128i descs[IAVF_VPMD_DESCS_PER_LOOP]; #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC - __m128i descs_bh[IAVF_VPMD_DESCS_PER_LOOP]; + __m128i descs_bh[IAVF_VPMD_DESCS_PER_LOOP] = {_mm_setzero_si128()}; #endif __m128i pkt_mb0, pkt_mb1, pkt_mb2, pkt_mb3; __m128i staterr, sterr_tmp1, sterr_tmp2; @@ -895,10 +920,11 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, #ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC /** - * needs to load 2nd 16B of each desc for RSS hash parsing, + * needs to load 2nd 16B of each desc, * will cause performance drop to get into this context. */ if (offloads & RTE_ETH_RX_OFFLOAD_RSS_HASH || + offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP || rxq->rx_flags & IAVF_RX_FLAGS_VLAN_TAG_LOC_L2TAG2_2) { /* load bottom half of every 32B desc */ descs_bh[3] = _mm_load_si128 @@ -964,7 +990,94 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, pkt_mb2 = _mm_or_si128(pkt_mb2, vlan_tci2); pkt_mb1 = _mm_or_si128(pkt_mb1, vlan_tci1); pkt_mb0 = _mm_or_si128(pkt_mb0, vlan_tci0); - } + } /* if() on Vlan parsing */ + + if (offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + uint32_t mask = 0xFFFFFFFF; + __m128i ts; + __m128i ts_low = _mm_setzero_si128(); + __m128i ts_low1; + __m128i max_ret; + __m128i cmp_ret; + uint8_t ret = 0; + uint8_t shift = 4; + __m128i ts_desp_mask = _mm_set_epi32(mask, 0, 0, 0); + __m128i cmp_mask = _mm_set1_epi32(mask); + + ts = _mm_and_si128(descs_bh[0], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 3 * 4)); + ts = _mm_and_si128(descs_bh[1], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 2 * 4)); + ts = _mm_and_si128(descs_bh[2], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, _mm_srli_si128(ts, 1 * 4)); + ts = _mm_and_si128(descs_bh[3], ts_desp_mask); + ts_low = _mm_or_si128(ts_low, ts); + + ts_low1 = _mm_slli_si128(ts_low, 4); + ts_low1 = _mm_and_si128(ts_low, _mm_set_epi32(mask, mask, mask, 0)); + ts_low1 = _mm_or_si128(ts_low1, hw_low_last); + hw_low_last = _mm_and_si128(ts_low, _mm_set_epi32(0, 0, 0, mask)); + + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 0); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 1); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 2); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset, uint32_t *) = _mm_extract_epi32(ts_low, 3); + + if (unlikely(is_tsinit)) { + uint32_t in_timestamp; + + if (iavf_get_phc_time(rxq)) + PMD_DRV_LOG(ERR, "get physical time failed"); + in_timestamp = *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset, uint32_t *); + rxq->phc_time = iavf_tstamp_convert_32b_64b(rxq->phc_time, in_timestamp); + } + + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) = (uint32_t)(rxq->phc_time >> 32); + + max_ret = _mm_max_epu32(ts_low, ts_low1); + cmp_ret = _mm_andnot_si128(_mm_cmpeq_epi32(max_ret, ts_low), cmp_mask); + + if (_mm_testz_si128(cmp_ret, cmp_mask)) { + inflection_point = 0; + } else { + inflection_point = 1; + while (shift > 1) { + shift = shift >> 1; + __m128i mask_low; + __m128i mask_high; + switch (shift) { + case 2: + mask_low = _mm_set_epi32(0, 0, mask, mask); + mask_high = _mm_set_epi32(mask, mask, 0, 0); + break; + case 1: + mask_low = _mm_srli_si128(cmp_mask, 4); + mask_high = _mm_slli_si128(cmp_mask, 4); + break; + } + ret = _mm_testz_si128(cmp_ret, mask_low); + if (ret) { + ret = _mm_testz_si128(cmp_ret, mask_high); + inflection_point += ret ? 0 : shift; + cmp_mask = mask_high; + } else { + cmp_mask = mask_low; + } + } + } + } /* if() on Timestamp parsing */ flex_desc_to_olflags_v(rxq, descs, descs_bh, &rx_pkts[pos]); #else @@ -1011,10 +1124,50 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq, /* C.4 calc available number of desc */ var = __builtin_popcountll(_mm_cvtsi128_si64(staterr)); nb_pkts_recd += var; + +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wimplicit-fallthrough" + if (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP) { + inflection_point = (inflection_point <= var) ? inflection_point : 0; + switch (inflection_point) { + case 1: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 0], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 2: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 1], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 3: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 2], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + case 4: + *RTE_MBUF_DYNFIELD(rx_pkts[pos + 3], + iavf_timestamp_dynfield_offset + 4, uint32_t *) += 1; + rxq->phc_time += (uint64_t)1 << 32; + case 0: + break; + default: + PMD_DRV_LOG(ERR, "invalid inflection point for rx timestamp"); + break; + } + + rxq->hw_time_update = rte_get_timer_cycles() / (rte_get_timer_hz() / 1000); + } +#pragma GCC diagnostic pop +#endif + if (likely(var != IAVF_VPMD_DESCS_PER_LOOP)) break; } +#ifndef RTE_LIBRTE_IAVF_16BYTE_RX_DESC +#ifdef IAVF_RX_TS_OFFLOAD + if (nb_pkts_recd > 0 && (rxq->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP)) + rxq->phc_time = *RTE_MBUF_DYNFIELD(rx_pkts[nb_pkts_recd - 1], + iavf_timestamp_dynfield_offset, uint32_t *); +#endif +#endif + /* Update our internal tail pointer */ rxq->rx_tail = (uint16_t)(rxq->rx_tail + nb_pkts_recd); rxq->rx_tail = (uint16_t)(rxq->rx_tail & (rxq->nb_rx_desc - 1)); -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE 2023-05-29 2:23 ` [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng @ 2023-06-01 2:49 ` Tang, Yaqi 0 siblings, 0 replies; 22+ messages in thread From: Tang, Yaqi @ 2023-06-01 2:49 UTC (permalink / raw) To: Zeng, ZhichaoX, dev Cc: Zhang, Qi Z, Richardson, Bruce, Konstantin Ananyev, Wu, Jingjing, Xing, Beilei > -----Original Message----- > From: Zeng, ZhichaoX <zhichaox.zeng@intel.com> > Sent: Monday, May 29, 2023 10:23 AM > To: dev@dpdk.org > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; Tang, Yaqi <yaqi.tang@intel.com>; > Zeng, ZhichaoX <zhichaox.zeng@intel.com>; Richardson, Bruce > <bruce.richardson@intel.com>; Konstantin Ananyev > <konstantin.v.ananyev@yandex.ru>; Wu, Jingjing <jingjing.wu@intel.com>; Xing, > Beilei <beilei.xing@intel.com> > Subject: [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE > > This patch enables Rx timestamp offload on SSE data path. > > Enable timestamp offload with the command '--enable-rx-timestamp', pay > attention that getting Rx timestamp offload will drop the performance. > > Signed-off-by: Zhichao Zeng <zhichaox.zeng@intel.com> > > --- > v4: rework avx2 patch based on offload path > --- > v3: logging with driver dedicated macro > --- > v2: fix compile warning and timestamp error > --- Functional test passed. Cover SSE, AVX2 and AVX512 paths. Tested-by: Yaqi Tang <yaqi.tang@intel.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path 2023-05-29 2:23 ` [PATCH v4 " Zhichao Zeng ` (2 preceding siblings ...) 2023-05-29 2:23 ` [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng @ 2023-06-06 5:41 ` Zhang, Qi Z 3 siblings, 0 replies; 22+ messages in thread From: Zhang, Qi Z @ 2023-06-06 5:41 UTC (permalink / raw) To: Zeng, ZhichaoX, dev; +Cc: Tang, Yaqi > -----Original Message----- > From: Zeng, ZhichaoX <zhichaox.zeng@intel.com> > Sent: Monday, May 29, 2023 10:23 AM > To: dev@dpdk.org > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; Tang, Yaqi <yaqi.tang@intel.com>; > Zeng, ZhichaoX <zhichaox.zeng@intel.com> > Subject: [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path > > Enable timestamp offload with the command '--enable-rx-timestamp', pay > attention that getting Rx timestamp offload will drop the performance. Performance drop when enabling Rx timestamp offloading is a known issue, actually, the patch reduces the downgrade. Refined the commit log to : This patch enables Rx timestamp offload on the vector data path. It significantly reduces the performance drop when RTE_ETH_RX_OFFLOAD_TIMESTAMP is enabled. > > --- > v4: rework avx2 patch based on offload path > --- > v3: logging with driver dedicated macro > --- > v2: fix compile warning and SSE path > > Zhichao Zeng (3): > net/iavf: support Rx timestamp offload on AVX512 > net/iavf: support Rx timestamp offload on AVX2 > net/iavf: support Rx timestamp offload on SSE > > drivers/net/iavf/iavf_rxtx.h | 3 +- > drivers/net/iavf/iavf_rxtx_vec_avx2.c | 186 +++++++++++++++++++++- > drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- > drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - > drivers/net/iavf/iavf_rxtx_vec_sse.c | 159 ++++++++++++++++++- > 5 files changed, 538 insertions(+), 16 deletions(-) > > -- > 2.34.1 Acked-by: Qi Zhang <qi.z.zhang@intel.com> Applied to dpdk-next-net-intel. Thanks Qi ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path @ 2023-05-26 2:41 Zhichao Zeng 0 siblings, 0 replies; 22+ messages in thread From: Zhichao Zeng @ 2023-05-26 2:41 UTC (permalink / raw) To: dev; +Cc: qi.z.zhang, yaqi.tang, Zhichao Zeng Enable timestamp offload with the command '--enable-rx-timestamp', pay attention that getting Rx timestamp offload will drop the performance. --- v4: rework avx2 patch base on offload path --- v3: logging with driver dedicated macro --- v2: fix compile warning and SSE path Zhichao Zeng (3): net/iavf: support Rx timestamp offload on AVX512 net/iavf: support Rx timestamp offload on AVX2 net/iavf: support Rx timestamp offload on SSE drivers/net/iavf/iavf_rxtx.h | 3 +- drivers/net/iavf/iavf_rxtx_vec_avx2.c | 186 +++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_avx512.c | 203 +++++++++++++++++++++++- drivers/net/iavf/iavf_rxtx_vec_common.h | 3 - drivers/net/iavf/iavf_rxtx_vec_sse.c | 161 ++++++++++++++++++- 5 files changed, 539 insertions(+), 17 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2023-06-19 1:03 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-04-10 7:35 [PATCH 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng 2023-04-12 6:49 ` [PATCH v2 " Zhichao Zeng 2023-04-12 8:46 ` Zhichao Zeng 2023-04-27 3:12 ` [PATCH v3 " Zhichao Zeng 2023-05-26 2:42 ` [PATCH v4 " Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng 2023-05-26 9:50 ` [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng 2023-06-14 1:49 ` [PATCH v5 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng 2023-06-15 9:28 ` Tang, Yaqi 2023-06-19 1:03 ` [PATCH v5 0/3] Enable iavf Rx Timestamp offload on vector path Zhang, Qi Z 2023-05-29 2:23 ` [PATCH v4 " Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 1/3] net/iavf: support Rx timestamp offload on AVX512 Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 2/3] net/iavf: support Rx timestamp offload on AVX2 Zhichao Zeng 2023-05-29 2:23 ` [PATCH v4 3/3] net/iavf: support Rx timestamp offload on SSE Zhichao Zeng 2023-06-01 2:49 ` Tang, Yaqi 2023-06-06 5:41 ` [PATCH v4 0/3] Enable iavf Rx Timestamp offload on vector path Zhang, Qi Z 2023-05-26 2:41 Zhichao Zeng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).