I modified the dpdkdump project's code (which uses the pdump framework) to scale and support the highest throughput possible. I enabled the pdump framework (which creates a hook in rx/tx in the background) with big rte_ring and then tried to fan out packets to several rte_rings (I tried with max 10) which was continuously polled by 10 separate processes solely running on separate cores and write the packets. when my primary application had in/out around 1 million pps I saw that, then I could get around 2 million pps (in+out) in the main rte_ring. but when I increased the load to 2 million pps (in/out), I was only getting around 2.6-2.8 million pps from the hook though I should get around 4 million pps (in+out). I am seeing a big "ring full" count, but is it because my fan out is slow or pdump itself have a bottleneck limit? please let me know... 
I took a look at dumpcap which aims at capturing 10gbit/s but it needs to be run as a primary process as far as I understand... but I need a secondary dpdk application to capture around 5-10 million pps