[ixpmanager] SFLOW Under Reporting?

Tue Jun 27 10:18:05 IST 2023

Hi,

So... since my earlier message (below), I whipped up a little test 
script to try out threads in perl.

On the back of that, I tried a little hack:

Added `use threads;` and did this 1 line fix:

-               process_rrd($interval, $matrix, $rrdcached);
+               threads->create('process_rrd', $interval, $matrix, 
$rrdcached);

Bingo!

We're now doing 736G according to MRTG, the life sflow graphs are 
reporting 332G, now my new instance in a VM is giving 703G and on bare 
metal 705G.

I still think it's worth re-writing this in Go for performance, which I 
can look at in the future, but for now that appears to have resolved 
things.

I'll do a pull-request on GitHub, as long as this seems to continue to 
work ok.

Ian

On 2023-06-27 08:01, Ian Chilton wrote:
> Hi,
> 
>> On 2023-06-27 00:31, Nick Hilliard (INEX) wrote:
>> My initial suspicion would be tail drops due to a buffer overflow on
>> the pipe between the sflowtool process and sflow-to-rrd-handler.
> 
> You beat to me to posting, but also came to this conclusion and I 
> believe i've proved this to be the case.
> 
> I wrote this to simply parse the sflow data and sums it. Every 
> 'interval', it takes the total and zeros it. In a thread (which is 
> irrelevant here but I was testing for the bigger workload), it prints 
> it out.
> 
> https://gist.github.com/ichilton/b53cde596bb02289fca88fb61480c58f
> 
> It was 1AM when I did this and now only 07:30 so i've not tested it at 
> peak traffic, but at lower traffic at those times i'm seeing between 
> 5-8% of the total traffic reported by MRTG (obviously that is over a 5 
> minute interval and this is at a 1 minute interval, but it's in the 
> ballpark).
> 
> So I believe, assuming it can even keep up normally, what is happening 
> is while it's busy executing the periodic flush / mac table code, the 
> buffer is overflowing and it's missing samples.
> 
> That explains why the shape/trend of the graphs are correct - they are 
> lower in numbers than they should be.
> 
> A quick workaround would be to hack the current script to do the 
> periodic flush/reload in a thread so it happens concurrently to the 
> flow parsing.
> 
> Ultimately, we'll keep hitting this in the future as interfaces and 
> traffic increases.
> 
> I plan to do the following:
> 
> - Re-write this in Go for better performance.
> 
> - I have an idea that we could have a thread per switch - each switch 
> sends sflow to a different port, the script manages an sflowtool per 
> switch, so that thread is only parsing a subset of the overall data (we 
> have ~20 switches), which would make it more scalable.
> 
> - Nicolaas posted to the list about goflow2, which is worth comparing 
> to sflowtool.
> 
> I'm busy for the rest of this week with travel/DCs/event and prep for 
> that, but I plan to work on this more in the coming weeks.
> 
> In the meantime, any thoughts/ideas/suggestions are welcome.
> 
> Ian