https://support.vyos.io/en/kb/articles/system-optimization-4
I've found that messing with the “ring buffer” size of a NIC does what the article states, a higher value will lower CPU utilization where a lower value will reduce latency. If you run “top” during an iperf3 test you'll see the process “ksoftirqd” pop to/towards the top, with high “ring buffer” values (assuming your card supports it) you'll notice the CPU usage is less.
Since latency is important and is normally limited by WAN connections perhaps lower “ring buffer” sizes on the WAN interfaces and higher “ring buffer” sizes on the LAN interfaces??? This needs to be tested.
To find the current and supported settings:
sudo ethtool -g eth0
It should give something like:
Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 128 RX Mini: 0 RX Jumbo: 0 TX: 128
To test a different size (here we're doing the maximum size):
sudo ethtool -G eth0 tx 4096 rx 4096
If the result fits your needs you can make it permanent by:
sudo vi /config/scripts/vyos-postconfig-bootup.script
And adding:
ethtool -G eth0 tx 4096 rx 4096
Be aware that if you have different model network cards in your device they may have different supported sizes. Check each one to be sure before you start modifying settings.
I found that when everything is logged it will eat up a bunch of CPU time, specifically you'll see this if you run top during an iperf3 test, “journald” may be at the top and CPU ID will be something like 20. After disabling all firewall logging this went from the 20s to the 80s. After disabling logging only on established/related/invalid and ping it stayed in the 80s when running an iperf3 test @ 940Mbit/s. When it was down to the 20s, if you ran “journalctl” you'll see alot of kernel missing something messages, I'm guess this is the logs it couldn't record but that's just a guess…
Obviously we want to log things, just log what is important if your resources are limited. One thing you don't want to log is the traffic from your VyOS device to your syslog server (if you've set one up), this generates sooo much traffic that gets logged that even on a device with low traffic volume it will chew up CPU resources.
So as of today (07-14-2022) I'm going to plan on logging everything except
https://support.vyos.io/en/kb/articles/system-optimization-4
Per VyOS' support article VyOS will attempt to split the IRQs across CPUs/cores automatically. This appears to be physical interfaces, not something like bond0.
Per the article, you want a CPU/core per interface and on newer systems each interface might have the possibility of being split up into 3 CPUs/cores per interface (rx, tx and control).
So the moral of the story is to have at least 1 CPU/core per physical interface or 3 per interface if your system supports it. I think I'll start using 10Gbit interfaces as opposed to bonding a bunch of gigabit ports, that should be cheaper than a Threadripper in a router.
If using iperf, running it without -R will show upload speed, with will show download speed.
If you use SNAT then and try to use matching for a SNATed subnet/address then it won't work, the default rule will apply instead. Here you'll have to mark the traffic and match that. See this forum post: https://forum.vyos.io/t/limit-bandwith-for-indivindual-ips-on-1-2-5/5947/25 and how to mark here: https://blog.vyos.io/using-the-policy-route-and-packet-marking-for-custom-qos-matches
If you use WAN load balancing/failover they you're screwed, you won't be able to mark packets to have them matched. This is due to the mangle (though I don't really understand this) function of the kernel. Here is info from people who know more than I do, along with a possible hack fix: https://forum.vyos.io/t/solved-wan-load-balancing-with-policy-route-rules-previously-wan-load-balancing-with-2-pppoe-connections-with-tcp-mss-clamping/1968
The moral here is if you want to do QoS, setup an internal router with a single “external” interface that connects to another router that does nat and wan failoever; don't do snat or wan failover on a device with anything but basic QoS.
When using fq_codel, a higher queue size means more throughput, if you're setting a ceiling of 240mbit for instance and your tests only give you 150Mbps, increase your queue limit to 1000 and see if that allows it to reach full bandwidth, then back it off until you start affect throughput as lower queue limit typically means lower latency.
When using fq_codel, if your bandwidth is set to 1000mbit, then you have 3 classes with minimum and ceiling bandwidths, but your actual connection can never reach 1000mbit due to a bottleneck elsewhere, then your shaping won't work and you'll probably notice that bandwidth is simply handed out evenly among the 3 classes instead of being “shaped” as you wish.
When using a shaper, for classes it appears you can use a higher ceiling than the bandwidth available. This would allow you to for instance…. wait, let me test.