I’ve started encountering a problem that I should use some assistance troubleshooting. I’ve got a Proxmox system that hosts, primarily, my Opnsense router. I’ve had this specific setup for about a year.
Recently, I’ve been experiencing sluggishness and noticed that the IO wait is through the roof. Rebooting the Opnsense VM, which normally only takes a few minutes is now taking upwards of 15-20. The entire time my IO wait sits between 50-80%.
The system has 1 disk in it that is formatted ZFS. I’ve checked dmesg, and the syslog for indications of disk errors (this feels like a failing disk) and found none. I also checked the smart statistics and they all “PASSED”.
Any pointers would be appreciated.
Edit: I believe I’ve found the root cause of the change in performance and it was a bit of shooting myself in the foot. I’ve been experimenting with different tools for log collection and the most recent one is a SIEM tool called Wazuh. I didn’t realize that upon reboot it runs an integrity check that generates a ton of disk I/O. So when I rebooted this proxmox server, that integrity check was running on proxmox, my pihole, and (I think) opnsense concurrently. All against a single consumer grade HDD.
Thanks to everyone who responded. I really appreciate all the performance tuning guidance. I’ve also made the following changes:
- Added a 2nd drive (I have several of these lying around, don’t ask) converting the zfs pool into a mirror. This gives me both redundancy and should improve read performance.
- Configured a 2nd storage target on the same zpool with compression enabled and a 64k block size in proxmox. I then migrated the 2 VMs to that storage.
- Since I’m collecting logs in Wazuh I set Opnsense to use ram disks for /tmp and /var/log.
Rebooted Opensense and it was back up in 1:42 min.
Check the ZFS pool status. You could lots of errors that ZFS is correcting.
I’m starting to lean towards this being an I/O issue but I haven’t figure out what or why yet. I don’t often make changes to this environment since it’s running my Opnsens router.
root@proxmox-02:~# zpool status pool: rpool state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. scan: scrub repaired 0B in 00:56:10 with 0 errors on Sun Apr 28 17:24:59 2024 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 ata-ST500LM021-1KJ152_W62HRJ1A-part3 ONLINE 0 0 0 errors: No known data errors