Forum Discussion

Lewis_Beard's avatar
2 months ago
Solved

SNMP collector performance: SHA/AES vs MD5/DES

Is there a significant difference in the collector processing load or overhead that would impact performance, when switching from MD5/DES to SHA/AES?

https://www.logicmonitor.com/support/collectors/collector-overview/collector-capacity

I was looking at the collector capacity page, and while obviously v3 is more of a burden on the collector than v2. But I dont see anything about MD5/DES vs SHA/AES. I’m wondering if we can simply change a collector’s snmp properties and assume a fairly (but not overly) loaded collector can handle this?

Or is that a huge chance? Anyone have any experience with this?

  • I don’t actually know which would be more demanding offhand. The simplest way to test it would be to pick a resource or small set of resources and make the change, then measure what we call the ival and see if it’s significantly different. Full explanation below:

    Collector capacity for SNMP is primarily constrained by the availability of collector SNMP worker threads. A collector’s workload for SNMP polling consists of those instances which use snmp as a collection type attached to the resources assigned to that collector times the polls (inverse of the polling interval). Each of these instances will create a recurring task on the collector that will be scheduled at the proper interval. Waiting to serve these tasks are a pool of dedicated worker threads. When a task’s scheduled time occurs (as dictated by the interval), it will request a thread from the pool. If a thread is available, the task will run on the thread, making the request from the collector to the SNMP agent on the target resource, it will wait for the response, and when that response comes, it will process the data and release the thread back into the pool.

    If a collector has enough tasks running that all are busy when a scheduled task makes the request, the task will wait for a little bit for a thread to become available. If some amount of time passes and no thread becomes available, the scheduler will give up on the task (and by then, the task’s next scheduled run is likely imminent in any case) At that point the collector is working beyond its capacity, as task executions are being requested, but there are no idle threads in the pool to execute the tasks, so data is not being collected. Because the collector’s scheduler does the best that it can, it’s normal at this point to see gaps in graphs with some successful executions sprinkled in. 

    There are a few collector datasources that can help show when this is happening. The LogicMonitor_Collector_DataCollectingTasks has a datapoint UnavailableScheduleTaskRate which by default will trigger a warning when tasks aren’t getting scheduled properly. It’s a good one to keep an eye out for, but once it’s there, you’re already losing data because of capacity constraints. 

    From a workload perspective, the longer the responses take, the more time each will occupy a thread, so switching the encryption, if it slows things down, could certainly have an impact. (Although having a non-responsive resources is going to use much, much more thread time as the threads will all wait for an entire timeout interval; a typical SNMP resonse time is in single digit MS, while a timeout is going to take many thousands of MS) As you can imagine, there are a lot of variables that go into this. 

    For your testing though, the easiest way to see how the change goes is to use the collector debug facility (only available to administrators with the proper role) to see how long the execution tends to take on a resource or a few resources, make your changes, wait for a new set of polls, and then check again. 

    !tlist with filter for SNMP and resource

    Here’s a screenshot of a lab collector’s response to a !tlist command with filter arguments for SNMP collection and a particular resource. The execution time and status are the two rightmost columns. (Note that SNMP is fast ~ 2 ms here) 
    There are also constraints where it comes to memory and CPU, but you’re less likely to see those directly. All of these constraints can be managed easily by changing the collector size

1 Reply

  • I don’t actually know which would be more demanding offhand. The simplest way to test it would be to pick a resource or small set of resources and make the change, then measure what we call the ival and see if it’s significantly different. Full explanation below:

    Collector capacity for SNMP is primarily constrained by the availability of collector SNMP worker threads. A collector’s workload for SNMP polling consists of those instances which use snmp as a collection type attached to the resources assigned to that collector times the polls (inverse of the polling interval). Each of these instances will create a recurring task on the collector that will be scheduled at the proper interval. Waiting to serve these tasks are a pool of dedicated worker threads. When a task’s scheduled time occurs (as dictated by the interval), it will request a thread from the pool. If a thread is available, the task will run on the thread, making the request from the collector to the SNMP agent on the target resource, it will wait for the response, and when that response comes, it will process the data and release the thread back into the pool.

    If a collector has enough tasks running that all are busy when a scheduled task makes the request, the task will wait for a little bit for a thread to become available. If some amount of time passes and no thread becomes available, the scheduler will give up on the task (and by then, the task’s next scheduled run is likely imminent in any case) At that point the collector is working beyond its capacity, as task executions are being requested, but there are no idle threads in the pool to execute the tasks, so data is not being collected. Because the collector’s scheduler does the best that it can, it’s normal at this point to see gaps in graphs with some successful executions sprinkled in. 

    There are a few collector datasources that can help show when this is happening. The LogicMonitor_Collector_DataCollectingTasks has a datapoint UnavailableScheduleTaskRate which by default will trigger a warning when tasks aren’t getting scheduled properly. It’s a good one to keep an eye out for, but once it’s there, you’re already losing data because of capacity constraints. 

    From a workload perspective, the longer the responses take, the more time each will occupy a thread, so switching the encryption, if it slows things down, could certainly have an impact. (Although having a non-responsive resources is going to use much, much more thread time as the threads will all wait for an entire timeout interval; a typical SNMP resonse time is in single digit MS, while a timeout is going to take many thousands of MS) As you can imagine, there are a lot of variables that go into this. 

    For your testing though, the easiest way to see how the change goes is to use the collector debug facility (only available to administrators with the proper role) to see how long the execution tends to take on a resource or a few resources, make your changes, wait for a new set of polls, and then check again. 

    !tlist with filter for SNMP and resource

    Here’s a screenshot of a lab collector’s response to a !tlist command with filter arguments for SNMP collection and a particular resource. The execution time and status are the two rightmost columns. (Note that SNMP is fast ~ 2 ms here) 
    There are also constraints where it comes to memory and CPU, but you’re less likely to see those directly. All of these constraints can be managed easily by changing the collector size