Forum Discussion

wfaulk's avatar
wfaulk
Icon for Neophyte rankNeophyte
3 months ago

Collecting a very large number of datapoints

I have a need to collect data about CPU P-levels in VMware hosts.

The way that VMware is structured in LogicMonitor relies on a vCenter server, and then all of its hosts are created in Active Discovery. There does not seem to be a way to create a datasource that uses those AD-found hosts as target resources.

So I have a script that hits my vCenter, loops through each of a few dozen hosts, each of which has around 80 CPUs, each of which has around 16 P-levels. When you multiply that all up, that's about 30,000 instances. The script runs in the debug environment in about 20 seconds and completes without error, but the output is truncated and ends with "(data truncated)".

When I try to "Test Active Discovery", it just spins and spins, never completing. I've waited up to about 20 minutes, I think.

It seems likely that this is too much data for LogicMonitor to deal with all at once. However, I don't seem to have the option to be more precise in my target for the script. It would make more logical sense to collapse some of these instances down to multiple datapoints in fewer instances, but there isn't a set number of P-levels per CPU, and there isn't a set number of CPUs per host, so I don't see any way to do that.

There doesn't seem to be any facility to collect this data in batches.

What can I do?

  • Are referring to P-states on the CPU which controls its running frequency? Would that need to be a separate datapoint for each state? I would think it would just be one PState datapoint that would store 0, 1, 2, etc. For 30 servers with 80 cores that would 2400 instances. Or if you create a device per host, would only be 80 instances.

    Otherwise it does sounds like you'll have to break it up into multiple devices unless you can accept some sort of summary collection. Even if you got this working, you might run into the limit of the metrics-per-device (see https://www.logicmonitor.com/support/logicmonitor-limits-quotas-and-constraints) limit which I haven't seen myself but this sounds like it would hit it.

    You said you need to collect data so I assume just setting up the Datasource to only create instances during an alert event is not enough. Otherwise you can have the AD script only create the instance when it's important but that also means you only can do that at 15-minute intervals and not really useful for graphing data.

    If you do need to collapse into multiple datapoints, you can just create as many as the most you need. So if there is 16 at most, create the 16 datapoints but if the CPU doesn't have that many you can just skip the ones that don't apply. They will report NaN and not even show in a graph.

    • wfaulk's avatar
      wfaulk
      Icon for Neophyte rankNeophyte

      I responded to these questions as a reply to the main question, which I guess is not the correct culture here. Sorry.

  • What is the outcome you aim to achieve? What do you want to do with the data?

    Additionally, for most customers their production ESXi servers are also distinct LogicMonitor devices. In these, we already have a Logical Processors module with an instance for each, so maybe we could add another datapoint to that if you could explain what you aim to do with the data.  Thanks.

     

     

     

  • I just discovered that I can't even prematurely aggregate the data because it's a counter; I need the previous value to make any sense of the new value.

  • The data is provided as a list of CPUs, each with its own list of P-states, each of which has a numeric value. Part of the reason that I wanted to collect all of that data is because it's not documented what the numeric values are and I wanted to see what I could determine from collecting them for a while and seeing what they showed.

    I wasn't able to get the data into LogicMonitor, so I ended up collecting the data manually and figuring out what kind of data it was outside of LM. It turns out, as per my previous update, that each value is an ever-increasing counter, something along the lines of total time spent in each P-state or total number of operations run in that P-state.

    (For what it's worth, the reason I'm collecting this data is that I have a recurrent hardware problem that is preventing processors from running at full speed, and this is the only statistic I could find that was reporting the problem accurately.)

    Since the data is presented as counters, in order for the numbers to be meaningful, they have to be compared with previous numbers. Imagine a situation where a particular CPU has been running at full speed (P-state 0) for a month. Let's say that the number is seconds spent in that P-state (it's probably not that, but it's something like that), and the CPU has four P-states. So you'd like the data for that CPU to be {2592000,0,0,0}. But when it's acting normally, it still spends some small amount of time in the other states, so it would actually look something like {2525087,21254,19083,26576}. Then something happens and it starts consistently running in P-state 3. After 6 hours, the data now looks like {2525087,21254,38695,28564}. You can see how, despite the fact that P0 and P1 have not increased at all, that fact is not at all obvious when looking at the raw data without comparing it to previous data.

    I have tried aggregating the data into two datapoints per CPU: totalCount = sum(plevel_counter) and weightedCount = sum((plevel+100)*plevel_counter). I can then construct a complex datapoint that is ((weightedCount/totalCount)-100). (The "100" is because the smaller the multiplier is, the more it can get overwhelmed by other counts. The extreme of this is when the multiplier would be 0, but even just using multipliers 1-17, the 17 is way bigger than 1, while 117 is not all that much bigger than 100.) This is a marginally okay approximation, but it sure would be better if LogicMonitor had real data. (I could get 100% accurate data this way by, instead of "+100", using "*(max(plevel_counter)+1)^plevel", but those numbers feel like they'd be far too large for, well, anything. (The real values can be at least 14 digits long.)

    It does turn out that there is a maximum number of P-states, 16, so I could just implement it that way, I think. I think there's probably a way to deal with that in a complex datapoint.

    I can also talk with our LM admin again to see what he thinks about directly monitoring each ESXi host instead of basing everything on the vCenter server.

    (Also sorry about the slow response. I was unknowingly looking directly at my other response, which doesn't show other responses to the main post.)