Forum Discussion

Jesse_Shumaker's avatar
3 years ago

More or Less Collectors Moving to Azure

We are looking to move our on-prem bare metal collectors in our datacenters upto azure ubuntu vm's. We've come up with these estimates and currently our largest on-going problem is dealing with "No Data" issues where we have gaps in data collection on collectors along with the "All Nan" data collection outputs. The standard tag line from LM support is placing collectors as close to the resource as possible. These collectors will be monitoring the resources in that region on MPLS connections, so our latency will be around 50-100ms. What recommendations can you guys give on the direction we should take?

Small Collectors Needed Per Region - 141 Collectors Needed

Central US - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Small Collector 7k  = 39
Germany West Central - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Small Collector 7k  = 15
North Europe - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Small Collector 7k  = 15
South India - 722 Devices * Average Instance size 325 / Small Collector 7k = 33
West US2 - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Small Collector 7k  = 39

Medium Collectors Needed Per Region - 117 Collectors Needed

Central US - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Medium Collector 10k  = 27
Germany West Central - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Medium Collector 10k  = 10
North Europe - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Medium Collector 10k = 10
South India - 722 Devices * Average Instance size 325 / Medium Collector 10k = 33
West US2 - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Medium Collector 10k  = 27

Large Collectors Needed Per Region - 68 Collectors Needed

Central US - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Large Collector 14k  = 19
Germany West Central - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Large Collector 14k  = 7
North Europe - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Large Collector 14k = 7
South India - 722 Devices * Average Instance size 325 / Large Collector 14k = 16
West US2 - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Large Collector 14k  = 19

Extra Large Collectors Needed Per Region - 47 Collectors Needed

Central US - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Large Collector 20k  = 13
Germany West Central - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Large Collector 20k  = 5
North Europe - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Large Collector 20k = 5
South India - 722 Devices * Average Instance size 325 / Large Collector 20k = 11
West US2 - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Large Collector 20k  = 13

Double Extra Large Collectors Needed Per Region - 32 Collectors Needed

Central US - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Large Collector 28k  = 9
Germany West Central - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Large Collector 28k  = 3
North Europe - EMEA Devices 670 / 2 = 335 * Average Instance size 325 / Large Collector 28k = 3
South India - 722 Devices * Average Instance size 325 / Large Collector 28k = 8
West US2 - North/South America Devices 1692 / 2 = 846 Devices * Average Instance size 325 / Large Collector 28k  = 9

3 Replies

  • I guess I'm not clear, are your existing gap issue from monitoring remote sites from a more central location, or that the collectors are overloaded because there are not enough? Or Both? Also can you just run the collector servers directly in Azure? That would also lower your bandwidth use over the MPLS.

    You might be able to push each collector to have more instances then you list. I can't check anymore but I think I've pushed 30k+ instances per large Windows collectors (before redundancy) in some situations. I personally tend to go with larger collectors rather than have tons of small size ones, but you still need to make sure you have the redundancy for collector failures into account.

    But the good thing is that you can add and remove collectors on-the-fly without downtime. You can start with fewer then add more if you see the load it too much, or add too many and remove some if they are too idle. Although that might not help when working out budgets.

  • 15 hours ago, Mike Moniz said:

    I guess I'm not clear, are your existing gap issue from monitoring remote sites from a more central location, or that the collectors are overloaded because there are not enough? Or Both? Also can you just run the collector servers directly in Azure? That would also lower your bandwidth use over the MPLS.

    You might be able to push each collector to have more instances then you list. I can't check anymore but I think I've pushed 30k+ instances per large Windows collectors (before redundancy) in some situations. I personally tend to go with larger collectors rather than have tons of small size ones, but you still need to make sure you have the redundancy for collector failures into account.

    But the good thing is that you can add and remove collectors on-the-fly without downtime. You can start with fewer then add more if you see the load it too much, or add too many and remove some if they are too idle. Although that might not help when working out budgets.

     

    In our environment we don't have bandwidth or latency issues, and all our collectors are 2xl, and they average around 8k instances each, so the collectors aren't overloaded. the pics below are from a device who has 5ms latency to the collector and it's showing gaps from 2 different datasources at the same time. my questions are the following:

    1. Are gaps in polling expected behavior in LM, is this something we should just get used to, does your environment constantly have them?

    2. If they can be fixed, what deployment model above is recommended for moving into azure so we don't see them any longer?

  • Gaps are not normal, that is sign that you have either collector issues or network issues. If I read those graphs correctly, those are gaps of like 12+ hours which is huge.

    I would first look for patterns in the gaps. Are the gaps always at the same time? They just occur during the day or also at night? Any difference between weekday vs weekend? Do they occur across multiple devices at the same time? Do devices on different collectors show the same gap at the same time? Do the gaps show up in more basic checks like ping and/or just more complex checks? Do the gaps only show for particular types of devices (network vs servers)? Do they only occur for particular network segments?

    If you can catch the issue when it's actively showing No Data, try doing Poll Now and see if you are getting error messages as Poll Now can provide extra details. Run the collector debug console and check !tlist with !tdetail so see what the last results of those checks have been. Log into the collector server directly and try running equivalent commands. If WMI checks fail in LM, try using Get-WMIObject powershell. If snmp fails, try using the !snmpdiagnose debug command or 3rd party snmp tool.

    Also look at the Collector log files for the time of the gap. They should show lots of information of what the collector attempted and if it got any errors.

    Good luck!