Forum Discussion

Kelemvor's avatar
Kelemvor
Icon for Expert rankExpert
27 days ago

How do you organize your resources? Folder structure...

Hi,

We are looking into our LM system and trying to figure out if we are using it the most efficient way.  I'm wondering if others might share how you lay out your devices in folders to make it easier to handle alerting and assign tickets to the right departments and things like that.

We have top-level folders like: Data Center Hardware & Network Hardware which send tickets to two different teams.

We then have folders broken up by Tier like:

  • Prod Database Servers
  • Prod Kubernetes Servers
  • Prod AIX Servers
  • Prod Servers

We then have  UAT versions of the above and Non-Prod versions of the above.  They all have various subfolders underneath them based on product or client.  All the alerts for any of the Database Server groups go to our DBA team, the Kub alerts go a different team, etc.  That's all fine.

Under the more general "Prod Servers" group, most alerts go to one team, but there are some specialized ones that go to other teams.  We handle this by having a bunch of Alert Rules to pull out those custom ones before they get to the make Rule that sends everything to the main team.

We also send alerts to different teams based on the severity of the alert.  Anything that is an Error alert generally goes to the team that will resolve the issue.  However, all Critical alerts go to a 24/7 team specifically so they can alert someone and then send them the ticket.  This is done by having even more alert rules to handle the alerts based on the severity.

The ticket routing is done by having a ton of Integrations with the ticketing system that are all hard-coded to send tickets to each individual team.

Anyway, just curious how everyone else organizes things, how you handle alerts, how you route tickets, etc.  Out system was setup many years ago, and I'm sure there are improvements we can make, so just looking for other thoughts.

Thanks.

2 Replies

  • We're a service provider so we have our structure split out first by the customer support type (UK only support, or globally supported).  We assign access at the "Global Customers" and "UK Only Customers" group - so support staff  in the correct region see only the customer devices they should.  We also sent a property for Service Now assignment group here.

    There are customer sub-folders under each of these region folders.  On each customer folder, we apply a property that is sent to our Service Now integration which aligns with the name of the customer as configured in Service Now.  Very occasionally, we might give a customer read access to their respective folders.

    We often also break down into device type under each customer, really just for setting device properties that wouldn't make sense being inherited by everything.  So we would set wmi.user/wmi.pass on the "Windows Server" folder, snmp details on the "Network Devices" folder, etc.

    For example:

    - Global Customers
        - Customer A
              - Collectors
              - Windows Servers
              - Linux Servers
              - Network Devices
              - Virtualisation
        - Customer B
    - UK Only Customers
        - Customer C
        - Customer D


    Alert routing wise, it's all going into Service Now, usually through a single integration, with the assignment group taken from the inherited property set at Global Customers and "UK Only Customers".  We've fairly heavily modified the payloads being sent to Service Now and made changes to the scripts that run on the Service Now side so the behaviour better aligns with our processes.

    There are a couple of exceptions where we route to Service Now but via a separate copy of the integration:

    • Alerts related to SQL Server specific datasources - the assignmentgroup is hard-coded to our DBA Team rather than the inherited assignment group.  So the "normal" assignment group would get the alerts related to the OS, but the DBA team get the database specific alerts.
    • Alerts related to Azure - Has slightly different tokens used in short_description and description (because we like to pass Azure Sub and Resource Group there - which would be blank for non Azure stuff).  It also has the assignmentgroup hard-coded to our public cloud team.
    • A handful of customer specific alert rules where we want to route some things both to our Service Now and to a customer email address.

    We do also have an "Internal" folder structure, further broken down by the Internal teams for whom we monitor devices/services for.  Those send to Service Now as well, inheriting the assignmentgroup from the sub-folders.

  • Overall we break up our resources similar to what you described.  One big difference is we generally only monitor Production-things with LogicMonitor.  We are just getting to the maturity level where we now think about application-level observability and coordinate with teams as they promote things from dev to stage to prod, so we do have a few non-production things being monitored for our "big stuff."

    I'll describe what we do, but I am going to split it into two areas:  Resource Grouping and Alert Routing

    Resource Grouping
    The following is a generalization of our Resource Tree within LM:

    - Devices by Team
    -- Team A
    -- Team B
    -- Application 1
    -- Application 2
    -- etc
    
    - Devices by Type
    -- Type 1 (*)
    -- Type 2 (*)
    -- Type 3
    -- etc
    
    - Cloud Things
    
    - Infrastructure
    -- Region 1
    --- Location 1
    ---- Network Devices
    ---- Compute Devices
    --- Location 2
    ----Network Devices
    -- Region 2
    -- Non-Production
    
    - Line of Business A
    -- AMER
    --- Region 1
    ---- Location A
    ---- Location B
    --- Region 2
    ---- Location C
    -- EMEA
    --- Region 3
    --- Region 4
    -- APAC
    --- Region 5
    
    - Sandbox & Setup

    We have some of the Devices by * folders that LM suggests for best practices and to work with their default dashboards. There are also a couple of Sandbox/Setup folders that catch new VMs or Network devices that appear in the portal and we haven't classified yet.

    I'm assuming you know this, and just repeating for anyone newer to LM reading:  Devices can (and should) be in lots of different Resource Groups. I love this feature of LM and how it can be used to control properties, routing of alerts, etc.

    Our other top-level resource grouping generally matches some organizational setup (aka VPs and their teams...but not as specific as Devices by Team that might drill down to a couple of people and their manager), mixed with a bit of geography. These are what I think of as our business folders because they mirror various in-real-life people-organizations of our business.

    For example, we have an Infrastructure group related to our back offices or warehouses are grouped together in one offshoot of folders under here. Because our Infrastructure teams (like network/systems) is a Tier 2 support for most applications, we have a top-level resource group just for them that organizes anything how they want.  Like I mentioned - some of this is organization-based and some is geography-based.

    Other top-level folders are based around certain lines of business.  For example, retail stores.  There's a breakdown of folders based on global-geography, regional-geography, location, and the devices within those locations.

    Again - I'll mention the benefit of devices existing in more than one resource group because we have lots of tagging that happens at various levels in these folders.  We have some properties being set in our Devices by Team and Devices by Type folders, but often these might be overridden by the same property being set in our Infrastructure or Line of Business resource group tree. 

    Because our business folders generally end up having a deeper tree than any of the Devices by * folders we rarely have property conflicts where we expected a property to be set and it is not. For example, all network devices get a servicenow.group property saying they belong to our Network Team (Tier 2 support). This is set on a folders within the Devices by Type resource group.  But, one of those devices might be specific to a particular retail store, so it shows up again six-levels deep in one of our line of business folders and that servicenow.group property new says it belongs to our Retail Support Region 3 Team (Tier 3). 

     Another benefit with these business folders is using RBAC for the various teams that just want to see their stuff.  There is a lot of granularity within the Role-configuration for your Resource Tree

    Alert Routing
    We are using ServiceNow to catch alerts coming out of LogicMonitor, and try and keep that path as simple as possible. We have one main Integration defined with ServiceNow, and one main Escalation Chain to get alerts to ServiceNow that covers 99% of our alerts.  

    As far as routing to different teams, we use that servicenow.group property I mentioned on all of our devices for which Assignment Group in ServiceNow the Incident should go to when created.  We modified the JSON being sent for events between LM-SNow to include some extra data in values that aren't officially documented by LM - we figured out the were available by looking at the SNow code/trigger receiving these alerts.  Also important to note is the servicenow.group property matches up with our Tier 2 & Tier 3 support teams because we try and have any automated alert skip over our Tier 1 Service Desk team for assignment. 

    This also means our Alert Rules are fairly simple - we have a catch-all to ignore Warnings, and a catch-all to route all Criticals to ServiceNow. Most Rules are dealing with which Errors we care about and if they should get ignored or sent on to ServiceNow. Thus, most of our rules have criteria based around our resource group structure - e.g. "Warnings for Palo Alto devices should go to SNow."  One important thing with our Alerts is we've grouped the numbers based on teams (picture the Dewey Decimal system).  So things with Priority in the 400s might be a Line of Business, rules with Priority in the 500s are Networking, etc.  Anything under 100 us a super-special-override that might be documented "why" somewhere.

    ----

    Overall, I would say our setup philosophy is this:  if I need to accommodate a special case, business rule, etc then I want that weird thing to be represented in an obvious place and be well documented.

    For example, we might make a special Resource Group folder under a Line of Business with a subset of devices within to get certain things to route properly.  Or, we create a very special Alert Rule with Priority < 100.

    I feel lucky that I can treat LM only as an O11y and Alerting tool, and we have ServiceNow to handle the response processes, and also that we are in a phase of standardizing incident response across the IT org. 😀 I want LM to watch what should be normal, and raise an alarm to a human when things are bad.