Back

Incident management - thresholds, alerts, and actions

As discussed in View custom robot data, in addition to the default robot data sources, you might have created other sources of data from your robots that you want to see on InOrbit Mission Control.

You can set thresholds on the data sources to trigger warnings or errors when a threshold is exceeded, alerts about those warnings or errors, and actions to take when a warning or error is triggered. An exceeded threshold is called an incident and is recorded in the incidents total shown on Mission Control.

Example. For a robot’s CPU usage, you might want to set these values:

Incident typeThresholdAlert channelAction
Warning85% CPU usage for 60 seconds or moreIn appRun script on robot
Error95% CPU usage for 30 seconds or moreSlackRestart InOrbit agent

About on-robot shell scripts

On-robot shell scripts you can run after an incident are Linux scripts that you yourself write and are completely under your control.

Continuing the example above about a warning of exceeded CPU usage, you might write a script that lists the top 10 running processes on the robot and saves the list to a file. Such a script could look like the following.

# !/usr/bin/bash
# where to save the script output
export SAVEFILE="/tmp/top10cpu.txt"
# list top 10 processes' percentage of CPU, PID, username, 
# and running program with arguments, sorted by percentage of CPU,
# and saved to a file
ps -eo pcpu,pid,user,args | sort -k1 -r -n | head -10 \
> $SAVEFILE

Setting up thresholds, alerts, and actions

These are steps to define thresholds, alert channels, and action.

What you need

Steps

Automated incident responses have two general parts:

Define thresholds

  1. As administrator, login to Mission Control.

  2. In the upper right, click Settings.

  3. Click the Robot Data tab.

  4. Scroll to find the desired robot data source whose thresholds you want to define.

    Not all data sources have the same definable fields, which depend on the type of data source.

  5. From the function pulldown menu, select the desired function. The function depends on the data source.

    For example, CPU and disk usage uses the max-time function.

  6. For the error at field, enter the error threshold.

    For example, for CPU, enter a percentage.

  7. For the warning at field, enter the warning threshold.

    For example, for CPU usage, enter a percentage.

  8. For the for sec. field, enter the length of time that the warning or error condition must last before the threshold is exceeded.

  9. For status label, keep the default label or enter a a different label for this data source.

Define alert channels and actions

  1. As administrator, login to Mission Control.
  2. In the upper right, click Settings.
  3. Click the Robot Data tab.
  4. Scroll to find the Status section.
  5. Scroll to find the desired robot data source whose alert channels and actions you want to define.
  6. For the ERROR block, select values for the following fields:
    • Alert channels
      • In App
      • Optional Pager Duty
      • Optional Slack
      • All of these channels
    • When triggered run:
      • Restart agent
      • Run script on robot. Enter the name of the script in ${HOME}/.inorbit/local/user_scriptson the robot.
    • When notified, user can run
      • Restart agent
      • Run script on robot. Enter the name of the script in ${HOME}/.inorbit/local/user_scriptson the robot.
  7. For the WARNING block, repeat the preceding step.