Bots Rising - Part 6

Our internal bot has been running for a while and getting some use.

We've gotten it so that it runs for 7 - 10 days without issue, but after that length of time it seems to quietly lose the connection and we previously weren't checking for that. I'm not sure if that is a feature of Slack itself or the library we're using. In any case, 7 - 10 days seems pretty good and we should probably add some code to recover if something happens.

To get to 7 - 10 days with a Slack bot, the key was to use the Ping/Pong functionality you can see here. It's part way down the page, just look for 'pong'.

We're currently pinging every 10 minutes. This is a lot less than I've seen recommended, but I believe those recommendations were based on building chat clients with the RTM API, not long-running bots with lots of downtime. I don't want to get in trouble with Slack for sending a million ping messages for no reason.

Instead of upping the number of pings we send (which may have helped) we've added some functionality to check as part of our ping/pong functionality to check for the health of the connection. This seems to be the more responsible way to run the health check and reconnect if the connection is offline.

We've tested our reconnect, which works well. Now we just need to wait 7-10 days to see the production failure occur and see if it recovers.