DevOps  Postmortem  when The   OPs  is down.

DevOps Postmortem when The OPs is down.

Alx Software Engineering 0x19. Postmortem Task

Simple guide on writing Postmortem

Working with two servers on this report.

  • server web-00
  • sever web-02
  • one LoadBalancer lb-00

sickOps.jpeg

Sick Server

Issue Summary: From 2:30 PM to 5:30 PM, lb-00 experienced a wrong redirection configuration to a site moh.tech on web-00 instead of the later mmohs.tech as configured in web-02. (error 301) Permanent redirection was configured for moh.tech on cloud web-00. The act which affects the site mmohs.tech; Users where redirected to diff*erent site entirely from the site of intent after their time out; or in the case of refresh . The error was rectified by the up and doing SRE Team, Customers that accessed the site at the time from 2:30 PM to 5:30 PM majorly experienced a wrong redirection.

All Time (Specific Time):

  • 2:30 PM (Configuration begins)
  • 3:05 PM (wrong Redirection Experienced)
  • 3:05 PM (alert sent to teams)
  • 3:20 PM (server Down )
  • 3:20 PM (Personnel arrive to access the situation )
  • 4:00 PM (Reconfiguration of lb-00)
  • 4:30 PM (Server Restart begins)
  • 5:03 PM (Tested and works fine)
  • 5:30 PM (Usual site Traffic experienced .back online)

nogoingbck0.jpg

errors are errors if xame error is repeated twice.

Root Cause The Server web-00 was down for an hour to reconfigure the new lb-00 (LoadBalancer)to distribute the traffic experienced by the server to another server web-02; during the course which the pointed domain name was misspelled generating a redirection error to some customers as gathered by the System admin Personnel. Since The System was automated using Puppet style it was pretty fast debugging the error from the general puppet code used to point to a server .

Corrective and preventative measures

  • Internal Review were made to analyze and study the cause of the errors
  • Usage of AI tools such as Machine Learning to help create awareness in time when the server are about to encounter such flaws.
  • Automating All Task which has helped to debug the error within a limited time frame .