IT systems issues: strategies to minimize data center downtime - Cavern Technologies

IT systems issues: strategies to minimize data center downtime

IT systems issues

Software & IT systems issues are the third most common causes of data center downtime based on a recent Uptime Institute report.1 Today we continue our blog series on minimizing the potential for data center downtime by exploring ways to mitigate IT system errors that lead to outages.

The three most common causes of data center downtime1

  1. Uninterruptible Power Supply (UPS) Battery Failure (we cover this topic in this post)
  2. Network Failure Due to DDOS Attacks (we take an in-depth look here)
  3. Software & IT Systems Issues (we break out software errors in this post)

IT system downtime can be prevented

When asked about their last major downtime incident, 60% of responders to an April 2019 Uptime Institute survey2 believed that the issue could have been prevented with better management, tighter processes or improved configuration. That implies that for large-scale mission-critical systems, your IT team will need to rigorously analyze your total IT infrastructure. Your objective is to plan for the potential for failure then develop business processes and an array of safeguards to avoid a mishap.

To get you started, we’ve created a list of strategies to help you identify areas of vulnerability in your data center and reduce frequent mistakes that lead to data center downtime from IT system failure.

Strategies to prevent IT system failure

  1. As your network becomes more complex, developing and maintaining appropriate and complete procedures is essential to achieving performance and service availability. These should include best practices for initial hardware and application setup, a pilot network for fully testing and debugging new features, and staged deployments with an available rollback backup. To succeed, you’ll need full buy-in from your team to make sure everyone will demonstrate the extreme discipline it takes to adhere to the procedures once they’re in place.
  2. Create an equipment inventory and review it annually. Document the business functions each piece of equipment supports, its intended use, available capacity and life expectancy. Be sure to document changes.
  3. Replace equipment approaching its performance limits. (Keep in mind, if a system or operating system is older than five or six years, it is most likely no longer supported by the manufacturer and a potential vulnerability.)
  4. Aggressively cull temperamental, bottleneck or underutilized servers that slow response time or create a drag on your data center environment.
  5. Make sure you have no single point of failure in your hardware and device configurations.
  6. Assess IT infrastructure lifecycles to make sure they’re in sync with the pace of innovation. Plan and manage upgrades to combat asymmetric criticality, where the infrastructure and processes have not been upgraded or updated to reflect the growing draw of the applications or business processes they support. Consider upgrading or right-sizing servers to accommodate increased data requirements and workloads like virtualization, data analytics, artificial intelligence and storage.
  7. If you’re not already, use Dynamic Host Configuration Protocol servers to assign IP addresses and network configurations to every device in your network to ensure high availability.
  8. Integrate cloud applications to provide a more distributed approach to resiliency, including replicating data across zones.
  9. Outsource or staff for the skills needed to ensure your enterprise has a dedicated workforce to monitor and manage updates. Define ownership of specific tasks. Prioritize ongoing training and assessments for your internal IT teams to help them stay on top of the latest updates, device configurations and security challenges.
  10. Perform preventative, proactive maintenance as scheduled. This includes tracking service and support intervals for hardware and software, including operating systems.
  11. Make sure your emergency power off buttons are labeled and shielded.
  12. Perform regular cleaning of your hardware. Check the functionality of intakes, exhausts and fans. Assess your data center’s floor temperature and airflow and verify that all fans are exhausting into the hot aisle. (If they’re not, your IT equipment is drawing warm air which decreases it’s expected useful life.) Confirm cables and cords are secure. Examine cables for bends and kinks and make sure they’re not too tightly packed.
  13. Annually review your enterprise’s regulatory compliance requirements and confirm you are meeting or exceeding them.
  14. Develop a comprehensive disaster recovery plan. Review and update it annually.
  15. Create downtime simulations and have your team rehearse their response.

While you may not be able to avoid every mishap, implementing these strategies will help you recognize IT systems issues and vulnerabilities and fortify a response. It gives you the latitude to create safeguards and make proactive, planned changes before you’re faced with a failure in a business-critical system.

In the next installment in our blog series about the top reasons for data center downtime, we review the role of human error in data center outages.

Looking for advice on how to manage your data center? We’re here to help.

Share!

Share on facebook
Share on twitter
Share on linkedin
  1. Lawrence, Andy. (2020, January). Houston We Have a Problem. Uptime Institute. Retrieved from http://journal.uptimeinstitute.com/outages-drive-authorities-and-businesses-to-act.
  2. Heslin, K. (2019, September). How to avoid outages: Try harder! Uptime Institute. Retrieved from https://journal.uptimeinstitute.com/how-to-avoid-outages-try-harder

Leave a Reply

Your email address will not be published. Required fields are marked *