Solutions Architect's Handbook
上QQ阅读APP看书,第一时间看更新

Building resilient architecture

Design for failure, and nothing will fail. Having a resilient architecture means that your application should be available for customers while also recovering from failure. Making your architecture resilient includes applying best practices in every aspect, in order to make your application recoverable. Resiliency needs to be used in all layers of architecture, including infrastructure, application, database, security, and networking.

From the security perspective, the Distributed Denial of Service (DDoS) attack has the potential to impact the availability of services and applications. The DDoS attack usually puts fake traffic in your server and makes it busy, therefore, legitimate users are unable to access your application. This can happen at the network layer or the application layer. You will learn more about DDoS attack and mitigation in Chapter 8, Security Considerations.

It's essential to take a proactive approach to prevent DDoS attacks. The first rule is to keep as much as application workload possible in the private networks and not expose your application endpoints to the internet wherever possible. To take early action, it is essential to know your regular traffic and have a mechanism in place to determine substantial suspicious traffic at the application and network packet levels.

Exposing your application through the content distribution network (CDN) will provide the inbuilt capability and adding the Web Application Firewall (WAF) rule can help to prevent unwanted traffic. Scaling should be your last resort but be ready with an auto-scaling mechanism to enable you to scale your server in the case of such an event.

To achieve resiliency at the application level, the first thing that comes to mind is redundancy, which leads to making your application highly available by spreading the workload across the geographic location. To achieve redundancy, you can have a duplicate server fleet at a different rack in the same data center and in a different region. If servers are spread across different physical locations, the first level of traffic routing can be handled using the Domain Name System (DNS) before it reaches the load balancer:

Application architecture resiliency

As you can see in the preceding architecture, resiliency needs to be applied in all the critical layers that affect the application's availability to implement the design of failure. To achieve resiliency, the following best practices need to be applied in order to create a redundant environment: 

Use the DNS server to route traffic between different physical locations so that your application will still be able to run in the case of entire region failure.

  • Use the CDN to distribute and cache static content such as videos, images, and static web pages near the user location, so that your application will still be available in case of a DDoS attack or local point of presence (PoP) location failure.
  • Once traffic reaches a region, use a load balancer to route traffic to a fleet of servers so that your application should still be able to run even if one location fails within your region.
  • Use auto-scaling to add or remove servers based on user demand. As a result, your application should not get impacted by individual server failure.
  • Create a standby database to endure the high availability of the database, meaning that your application should be available in the instance of a database failure.

In the preceding architecture, if any components fail, you should have a backup to recover it and achieve architecture resiliency. The load balancer and routers at the DNS server perform a health check to make sure that the traffic is routed to only healthy application instances. You can configure this to perform a shallow health check, which monitors local host failures; or deep health checks, which can also take care of dependency failure. However, a deep health check takes more time and is more resource-intensive than to the shallow health check.

At the application level, it is essential to avoid cascading failure, where the failure of one component can bring down the entire system. There are different mechanisms available to handle cascading, such as applying timeout, traffic rejection, implementing the idempotent operation, and using circuit-breaking patterns. You will learn more about these patterns in Chapter 6, Solution Architecture Design Patterns.