This is a guest blog post by Linas Klimaitis, SysAdmin here at Tesonet.
The evolution of microservice architecture and containers can create quite a headache for the operations team. The way developers are now developing, building, testing and deploying applications is becoming more complex. Thousands of interconnected containers may prevent applications from crashing, but this means operational teams have to manage even more complicated processes. With that, debugging becomes even a bigger hassle – finding what causes the bug requires more thorough investigation and critical thinking than the solution itself.
Where to Begin?
Orchestrating your containers using Kubernetes, Swarm or Mesos Marathon will greatly help with the management part – as long as you have the resources needed to scale certain applications. Using them, you can scale with just a press of the button – new containers will be created and prepared to serve requests from your clients almost immediately. But what if the application containers work just fine in the CI and QA stages, but start crashing in production?
With CI/CD infrastructure in place, development teams can deploy 10, 20 or even more than 100 times during the same day. So, it is very important to make sure your monitoring services are aware of new deployments and notify you of any anomalies that happened after the deployment. And also, allow you to roll-back the application before the clients start noticing issues.
Upgrading existing application to a new version creates a very chaotic environment for users using the service, as requests made from the previous version of the application can be serviced by the new version of the application. Thus, it’s very important to set-up canary deployments that allow the users to finish their requests using the old version and re-route them to the new version once their requests are finished. Otherwise, changes between versions might create undesired and unexpected behaviour which will impact the user experience.
Prepare for the Unexpected
However, even with canary deployments and monitoring tuned, the unexpected happens – timeouts, connection failures and no response from the middleware components are the most common cases of that.
There are many possibilities for why it happens. Garbage collection, master election and packet retransmissions are some of the most common reasons why distributed systems fail to respond to clients’ requests. Developers often spend a lot of time debugging their application as the 50th, 75th or even 90th percentile connectivity from the application to the middleware seems to be just fine, but the application still fails to connect 1 in 1000 times. Misconfigured firewalls or misconfigured middleware servers are the primary suspects, but it can be difficult to tell exactly whether the client or server is misbehaving.
That’s why it’s very important to implement some form of tracing between the components. This allows both the development and operations teams to see whether the application sent the request, whether it was acknowledged by the middleware server, and which application container and middleware server exactly served which request. It will allow you to find the culprits without going through every server and every container manually.
Retain Good User Experience
Sometimes even being able to reproduce the bug does not solve the issue. Distributed systems are not 100% reliable – even the internet is operating at around 1% packet loss. So our applications should be resilient – designed and planned around transient failures. Implementing timeouts and retries in the application will make sure that transient errors won’t impact user experience, and circuit breaking will make sure that the application won’t try to contact failed services until their issues have been resolved and it’s safe to do so. Thus, it’s best to decouple the application code from the code that handles and tries to recover from network failures from outside the application by utilising service meshes.
One of the most popular one today is Istio. It allows you to implement circuit breaking, tracing, traffic shifting, canary deployments and even fault injection from outside the application code. By leveraging the Envoy proxy developed by Lyft, Istio allows operations team to configure the services to temporarily have longer timeouts and gradually shift traffic to newly upgraded database servers and roll-back traffic to previous, if needed, without sacrificing availability of the application or disrupting the service for the client. It also allows the developers to see how the requests inside the mesh are routed, which endpoints are most error-prone or sluggish and optimise the resiliency of the application without blindly guessing why some services tend to be much slower when certain requests are being made to the application.
While there is no CI system that will ensure all of the bugs are fixed before the application is released, there’s still something you can do. Designing your application to be lean, modular and resilient, allows you to not only spend less time troubleshooting, but also work more on improving the user experience and delivering the best possible product at the fastest possible time.