Skip to content

Jesse's Software Engineering Blog

Apr 10

Jesse

AWS CloudWatch Metrics

AWS CloudWatch metrics are an extremely beneficial tool for monitoring both application code as well as AWS infrastructure. AWS offers many metrics for all of their services, as well as the ability to both push custom metrics and to parse CloudWatch logs to build custom metric filters. These metrics can be used to build dashboards, message team members, or trigger a pager in the event that things are not working as expected. I am going to briefly outline setting up custom metrics and how they can be incorporated to monitor overall system health. I also put together a couple NodeJS code samples.

Custom Log Filters

Building custom metric filters on top of logs can help to monitor distributed workflows by parsing centralized logs and running aggregations on frequency counts. For example, if different pieces of software are handling different parts of core application logic, each part can publish a “status” to a shared CloudWatch log. If those messages are of the same structure, they can be parsed and grouped. Let’s say we use the following message structure:

{
  status: "ERROR",
  message: "Db call failed"
}

We could then set up a custom metrics filter on the log stream of the log (custom metrics are set on a stream level, not on a log level) to monitor entries for the word “ERROR”. Once the metric has been created, it will be available to both the CloudWatch dashboards as well as CloudWatch alarms. CloudWatch also provides various aggregation options including: sum, average, pX, min/max, etc. As an example, these custom filters could be used to proactively ping a Slack channel if there are more than 5 ERROR messages per hour, or trigger a pager if there are no SUCCESS messages for more than an hour.

Once the error message structure has been defined, it could be encapsulated into a shared resource and distributed out to your application or across a fleet of microservices ensuring that log messages are consistent.

Custom Data Metrics

Another approach is to push custom data directly into CloudWatch. Let’s say we have a component in our workflow that makes an API call. We may be interested in the response codes and the latency of each call.

{
  "Namespace": "SystemMetrics",
  "MetricData": [{
    "MetricName": "APILatency",
    "Dimensions": [{
      "Name": "Code",
      "Value": "200"
    }],
    "Timestamp": new Date().toISOString(),
    "Value": 234
  }]
}

Here we are monitoring the latency as well as grouping by status code. By using dimensions we can add more depth to our metric data. In this case, we will be able to graph things such as average latency or p90 latency for X status codes. Alternatively, we could use no dimensions and just graph latencies or add more dimensions for more detailed output. Note that each dimension is used as part of the query similar to a GROUP BY, so getting too granular often isn’t helpful. Also, the metrics must be strings. I’ve put together a couple of simple examples to demonstrate different metrics patterns. There is a limit on throughput for metric data you should be aware of, which can be increased with a request to AWS support.

Monitoring Metrics

Once the metrics are set up, either with custom log filters or custom data metrics, they can start to be monitored. Monitoring can be done passively i.e. building out dashboards and manually checking them or actively i.e. sending an SNS or email when metrics are not within a given threshold. Integrating AWS Lambda with Slack is now a predefined template which can be triggered by SNS. There are various approaches for monitoring depending on the level of severity to the issues.

First it should be noted that AWS offers a large amount of metrics for all of their services. From CPU/memory usage on RDS to network in/out on EC2 instances. Basically every piece of data you need (with a few exceptions) is available to ensure things are running as expected. Whenever creating a new piece of AWS architecture it is a good idea to set up monitoring for that infrastructure so that you are notified when something goes wrong.

In addition to infrastructure monitoring, the custom metrics can be used to monitor the application logic. From the above example, are our outgoing API calls averaging less than 300ms? What percentage of responses are returning 200 status codes? By using the custom metrics, acceptable business behaviour can be monitored and validated against predefined thresholds.

While dashboards inside your AWS account are great, it can create a bit of overhead adding users with correct permissions to various accounts. CloudWatch integrates well with ElasticSearch so that logs from many accounts can be funneled into a single cluster, and a “shared” front end could be set up with Kibana or built out. Having data inside of ElasticSearch also makes it easily queryable which allows for APIs to be built to access aggregated log data and could be automatically monitored with tools such as Runscope to ensure business expectations are being met.

Conclusion

While there are many great tools to monitor infrastructure (I have had success with DataDog and New Relic) using AWS CloudWatch is a great addition to any software monitoring system. It’s easy to set up, scalable, and fully managed. CloudWatch logs can also be used in some creative ways as a time series database, but be advised that when every piece of data matters, CloudWatch may not be the best solution. If the occasional missed data point (very uncommon) is acceptable, CloudWatch is hard to beat.

Blog Powered By Wordpress