CloudWatch Log Export

AWS services send their logs to CloudWatch whether they are configured to do so or not. AWS StepFunctions pushes all transition events to CloudWatch under the log group /aws/vendedlogs/states/*. Even if your code is wrapped in their service, your logs will show up there as well. AWS Batch will send stdout logs to log group /aws/batch/*.

Note, if you are using the AWS CDK or CloudFormation, you will need to declare these log groups directly in order to configure the retention period and removal policy as they will be created for you automatically. See LambdaLogGroupConstruct for an example.

You can write your code to push your logs elsewhere, but for provider services, CloudWatch will be the interface unless you export your log group into AWS S3 via the CreateExportTask.

Clusterless now provides an Activity (think cron job) specifically for exporting a declared CloudWatch log to an S3 bucket, aws:core:cloudWatchExport.

Unfortunately, there is a quota on this task: One active (running or pending) export task at a time, per account. This quota can't be changed.

This doesn’t mean the task is unusable, but it now becomes tricky. If there is already a task in play, a LimitExceededException will be thrown. It is unfortunate that this quota limit isn’t promoted up into the docs.

Now the question becomes, how many CloudWatch log groups can be exported via cron like processes (via a Clusterless Activity) within an account?

An Activity requires as a parameter the frequency in which to create the export task, say every 5 or 10 minutes. The Activity will set the from and to timestamps for the previous interval of arrived logs. If the frequency is 10 minutes, then the prior 10 minute interval is exported.

Assuming the task creator/submitter (an AWS Lambda in the Clusterless case) will retry failed task attempts, how long should the retry wait – until trying again so that it’s more likely to succeed and not cause any API rate limiting. This is in part a function of how long a task takes to complete, and how many competing Activities there are.

The describe-export-tasks aws cli command will export all export task metadata. This example here uses data from only one Activity (exporting one log group), where the log data is quite small.

aws logs describe-export-tasks | jq '.exportTasks[].executionInfo | .completionTime - .creationTime' | ministat -n
x <stdin>
  N      Min      Max    Median      Avg    Stddev
x 2663     1549    608539     2326   7376.5332   37206.281

As can be seen above, the median (p50) is 2.3 seconds. But a 10 minute duration was also encountered.

Unfortunately, any given Lambda attempting to submit a successful create export task command only has (up to) 15 minutes in which to have a success. If another Activity was exporting a different log group, and it was submitted milliseconds before the current Activity, the current Activity may need to wait (continue retrying) for 10 minutes before it can succeed and exit. This would be an expensive blocking operation.

Currently Clusterless does nothing clever to maximize the number of possible log groups that can be exported in an account. Clusterless only uses an exponential backoff retry policy starting at 3 seconds up to a maximum wait of 5 minutes. This all bounded into the Lambda configured timeout. That is, retries will continue until the Lambda function was configured to stop (timeout) so that log data can be emitted and the Lambda invocation marked as a failure.

A more clever arrangement can be devised in order to reliably support multiple log group exports at without costly blocking operations. Reach out if this would be of value.

Chris K Wensel
Chris K Wensel
Data and Analytics Architect