I’m here to tell you the truth, the good, the bad and the ugly of AWS X-Ray and Lambda
While there is plenty of value using AWS X-Ray with Lambda, the current limitations restricts its usefulness for complex systems. Learn more!
Jun 08, 2023 • 13 Minute Read
AWS announced general availability of AWS Lambda support for AWS X-Ray back in May — and it’s taken me a while to try it out. While I see plenty of value, its current limitations significantly restricts its usefulness in a complex system.
The AWS Lambda-specific documentation was a bit fragmented, so I relied on some experimentation and pieced together clues from several sources:
I discovered that the recording annotations and metadata didn’t work as advertised in the sample — although I could be doing something wrong.
Update July 03, 2017: after this post was published the folks at AWS got in touch and kindly cleared up some of the issues highlighted here which were caused by poor documentation which they’ll rectify in the near future. Scroll down to see the clarification on the relevant sections.
The Sample Application
The sample project I created centers around a Lambda function called service-a
, which in term calls a number of downstream systems:
- publishing to a SNS topic
- GET’ng and PUT’ng an object in S3
- GET’ng and PUT’ng a row in DynamoDB
- invoking another Lambda function (
service-c
) using the Lambda API - making a HTTP request to an API Gateway endpoint backed by another Lambda function —
service-b
,error
andtimeout
functions in the diagram represents the success, error and timeout cases respectively
The Good
Once I figured out the magic incantations, I was able to get the results. It took more time and energy than it should have — but by and large, most features worked as advertised at the first (or second) time of asking.
Below is a trace of the service-a
function, which includes the time it takes for Lambda to initialize the function, and the various downstream systems it talked to — all nested under custom subsegments.
It even includes the trace of the service-c
function — and the time it spent publishing to SNS — which was invoked using the Lambda API.
The service map for service-a
includes service-c
as a downstream dependency, as well as service-c
’s dependency on SNS.
The Bad — it’s always a 200 response code
When the service-a
function is invoked through its API Gateway endpoint and errors, the corresponding trace still reports a 200 response code.
Presumably what X-Ray sees is a 200 response from the Lambda service whose payload indicates a 502 response to the API Gateway invocation
and so it thought “hey, it’s a 200!”.
Oddly enough, the trace for the error
function also reports a 200 even though its own status field indicates it had erred.
This behavior is both confusing and inconsistent to me — perhaps I have misunderstood how it works. Sadly, the X-Ray’s concepts page also does not explain the difference between an Error
and a Fault.
While this might seem like a small nuisance now, the inability to quickly identify error traces will hurt you most when you need to diagnose problems in production. Most likely, when you’re under the most time pressure.
Update July 03, 2017: AWS confirmed that the reason the errors are reported as 200 is due to Lambda service returning a 200 response (with payload that indicates an error). One workaround is to use the filter expression service() { fault }
which returns all traces that contains a fault.
Traces don’t span over API Gateway
When the service-a
function makes an outgoing HTTP request to an API Gateway endpoint the trace stops at the API Gateway endpoint and doesn’t extend to the Lambda functions that are triggered by API Gateway.
This behavior was consistent with all 3 endpoints I tested — service-b
, error
and timeout
.
For this test, I have followed the X-Ray documentation and used the X-Ray SDK to wrap the Nodejs https
module when making the HTTP request.
I can see the trace IDs are correctly passed along in the outgoing HTTP request and received by the handling Lambda function.
This is the service map I expected to see in this case — where service-a
’s trace follows through the HTTP request to API Gateway and includes the invocation of the timeout
function.
Instead, the actual service map stops at the API Gateway.
However, when invoking another Lambda function directly (using the Lambda API and wrapped AWS SDK) the trace worked as expected.
Perhaps the limitation lies with API Gateway?
Below you can see the breakdown of time spent in service-c
when it’s invoked directly, but the tracing to service-b
stopped at the API Gateway layer.
The Ugly — no sampling
According to the Lambda’s documentation on X-Ray, requests should be sampled at 1 request per minute.
However, that wasn’t the case in my experiments. EVERY request was sampled, as you can see from the age
of the traces in the screenshot below. This behavior was consistent when invoking Lambda via API Gateway as well as via the Lambda management console.
As an example, I worked on a moderately successful social game at Gamesys with ~1M DAU. At roughly 250M user requests per day, X-Ray would have cost $5 * 250 * 30 days = $37500, which was more than our entire AWS bill at the time!
Update July 03, 2017: this turns out to be a problem with the documentation, which doesn’t mention that sampling is volume-based and only kicks in once you reach a certain volume of requests.
Want a deeper dive into AWS Developer Tools like AWS X-Ray? Check out these A Cloud Guru courses!
Annotations and Metadata only work on subsegments
The one thing that just refused to work — even though I have followed the examples — was adding annotation and metadata to the root segment:
module.exports.handler = (event, context, callback) => {
...
let segment = AWSXRay.getSegment();
let n = Math.random() * 3;
segment.addMetadata('random', `${n}`); // this doesn't work
segment.addAnnotation('path', event.path); // this doesn't work
...
}
Interestingly, adding annotations and metadata to subsegments works fine.
let publishSNS = segment => {
return new Promise((resolve, reject) => {
console.log('publishing to SNS topic');
let f = co.wrap(function* (subsegment) {
...
subsegment.addAnnotation('topic', topicArn);
subsegment.addMetadata('message', 'test');
let req = {
Message: message,
TopicArn: topicArn
};
yield sns.publishAsync(req);
subsegment.close();
resolve();
});
AWSXRay.captureAsyncFunc("## publishing to SNS", f, segment);
});
};
Looking at the logs, I found something interesting: the segment ID for the root segment doesn’t match the segment ID in the X-Ray trace.
For instance, I fetch the root segment for my function in the handler and logs it to CloudWatch Logs.
const AWSXRay = require('aws-xray-sdk');
module.exports.handler = (event, context, callback) => {
...
// this should be the root segment for my function
let segment = AWSXRay.getSegment();
console.log(JSON.stringify(segment));
...
}
In the logs I can see the segment ID is 05b2b9ac6c9e5682.
But in the X-Ray trace, the segment ID for the root segment is 2b7d5b4a2a2d96e9.
Furthermore, the trace ID is also different:
- in the logs it’s 1–59504311-d765e7accb8558871fa89d6d
- in the X-Ray console it’s 1–59504312–5ef2a3eda0c1b2c4d64dda00
This was very odd, so I decided to track the trace ID in the logs vs in the X-Ray console, starting with a coldstart.
Bingo! Looks like it’s a bug in the X-Ray SDK for Nodejs where AWSXray.getSegment()
returns the root segment from the previous invocation.
Update July 03, 2017 : whilst there was a bug in the X-Ray SDK wrt tracking the trace-id, adding annotations and metadata to the root segment is simply not supported, which the doc doesn’t explicitly state. You can work around this by creating a subsegment that covers the entire span of your function invocation to act as your custom root segment and attach any annotation and metadata related to the invocation there.
The Truth
So there you have it, my weekend escapade with AWS X-Ray and Lambda :-)
“I came here to tell you the truth, the good, the bad and the ugly.” — Oliver North
Overall I’m impressed with what I saw, and feel X-Ray would have added a lot of value to the serverless architecture I was building at Yubl. However, the inability to span traces over API Gateway endpoints makes it a far less useful addition to our ecosystem.
Furthermore, the X-Ray service is focused on execution time and helping you identify performance bottlenecks. However, there’s another important aspect to distributed tracing that X-Ray does not concern itself with — helping you debug your system by ensuring a set of correlation IDs are captured in all log messages.
Although you can probably use the trace ids that X-Ray provides, it’s still your responsibility to capture them in all log messages and propagate all your logs to one easily searchable place. We invested some effort into enabling distributed tracing in our serverless architecture at Yubl, which you can read about in detail in this post.
Are you using X-Ray in production? I’d love to hear your thoughts and feedbacks on the service, and any pitfalls to look out for in the comments section!
Get the skills you need for a better career.
Master modern tech skills, get certified, and level up your career. Whether you’re starting out or a seasoned pro, you can learn by doing and advance your career in cloud with ACG.
Thanks for reading! If you like what you read, hit the ❤ button below so that others may find this. You can follow me on Twitter.