When it comes to order confirmation, we have an overview of whether our production or consumption is generally breaching the SLO that we've defined for this exact use case. That gives you an overview of how many messages you have actually, what is the message rate per queue, and what is the footprint that you have for this queue. Customer inbox is mainly the team that builds the overall tooling around communication with our consumers. On the contrary, a message that's already rendered, and it's just an email that's about to go out. Thats a one-hour meeting where we go through our SLOs, product SLOs alongside specific service SLOs. Where developers & technologists share private knowledge with coworkers, Reach developers & technologists worldwide. We have, in the past quarter, more than a billion site visits. Generally, just killing the connection, which is something that we do automatically as well, solves this issue. We thought about actually contributing into filtering out matrix so that it becomes a little bit more similar to what we have in our Lambda function. What an interesting coincidence. 100% means that you're utilizing your system at its best. Your consumers are able to cope with the production throughput of your producers. Not all of those are connected to the RabbitMQ broker. In total, we have like 15,500 employees. We also utilize the LightStep plugin, in Grafana, as well to oversee what's going on.
Those are services that actually scale out on their own. We have this overview per application. We rely heavily on the management endpoint plugin. Thats the actual official logo of an EC2 instance, by the way. And then, whenever there is a price reduction happening on the platform, we try to match those reductions with customers that are interested in this. RabbitMQ generally have a long-lasting TCP connection from the application to the broker itself. Templates are in a system, rendering happens in another system, and there is an orchestrator in between. And the CPU usage of all nodes. And then, afterwards, if that does not exist anymore, your downscaling kicks in, depending on how you downscale and how you cool down. We have a rendering system that renders all of this.
You just need to be careful around what kind of applications that you're aiming to scale dynamically without caring for this. Generally, there is a stream running on order of placement. Based on this alarm, your Auto Scaling kicks in. Has anyone shopped at Zalando before? Towards the end of the debriefing session, we asked ourselves, What would have been a topic that we could have contributed to or was missing last year? We came to the conclusion, one year ago, that monitoring was not really covered in the first summit edition. One of those dependencies actually scales based on queue, because that was a better indicator on our case than like request latency or CPU usage. As I mentioned earlier, we run on AWS. Whenever actually, you see a bump over here that goes beyond your limit, that's an alarming thing to look at because this is where RabbitMQ would probably start throttling your producers alongside your consumers not being able to catch up. A little bit of context as well around channels. Just going through these monitoring pieces of the organization, that generally gives you a more holistic overview of what's actually going on. If we have a mismatch between those numbers, then we have an instance missing queue which we then have to either intervene or replace. What was written in the post mortem? This is what we do. That's a very good view actually on the consumer utilization. To learn more, see our tips on writing great answers. Thats something that was mentioned earlier in another talk. We also want our commercial communication with our customers to build the long-lasting relationship that they find the message that we send come to them at the right point in time, whether thats at the moment when they're waiting for a sale, or at the moment when theyve bought something and they want to couple this with something else.
Within our team alone, which is like a four-multidisciplinary team, we run 70 services. And then, we generally have a knowledge sharing session within our team around what was the knowledge gained from the Summit and what kind of things that we could include in our normal day-to-day work. This is customer inbox, at a glance. And then, once the backlog was processed, the service gradually scaled down. To subscribe to this RSS feed, copy and paste this URL into your RSS reader. It's called Cloudwatch Exporter. The majority of our applications are either producers or consumers, but we have an overlap set of applications where they both produce and consume, where I would generally recommend that you run the production and consumption of different connections with their own channels because that will give you a better throughput, in general, and that would not impact your application if the consumption is throttled, while the application publishing is not throttled at all by the RabbitMQ. We monitor the queues, the exchanges, the overall footprint of every queue and exchange, and number of messages and messages in RAM in and around all of this, alongside the clients. And then, we process the message. This is directly hooked up to any of our checks, in ZMON, that we consider mission critical. That was the learning actually that we've gained around monitoring the number of exclusive queues, around the number of instances running, and how to actually have better visibility around the usage of exclusive queues and what you expect from them. This is, by practice, what we figured out as well within our system. What kind of action items that were there? Because we have different sizes of different messages, so an entry message coming into our system, before rendering, or templating, or figuring out what's going to be eventually going out is very slim. Next, takes me actually to the bitter point of what we've done with the solution. Every application has generally its own dashboard where it monitors its own performance, data stores, and everything, alongside its own connections, and channels, and publish consumption rate from RabbitMQ. As a next step, I would like to auto scale these consumer processes so that we can avoid unnecessary processes when the queue size is low and add additional processes when the pending messages grow. We seem to be on the right track. That gives you a very holistic view of what's going on.
We've built a templating solution in-house with our own UI, where marketeers can go inside the tool, drag and drop any base components that they want to fill in a template. Because, based on those messages on those queues, we already know that this service will start working now. Coming to the price reduction use case, for instance. That will trigger the alarm. It could be either a push notification, an email, or anything else. We run almost 6000 microservices in production. This is actually a view from a busy time of our queues at a glance. You just annotate when you're publishing or consuming. If we still have messages over there, we actually throttle our consuming application that brings messages in until our services inside have processed this backlog. Within customer inbox, we have for multidisciplinary teams working on the same product. We've adopted Open Tracing and we've set on LightStep as one of our means to visualize our traces. I would like to share with you a couple of more examples of what we learned while we monitor and the use case of the thing that we learned. How do I replace a toilet supply stop valve attached to copper pipe? You pay more for the first couple of tiers. Monitoring seems to be a first-class citizen in 3.8, like with native Prometheus support and, at the same time, a Grafana plugin that gives you all the dashboards that you can need. Zalando's Communication platform products are powered by a RabbitMQ cluster. Site design / logo 2022 Stack Exchange Inc; user contributions licensed under CC BY-SA. This is the whole premise of building our platform in-house, that we can easily extend this towards any channel in the future. At that point in time, that might pose other challenge of how we report those matrix. The magnitude of services that we're running in production, with different production load, is becoming unmanageable from a bird's eye viewpoint to get an overview of the overall architecture, different dependencies between microservices and systems.
In the end, we have the product that builds the audience and, at the same time, builds the template, and also sends this out to our providers, and collect the asynchronous and synchronous feedback about what's going on. And then, when we wake up in the morning, our setup services. We're able to see that we're meeting what we committed on as a team or as a product. But if you're running a multi-cluster setup, your monitoring system is ideally aware of what your nodes are so that it could automatically poll each and every node on its own. We have, within each service and our product overview Grafana dashboard, that overview of our RabbitMQ performance within the context of what we want to serve as a business. Then, we have an overall panel around the queues - the message rate per queue, the consumer utilization of every consumer, the number of messages per queue, the actual RAM usage per queue - what is actually those. Our event streaming platform is built on top of Kafka and is built, actually, for different teams to consume, based on whichever cursor that they're in. One year ago, we decided on adopting Open Tracing as an organization. If you look at a bigger timeframe, you would see this scaling up and down rapidly very nicely. Since we don't have commercial communication running at night, we generally park all of this on our event streaming platform. The color matches the RabbitMQ color. As an enthusiast, how can I make a bicycle more reliable/less maintenance-intensive for use by a casual cyclist? That gives you either a higher throughput while publishing or a higher throughput while consuming. Within Zalando, in general, we have another event streaming platform which is called Nakadi. How to help player quickly made a decision when they have no way of knowing which option is best. It's not the perfect state but it doesn't mean that your system is not functioning because a consumer can slowly lag but then, all of a sudden, it can start picking up again. And then, someone actually shares a tracing overview, or a Grafana dashboard, or a check that goes flaky. Now, that takes interesting branding, I know. That's enabled on our cluster.
Sometimes, were just woken up at night for someone to just add a couple of more instances, messages in the queue are just drained. Once our application introduced publish confirms, because it was an application that produces messages, that we wanted not to lose at all. You can look into it for details. A very interesting learning forum for me, personally, within our organization, is our 24/7 chat because this is generally where someone gets pulled in.
We're not only supporting just emails and push notifications, but those are our most prominent channels that we use. That's generally sufficient, in our case, to have the visibility that we need over our cluster at that point in time. This is basically what I'm going to be talking about, how we monitor our system and its production state nowadays. We started off 11 years ago. This is something that a customer expects. Within a combination of services that we have, we have the primarily scaling strategy. We generally report number of consumers, number of connections, number of channels, publish rate, consumption rate. This is not the only dashboard that we use to monitor. We don't want our communication to be only transactional. Or, the more that bad things happen, the more you get to know what's going on. Not in the RabbitMQ Git repository. We ended up actually with local caching as well on the render services themselves. We've instrumented our services a year ago. Probably, you've interacted with our product. Those 10 million messages are 10 million price reductions that happened overnight. If you want to use what I've just presented, just right away, you can just enable this plugin in your test setup, start seeing matrices over there. By default, we have tier definitions of application and what is an application criticality within the whole Zalandro realm. Most of the screenshots that I'm showing now are actually just from our RabbitMQ Grafana dashboard. Microservice architecture, like every service isnt about the context. Now, we're doing this and here we are today. You could combine two matrix, merge them together. That was our next learning. That would be something that your check system needs to be aware of while polling this. While application starts opening a channel, starts publishing into it, and waiting acknowledgement, it started creating way more channels than we expected. Before starting, I would like to share the motivation of giving this talk. Our checks are generally running at a 30- to a 60-second interval. Why was this person woken up? How do people handle such an auto scaling mechanism for consumers, if anyone does? We capped the number of channels, actually, to 200 which gave us the right amount of throughput that we need, with the right configurations of threads that are running in this application, with the right number of batch consumption from our external event processing platform. Just running a check every one second might not mean that beneficial because the management endpoint, at the moment, is highly coupled as well with the run time of the RabbitMQ itself. What also the plugin lacks is a combination of combining multiple matrix together. If you actually look in between, that was almost around midnight where we're generally not sending anything to our customers post 12:00 to 1:00 am because this is generally where most of the people are sleeping. This is actually how it looks like in practice. Basically, we use it in our own product. Theyre there. It's better to have your topology in a centralized place so that you're able to just deploy another broker right away. How we did this, within our team, is that we built a small client library for tracing everything going inside RabbitMQ with an annotation, so that it's not that intrusive within our code base. That's the audience building part. And we know how many entities are running actually for this application. Thats Zalando, basically, at a glance. Our template service generally publishes a template change event, inside our RabbitMQ, for coordination between both template service and render service. 464), RabbitMQ message consumers stop consuming messages, Competing Consumers in Mass Transit with RabbitMQ. It's actually an open source plugin developed by an individual contributor. Has anyone been just woken up at night, to just add a couple of more instances? The other thing that we did. Once an application, passes the health check and becomes hooked up to the load balancer and actually starts getting traffic, we know that it exists and we know that it has a bind in the exclusive queue along next to it. But that was another interesting usage of the management endpoint to back pressure processing and parking messages outside of our brokers, so that we make sure that our broker is performing in line with our expected throughput and SLOs.
Since we don't send communication overnight, we start in the morning but we want to go through the whole backlog before the peak normal transactional flow of the shop and people ordering. That's actually a Zalando open-source monitoring and alerting tooling that has kickstarted actually as a Hack Week project, maybe like five years ago, because of the lack of the possibility within normal monitoring tooling to define any entity that you're interested in. There is actually a RabbitMQ plugin. As I mentioned earlier, we rely on RabbitMQ as an inter-product broker. You actually still pay more for the first couple of tiers, if you've moved to another tier. That was very evident in our graph. In our case, it made sense because we had a default that we could fall back to and it made sense in our normal processing pipeline. Most of our traffic actually comes from mobile devices, so whether that's mobile web or our native apps. Why had climate change not been proven beyond doubt for so long? If you set a timeout, then you actually reach a cache number of channels, at the same time, a capping of those channels. How we're collecting metrics. We were running at four instances at that point in time. This is where most of the products communicate when the messages between systems are relevant for other products as well. We have been able to drill down the root cause of this - because we were not reusing channels at all. An intro to the Zalando.
And then, we process everything that's within our domain, within our RabbitMQ cluster. We actually went a little bit higher because the application in place that we're talking about is actually consuming from our event-streaming platform where we might have actually a backlog of 10 million messages. And then, by default, that starts shipping the traces. If you want to try out what I just mentioned today, you can just enable this plugin and just export everything. Afterwards, that moves further. And then, based on this, we decide whether we want to throttle or not, because that mixes also your production load with telemetry information. We might handle this in our own controller that operates our cluster. Alongside, the messages rates themselves. Zalando has 200 teams. At the same time, we have a higher message rate per queue. I think around 22 of those actually have direct connection to RabbitMQ, but the whole platform interconnects and work together. We're still debating this. Thanks for contributing an answer to Stack Overflow! Another thing that this is something that you shouldn't be doing very aggressively. We rely on a couple of exclusive queues as well. That's actually a feature available in Cloudwatch itself. 80% is the amount of times that we've observed that that's within our domain and scaling made sense. Horizontally and vertically scaling RabbitMQ consumers? And then, you have queues. This slide actually was just updated the day before yesterday because I had October numbers but we shared our November numbers. So, 100% is the optimal that you should try to optimize for. Managing cost efficiency and performance at Zalando has been a seamless effort thanks to our setup. Thanks. We have an in-house-built tooling around template designing, audience building, campaign setup and sending transactional and commercial communication to our customers. We started caching the actual template state and the rendering engine of what template that I'm rendering at the moment. Thats just from our Grafana dashboard of RabbitMQ which is generally a very good starting point for me to check up where are things going wrong, if something is going wrong, because that gives you a very high visibility on the overall broker. To give you a little bit more context around how we do this and why we do exclusive queues, the most computationally intensive process in our pipeline is actually rendering the message itself. In order to have the visibility on a specific use case that you're interested in, we figured out that relying on tracing and seeing all of this in our Grafana dashboard made more sense. Most of the cases it's legally binding as well and you don't want to miss out on sending out an order confirmation. We poll the Management API endpoint to get all of the information that we store. And then, they start processing. Within the render service, the first instance that it picks up the message, fans this out through the RabbitMQ Spring Cloud Bus to a number of render service instances to invalidate their local caches.