How not to use Spring Reactive

A tale of mistakes we made
Dark mode
About VW Digital Solutions

Dear esteemed reader, this is not just another expertly crafted technical article about the wonders and intricacies of reactive systems and its various tools and frameworks. Quite the opposite. By lacking such expertise, one can only speak to you about, a mere example of how not to use a reactive framework. Well, now you wonder, why would I read such a thing? In reality if you already know about reactive systems, then this article is probably not for you, and you can just click the cross on your right or left top-most corner. However if you have no prior knowledge on the matter or are just starting out, this may be useful to avoid our mistakes. Also, ... I'll leave at the end some references to expertly crafted technical articles about the wonders and intricacies of reactive systems and its various tools and frameworks.

Our problem

At Volkswagen Digital Solutions, I’m part of the MAN Digital Hub, more specifically of the MAN Intelligence team, where we are responsible for providing users the possibility of creating MAN trucks, by combining all available MAN parts. As you can imagine, we handle big amounts of data and for that and a few other reasons we needed our infrastructure to scale accordingly. The process of enabling an existing application to scale up/down comes with its own set of challenges.

So, we were coming off a major migration from an on-premises infrastructure to an AWS infrastructure, and it was time to start testing our application. We took our time, and tried to be thorough. All seemed to be working, we were thrilled. Until, … (there’s always something).
We started to observe timeout issues with long running computations.

Short context: we communicate with a system outside of our AWS accounts to provide us with the results of complex computations, synchronously. Yeah, I know, I know, … this was almost inevitable to happen at some point.
Remember, “Anything that can go wrong will go wrong” (Murphy’s first law)

We dug deep to understand what could be causing this. After some research and exchanges of messages with AWS support, we figured out it was a problem with the request timeout per origin of CloudFront, which was already maxed out, and there was no way to increase it even further.

Well, … now what?

We needed an asynchronous mechanism, but also, capable of reacting to responses from the external system, in order to return the result to the user, once the computation was done.

What we wanted our solution to be

Oh wait, did you say you need a mechanism “capable of reacting to responses”? I know just the solution.
And so, our issues began ….
We were a bit too excited about the opportunity to work with a reactive framework, that we forgot to really understand the nature of our issue.

We needed our request to become asynchronous, which meant:

  • The frontend would need to display the results, only once the external system replies. So, block waiting for a response was no longer an option. We studied different possibilities (e.g., SSE, WebSockets, Polling, …), and given our time constraints opted by Polling, as we considered it to be the fastest solution.
  • The backend would need to send the same request and make the results available for the frontend to consume, at some point. Given that we chose to go with Polling and our backend is only used by a frontend, then we opted by responding straight away with an identifier to the frontend on the first call, so that we could then poll the result on a second call.

But we failed to comprehend (or maybe even read), the 4 main principles that make a system reactive (https://www.reactivemanifesto.org/):

  • Responsive – a system must have rapid and consistent response times, in order to provide quality of service
  • Resilient – a system must stay responsive in case of failure, through replication, containment, isolation and delegation
  • Elastic – a system must stay responsive in different workloads, scaling and adjusting via sharding, replication and inputs distribution
  • Message Driven – a system must use asynchronous message passing, as a means to guarantee loosely coupled and isolated components

From this, we can already see there’s a big overhead with making a system truly reactive.
Well, we must start somewhere, right?

So, how about narrowing things down (albeit losing precision while doing so), and simply compare async programming with reactive programming.
Keep in mind, reactive programming is a possible programming paradigm to use when building a reactive system.

Asynchronous programming:

… is a technique that enables your program to start a potentially long-running task and still be able to be responsive to other events while that task runs, rather than having to wait until that task has finished

(source: https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/Introducing)

Reactive programming:

… is an event-based model in which data is pushed to a consumer as it becomes available, turning it into an asynchronous sequence of events. … Reactive programming is a useful implementation technique for managing internal logic and dataflow transformation locally within components (intercomponents), through asynchronous and nonblocking execution.

(source: https://www.ibm.com/downloads/cas/YEEQBXND)

Turns out asynchronous and reactive is not that interchangeable. Who would have guessed?

Hmm, …  okay, … uhmm, … not sure this is exactly what we need, at least not right now. What we really want is to leave something running in the background, and then make that one single item we receive as response, available, when it’s ready.

We’re not really dealing with dataflows, unless we consider splitting the data unit per byte, and ignore its business representation. But we would still need all the bytes to be received, before making any sense of the response, so ….

The other way of looking into this, would be to consider each item received from the external system (1 request -> 1 response item), as an independent event part of a dataflow, and that would be correct, since each user input (i.e., user request) is independent of other user inputs.
By doing so, we would be capable of continuously receiving many simultaneous user inputs and respond when the results are ready, as if our application acted like a pipeline receiving and responding to different inputs.

However, that’s not really the problem we wanted to solve. We were not having issues with the amount of user inputs. What we were having issues with was the time processing one user input took, which was inevitably higher than what the infrastructure supported.

We should’ve made this analysis to understand what we were really getting into, but…

So, we began working on the implementation.
We chose to use the Spring Reactive Stack which includes Project Reactor as the foundation for the reactive system and reactive programming, and Spring WebFlux for the non-blocking I/O operations.

We had to do some significant changes:

  • Change from Spring Data Repositories to Spring Data Reactive Repositories (e.g., Redis and Mongo in our case)
  • Change from Spring Security to Spring Security Reactive
  • Add a WebClient to be able to call the external system
  • Change our exception handler to now support a RouterFunction
  • Change all our controllers from Spring MVC to Spring WebFlux.

All in all, it was not as pleasant an experience as we thought.

And then, just when we think we’ve made it…. All came crumbling down, as it did not solve our problem. We were still getting a timeout. Once again, we were at a crossroads

Our real solution

With no more time to waste, we took a few steps back, and went with the simplest and quickest solution we could’ve thought at the time.

If async is what we need, then let’s use constructs of the language and framework that allows us to do just that.

So, we opted by using Kotlin Coroutines as we already had a few in other points of our system, and we knew it would suffice to solve our issue.

It was very simple, all we had to do was:

  • Generate an id
  • Issue the request to the external system within a coroutine scope, using that id to track the request
  • Return the id as response to the frontend
  • Still, within the scope of the coroutine, once the response from the external system comes through, we store the results associating them with that same id
  • Finally, on the frontend we keep on polling another endpoint, using the id received, until we get the results to display them

The mistakes we made would’ve been avoidable, if only we had chosen the right solution, based on our use case, requirements and the actual problem, instead of trying to squeeze a technology into our stack, just because it has some similarities (less than we thought, as seen above) and is shiny.
Also, this doesn’t mean we shouldn’t consider turning our system reactive, at some point, but first we must understand the overhead of doing so, and if our use case justifies it.

P.S.: This is not really the end of it, as at a later point, the solution evolved into something more robust, but that’s a tale for another day.

Useful links (as promised):

Thank you for reading and feel free to provide feedback.

This is an opinion article and doesn't necessarily reflect the Volkswagen Group view.