Platform Engineering, Cloud Costs, and the "Copy-Paste Problem": An Interview with Alex Kropp

Alexander Kropp, a platform engineer at DataFlow Academy, discusses cloud costs, the "copy-paste problem" in tech stacks, and why platform teams are the most important tech multipliers in a company.

Alex, you describe yourself as a platform engineer. What do you actually do all day?

Alexander Kropp: (laughs) I do platform things. That means I build and orchestrate complex systems in the cloud, primarily in Azure. But you don’t have the cloud just to have the cloud, but so that things can run on it.

So I provide complete platforms for product teams and developers. At the center, there’s often a Kubernetes cluster, along with databases, storage, and also Kafka.

My main task is to abstract complex problems, preferably through Infrastructure as Code (IaC). I build self-service solutions such that developers can order new resources as easily as possible. Whether it’s a database, compute resources, or a Kafka topic. This keeps the entry barrier as low as possible.

What is the concrete value you bring to customers? What does an environment look like after you’ve been there for 6 months?

Alexander Kropp: Of course, that depends on where the customer starts. If we start from scratch, I bring the experience to build a robust, highly available cloud platform, including monitoring and logging, that doesn’t require five people for just maintenance.

In the end, you have a simple, extensible codebase that abstracts complex problems.

But perhaps the most important point is knowledge transfer. You don’t learn cloud and infrastructure topics in university. When companies migrate, this knowledge is often missing internally. My goal is always to bring the internal employees along. I do a lot of pair programming so that the team can work without me after a certain period.

About Alexander Kropp: Alexander Kropp has been a passionate computer scientist since childhood and has been programming since he was 10 years old. As a researcher and consultant, Alexander has been supporting well-known companies in digitalization and prototype development for over a decade. In parallel, he works as a lecturer and trainer in the cloud environment.

There is a lot of talk about infrastructure costs. Why do companies often spend too much money on the cloud?

Alexander Kropp: I often see with customers that cloud costs are three times higher than they need to be. The problem is often that the individual in the company has no direct budget responsibility and gives little thought to the costs.

This adds up extremely quickly. Often, a setup is used as a blueprint for future services. If this was incorrectly configured or oversized, the problem perpetuates. Before you know it, a simple NGINX that gets three requests per hour is running on a VM that costs 300 euros a month.

You are not only a platform engineer but also the co-author of Kafka in Action. How do you see Kafka being used by your customers?

Alexander Kropp: Kafka is used by many product teams to exchange data. I’ve seen it all: data being pushed for analytics, or even synchronous processes being mapped via Kafka.

Kafka often becomes the go-to solution for every problem. When it’s simple for teams to get a new topic or user, they’ll often choose Kafka out of convenience, even if it’s not the ideal tool for the job. If Kafka is already in place, it’s seen as the path of least resistance compared to introducing something new.

What are the biggest challenges you see for development teams when dealing with complex technology?

Alexander Kropp: The biggest challenge is that the tech stack in the cloud is damn big. It has become rare that people only need to be really good at one technology. It’s impossible to ensure that everyone knows everything.

This creates what I call the "copy-paste problem." Everyone, intentionally or not, creates blueprints that get copied. But if there’s no deep understanding of the technology, things get copied that might have been suitable for the original use case, but not for the new one.

You can get by with copy-pasting for a while if the load is low and costs don’t matter. But eventually, you’ll fall flat on your face. And then ten people will spend three months on a performance bottleneck.

Can you give a typical example of such a mistake that arises from half-knowledge?

Alexander Kropp: Oh yes, you see that often, especially with Kafka. My "favorite example": I’ve seen offsets being committed manually. They would fetch and process 1,000 messages as a batch, but the offset was only incremented by 1 in the code. What happens? The next fetch of "new" messages loads 1,000 messages, of which 999 have already been read. This creates an extreme load.

Another classic: developers tinker with the configuration to get more performance. I have repeatedly seen the consumer parameter fetch.max.wait.ms set to 0. In reality, they start to DDoS their own Kafka cluster because the consumers send requests to the brokers in a tight loop.

client-quote-img

I see the platform team as a multiplier for the entire company. You can’t just stop at provisioning with a "take it or leave it" attitude.

Alexander Kropp
Platform Engineer, DataFlow Academy

Is this a Kafka problem, or does the cause lie elsewhere?

Alexander Kropp: No, this is not a Kafka problem; you see the same thing with databases or Kubernetes. I personally find the official Kafka documentation, or our book (laughs), very good for getting a basic understanding.

The problem is organizational. The daily work of development teams is often driven by the business. The product owners are breathing down their necks for the next feature. This is a problem with how companies work.

That sounds a lot like the DevOps philosophy. But you emphasize the role of the platform team as a "Center of Excellence." So is "You build it, you run it" overrated?

Alexander Kropp: I wouldn’t say "less DevOps." But leaving development teams on their own is the wrong approach. You need a central point of contact.

Open communication is extremely important within departments. The platform team must take responsibility for its services and be proud of them. The attitude must be: "I am the expert. I’m happy to help you," and not "Oh, that stupid developer wants something from me again."

This is exactly where I see the platform team as a multiplier for the entire company. You can’t just say, "Every team is responsible for its own tech stack." Instead, you should have a team of experts responsible for operating Kafka, Kubernetes, or databases.

Very important: You can’t just stop at provisioning with a "take it or leave it" attitude. These teams must be seen as multipliers and a point of contact. An additional person in a platform team can, in doubt, save the work of three or four developers in the product teams because they free up their time for their actual subject matter topics.

Focus: Platform Operations

80% of the work is enablement

The purely technical operation (with IaC) often only accounts for 20%. The rest is governance, orchestration, and knowledge transfer.

How much effort is it actually to operate Kafka?

Alexander Kropp: Once you have set up a proper setup with Infrastructure as Code and scalable solutions, for example, with the Strimzi operator on Kubernetes, the pure operation requires not even one full-time person.

The point is: A large part of the work is not in hosting the Kafka cluster itself, but in everything that comes with it: the orchestration of topics and users, governance, and above all, the enablement of the teams. No Kafka as a Service provider will do this work for you. Once the system is in place, 80% of the work is enablement.

What is your personal favorite tool from the Kafka ecosystem that maybe not everyone knows?

Alexander Kropp: Karapace. It’s an open-source REST proxy and a schema registry under the Apache license. There aren’t many alternatives, and compared to some proprietary solutions that are heavily pushed by vendors, I consider Karapace to be the much better choice.

Your final appeal to all platform teams out there?

Alexander Kropp: Two things. First, always ask yourselves: If you had to use your own services, would you be satisfied with them? Is it "convenient"?

Second, rely on automation! Absolutely. Use Infrastructure as Code. It makes everything maintainable, it’s living documentation, it has only advantages. As soon as someone clicks around in a GUI – maybe for debugging, okay – but for operations? Never.

A GUI is great for looking, not for touching in production. If everything is in code, preferably in readable YAML with comments, you can always trace who built what and why.

Thanks for the interview, Alex! Where can we find you online?

Alexander Kropp: You can find me on LinkedIn.

About Anatoly Zelenin
Hi, I’m Anatoly! I love to spark that twinkle in people’s eyes. As an Apache Kafka expert and book author, I’ve been bringing IT to life for over a decade—with passion instead of boredom, with real experiences instead of endless slides.

Continue reading

article-image
How REWE Masters Digital Transformation with Apache Kafka

Whether shopping online, using self-checkout in stores, or getting home delivery: Almost every German regularly interacts with REWE. What few people know: Behind the scenes, Apache Kafka ensures everything runs smoothly. Paul Puschmann and Patrick Wegner provide insights into the technological transformation of one of Germany's largest retailers.

Read more
article-image
The Best Thing About Apache Kafka? It's Boring!

How a system that simply works helps the German media landscape count millions of readers daily. An interview with Felix Sponholz.

Read more