Why Cloud Data Platforms?March 5, 2023
Always ask “why.”
I really couldn’t think of a better blog title, I usually do technical stuff and the title writes itself in most cases. I also write more technical blog posts because that’s my comfort zone, I prefer those to “thought” blogs. But a recent blog about an organisation moving their infrastructure away from the cloud and onto their own managed environment got me thinking about the “why” of the cloud. And more specifically in my work world, the constant ask of “should we move our data warehouse into the cloud?” “Do we need to be using cloud data platforms?”, “Should we put all our data in the cloud?”
I’m not going to answer those questions in this blog post because…well, it depends. And anyone who is vaguely familiar with what I do, knows I work with Microsoft products including their cloud data products. Have you ever heard of Synapse? It’s quite good…
Anyway, back to thoughts on cloud data platforms.
Vendors will tell you your current data infrastructure isn’t working for you, that you’re not able to fully realise the potential of your organisation’s data and that crucially your competitors are. That the cloud is where you need to be as it’ll free up your data to become truly valuable and unlock untold insights to drive your org to new heady heights of success (btw when I say free, I don’t mean everyone has access to it or you’re giving it away). You’re being sold a vision of the world that they shape, you’re being sold a solution to a problem you might not have in the first place.
So, we ask ourselves “why”
Why migrate to the cloud? Why use Synapse Analytics? Why use Databricks, DuckDB, Snowflake, any of them? Don’t ask the vendors though…you know what happens then 😉 Ask yourself what problem(s) are you trying to solve? What opportunities are you trying to unlock? What can’t you do now that you feel you need to do? And why can’t you do it with your current data platform?
Do you listen to people like me? Well, with a pinch of salt…In my line of work I usually come in when the decision has been made to use Microsoft products, I’m not the decision maker. I’m the do-er, the one that gets pointed at and asked/told “right, now you go and make that thing happen”. I don’t endlessly debate about different cloud vendors and data products, although I have done in the past. Oh and don’t get me started on the endless discussions about which modelling method is best used in data warehousing/analytics – I’ve been hearing that Dimensional modelling has died/been dying, will die, for many many years (my hairline can attest to that).
Having said that, I have on several occasions said “nope, this isn’t right for you…take another look around and something else will better fit your scenario, your organisation, your direction.”
Shiny Thing…Shiny Thing
This one is as old as the hills, this could have been written last month, last year, 10 years ago…20…30 you get the picture. Going back to vendors (and some consultants, let’s not forget about those consultants) telling you that you need a new vision, a new product, one with the shiniest of shiny features. I get it though, I get caught up in the next new shiny thing, because it’s fun to get involved and be hands-on, to learn something new. To be the one that solves the challenges, to unlock the opportunities is a thrill.
Some of those shiny things do indeed hit the mark, some not so much.
So does that mean we should jump in feet first? Well, how coherent a data strategy can you really create if you’re always looking at the new things? Are organisations starting to look at technology in a more agile way in terms of swapping out components/platforms more frequently, being riskier with their choices? Hey, no-one got fired for buying IBM right? Well maybe if it didn’t work out…
The world of data products has exploded in recent years and the wealth of choice can be dizzying from warehousing to big data processing, to analytics and visualisation platforms. I’m sure it didn’t used to be like this…have we all been caught off guard? Stitching together a data infrastructure from all the best-in-class products and all the new shiny things across all the vendors can be a lot of work and introduce more complexity.
The Balance of the Team
What about the skillset of people who work with cloud data platforms? It can be a little dizzying sometimes with people on the social medias telling you about the next new thing, the skills you need, what you must do to stay ahead of the competition.
If you work in a data team what skills do you need? If you lead a data team how do you cross/upskill? Are the people in your team willing? Do you need to augment the team with new skills, swap out roles completely? How do you convince the higher-ups you need to do this? If you’re a “higher-up”, what convinces you? CIOs/CTOs get bombarded with “your data team/data infrastructure needs to do these 10 things now or you’ll get left behind!” Those poor higher-ups…
In terms of coding skills for cloud data platform, I went out there to ask about SQL data engineering (personally I work with SQL) and was told “no, you’re not actually a data engineer unless you can write a recursive data processing pipeline in 2 lines of pyspark.” Really? Does every data engineering process really have to involve the overhead of Spark? SQL is quite good so I’ve been told, I wonder whether it will catch on. Don’t get me wrong, it’s not that I don’t like Spark it’s just that I often see it being used when it’s not really needed.
“you’re not actually a data engineer unless you can write a recursive data processing pipeline in 2 lines of pyspark”
Balance can also encapsulate an organization’s own internal shiny things, the products and processes they build and the constant desire to deliver deliver deliver. So some organizations view of certain data roles can be quite negative… “so what, they just sit there and make sure everything is running OK and they fix a few issues here or there? Well, what business impact are they REALLY making, what shiny things are they contributing?” That’s like saying “so what, a car mechanic just makes sure my car is working well and if there’s something wrong they fix it? What value is that…” Moving to the cloud doesn’t just mean stuff just works and your data platform will run along nice and smooth without any intervention and maintenance.
Nonetheless, the desire for a business to swap out a role like this for another “developer” focused role is strong, and yes that can work in the cloud arena. But be warned, these products still need to be maintained and optimised and looked after (Synapse Dedicated SQL Pool I’m looking at you…table distributions…Delta Lake yes I’m also looking at you…Vacuum, Optimise…)
I couldn’t very well write this without an example of what I’m dealing with on a day-to-day basis in terms of what people are looking at as the answer to all their data management and analytics “problems.” The Lakehouse…the scale and flexibility of the data lake, with the schema enforcement and transactional capability of the data warehouse.
Lakehouse or Relationallake?
Couldn’t we call it the Relationallake then? Doesn’t have good a ring to it if I’m honest and I’m glad I don’t work in marketing. I’m not going to go into too much detail here, here’s a great article from Leo Lachev (t) about the Good, the Bad, and the Ugly of Lakehouses that I very much enjoyed reading. This is not about being disparaging about new methods, ways of thinking, ways to implement solutions…it’s simply to step back and ask why. I love the Lakehouse pattern and the value it can bring to an organisation…in the right scenario.
But I will say that we are very much in the hype-cycle with the Lakehouse right now and there are going to be organisations “building a Lakehouse” architecture when they probably don’t really need to. Saying that “we are future proofing” isn’t always the answer, and who’s to say technology like Synapse, Databricks, Snowflake, Delta Lake, Iceberg, Hudi, is anymore relevant to a successful data platform than data technology that we’ve been using for the last X number of years? Again, not saying I don’t value the Lakehouse pattern, I wrote a MSc thesis on it for starters! I do value it, in the right way, for the right reasons, in the right scenario. I might publish some of that thesis at some point…
Second Star to the Right, And Straight on Till Morning
So what now? Do you need to jump on board with a cloud data platform if you’re not already there? Maybe…maybe not. Always investigate what benefit you’ll get, always do your due diligence, get that Excel spreadsheet out and do your cost and value comparisons. And be wary of all those ChatGPT generated blog posts publishing very generic quotes such as “cloud platforms reduce your costs”, “cloud data platforms make your data secure”, and “cloud data platforms unlock valuable business insights.” With any of this, you need to find out why and then find out how.
Find out why, find out how
As Alan Partridge said “Evolution not revolution….I evolve but I don’t…er…revolve.”
Happy Sunday everyone, and always ask “why.”
By the way, if you need help with the why just reach out 🙂