Many products rely on data to function correctly and to show information that’s useful. Getting that data right is so incredibly vital. That seems like a really obvious thing to say, but you may be surprised to learn that many data-reliant product teams aren’t focused on clean, consistent, and organized data.
Why is this so important? Because if you don't have the right data your system is not going to work correctly. Your product won't show the right information. It won't to what people are expecting it to do.
But it gets worse than that, because if the product shows incorrect data, users don’t think “gee, maybe this data point hasn’t been updated in the database.” Instead, they assume that the product itself is broken, and may even write it off as useless. Users don’t usually understand the distinction between the data and the functionality, nor should they have to!
So you can build the best product on Earth, but if it's using bad data, it's going to look broken, it's going to look like the product itself is broken and bad.
Lots of factors can lead to poor data quality in a product. Sometimes the data is coming from multiple places, and fed through different processes that produce slightly different outcomes. There might be a normalization process that isn’t happening. In fact, I just worked on a product that integrated data from several sources, from spreadsheets to file inputs to outside databases. One of the biggest challenges was understanding and agreeing on the source of truth for each data point!
Sometimes a product is exposing data that no one has looked at in a long time, if ever. And the assumption is that the data is fine, because it’s never been touched, but it could be outdated, or maybe it was never correct but no one checked it.
Sometimes the team responsible for the data is completely separate from the product team. And the data team may be so confident that everything is correct and up to date that they refuse to verify anything. And if there is a problem, it will be blamed on the product team, so the data team has little incentive to launch a cleanup initiative.
And sometimes it’s a little bit of everything.
Story time! Once upon a time I built a really cool store locator. My team replaced a basic search feature with a much more modern locator, including more robust search and filter options, plus a map displaying pins at each store location. Each store also had icons indicating the various services they offered.
Because we were modernizing the locator so dramatically, we were exposing several data points that had previously existed in the database but had never been used or exposed in any way. For example, the services had never been shown in any public place. And the map pins relied on latitude and longitude. This data had always existed in the database, but had never been used for anything.
This was an unreasonably fast project, with just a few weeks to build the entire locator from zero to final launched product. So there wasn't enough time to do many of the basics that I would normally do when building a new product. One of the things that I really regret not being able to look at more closely was the data cleanup.
Which was controlled by an entirely separate team.
While I had no authority over the data team, I did make it very clear that the new store locator would be using additional data points, and listed the ones we planned to use. I stressed the extreme importance of ensuring the data was accurate. I explained that the new store locator would not appear to function correctly if the data quality was poor.
The data team replied with confidence that their data was complete, correct, and up to date. Every store updates their respective data regularly, so there is nothing that would ever be incorrect. Even when I reminded them that we’d be exposing data points that no one had ever used, so stores may have been lax in updating them, they remained adamant that all their data was completely buttoned up. They seemed almost insulted by my implication that their data might not be perfect.
At that point there was nothing I could do except raise the risk with my own leadership, and move on with the work that was within my control.
You can guess where this is going.
Fast forward to our first stakeholder demo. Everyone noticed something strange: several dozen stores were pinned in the middle of the Atlantic Ocean, where the company definitely did not have stores.
The cause was immediately obvious to me: that location is zero latitude and zero longitude. The lat/long data was missing for all of those stores, and the locator was defaulting the missing data to zero/zero. But the cause was less obvious to the stakeholders, who proclaimed that the locator didn’t work at all. They told anyone who would listen that this whole project was a giant failure.
The data team still refused to check the data, even though I now had obvious evidence of data quality problems. They would not believe that their data could possibly be incorrect or missing. My engineers had to spend days re-testing and reviewing code, to prove beyond a shadow of any doubt that the product’s logic was fine. Only then would the data team consent to a data review.
We lost precious days from an already overly-tight schedule, and my team took a reputational hit as stakeholders complained about the locator.
Then we had problems with the services icons, because a lot of stores had the wrong services listed in the database. Here again there were obvious errors, such as a store in a completely landlocked country offering marine services, as well as less obvious errors that showed up over time. And once again the data team refused even to look at the data until my team proved the functionality was correct. We lost more time and reputation.
It turned out that the product’s functionality and logic were perfectly correct all along; there was no problem at all with the locator itself. But the bad data had eroded everyone’s confidence in the product overall.
By the time we were able to prove that the locator worked AND get the data corrected, the launch deadline was very close. And because of the eroded confidence, the primary stakeholder did not feel the product was ready for launch.
The locator did eventually launch, to many compliments, and years later is still working well. But at the time, my team took an L because we were not able to meet the launch deadline, and so many key stakeholders, who didn’t understand the nuance of data vs. logic, thought we just plain did a bad job.
All of this could have been avoided with a data review up front. Had the data team asked stores to update ALL data points when I made the initial request, we would have launched on time and the product team would have been heroes for the exceptionally fast turnaround.
Data cleanup is a lot of work and it is not sexy work. It's not the work that's going to make for a beautiful PowerPoint slide or a really cool demo. It's not the kind of work that is easy to brag about, but it is so incredibly vital. It will make the difference between a data product that is helpful and valuable, and a data product that everyone assumes is trash.
If this sounds like something that you've been struggling with, or if you think this could be a problem for you, please reach out! Let's talk about how to understand and plan your data cleanup effort alongside your product build out, and how to catch these problems before they become crises.
If you think this could be a problem for you, let's talk about how to understand and plan your data cleanup effort.