It’s not AI that’s the problem, it’s the data.

3 min readSep 27, 2023

This is an opinion piece based on my own experiences, yours may be different.

I think that there is a fundamental lack of understanding around AI. More and more non-technical people understand that models are the basic building blocks of AI, that they are the codified results of a learning algorithm applied to data. What I don’t think people and organizations have a good understanding of, are data. While it is true that the technology involved in learning have become more complex, making it possibly seem amazing and frightening at the same time, these tools would not be even considered if it were not for the massive quantities of data that are available. More and more organizations are incorporating models into everything and will only increase with the latest advance in LLMs.

These data that the models rely upon are dirty. Not only are they filled with errors, biased and collected through the lens of a privileged class, but then they are segmented, imputed and transformed before the models are applied. To my knowledge, these changes are done in a largely unregulated environment. This often results in the models only being a reflection of the data used for training, validation and testing; and only up to a certain point. Designers of models tweak the learning parameters so that they learn on the data but not too well with the hopes that the results will generalize to the new, as yet unseen data. It is only once the models are live and monitored for problems and/or feedback is delivered after the outputs are sent, do we truly understand how useful they are for the task they have been expected to perform.

In business, we refer to our preferred source of data as “the single source of truth”. This loaded phrase admits that the data are variable, and that there could be considered many “truths” available. I don’t think that in and of itself is problematic, but rather that the results of using one data set over another are generally not fully explored. There is not usually the time to do this analysis and the hope is that the preferred data source contains enough directional information to help with decision making at least some of the time.

However, as we have become a society that relies more and more on data for all services…from banking, insurance, policing, healthcare, and even food bank activities', the effect of this incomplete source of truth becomes more insidious. While having no data to work from, the alternative merely gut reaction and personal experience, has been shown to be problematic…. an acknowledgement of the challenges of our society dependent on this faulty resource has yet to happen.

This type of discussion has often been isolated within ethics groups inside academia, think tanks or other non business or service delivery organizations. That is, away from the organizations that deliver the products and services impacted by the data. Organizations are often looking for incremental improvement to their existing decision making but the improvement is often measured in terms of business goals such as profits or customer retention.

While there are some notable exceptions on “data responsibility”, e.g. some police organizations, I think that until we make data source impact analyses a part of every organization, there is the potential for a dystopia to emerge. Perhaps not a sci-fi style dystopia with AI taking over the world, but rather a non-functioning piece-wise broken set of services. This data-induced disfunction is no less scary, its just not the AI dysfunction that that I hear everyone talking about.

It’s not AI that’s the problem, it’s the data.

Written by Tyna Hope