May 13, 2025
Seven questions to help optimize a data training set for artificial intelligence
Organizations are rushing to take advantage of AI applications in hopes of transforming their operations, products, and customer experiences. According to some estimates, of all global businesses are now using or exploring the use of at least one AI tool in areas ranging from accounting, inventory, and supply chain management to customer service, recruiting, and more.
Regardless of the use case, the data used is critical to the success of any AI application. Issues with data quality and relevance are frequently cited as the most common causes of failure in AI projects. Without good data, efforts to implement AI applications put companies at risk of wasting time and money, as well as organizational motivation and trust. Creating a strong foundation for a new AI tool means focusing first on whether you have enough of the right kind of data for training and development — which is often more important than the AI tool itself.
Data quality is key to ML models. Data errors and uncertainty can propagate from the point of measurement all the way to the analysis dataset, leading to poor results.
What do we mean by AI?
Today, "AI" is often used as a blanket term that covers many different technologies, from simple algorithms that offer results based on user interaction to complex large language models (LLMs) that can hold coherent conversations. In this article, we'll explore machine learning (ML), a broad branch of AI that uses algorithms to find patterns in data. ML algorithms predict new data based on what they recognize in existing data. ML is used for speech recognition, recommendation systems, fraud detection, image processing, medical imaging, and more.
The importance of data quality and fitness
Data quality is key to training ML models. Data errors and uncertainty can propagate from the point of measurement to the training dataset, leading to poor results. The first step in vetting data quality is to assess data completeness. For example, if patient information in a database is required to include quantities of medication prescribed, any record missing quantities is considered incomplete. Likewise, data should be validated, meaning it should conform to data or business rules, such as format, allowable data types, and numerical ranges. Incorrect data values — like the age of physical assets derived from survey records or values derived from medical testing — can degrade model performance, potentially resulting in consequential incorrect predictions.
In many cases data should be timely, or all collected at the same time. Things change over time, and outdated data can skew AI model results, leading to poor decisions and business outcomes. Data consistency is also key, and all representations of a particular item across multiple data stores should ideally match. For example, if information about physical assets is stored in both inspection and maintenance records and a separate system that documents repairs, it's important that all important details match so records can be joined based on overlapping fields.
Seven questions to ask about your AI datasets
- Do you have labeled data? It's essential to know whether you have data that has been labeled to provide machine learning models with context. Have subject matter experts reviewed and approved these labels? Have data scientists reviewed the labels? Do you have enough of the right labeled datapoints to train your ML model?
- Are there gaps in your data? Assess your data for completeness and consistency. Are there missing values or errors that need to be addressed? Can those missing values be added using other tools before moving on to AI models? Should those records be dropped, or can the missing values be imputed? If so, what imputation method should be used?
- What is the source of your data? Understand the origin of your data and whether it was collected for the current project or repurposed from another source. If it was repurposed, how and why? Repurposed data can be used for many projects, but businesses should know how and why the data was repurposed. This helps in evaluating its relevance and suitability.
- What regulatory environments do you operate in? Consider the regulatory requirements that apply to your data and AI model. This includes ensuring explainability (how the data was used and why) and compliance with relevant laws. Data privacy laws may mean more stringent requirements for security, redaction of personal information, etc.
- Do you have domain expertise? Bringing together both the data science and relevant domain expertise is crucial for understanding the nuances and context of the data. For example, a medical device manufacturer may not have expertise with epidemiological health data, or a manufacturer may need assistance with data related to human factors.
- How will you ensure data privacy and security? Address data privacy and security concerns, especially if you are using sensitive or personal data. This includes complying with relevant data protection regulations, implementing encryption protocols for data at rest and in transit, and conducting regular security audits.
- What are the potential biases in your data? Identify and mitigate any potential biases in your data. For example, you can analyze the demographic distribution of your dataset to determine whether it represents the target population accurately and implement techniques to correct for any overrepresented or underrepresented groups as necessary.
Improving your data quality and fitness
Every project is unique, and data quality is ultimately defined by how it will be used and the business goals. In many cases, the best way to improve data quality is to improve data collection processes moving forward. New data can be collected to supplement the existing dataset if appropriate — using defined and accurate sources to help ensure results remain valid and usable. If existing data does not meet all standards for the project, process knowledge and domain expertise can be used to ensure ML can take advantage of existing data.
There are also many technical ways to improve and prepare datasets for use in ML training. Data processing serves to clean, harmonize, or otherwise transform data to engineer features appropriate for ML algorithms. For example, images used for training may first need to be centered, cropped, filtered, or converted to monochrome. Information can also be transformed to reduce unnecessary data that could cause issues during machine learning. Processing data in this manner is critically important and should be well planned and executed.
When preparing to deploy AI, it's wise to take a measured approach. Begin with the fundamental business question AI is meant to solve and ensure you have the data to solve it before looking for an AI tool. In some cases, AI might not be the best tool for the job, and there may be other methods that will work just as well. For example, engineering models, rule-based systems, improved data management processes, and even human expertise could deliver the answers you need without the time and energy required to train a large-scale AI model. In either case, however, it's crucial to ensure that the data you use is accurate, readable, and abundant.
What Can We Help You Solve?
51ÉçÇø's AI/ML consultants and data scientists help businesses evaluate AI/ML tools and prepare their data for cutting-edge algorithms. With expertise in a wide variety of scientific and engineering disciplines, we can bring specific knowledge to assessing training data for your AI project.
Insights
