Refining the Data Journey: The Synergy of Labeling and Quality Assurance

Case study

Daniel Kondermann

Managing Director

24 October 2023

11 min read

In today’s era, dominated by data, there is an increasing need for meticulously annotated data. Robust and accurately labeled datasets are fundamental cornerstones in refining machine learning (ML) algorithms and increasing the precision of organizational decision-making processes. Their critical role is unequivocal; however, the creation of large and diverse datasets, paired with precise annotations, presents a considerable challenge.
‍
The data labeling industry is currently in a dynamic phase of growth and innovation, exploring new methodologies, best practices, and pioneering research. Engineers are actively seeking knowledgeable guidance to successfully navigate through this ever-changing landscape.
‍
This article aims to explore the importance of data labeling in making AI projects successful, highlighting the need for a strong Quality Assurance (QA) process. We will share useful tips and advice, showing how to use tools like Kili Technology and Quality Match to improve and simplify these processes.

Why is data quality so crucial?

The precision of data labeling is paramount in the fields of machine learning and artificial intelligence, directly influencing model performance and outcomes. Research consistently illustrates that even a 10% decrement in labeling accuracy can trigger a substantial, cascading effect, leading to a 2 to 5% degradation in model performance. This variance is not trivial—it can be likened to a self-driving car causing harm to five out of a hundred people.
Addressing this issue often involves the integration of additional data into datasets by machine learning teams. However, this not only escalates the size of datasets and the associated labeling costs but also doesn’t guarantee an enhancement in model performance, particularly if the new data is inaccurately labeled.

This highlights the strategic importance of prioritizing data quality from the outset. A focus on enhancing data quality alleviates the need for exhaustive labeling efforts, reducing infrastructure costs while fostering improved model performance. It leads to more concise and efficient datasets, optimizing resource allocation and performance.
Furthermore, prioritizing label quality carries profound implications beyond financial and performance considerations—it plays a vital role in minimizing biases within datasets. Such biases can significantly influence machine learning models, affecting their performance, validity, and ethical standing. Ensuring better label quality is more than a technical requirement—it’s a pivotal step towards building AI systems that embody both ethical integrity and unbiased accuracy.

Why is achieving data quality a challenge?

Creating effective AI models is not merely about excellent coding anymore. We are witnessing data taking center stage, evolving from the 2000s revolution where coding was the backbone of digital platforms.
This evolution underscores the critical role of data management. While crafting accurate code is still vital, it’s becoming increasingly clear that enhancing datasets is more instrumental for augmenting AI models' performance than merely improving algorithms.
As a result, the focus of machine learning teams is gradually pivoting towards nurturing datasets that optimize performance.

However, attaining good data quality is a big challenge. Current estimations reveal that about 3.4% of data in public datasets bear incorrect labels.
For instance, Google’s Quickdraw dataset has a 10% error rate, and ImageNET records a 6% error rate. Additionally, datasets like StableDiffusion, based on LAION, are riddled with inaccuracies such as incorrect language detection and non secure content.
But why does this happen? The reality is that even labeling tasks that appear simple are riddled with intricacies.

Consider an image in which a kite surfer needs to be identified. Questions arise: Should the identifier include the surfer, the sail, or both within the bounding box? It becomes crucial to establish comprehensive and accurate labeling guidelines from the beginning and maintain steadfast adherence throughout the project. Even in classification tasks that seem straightforward, like binary ones involving 'yes' or 'no', complexities abound. For instance, is a chocolate Easter bunny to be classified as a 'bunny'? Does a buttered toast still fall under the bread category, or is it a modified form of bread? And should the butter spread on the toast be labeled separately as butter?

Ensuring a balanced distribution of data within a dataset is fundamentally important, as disparities frequently exist. Take the automotive industry as an example; disparities may appear due to insufficient data in situations like vehicles using turn signals, encountering yellow traffic lights, or unusual vehicles such as sidecars and quads being present. Addressing rare yet significant scenarios, like zigzagging road lines or encountering clusters of multiple traffic lights, is also essential, particularly in training models for autonomous vehicles.

Although data quality is paramount to the performance of ML models, achieving good data quality is a multifaceted challenge that the industry is still striving to master.

Kili Technology Strengths as a top-tier training data platform

Selecting an appropriate tool for data labeling from start to finish is crucial. It’s important to opt for tools capable of facilitating the achievement of good data quality.

Kili Technology emerges as a multifaceted tool, tailored to simplify the data labeling process, enhancing productivity, and optimizing label quality, while seamlessly integrating into your existing ML ecosystem. It acknowledges the taxing nature of labeling tasks and, therefore, centralizes user experience, providing labelers with a platform that is both efficient and user-friendly.
It encompasses features designed to boost productivity such as keyboard shortcuts and ML automated labeling functionalities (model-in-the-loop, GPT-automated text labeling, automated object detection utilizing the potent SAM model, etc.).

These features are meticulously incorporated to save time and effort, enabling labelers to maintain a heightened focus over extended periods.

Kili Technology also underscores the significance of robust quality safeguarding mechanisms. Recognizing the intricate and repetitive aspects of labeling, it considers the susceptibility of even the most proficient labelers to errors. To counteract this, Kili Technology equips users with various strategies to bolster quality, reinforcing resilience against potential inaccuracies.

Review

Use reviewer permission levels to go over the labels created by the labeling team, add comments, raise issues, validate, or send back assets that are erroneous. Keep a tight grip on quality through monitoring.

Quality metrics

Generate quality metrics such as consensus or honeypot to get a global view of the label agreement between labelers or a labeling score based on ground truth.

Plugins & programmatic QA

Automate QA steps with custom plugins, i.e. custom code modules that will automatically detect errors and flag them for your reviewers’ attention.

Orchestration of all of the above

Run all quality strategies consecutively or in parallel and automate each step of the way as you see fit.

Kili Technology maintains its commitment to the precision of labeling, acknowledging it as a continuous task needing ongoing refinement and enhancement as datasets evolve.
It is imperative to continuously update and improve labels, as well as adapt them to encompass new scenarios and applications. The effectiveness and quality of labeling depend heavily on the establishment of a resilient iterative cycle, crucial for the timely rectification of issues.
Kili Technology fosters this iterative model by integrating convenient features such as Single Sign-On (SSO), uninterrupted cloud storage connectivity, and the provision of simplified, customizable data export options. Ultimately, Kili Technology emerges as an all-encompassing solution that streamlines the labeling workflow, augments productivity levels, and uncompromisingly sustains rigorous quality benchmarks.

The strengths of Quality Match, the first dataset quality assurance platform

To unlock the full capabilities of data annotation, it is essential to intertwine it seamlessly with Quality Assurance (QA). Together, they form a unified approach that is instrumental in creating highly performant training datasets.

Quality assurance operates as the diligent overseer of the labeling process, promptly identifying and correcting errors and inconsistencies, thereby enhancing the overall reliability of data annotation.

Quality assurance’s role extends beyond merely spotting errors. In an evolving regulatory landscape marked by initiatives like the EU AI Act, QA brings a level of accountability, ensuring that datasets meet increasingly rigorous standards for transparency and reliability in AI applications.

In this context, Quality Match (QM) offers a robust platform that provides the necessary tools for establishing technology-driven, consistent processes. It ensures that datasets meet stringent quality benchmarks, blending transparency with cost-effectiveness to enhance the precision and trustworthiness of AI applications.An effective QA process is iterative and focuses on quick, actionable feedback to improve data quality continuously. Such a methodology enables real-time improvements, ensuring datasets are refined and ready for AI/ML model training.Quality assurance specializes in identifying common error sources such as human mistakes, faulty sensor data, and unclear labeling instructions or taxonomies.

Human mistakes: People can make mistakes, and it's not uncommon to find that 10-30% of errors in human-labeled datasets stem from simple human error.

Faulty Sensor Data: There are times when the data itself presents challenges. Whether it's images or lidar that are dark, blurry, or obstructed, even the most skilled human labeler might struggle to interpret it correctly.

Incomplete Taxonomies: This occurs when the labeling guide lacks sufficient clarity or completeness. Even with an expert labeler and good sensor data, questions may arise if the guidelines fail to address every possible situation. For instance, how should a person in a wheelchair be categorized as a pedestrian? What about a police officer walking across the road?

QM does more than simply correct these errors. It also:

Compares the old and new error rates following the quality assurance, particularly concerning human error.

Identifies ambiguous cases arising from poor sensor data.

Highlights edge cases that result from inadequate taxonomies.

Assesses and categorizes each error by type and severity, along with other potential attributes.

Establishes statistical confidence for each assessed annotation.

Quality Match (QM) has structured its Quality Assurance (QA) processes to foster accuracy and efficiency throughout each stage of data handling. The process begins with a thorough analysis of label guides, ensuring that each object class is precisely defined and scrutinized. This foundational step is essential for establishing clear guidelines and criteria for the subsequent stages.

Following the initial analysis, QM focuses on crafting comprehensive taxonomies for each object under review, setting the stage for detailed QA metrics and evaluations. This involves categorizing and defining every aspect that needs examination, ensuring a structured approach to quality assurance.

In the subsequent phase, QM delves into designing robust QA processes and pipelines, each one meticulously customized for specific tasks or object classes.

After thorough testing and refinement, the QA processes and object taxonomies are consolidated. This integration ensures a seamless and coherent workflow, enhancing the overall efficacy and precision of the QA process.

Upon consolidation, the next phase involves ramping up the refined QA processes, scaling and implementing them as necessary to meet the demands of the dataset. Each process is adapted and expanded based on its performance and the specific needs of the data being handled.

Central to QM’s technology are two pivotal concepts: nano-tasking and the repetition of these nano-tasks. Nano-tasking involves breaking down complex QA tasks into more manageable, singular decisions, significantly easing the reviewer’s workload. This detailed segmentation allows for each characteristic within an image to be treated as an independent task, reducing the chances of errors or inconsistencies being missed due to mental fatigue. This meticulous approach ensures a more thorough review process, enabling easier identification of ambiguous areas or edge cases, thereby enhancing the overall precision and reliability of the data labeling process.

Example of a nano-task

Nano-tasks are methodically arranged into coherent decision trees, shaping the decision-making process in a hierarchical manner. It initiates with overarching questions, progressively tapering down to finer details based on preceding responses, guaranteeing that no unnecessary questions are posed. This optimizes both time and resources, enhancing the efficiency of the process.

The secondary fundamental concept of the technology encompasses the repeated execution of nano-tasks by various reviewers. This recurrent nature aids in uncovering ambiguities or edge cases, and it contributes statistical robustness to the QA outcomes.

In this context, statistical power equates to the reliability with which we can trust the QA results. Gathering a variety of responses allows for an aggregated and analyzed outcome, reflecting the reliability of the combined answer. The more robust the statistical power, the higher the confidence in the result.

If a task is straightforward, early stopping of repeats can lead to cost savings. If reviewers disagree, it signifies potential ambiguity or edge cases, warranting further scrutiny through decision trees to uncover hidden complexities.

By deconstructing label guides into decision trees, each nano-task within can be repeated until statistical convergence. If convergence isn't reached, it signifies an ambiguity in the annotation, exposing edge cases. By recognizing edge cases and ambiguity early we preemptively mitigate potential error sources.

Combining both

Integrating both solutions into a unified tooling suite is seamless, thanks to their robust and readily available APIs. For a detailed exploration of Kili Technology’s API, you can refer to their comprehensive documentation.

The ROI of great labeling and QA

Sharing real-world experiences often resonates more profoundly than theoretical discussions. Here, we delve into two illustrative customer experiences where Kili Technology and Quality Match have been instrumental in refining dataset labeling to unparalleled standards.

Our initial narrative involves a tech-savvy agricultural firm on a quest to redefine sustainable farming techniques, leveraging a flawless computer vision model. Their foremost hurdle revolved around the meticulous identification of weeds amidst soil to bolster crop yields and optimize resource utilization.

Prior to embracing Kili Technology, they grappled with the constraints of open-source utilities, an issue that was exacerbated as their project scaled. The incorporation of Kili Technology transformed their operational dynamics, bestowing them with a flexible, intuitive interface for various labeling endeavors, complemented by the automation proficiencies of its API. A notable triumph was the substantial 30% curtailment in labeling expenditures within a mere three-month window, enhancing their labeling quality.

Our second use-case involves Quality Match. They collaborated with a customer on a project that scrutinized the efficiency and cost-effectiveness of two quality assurance methods. The customer's existing object detection model with Quality Match quality assurance system vs. the traditional adder-approver quality assurance method. Additionally, the customer sought to reuse the corrected data to bootstrap from existing machine-learning models.

The project revolved around the detection and correction of both False Negatives (FNs) and False Positives (FPs) within three object classes. Given the safety-critical nature of the project, high precision was of utmost importance, especially in edge cases such as object reflections and illustrations.

Our QA project began with an in-depth analysis and collaboration with the customer to understand their goals and extract key questions for the QA. These questions, transformed into nano-tasks, were connected and visualized as decision trees. We call this a QA pipeline.

We created two distinct pipelines:

Detecting and Correcting FNs: We broke down the frames into "image tiles" or sub-frames and QAed them individually. This allowed reviewers to concentrate on specific areas and detect even the smallest objects.

Detecting and Correcting FPs: We crafted a series of nano-tasks for this pipeline, culminating in a decision tree to identify true positives or false positives.

The project yielded promising results

FP Detection: With statistical power higher than 98%, we identified 15.88% of annotated objects as FPs, and their taxonomies and error severity were precisely cataloged.

FN Detection: The rate of FNs was extremely low.

We recommended the customer to focus on FP detection in the future

Conclusion

As the technological landscape transitions from code-driven to data-driven paradigms, the construction of datasets remains a challenge. Professionals globally are exploring optimal strategies to enhance their data labeling workflows, gleaning insights from pragmatic experiences and iterative refinements.

Addressing this, numerous organizations globally are striving to streamline the dataset creation process, with Kili Technology and Quality Match emerging as frontrunners. Kili Technology exemplifies a steadfast dedication to labeling efficacy and excellence, while Quality Match innovates with a strategic approach to QA processing and insights extraction. Their synergistic integration starts a new era of efficiency and precision, embodying the epitome of solutions that ML engineers aspire for.

Introducing HARI

Take a look at our product HARI
and it's features.

Case study

Take a look at our product HARIand it's features.

Take a look at our product HARI
and it's features.