LLMClean: An AI approach to automated context model generation using large language models to analyze and understand diverse data sets

The increasing expansion of the data landscape, driven by the Internet of Things (IoT), poses an urgent challenge: ensuring data quality amid the flood of information. As IoT devices become increasingly connected and the cost of data collection decreases, companies are using this wealth of data to make strategic decisions.

However, the quality of this data is of utmost importance, especially given the increasing reliance on machine learning (ML) across various industries. Poor quality training data can introduce bias and inaccuracy, undermining the effectiveness of ML applications. Real data often has inaccuracies such as duplicates, null entries, anomalies and inconsistencies, which significantly affect data quality.

Efforts to mitigate data quality issues have led to the development of automated data cleaning tools. However, many of these tools require more contextual awareness, which is critical for effective data cleaning in ML workflows. Contextual information clarifies the meaning, relevance and relationships of the data, ensuring consistency with real-world phenomena.

Context-aware data cleaning tools show promise and leverage ontological functional dependencies (OFDs) extracted from context models. OFDs provide an advanced mechanism for capturing semantic relationships between attributes, improving error detection and correction accuracy.

Despite the effectiveness of OFD-based cleaning tools, manually building context models presents practical challenges, especially for real-time applications. The labor-intensive nature of manual methods, coupled with the need for domain expertise and scalability concerns, highlights the need for automation.

In response, the proposed solution LLMClean leverages large language models (LLMs) to automatically generate context models from real data, eliminating the need for additional metainformation. By automating this process, LLMClean addresses the scalability, adaptability, and consistency issues inherent in manual methods.

LLMClean includes a three-tier architecture framework that integrates LLM models, context models, and data cleansing tools to effectively identify bad instances in tabular data. The method includes classification of datasets, model extraction or mapping, and context model generation.

By leveraging automatically generated OFDs, LLMClean provides a robust data cleansing and analysis framework tailored to the evolving nature of real-world data, including IoT datasets. Additionally, LLMClean introduces sensor capability dependencies and device link dependencies, which are critical for accurate fault detection.

Visit the Paper. All credit for this research goes to the researchers of this project. Also don’t forget to follow us Twitter. Join our… Telegram channel, Discord channelAnd LinkedIn Grupp.

If you like our work, you will love ours Newsletter..

Don’t forget to join our 41k+ ML SubReddit

Arshad is an intern at MarktechPost. He is currently completing his International. MSc Physics from Indian Institute of Technology Kharagpur. Fundamental understanding of things leads to new discoveries, which lead to technological advances. His passion lies in fundamentally understanding nature using tools such as mathematical models, ML models and AI.

This is a curated content sourced publicly with a clear linkable mention to the original source and you may view the source from the following Source link

Notepad is free a updates hub to keep you updated.