Researchers at Stanford University introduce SUQL: a formal query language for integrating structured and unstructured data

Large Language Models (LLMs) have gained popularity due to their exceptional performance in various tasks. Recent research aims to improve their facticity by integrating external resources, including structured data and free texts. However, many data sources such as patient records and financial databases contain a mix of both types of information. “Can you find me an Italian restaurant with a romantic atmosphere?” an agent must combine the structured attributes “cuisines” and the free text attributes “reviews”.

Previous chat systems typically use classifiers to direct queries to specialized modules for dealing with structured data, unstructured data, or chatter. However, for questions that require both structured data and free text data, this method is inadequate. Another approach is to convert structured data into free text, which limits the use of SQL for database queries and the effectiveness of free text retrievers. The need for hybrid data queries is highlighted by datasets like HybridQA, which contain questions that require information from both structured and free-text sources. Previous efforts to base question-answering systems on hybrid data either rely on small data sets, sacrifice the richness of structured data queries, or support limited combinations of structured and unstructured knowledge queries.

Stanford researchers present an approach to anchoring conversational agents in hybrid data sources, using both structured data queries and free-text retrieval techniques. It empirically shows that in real-world conversations, users often ask questions that involve both structured and unstructured data, with over 49% of queries requiring knowledge of both types. To improve expressiveness and precision, they suggest SUQL (Structured and Unstructured Query Language)a formal language that extends SQL with free text processing primitives, enabling a combination of off-the-shelf retrieval models and LLMs with SQL semantics and operators.

The design of the SUQL aims to achieve this Expressiveness, accuracy and efficiency. SUQL extends SQL with NLP operators such as SUMMARY and ANSWER, facilitating comprehensive queries on hybrid knowledge sources. LLMs competently translate complex texts into SQL queries and enable SUQL for complex queries. While SUQL queries can run on standard SQL compilers, a naive implementation may be inefficient. We outline SUQL’s free-text primitives and highlight how they differ from retrieval-based methods by comprehensively expressing queries.

Researchers are evaluating SUQL through two experiments: one on HybridQA, a question-answer dataset, and another on real restaurant data from The HybridQA experiment uses LLMs and SUQL to achieve 59.3% Exact Match (EM) and 68.3% F1-Score. SUQL outperforms existing models in the test set by 8.9% EM and 7.1% F1. In real restaurant experiments, SUQL demonstrated turn accuracy of 93.8% and 90.3% for single-turn and conversational queries, respectively, and outperforms linearization-based methods by up to 36.8% and 26.9%, respectively.

Finally, this article presents SUQL as the first formal query language for hybrid knowledge corpora covering structured and unstructured data. Its innovation lies in the integration of free-text primitives into a precise and concise query framework. Contextual learning applied to HybridQA achieves results within 8.9% of SOTA, trainable on 62,000 samples. Unlike previous methods, SUQL supports large databases and free-text corpora. Experiments on Yelp data demonstrate the effectiveness of SUQL, with a 90.3% success rate in satisfying user requests compared to 63.4% for linearization baselines.

Visit the Paper, GithubAnd demo. All credit for this research goes to the researchers of this project. Also don’t forget to follow us Twitter. Join our… Telegram channel, Discord channelAnd LinkedIn GrOup.

If you like our work, you will love ours Newsletter..

Don’t forget to join our 41k+ ML SubReddit

Asjad is an intern as a consultant at Marktechpost. He is studying B.Tech in Mechanical Engineering from the Indian Institute of Technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is constantly exploring the applications of machine learning in healthcare.

Source link