Solo: Data Discovery using Natural Language Questions via a Self-Supervised Approach

Solo is a tabular data discovery system that uses self-supervision to automatically assemble training data over a large repostitory of tables withou any human involvement and then trains and deploys the system. To cope with the distribution shift due to the synthetic training data, a new represenation of table and relevance model between natural language question and table are desgined. Solo achieves performance comparable to those trained on humman-annotated data.

System overview

overview

Demo

Find datasets on Chicago data portal

Publication

Solo: Data Discovery using Natural Language Questions via a Self-Supervised Approach

Qiming Wang, Raul Castro Fernandez
International Conference on Management of Data (SIGMOD 2024)
[pdf] [code]