Solo: Data Discovery using Natural Language Questions via a Self-Supervised Approach
Solo is a tabular data discovery system that uses self-supervision to automatically assemble training data over a large repostitory of tables withou any human involvement and then trains and deploys the system. To cope with the distribution shift due to the synthetic training data, a new represenation of table and relevance model between natural language question and table are desgined. Solo achieves performance comparable to those trained on humman-annotated data.
System overview
Demo
Find datasets on Chicago data portal
Publication
Solo: Data Discovery using Natural Language Questions via a Self-Supervised Approach
Qiming Wang, Raul Castro Fernandez
International Conference on Management of Data (SIGMOD 2024)
[pdf]
[code]