Automatically assemble training data for table discovery using natural language questions

Imagine you are an IT manager for an organization which possesses a lot of tabular datasets. You want to share the datasets among all users in the organziation for productivity. To this end, all users must be able to find the needed datasets easily and precisely. One option is thus to let users ask a natural language (NL) question and the system then returns the datasets needed.

However, to understand the extent a table can answer a NL question, existing solutions requires to learn from human-annotated (question, table) pairs in the order of 10K which is very expensive. In addtion, this annotation has to be redone for each new repository of tables, e.g. a new domain.

Given any reposiory of tabular datasets, how to automatically assemble the training data without humman involment, and build an end-to-end data discovery system that is comparable to those trained on human-annotated data?

System that Addresses this Use Case

Solo: Data Discovery using Natural Language Questions via a Self-Supervised Approach