Metam: Goal-Oriented Data Discovery

Metam is a goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, Metam leverages properties of the: i) data, ii) utility function, and iii) solution set size. More specifically, a key insight of Metam is that similar augmentations perform similarly on the downstream task. Metam exploits this insight by clustering augmentations and judiciously choosing them from different clusters, thus skipping computation. To cluster augmentations, Metam represent each with a vector of data profiles, which are task-independent measures of data and include semantic similarity, correlation, and mutual information, among others. When a combination of data profiles is correlated with the task’s utility, clustering narrows down the number of augmentation candidates.

metam overview

The consequences of goal-oriented data discovery are significant. In the use case to predict “housing prices” in a geographical area, Metam increased the accuracy from 0.69 to 0.81. Metam identified some obvious datasets that a social scientist would have been able to identify using a discovery system, such as “income of people staying in the neighborhood” and “crime stats”. But crucially, it also identified non-obvious datasets correlated with housing prices such as “presence of grocery stores” and “number of taxi trips” from those areas. Indeed, many sociologists and economists leverage external data to infer causal relationships between attributes of interest. But they rely on domain knowledge and manual effort to identify those relationships. Goal-oriented data discovery paves the way to identify new causal relationships from large data repositories automatically.

Publication

Metam: Goal-Oriented Data Discovery

Sainyam Galhotra, Yue Gong, Raul Castro Fernandez
39th IEEE International Conference on Data Engineering (ICDE 2023). Anaheim, CA, USA
[pdf]