Goal-Oriented Data Discovery
Bob, a data analyst, aims to predict house prices across various regions in New York, California, and Illinois. He sourced a house price dataset from Redfin covering these states. Utilizing the dataset’s attributes, Bob trained a random forest classifier to determine if house prices in a particular zipcode are classified as low or high. However, his classifier achieved an accuracy of just 69%. To enhance the classifier’s performance, he’s now looking to incorporate additional attributes into the original house price dataset.
In this example, Bob has a goal-oriented data discovery problem. The goal/task he has is to maximize the accuracy of a classifer. The data discovery proecess is to identify augmentations that cause an increase in the task’s utility.
Existing Solutions
To solve this problem, one could use a traditional data discovery system to identify what tables join with the input data, and then, separately, identify what joins increase the task’s utility. This discover-then-augment approach works when the discovery system returns candidates relevant to the task. Unfortunately, in practice, it is hard to guarantee the discovery system identifies good augmentations because: i) relevant augmentations depend on the task; and ii) analysts may not know what properties make an augmentation relevant, e.g., what features augment the predictive power of a classifier.
Goal-oriented data discovery can also be solved by enumerating every subset of candidate augmentations, computing their utility, and choosing the minimal subset that achieves utility of at least $ \theta $. With $n$ candidate augmentations, this process may require up to $O(2^n)$ queries, so this approach is computationally expensive and infeasible when $n$ is large.
How do we utilize the downstream task to steer the discovery and augmentation process efficiently?