My research in machine learning is heavily motivated by a focus on efficient data collection in real world problems and applications in business contexts.
The following gives descriptions of some recent projects.
In the Crowdfunding model, many users pledge a small amount of funding towards a common project which is funded if the total amount of the pledges exceeds a reserve price. The harsh reality of crowdfunding is that there are far more loans than lenders and the total budget of pledges available is very limited. Many bad projects don’t stand a chance at being funded, and it is a waste of precious impressions to even show them. Naive strategies that do not prioritize projects likely to reach the given price are doomed to only partially fund many projects, versus a preferred outcome of a moderate number fully funded to the reserve price.
Problem: How do we match projects to users to result in the largest number of fully funded projects, versus the unfavored alternative of many partially funded projects?
We are working closely with the nonprofit microlending site Kiva to find ways ensuring that as many loans as possible are funded.
Firing Bandits: Optimizing Crowdfunding. L. Jain, K. Jamieson. ICML 2018.
We propose an adaptive sampling approach for multiple testing which aims to maximize statistical power while ensuring anytime false discovery control. Inspired by the multi-armed bandit literature, we provide an algorithm that takes as few samples as possible to exceed a target true positive proportion (i.e. proportion of actual positives discovered) while giving anytime control of the false discovery proportion (nulls predicted as actual positives). Given the simplicity of the approach, and its sample efficiency, the method has promise for wide adoption in the biological sciences, clinical testing for drug discovery, and online A/B/n testing problems.
A Bandit Approach to Multiple Testing with False Discovery Control. K. Jamieson, L. Jain. NIPS 2018.
The goal of ordinal embedding (aka non-metric multidimensional scaling) and metric learning is to embed items as points given a set of constraints in the form of distance comparisons like “item A is closer to item C then item B”. Such similarity judgements often arise from user studies similar to max-diff studies. Though similarity judgements have been a popular non-metric technique for over 40 years, there has been little to no analysis of the number of samples needed to learn a high-fidelity embedding. We provide the first practical sample complexity results, and also manage to show that ordinal embedding data is enough to completely give a geometric characterization of a set of items. The embeddings derived from our techniques can then be used for downstream machine learning tasks, or to determine which features most influence perception about the items. This work was developed in tandem with the Cognitive Psychology/Educational Departments at the University of Wisconsin.
Learning Low-Dimensional Metrics, L. Jain, B. Mason, R. Nowak, NIPS 2017.
Finite Sample Prediction and Recovery Bounds for Ordinal Embedding, L. Jain, K. Jamieson, R. Nowak, NIPS 2016.
We consider the problem of efficiently sorting items according to their means into clusters of pre-specified sizes, which we refer to as coarse ranking. In many big-data applications, finding the total ranking can be infeasible and/or unnecessary, and we may only be interested in the top items, bottom items, or quantiles. For example consider grading exams in a MOOC - it may be much easier to cluster papers by grades then build a total ranking over all of the exams. Our work on this project was motivated by a collaboration with James Evans and Nandana Sengupta at the University of Chicago to determine relative safety scores for neighborhoods.
Adaptive Sampling for Coarse Ranking, Sumeet Katariya, L. Jain, Nandana Sengupta, James Evans, Robert Nowak. AISTATS, 2018.
NEXT is a computational framework and open-source machine learning system that simplifies the deployment and evaluation of active learning algorithms that use human feedback, e.g. from Mechanical Turk. The system is optimized for the real-time computational demands of active learning algorithms and built to scale to handle a crowd of workers any size. The system is for active learning researchers as well as practitioners who want to collect data adaptively. NEXT is currently being used by a variety of groups including American Family Insurance, the University of Wisconsin Psychology group and the Air Force Research Lab.
Each week, the New Yorker magazine runs a cartoon contest where readers are invited to submit a caption for that week's cartoon - thousands are submitted. The NEXT team has teamed up with Bob Mankoff, cartoon editor of the New Yorker, to use crowdsourcing and adaptive sampling techniques to help decide the caption contest winner each week. This is an example of state-of-the-art active learning being implemented and evaluated in the real world using the NEXT system and the principles developed in that paper.
NEXT: A System for Real-World Development, Evaluation, and Application of Active Learning, Kevin Jamieson, L. Jain, Chris Fernandez, Nick Glattard, Robert Nowak, NIPS, 2015.