CINF 17 Knowing when to say "When..."
Farhad Soltanshahi, Michael S. Brusati, and Robert D. Clark, Tripos, Inc, 1699 South Hanley Road, St. Louis, MO 63144

Sampling large data sets efficiently is a computational challenge but it can also be a philosophical one. Keeping structural diversity within the selected subset high is important, but so is maintaining representativeness of the data set as a whole. As the fraction of the data set selected increases, enhancing diversity becomes increasingly expensive in computational terms, but of progressively less value in practical terms. So when does it make sense to stop worrying about diversity and shift over to straight random sampling? Optimizable k-dissimilarity (OptiSim) is a stochastic selection method that is uniquely positioned for addressing this question, in part because it returns an ordered selection set in which the earlier selections being, on average, measurably more distinctive and more representative than are later ones.