CINF
17 Knowing when to
say "When..."
Farhad Soltanshahi, Michael S. Brusati, and Robert D. Clark, Tripos,
Inc, 1699 South Hanley Road, St. Louis, MO 63144
Sampling
large data sets efficiently is a computational challenge but it can also be
a philosophical one. Keeping structural diversity within the selected subset
high is important, but so is maintaining representativeness of the data set
as a whole. As the fraction of the data set selected increases, enhancing
diversity becomes increasingly expensive in computational terms, but of
progressively less value in practical terms. So when does it make sense to
stop worrying about diversity and shift over to straight random sampling?
Optimizable k-dissimilarity (OptiSim) is a stochastic selection method that
is uniquely positioned for addressing this question, in part because it
returns an ordered selection set in which the earlier selections being, on
average, measurably more distinctive and more representative than are later
ones.