David O. Nelson
Evaluating genotype-phenotype relationships in complex data sets
with missing data
****************************************************************

One of the challenges of the post-genomic era will be to apply the
vast quantities of emerging human sequence data to public health
problems. Cheap re-sequencing technologies, coupled with developing
technologies in rapid genotyping, will drive the development of
methods to predict the susceptibility of population strata to outcomes
such as cancer incidence or survival in response to environmental
exposures of interest.

LLNL is developing and analyzing a data set that attempts to enumerate
the variation in DNA repair genes in a collection of human cell lines
and associate each cell line with a phenotype that provides an
integrated, end-to-end measure of reduced repair capacity in response
to ionizing radiation. This data set is being used to develop a simple
scoring system that will associate genotypes with diminished DNA
repair capacity and, in the future, with increased risk for clinical
outcomes of interest.

This data set is characterized by high dimensionality, ordinal
predictors,and  multiple outcome measurements. One of its more
interesting features is its incompleteness: due to the vagaries of
current genotyping methods, very few cell lines contain complete
information on all genotypes.

In this talk, we will report on progress in developing algorithms that
search for simple functions of combinations of interesting genotypes
that predict reduction in DNA repair capacity. These algorithms
exploit R on high-performance multiprocessors. They evaluate
potentially interesting combinations of genotypes by resampling
methods such as those developed by Breiman, Fridlyand, and others, and
develop scoring functions by heuristic search methods similar to Logic
Regression.