Contents - ASIC
5 stars based on
The invention provides methods for identifying a "necessary" set of information rich variables from which a plurality of "sufficient" classifiers maybe derived. In the field of biological diagnostics, the invention may be used to provide short lists of genes, referred to as "gene signatures" binary pro signals live test update 010312 may be used to carry out specific classification tasks such as predicting the activity and side effects of a compound in vivo.
Thus, most diagnostic devices are simply two-class classifiers. The classifier can be a function of all or of a subset of the initial variables. Binary pro signals live test update 010312 value of that function is calculated for each individual datum. The individual sample is assigned to one or the other class depending binary pro signals live test update 010312 whether the result of the classifier function exceeds a defined threshold.
Desirable attributes of a diagnostic assay include high sensitivity and specificity measured in terms of low false negative and false positive rates and overall accuracy. Because diagnostic assays are often used to assign large number binary pro signals live test update 010312 samples to given categories, the issues of cost per assay and throughput number of assays per unit time or per worker hour are of paramount importance. Usually the development of a diagnostic assay involves the following steps: Over the past 10 years, a variety of techniques have been developed that are capable of measuring a large number of different biological analytes i.
Perhaps the most successful example is the DNA microarray, which may be used to measure the expression levels of thousands or even tens of thousands of genes simultaneously. Based on well-established hybridization rules, the design of the individual probe sequences on a DNA binary pro signals live test update 010312 now may be carried out in silico, and without any specific biological question in mind. Although DNA microarrays have been used primarily for pure research applications, this technology currently is binary pro signals live test update 010312 developed as a medical diagnostic device and everyday bioanalytical tool.
Although DNA microarrays are considerably more expensive than conventional diagnostic assays they do offer two critical advantages. First, they tend to be more sensitive, and therefore more discriminating and accurate in prediction than most current diagnostic techniques. Using a DNA microarray, it is possible to binary pro signals live test update 010312 a change in a particular gene's expression level earlier, or in response to a milder treatment than is possible with more classical pathology markers.
Also, it is possible to discern combinations of genes or proteins useful for resolving subtle differences in forms of an otherwise more generic pathology. Second, because of their massively parallel design, DNA microarrays make it possible to answer many different diagnostic questions. In addition, by using different combinations of variables that may be available on an array, it may be possible to confirm the answer 1 to a single classification question in multiple independent ways and thereby increase accuracy.
A key challenge in developing the DNA microarray as a diagnostic tool lies in accurately interpreting the large amount of multivariate data provided by each measurement i. Indeed, commercially available high density DNA microarrays also referred to as "gene chips" binary pro signals live test update 010312 "biochips" allow one to collect thousands of gene expression measurements using standardized published protocols. However, typically only a very small fraction of these measurements are relevant to a given diagnostic classification question being asked by the user.
For example, only genes out of 10, available on the microarray may be used as the gene signature for a specific question. Thus, current DNA microarrays provide a large amount of information that is not used for answering most typical diagnostic assay questions. Similar data overload problems exist in adapting other highly multiplexed bioassays such as RT-PCR or proteomic mass spectrometry to diagnostic applications.
A recently developed powerful new application for the DNA microarray is chemogenomic analysis. A comprehensive database of chemogenomic annotations for large numbers of genes in response to large numbers of chemical compounds may be used to design and optimize new pharmaceutical lead compounds based only on a transcriptional and biornolecular profile of the known or merely hypothetical compound.
For example, a small number of rats may be treated with a novel lead compound binary pro signals live test update 010312 then expression profiles measured for different tissues from the compound treated animals using DNA microarrays.
Construction of a comprehensive chemogenomic database and methods for chemogenomic analysis using microarrays are described in Published U. Systematic "mining" of large chemogenomic datasets has led to the discovery of new relationships between genes. It has also led to new insight into the genes and pathways affected by particular classes of compound treatments. An important tool for discovering these new relationships are specific, short weighted lists of genes that may be used to determine whether certain gene expression changes are related i.
These gene lists, referred to as "gene signatures," provide simple, robust tools for answering classification questions using DNA microarrays. Methods for deriving and using gene signatures to analyzed chemogenomic data are disclosed in Published U. The use of gene signatures to answer diagnostic questions is not limited to the DNA hybridization assay context.
The general concept of signatures may be widely applied to any analytical testing situation that may be reduced to a question of whether data are within or outside a specific class. Even with robust gene signatures, however, sometimes data are measured that defy simple classification algorithms. That is, the signature does not clearly place the data in either of the two classes it defines. This may be due to the nature of the data originally used to derive the signature i.
New methods are needed to derive signatures capable of classifying this type of "borderline" data. The availability of improved signatures would greatly increase the usefulness of these signatures as accurate and reliable tools for diagnostic classification. In one preferred embodiment, a single log odds ratio threshold of greater than or equal to 4. In an alternative embodiment of the method, a second threshold may be selected and used to determine the performance of the remaining variables when repeating steps c and d.
In one embodiment, the method may be carried out wherein the multivariate dataset comprises chemogenomics data, and specifically, comprises a dataset from polynucleotide array experiments on compound-treated samples. In another preferred embodiment of the above method, the linear classifiers are sparse, that is they are composed of short gene lists. In another embodiment the above method is carried out with a multivariate dataset comprising data from a proteomic or metabolomic experiment.
The present invention also includes a set of necessary variables for answering classification questions made according to the method described above.
Necessary sets of the invention may be quite large and include all or nearly all variables in the full set of variables. The present invention also includes an array, or other diagnostic device, comprising a set of polynucleotides each representing a gene in the necessary set made binary pro signals live test update 010312 to the method described above. In another embodiment, the invention includes a diagnostic reagent set useful in diagnostic assays and diagnostic kits for a specific classification question comprising a set of polynucleotides each representing a gene in the necessary set made according to the above method.
In another embodiment, the invention includes a subset of genes useful for answering a chemogenomic classification question including those questions disclosed in Table 2 comprising a percentage of genes randomly selected from necessary set made according to method described above, wherein the addition of the percentage of genes to the depleted set for the classification question increases the average logodds ratio of the linear classifiers generated by the depleted set.
Binary pro signals live test update 010312 some embodiments, the subset may be defined according to the percentage increase in the average LOR performance of the depleted set, binary pro signals live test update 010312 other embodiments, the increase corresponds to a set average LOR threshold.
In one specific embodiment, the subset of genes is useful for answering the monoamine re-uptake SERT inhibitor classification question and the necessary set consists of the genes listed in Table 5. In another embodiment, the invention includes a diagnostic assay comprising a set of secreted proteins encoded by the genes of a necessary set made according to the above- described method e. In another embodiment, the invention provides a method for preparing a reagent set comprising the steps of: This method of preparing a reagent set may further include the steps of: In one embodiment, the reagent set comprises polypeptides represent genes capable of detected secreted proteins.
In another embodiment, the invention provides a set of necessary variables for answering a classification question comprising the variables whose removal from a full multivariate dataset results in a depleted set of variables that are unable to answer the classification question with a performance greater than some selected threshold e. In another embodiment, the invention includes a diagnostic device e. In other embodiments, the chosen threshold is an average LOR greater than or equal to 3.
In an alternative binary pro signals live test update 010312, the invention provides a diagnostic device comprising a set of secreted proteins encoded by the genes in the necessary set for a specific classification question or a set of reagents capable of detecting binary pro signals live test update 010312 secreted proteins. In one embodiment, the present binary pro signals live test update 010312 provides a method of identifying non- overlapping sufficient sets of variables useful for answering a classification question comprising: In one embodiment, the method of identifying non-overlapping sufficient sets maybe carried out wherein the multivariate dataset comprises chemogenomics data, and specifically, comprises binary pro signals live test update 010312 dataset from polynucleotide array experiments on compound-treated samples.
In another preferred embodiment of the above method, the linear classifiers are reducible to weighted gene lists. In another embodiment the above method is carried out with a multivariate dataset comprising data from a proteomic experiment, The present invention also provides a method of classifying experimental data comprising: Figure 2 A and B depict results of repeatedly applying the stripping algorithm for four different classification questions used to query a chemogenomic dataset.
Four signatures were chosen. A shows the evolution of the Test Log Odds Ratio as function of the cycles of stripping. B shows the cumulative number of genes used. Figure 3 depicts results of the analysis of a monoamine reuptake inhibitors SERT signature. This process was repeated 50 times. The table presents the mean and standard deviations of the LOR for each subset size added to the depleted set. Impact was defined as the product of a gene's weight by the log ratio of expression in a given treatment.
Negative impact values are colored green and positive are colored red. At the extreme right a "total impact" column was added. This column represents the sum of the impact values for a gene across all treatments. Strong positive values are in red, all other values are green.
Binary pro signals live test update 010312 5 depicts results confirming that compounds are signature hits. The complete table encompasses compounds. The label of each compound is shown next to the compound name. These are not part of the signature generation. The same data is expressed as a rank in the right panel. Figure 7 depicts plot of GO terms identified at different stripping cycles during the generation of the HMGcoA-statin necessary set.
Overview The present invention provides a method of defining a "necessary" set of variables from which multiple independent classifiers e. Using multiple independent signatures for the same classification question in a single classification experiment e. For example, two non-overlapping gene signatures that classify a specific type of pathway inhibitors may be used to reach a consensus classification for a particular compound that does not score highly with either signature alone.
In addition, the necessary set itself, which may be derived for any classification question according to the methods disclosed herein, represents a source of information rich variables that may be used to prepare diagnostic devices. As shown herein, even a small percentage of genes randomly selected from the necessary set for a specific classification question may be used to "revive" a depleted dataset.
For example, by using a method of sequentially "stripping" away binary pro signals live test update 010312 signatures from the full dataset to reveal underlying gene signatures associated with distinct metabolic pathways.
These distinct and independent signatures can provide an alternative signature useful for development of a novel diagnostic test. Thus, the present invention provides tools to develop novel toxicology or pharmacology signatures, or diagnostic assays. Definitions "Multivariate dataset" as used herein, refers to any dataset comprising a plurality of different variables including but not limited to chemogenomic datasets comprising logratios from differential gene expression experiments, such as those carried out on polynucleotide microarrays, or multiple protein binding affinities measured using a protein chip.
Other examples of multivariate data include assemblies of data from a plurality of standard toxicological or pharmacological assays e. For example, variables may include relative or absolute amounts of biological molecules, such as mRNA or proteins, or other biological metabolites. Variables may also include dosing amounts of test compounds.
A "classification question" may be of any type susceptible to yielding a yes or no answer e. A valid classifier is defined as a classifier capable of achieving a performance for its classification task at or above a selected threshold value. Higher or lower threshold values may be selected depending of the specific classification task. A signature may include as few as one variable.