|
|
|
|
|
Building Physical Network Models
|
The Potential of Protein-Protein Data
Generally speaking, interactions reported from traditional small-scale and low-throughput experiments or those supported by multiple experiments are more reliable than high-throughput assays; the DIP dataset can be subdivided into subsets of both of these. We used empirical results from Deane et al. (2002) to estimate the error rates for each protein pair. The study performed by Deane et al. utilized two independent tests (ERP and PVM) on the DIP database and subsets to gauge both the level and type of errors introduced. The EPR testing examined the distribution of the Euclidean distances between protein pairs among expression profiles to estimate the false positive rate of a set of protein pairs. PVM, on the other hand, verified whether paralogs of a protein pair also bind and gave corresponding confidence measures. Each pair (g1,g2) of proteins has four binary labels: whether it appears in the DIP dataset (f1), whether it is reported from multiple sources (f2), whether it is validated in the PVM test (f3), and whether it appears in small-scale experiments (f4). Each labeling was treated as an independent piece of evidence concerning whether g1 and g2 interact, which we represented here as a binary 0/1 variable b. EPR analysis provides
P(b = 1|f1 = 1) = 0.50,
P(b = 1|f2 = 1) = 0.85
whereas PVM gives
P(f3=1|b=0) = 0.05,
P(f3=1|b=1) = 0.50
The protein pairs where f4 = 1 (i.e., reported in small-scale experiments) were taken as correct interactions so that P(b=1|f4=1) = 1. We relaxed this slightly by using P(b=1|f4 = 1) = 1-ε for small ε > 0.
Deane et al's analysis did not report P(b = 1|f1 = 0), P(b = 1|f2 = 0), or P(b = 1|f4 = 0).
We assumed in our study that the absence of a protein pair from a given subset provides no additional information about the interaction. We therefore omitted any evidence from f1, f2 or f4 when their values were 0. More precisely, we assumed that P(b = 0|fi = 0) = P(b = 1|fi = 0) for i = 1, 2, 4.
With xei denoted as an indicator variable of protein-protein interaction ei and yei its measurements, we specified the potential function Φ(xei;yei) for selecting protein-protein edges. Here yei represents the available information about protein-protein interactions, and the potential function is based on a likelihood ratio as before, given by
Assuming the labels are conditionally independent given the known state of interaction or b,
and by using Bayes law to transform the components into the form available from Deane et al's analysis, we finally get
where P(b), P(b|f1), P(b|f2), P(b|f4) and P(f3|b) can be substituted with empirical values obtained from EPR and PVM tests.
|
|
|
|
|
|
|
|