Ethan Cerami Architectural Overview

This document summarizes architectural considerations for building a data services layer for Cytoscape. It currently summarizes ideas from Gary, Paul and Ethan. All of this is open to debate. Feedback is most welcome. This document is written in XDocs format and will be kept under source control, so that we can keep track of revisions.

We plan to build a new modular library for reading / writing to multiple biological data sources. Cytoscape Plug In writers can then use this new library to import / export data directly. In conjunction with the building of the library, we also plan to build several initial plug ins that illustrate the new functionality.

To ensure modularity of code, the current golden rule of cytoscape is "no biological semantics in the core." In keeping with this tradition, the data layer golden rule will be: "no cytoscape semantics in the data layer." In other words, the data layer will be a completely stand alone library that can be used by Cytoscape, but may also be used by other projects in the future.

Based on Paul's original ideas, we plan to have a suite of data service classes, and a Data Service Factory. We define the following terms:

We have a few options for building services and the factory. One option is to build a factory with a single method, e.g. getService(). The getService() receives a data source description object, and returns a matching data service. Some sample code is provided below:

Using this option, the client will need to cast to the correct service interface.

The second option is to build a factory with multiple methods, e.g. getProteinService(), getDnaService(), etc. These methods also receive a data source description object and return a matching data service. Some sample code is provided below:

The advantage of the second approach is that the client code is much more specific, and there is no need to rely on client casting of services. However, the advantage of the first approach is that is that it is much more extensible, and can support dozens of new services via the same getService() method.

Note: For phase 1, we decided to go forward with the first option.

UML diagrams of the core data service packages are provided below.

Figure 1:UML Diagram of the Dataservices Core Package.

Figure 2:UML Diagram of the Dataservices Services Package.

Broadly, we are considering 3 categories of data:
  1. Data about parts (nodes)
  2. Data about relationships/interactions (edges)
  3. Results that take extended time to compute

More details:

  1. Parts:
    Biological sequence (Protein, DNA, RNA), Small molecule (e.g. drugs, metabolites)
    Can be found in many databases e.g. GenBank, SeqHound, SRS, Swiss-Prot/Trembl, Interpro, PIR, etc. Each category of part has its own attributes are would be useful to fetch and display e.g. proteins have domains, DNA has promoter regions.
  2. Relationships/Interactions. Interactions between two or more things. Interactions between two things are the typical edge that we deal with. Interactions between more than 2 things are sets of objects that will probably have to be represented as a special node that is a 'set' of parts (maybe called a multinode?). The relationship will be defined by the parts that are in the relationship e.g. a protein-protein interaction can be defined by two proteins, so a constructor that takes as arguments two proteins to form a relationship would return a protein-protein interaction. The logic of the data services layer would understand that protein-protein interaction information can only come from certain databases.
  3. Computed resources. This is anything that takes a significant amount of time to compute and return, such as BLAST. This probably does not include BLAST results that are precomputed and sitting in a database ready to read. The computing process must take "more time than the user is willing to sit around and wait for the return of data after a mouse click". E.g. QBLAST. We might have to compute this data offline and save it in a cache.

This list is unprioritized and incomplete.
This list is unprioritized and incomplete.
This list is unprioritized and incomplete.

To the extent possible, we want to use existing open source libraries, rather than reinvent the wheel. Currrently, we are considering the following:

$Log$
Revision 1.1  2003/05/30 19:13:49  ceramie
Initial commit

Revision 1.8  2003/05/16 21:23:25  cerami
Updated document to reflect reality of phase 1 implementation

Revision 1.7  2003/04/08 21:32:17  cerami
Fixed LI tag

Revision 1.6  2003/04/04 19:50:45  bader
Added more file format examples and fixed web services data source.

Revision 1.5  2003/04/04 19:08:29  cerami
Added  tag

Revision 1.4  2003/04/04 17:03:54  cerami
Fixed errors in well-formedness