This document summarizes architectural considerations for building a data services
layer for Cytoscape. It currently summarizes ideas from Gary, Paul and Ethan. All
of this is open to debate. Feedback is most welcome. This document is written in
XDocs format and will be kept under source control, so that we can keep track of
revisions.
We plan to build a new modular library for reading / writing to multiple biological
data sources. Cytoscape Plug In writers can then use this new library to import /
export data directly. In conjunction with the building of the library, we also plan
to build several initial plug ins that illustrate the new functionality.
To ensure modularity of code, the current golden rule of cytoscape is "no
biological semantics in the core." In keeping with this tradition, the data
layer golden rule will be: "no cytoscape semantics in the data layer." In
other words, the data layer will be a completely stand alone library that can be
used by Cytoscape, but may also be used by other projects in the future.
Based on Paul's original ideas, we plan to have a suite of data service
classes, and a Data Service Factory. We define the following terms:
Data Service: Any class that is capable of retrieving data from
or writing data to an external data source. Data may be local or
remote, and may use any number of protocols, including SQL, HTTP,
IBM TSpaces, SOAP, etc. However, implementation details are
completely hidden from the client.
Data Source Description: A single class that describes an external data source.
Currently, this class includes four main properties:
Category: Very broadly, this specifies the category of data available
from the data source. For example, valid values might include:
Protein, DNA, RNA, small molecule, or interaction.
Protocol: Specifies the data format or protocol used to retrieve/submit
data. For example, valid values might include: FASTA, GenBank,
SeqHound API, BIND XML, SRS, DIP etc.
Location: Specifies the location of the external data source. This could
be a pointer to a local file, a database connect string or an
absolute URL.
Cache: Specifies several options for local caching of content. Specific
options are still being fleshed out.
For example, you can specify 1) category=protein; 2) protocol=SeqHound API; 3)
location=http://seqhound.mshri.on.ca/ 4) cache=in_memory. If the SeqHound API moves,
you just modify the data location, and everything still works as expected. If we
find a new data source that provides identical protein information, we can just
specify a new protocol, and everything will work as expected then too.
Data Services Factory: The data service factory will receive a Data Source Description
object, and return the correct Data Service object.
Schematically, the Data Service Factory looks like this: | Data Service Factory| ---> | Data Service |
|+------------------------+ +---------------------+ +-------------------+
]]>
We have a few options for building services and the factory. One option is to build a
factory with a single method, e.g. getService(). The getService() receives a data
source description object, and returns a matching data service. Some sample code is
provided below:
Using this option, the client will need to cast to the correct service interface.
The second option is to build a factory with multiple methods, e.g.
getProteinService(), getDnaService(), etc. These methods also receive a data source
description object and return a matching data service. Some sample code is provided
below:
The advantage of the second approach is that the client code is much more specific, and
there is no need to rely on client casting of services. However, the advantage
of the first approach is that is that it is much more extensible, and
can support dozens of new services via the same getService() method.
Note: For phase 1, we decided to go forward with the first option.
UML diagrams of the core data service packages are provided below.
Figure 1:UML Diagram of the Dataservices Core Package.
Figure 2:UML Diagram of the Dataservices Services Package.
Broadly, we are considering 3 categories of data:
Data about parts (nodes)
Data about relationships/interactions (edges)
Results that take extended time to compute
More details:
Parts:
Biological sequence (Protein, DNA, RNA), Small molecule (e.g. drugs, metabolites)
Can be found in many databases e.g. GenBank, SeqHound, SRS, Swiss-Prot/Trembl, Interpro, PIR, etc. Each category of part
has its own attributes are would be useful to fetch and display e.g. proteins have domains, DNA has promoter regions.
Relationships/Interactions. Interactions between two or more things. Interactions between two things are the typical edge that we deal with.
Interactions between more than 2 things are sets of objects that will probably have to be represented as a special node that is a 'set' of parts
(maybe called a multinode?). The relationship will be defined by the parts that are in the relationship e.g. a protein-protein interaction can be
defined by two proteins, so a constructor that takes as arguments two proteins to form a relationship would return a protein-protein
interaction. The logic of the data services layer would understand that protein-protein interaction information can only come from certain databases.
Computed resources. This is anything that takes a significant amount of time to compute and return, such as BLAST.
This probably does not include BLAST results that are precomputed and sitting in a database ready to read. The computing process must take
"more time than the user is willing to sit around and wait for the return of data after a mouse click".
E.g. QBLAST. We might have to compute this data offline and save it in a cache.
This list is unprioritized and incomplete.
Protein
DNA
RNA
small molecule
molecular interactions between proteins, DNA, RNA and small molecules
BLAST results
Clustal (or other multiple sequence alignment - MSA) results
To the extent possible, we want to use existing open source libraries, rather than
reinvent the wheel. Currrently, we are considering the following:
BioJava:provides numerous I/O facilties for
parsing specific biological data formats, such as FASTA and GenBank flat file.
HttpClient:open
source HTTP client that provides several features beyond that provided in the
JDK API. In particular, it supports connection time outs, and support for
Multi-Part form POST for uploading of form data to databases that support submission of records via forms
and for uploading of large files.
Java Caching
Service:open source API for caching Java objects. Includes the ability to
create multiple caches within a single application, and the ability to cache
contents within memory and within a local file system.
$Log$
Revision 1.1 2003/05/30 19:13:49 ceramie
Initial commit
Revision 1.8 2003/05/16 21:23:25 cerami
Updated document to reflect reality of phase 1 implementation
Revision 1.7 2003/04/08 21:32:17 cerami
Fixed LI tag
Revision 1.6 2003/04/04 19:50:45 bader
Added more file format examples and fixed web services data source.
Revision 1.5 2003/04/04 19:08:29 cerami
Added tag
Revision 1.4 2003/04/04 17:03:54 cerami
Fixed errors in well-formedness