Protein Descriptor Calculation

General Information

The exponential growth of protein structural and sequence databases in recent years is enabling multifaceted approaches to address the long sought sequence-structure-function relationship. The advances in computation now make it possible apply well-established data mining and pattern recognition techniques to learn models which effectively relate structure and function. However, transforming all the structural data into meaningful numerical features is a key issue that requires an efficient and widely available solution.

ProtDCal (acronym for Protein Descriptors Calculation program) is a new computational software suite that addresses this need. This program is capable of generating tens of thousands of features considering both sequence-based and 3D-structural descriptors, process list of protein sequences and structures, export calculations configurations as projects, execute calculations on graphical or command line mode. In addition is developed in the Java programming language (JDK version 1.7) as it provides cross-platform support for any system where a Java Virtual Machine (JVM) is available. The Chemistry Development Kit (CDK) library was employed within ProtDCal, mainly for the manipulation of protein input data. Furthermore, it offers additional functionalities including the computation of empiric models of protein folding free energy, rate constant and protein-to-protein interaction features. Consequently, this software is suitable for tasks which require effective encoding of protein sequences and/or structures, such as protein classification and function prediction. Additionally the structure-based features generated by ProtDCal completes its general protein encoding capability.

Input/Output Files

The program accepts two input file formats: PDB, ENT and FASTA/multi-FASTA. In the former case, the full descriptor generation capability of the program is enabled, while inputting FASTA files will only enable the sequence-based subset indices. The program calculates the requested features and creates two tab-delimited files (*_AA.txt and *_Prot.txt). These files contain the compendium of all the residue level indices and the group-level descriptors, respectively, for each input protein.

Descriptor generation schema

ProtDCal's feature generation strategy comprises four hierarchical levels: