The exponential growth of protein structural and sequence databases in recent years is enabling multifaceted approaches to address the long sought sequence-structure-function relationship. The advances in computation now make it possible apply well-established data mining and pattern recognition techniques to learn models which effectively relate structure and function. However, transforming all the structural data into meaningful numerical features is a key issue that requires an efficient and widely available solution.
ProtDCal (acronym for Protein Descriptors Calculation program) is a new computational
software suite that addresses this need.
This program is capable of generating tens of thousands of features considering both sequence-based and 3D-structural descriptors,
process list of protein sequences and structures, export calculations configurations as projects, execute calculations on graphical or
command line mode.
In addition is developed in the Java programming language (JDK version 1.7) as it provides cross-platform
support for any system where a Java Virtual Machine (JVM) is available.
The Chemistry Development Kit (CDK) library was employed within ProtDCal, mainly for
the manipulation of protein input data.
Furthermore, it offers additional functionalities including the computation of empiric models of protein folding free energy, rate constant and protein-to-protein interaction features.
Consequently, this software is suitable for tasks which require effective encoding of protein sequences and/or
structures, such as protein classification and function prediction.
Additionally the structure-based features generated by ProtDCal completes its general protein encoding capability.
The program accepts two input file formats: PDB, ENT and FASTA/multi-FASTA. In the former case, the full descriptor generation capability of the program is enabled, while inputting FASTA files will only enable the sequence-based subset indices. The program calculates the requested features and creates two tab-delimited files (*_AA.txt and *_Prot.txt). These files contain the compendium of all the residue level indices and the group-level descriptors, respectively, for each input protein.
Descriptor generation schema
ProtDCal's feature generation strategy comprises four hierarchical levels:
layer is intended to select the type of indices to encode for each residue. These indices are grouped in three main classes:
Thermodynamics include all novel indices designed in our laboratory based on an empirical model of the main factors involved in the stability of protein structures. These indices are, in turn, divided into two panels grouping, on one side, those that are defined for 3D folded structures and on the other side, those based on information relating to the protein sequence. These indices refer to the contribution of the folded and unfolded states of a protein chain.
Topographic include many of the contact-based descriptors with proven correlation with the protein folding rate constant, e.g. the relative contact order (CO), the total contact distance (TCD), the cliquishness (CLQ), etc. These indices were defined originally as global metrics, however, they were modified to obtain a value for each residue of a protein. Each contact of the protein is weighted by a determined residue property selected in this interface. The weighting procedure is conducted by multiplying the values of the selected property for both residues that are in contact.
Property-based indices this group encloses a number of chemical-physical and structural properties of each type of residue such as hydrophobicity, electronic charge index, molar weight, volume, isotropic surface area, etc. In addition to the implemented property-based indices approaches, an option is included by which users can define their own properties indices.
Modification operators: these approaches are intended to modify the value of a selected index for a given residue according to the residues within a vicinity defined by the type of modification operator and its parameter value (e.g. for the autocorrelation operator with parameter k = 2, the neighbourhood of residue i comprises the residues in positions i ± 2). ProtDCal implements five modification operators that can be selected in the Weighting operators menu.
Groups: is intended to select one or more groups of residues according their ID or type. When a group of residues is selected, an array of index values is obtained corresponding to the residues in the group. In addition to the implemented grouping approaches, an option is included by which users can define their own groups of residues.
Aggregation operators: that are used to combine an array of values (from a group of residues) into a single value (descriptor) reflecting the distribution of the index within that group. Some examples of these aggregation operators are the sum, average, variance, kurtosis, geometric mean, information content, etc.