“Nature is a tinkerer, not an inventor" . Similar sequences (or sequence with common patterns) are largely used across kingdoms not only for important molecular functions evolved to best survive the environment but also for meeting the physicochemical restraints to function as part of a folded protein in water. One enticing application after learning so much from the now >170,000 PDB structures and the functions they encode is to extract their common structral elements (say, alpha-helix) decorated by 20 types of local chemistry (yet in an order that ensures the structural helicity) and reassemble them for potential medicinal uses.
A number of efforts have been laid to establish fragment-based design of drug leads [2, 3] and associated databases/software . However, there has not been a systematic approach to assemble structurally resolved protein fragments for the design of therapeutics. Preferably, given a known functional motif, say, a motif that can be recognized by an antibody or a sequence pattern essential to bactericidal activity, whether there is a proper search engine to locate all the matched structural fragments reporting relevant analytics in a timely fashion? To address such a need and showcase protein-fragment-based therapeutic design, we introduce the Therapeutic Peptide Design dataBase (TP-DB).
In this seminal work, we extracted ~1.7 million (1,676,119) structurally resolved helices and their contacting/interacting helical partners from the protein data bank (PDB). Helical propensity, coordination (contact) number and empirical potential that are derived from representative set of interacting helices are computed along with the stored helical sequences. We then establish a unique search engine for patterns that meet specific physicochemical and structural need rather than evolutionary criteria that is needed for Pattern Hit Initiated BLAST (PHI-BLAST) . PHI-BLAST requires the input of both pattern and the template sequence. As a result, the PHI-BLAST does leave out the sequences that meet the queried pattern but are not indicative of evolutionary homology, which differs from the functionality of TP-DB.
Extraction of Helical Peptide Sequences from the Protein Data Bank (PDB)
To obtain the amino acid sequences that fold into helices in nature, we processed the PDB  files of 130,000+ experimentally determined protein structures and extracted secondary structure information from the header part of the PDB files. This allowed us to obtain the peptide sequences corresponding to the helices in each of the proteins. In addition, for a given helix, its contacting neighbors in 3D space, the adjacent helical peptides that are <4 Å away (per heavy atoms), are also documented.
Creation of TP-DB
The collected peptides sequences are carefully developed into a searchable database by creating indexes that map peptide patterns (such as Y***G**K, which is equivalent to “Y 3 G 2 K”) into where they could be found in structurally solved proteins. "Where" a given peptide pattern can be found is defined by the PDB ID, the chain ID, and the index of the position where the pattern begins in that chain of the PDB. For easy computability and high flexibility of expression without losing accuracy, a given pattern (Y***G**K) is represented by a key (such as "Y 3 G 2 K") which is a combination of alphabets and numbers. The keys serve as the indexes for the database. For example, the key "Y 3 G 2 K" points to all the locations where a "Y" could be found such that the fourth (i.e. 3 + 1) amino acid downstream to it is a "G" and the third (i.e. 2 + 1) amino acid after "G" is a "K".
For each of the peptidep in the TP-DB (Fig. 1A), we scan the sequence of peptidep and generate its keys containing three amino acids (which are henceforth referred to as anchors) at a time (Fig. 1B). The three anchors do not have to be consecutive amino acids in the sequence. By design, we allow zero to four amino acids in between the first anchor and the second anchor, and between the second anchor and the third anchor. We represent each key obtained from the peptide as “CmDnE” such that C, D, and E are the one letter codes of amino acids and m, n are numbers of spacings. Therefore, peptidep with sequence ADEKKFWGKYLYEVA has keys that range from A0D0E, D0E0K, …, E0V0A, A0D1K, …, K4Y4A as shown in Fig. 1B. While scanning the sequence of peptidep for its keys, we extract the start position of each key at the same time. The keys and their start positions (values) are indexed to make up the database (Fig. 1C). For a given key-value pair in the database, the value is an array of peptide identifiers and all the start positions of that key in each of the peptides (Fig. 1C). The building of this database index (made up by the key-value pairs) is the database creation itself.
Querying the TP-DB
The design of the database and its indexes make it easy to query the database even when the patterns of interest are not simple. To query the database, the user specifies at least three anchor amino acids and the number of amino acids between the anchors, such that “ADE”, “K----VA”, and “K----Y----A” could be queried using “A 0 D 0 E”, “K 4 V 0 A”, and “K 4 Y 4 A” respectively (Fig. 1D). For such simple queries (such as “K 4 Y 4 A”), the results are fetched directly from the database’s index that resides in the server’s RAM (Fig. 1E). Furthermore, the design of the indexes of the database makes possible its efficient querying even when one needs to search for non-trivial patterns, which is discussed next.
We define a non-trivial pattern as a pattern that is not directly a key in the TP-DB but which could be pre-processed into a combination of simple patterns and subsequently into simple queries that could be directly found in the database. Therefore, when a non-trivial pattern is queried, we break it down until we reach its components that correspond to keys that could possibly be found in the database as illustrated with the examples in Fig. 1F. For instance, “A/Y 3 G 2 K 3 H 4 K” is broken down into a combination of two sub-queries/machining patterns “A 3 G 2 K 3 H 4 K” and “Y 3 G 2 K 3 H 4 K” and the five anchors in each of the machining patterns are treated as a combination for two keys each with three anchors such that the third anchor of the first key is the same as the first anchor of the second key as shown in Fig. 1F. We then query the database for the keys. A systematic combination of the results from all the keys (while taking into account the regions where the keys overlap) makes it possible to construct the needed results for the non-trivial query.
||Jacob, F., Evolution and tinkering. Science, 1977. 196(4295): p. 1161-1166.
||Jencks, W.P., On the attribution and additivity of binding energies. Proceedings of the National Academy of Sciences, 1981. 78(7): p. 4046-4050.
||Erlanson, D.A., Introduction to Fragment-Based Drug Discovery, in Fragment-Based Drug Discovery and X-Ray Crystallography, T.G. Davies and M. Hyvönen, Editors. 2012, Springer Berlin Heidelberg: Berlin, Heidelberg. p. 1-32.
||Kutchukian, P.S., D. Lou, and E.I. Shakhnovich, In Silico Fragment-Based Generation of Drug-Like Compounds, in Library Design, Search Methods, and Applications of Fragment-Based Drug Design. 2011, American Chemical Society. p. 151-177.
||Zhang, Z., et al., Protein sequence similarity searches using patterns as seeds. Nucleic Acids Research, 1998. 26(17): p. 3986-3990.
||Gilliland, G., et al., The Protein Data Bank. Nucleic Acids Research, 2000. 28(1): p. 235-242