A Comparative Genomics Pipeline for In Silico Characterization and Functional Annotation of Short Hypothetical Proteins

Soumyajit Guha, Shuvam Das, Shuvam Das, Sayak Ganguli, Sayak Ganguli


Hypothetical proteins are the proteins whose existence has been anticipated, but for which there are certain scarcities of experimental evidences about its structure, function or linkage to any known genes. Sequencing of several genomes has resulted in numerous predicted open reading frames to which structure or function(s) cannot be readily assigned and sometimes they can make up a significant portion of a genome. In this study, we designed a pipeline for the study and efficient functional annotation of short hypothetical proteins (only which were < 400 amino acids) comparing two case studies, using amino acid sequence informations retrieved from two different protein databases. The investigation and in-silico analysis of likely functional aspects of hypothetical proteins were performed employing various computational methods and tools based on sequence similarity, identification of targeting signals, presence of known protein domains, physicochemical characterization, etc. Our annotation pipeline was able to annotate 90 hypothetical proteins out of 100 compared to evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG) databases' annotation of 82 proteins, which is about 8% more compared to eggNOG for case study 1 and 78 hypothetical proteins out of 96 compared to eggNOG’s annotation of 58 proteins, which is about 20.83% more compared to eggNOG for case study 2. It was also seen that some hypothetical proteins had a high aliphatic index, indicating higher thermostability in extreme environments. From this study subcellular localization involving cytoplasmic proteins and membrane proteins were also predicted with higher accuracies than other proteins. Hypothetical proteins can provide an insight of different unknown structures and functions of proteins and can be an important area for further research.


Annotation, database, hypothetical protein, in silico, protein sequences, subcellular.

Full Text:



Eisenstein E, Gilliland GL, Herzberg O et al. (2000) Biological function made crystal clear - Annota-tion of hypothetical proteins via structural genomics. Current Opinion in Biotechnology 11 (1): 25–30. doi: 10.1016/S0958-1669(99)00063-4.

Sivashankari S, Shanmughavel P (2006) Functional annotation of hypothetical proteins – A review. Bioin-formation 1 (8): 335–338. doi: 10.6026/97320630001335.

Naveed M, Matloob M, Aziz U et al. (2016) Structural and Functional Characterization of a Hypothetical protein of Streptococcus Pyrogenes: An In-Silico Approach.

Kim KS, Kaplan EL (1985) Association of penicillin tolerance with failure to eradicate group A streptococci from patients with pharyngitis. The Journal of Pediatrics 107 (5): 681–684. doi: 10.1016/S0022-3476(85)80392-9.

Lamagni TL, Darenberg J, Luca-Harari B et al. (2008) Epidemiology of severe Streptococcus pyogenes disease in Europe. Journal of Clinical Microbiology 46 (7): 2359–2367. doi: 10.1128/JCM.00422-08.

Barragan-Osorio L, Giraldo G, J. Almeciga-Diaz C et al. (2015) Computational Analysis and Functional Predic-tion of Ubiquitin Hypothetical Protein: A Possible Tar-

get in Parkinson Disease. Central Nervous System Agents in Medicinal Chemistry 16 (1): 4–11. doi: 10.2174/1871524915666150722120605.

Finn RD, Mistry J, Schuster-Böckler B et al. (2006) Pfam: clans, web tools and services. Nucleic acids re-search 34 (Database issue): D247-51. doi: 10.1093/nar/gkj149.

Quevillon E, Silventoinen V, Pillai S et al. (2005) Inter-ProScan: protein domains identifier. Nucleic Acids Re-search 33 (Web Server): W116–W120. doi: 10.1093/nar/gki442.

Yu NY, Wagner JR, Laird MR et al. (2010) PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive ca-pabilities for all prokaryotes | Bioinformatics | Oxford Academic. Bioinformatics 26 (13): 1608–1615.

Yu CS, Chen YC, Lu CH, Hwang JK (2006) Prediction of protein subcellular localization. Proteins: Structure, Function, and Bioinformatics 64 (3): 643–651. doi: 10.1002/prot.21018.

Bendtsen JD, Jensen LJ, Blom N et al. (2004) Feature-based prediction of non-classical and leaderless protein secretion. Protein Engineering, Design and Selection 17 (4): 349–356. doi: 10.1093/protein/gzh037.

Jensen LJ, Julien P, Kuhn M et al. (2007) eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Research 36 (Database): D250–D254. doi: 10.1093/nar/gkm796.

Ikai A (1980) Thermostability and Aliphatic Index of Globular Proteins. The Journal of Biochemistry 88 1895–1898. doi: 10.1093/oxfordjournals.jbchem.a133168.

Guruprasad K, Reddy BVB, Pandit MW (1990) Corre-lation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo sta-bility of a protein from its primary sequence. Protein En-gineering, Design and Selection 4 (2): 155–161. doi: 10.1093/protein/4.2.155.

Rogers S, Wells R, Rechsteiner M (1986) Amino acid sequences common to rapidly degraded proteins: the PEST hypothesis. Science 224 (4655): 1343–1346. doi: 10.1126/science.6374895.

Fink AL (1998) Protein aggregation: Folding aggregates, inclusion bodies and amyloid. Folding and Design 3 (1): R9–R23. doi: 10.1016/S1359-0278(98)00002-9.

DOI: http://dx.doi.org/10.11594/jtls.10.02.06

Copyright (c) 2020 Journal of Tropical Life Science