IGeneral Information

I.1A short introduction to protein groups in shotgun proteomics

I.2Why Prophane?

 

IIThe Prophane Workflow

II.1Overviewing the Prophane pipeline

II.2Details on data submission

II.3Details on FASTA file submission

II.4Details on the annotation module

II.5Details on annotation proofing

II.6Details on taxonimical analysis

II.7Details on functional analysis

II.8Details on quantification

 

IIIThe output provided by the Prophane pipeline

III.1Overviewing the output

 

IVReferences

IGeneral Information
I.1A short introduction to protein groups in shotgun proteomics

One of the most frequently used approaches in proteomics, called bottom-up or shotgun proteomics, relies on tandem mass spectrometry of peptides after enzymatic protein digestion and subsequent correlation of the obtained spectral data with amino acid sequences of a given protein database (Eng et al., 1994).
The final protein identifications have to be inferred from the resulting peptide spectrum machtes (PSMs). This can be surprisingly difficult since peptides can be shared by many different proteins as conserved motifs (for review see Nesvizhskii and Aebersold, 2005). Especially in metaproteomic analyses where target databases contain naturally many homologous proteins from closely-related organisms the proportion of PSMs that can be assigned to more than one protein is high. To handle this, proteins can be clustered based on the shared PSMs (e.g. described by Koskinen et al., 2011). Each of these clusters (or protein groups) is represented by a master protein which has been selected based on PSM coverage and probability scores. However, the involved algorithms are highly diverse and you should refer to the respective manual to get more information on protein grouping provided by the software you are using.

I.2Why Prophane?

In metaproteomics highly complex protein mixtures are analyzed to get new insights in the taxonimical and functional diversity of more or less defined ecosystems. As described in I.1 protein groups resulting from such analyses can consist of many PSMs sharing proteins belonging to numberless taxonimical origins and involved in various functions. Thus, it is difficult or even impossible to choose a single master protein representing the whole group on taxonimical as well as functional level. With Prophane we provide a fully automatic (but highly adaptable) workflow which is not relying on master protein information but considering all protein members within the protein groups (firstly described in: Schneider et al., 2011). Each group is analyzed regarding commonalities on both taxonimical and functional level between the covered members. Additionally, Prophane eases data inpsection and interpretation by organizing all relevant information and analysis results in intuitive and interactive tables.

IIThe Prophane Workflow
II.1Overviewing the Prophane pipeline





The flow chart above shows a simplified model of the workflow provided by the Prophane bioinformatics pipeline. You can move the mouse cursor over the different elements to get more information.

II.2Details on data submission

So far Prophane accepts protein report files exported by Scaffold or Scaffold viewer (Proteome Software,http://www.proteomesoftware.com). Such protein reports are basically tab-delimited text-files extended by xls which can be opened by Microsoft Excel, OpenOffice Calc or any text editor. Scaffold’s protein reports are divided into two parts. First, an introducing list of parameters used for the database search and, second, the data table. Please make sure, that the data table contains columns headed by

The data table has to be closed by "END OF FILE" (last data table row). Additionally, point (.) has to be used as decimal separator. Prophane accepts protein reports from merged data sets. It considers sample and replicate names given in the repsective columns of the data table. If you have more questions on Scaffold’s protein report please refer to Scaffold’s manual.

II.3Details on FASTA file submission

Prophane accepts FASTA files meeting the officiaL standard . There is no limit of length neither for header lines nor for sequence lines. Multiple headers have to be separated by SOH (separator of header; ASCII char 001).



The figure demonstrates the tag-dependent accession recognition provided by Prophane. Accession numbers have to be introduced by an accession type tag (e.g. gi) followed by | and introduced by | or header start (>) or multiple header start (SOH). If an accession (defined by accession string and accession type) listed in the submitted protein report has been found, Prophane stores the header information and the aminoacid sequence. Importantly, if the header consists of multiple headers, Prophane adds all accessions of subheaders not sharing any accession with the respective protein group. This is due to the fact that many software suites performing MS database searches consider only header information until the first SOH.

II.4Details on the annotation module

Prophane tries to retrieve annotation data for any protein accession found in your protein report or multiple headers (see II.3). Dependent on the accession type different sources are considered:

Taxonomic information can be extracted from the FASTA file if specific prefices and suffices are used in the headers. If no taxonimical information is found in any source (e.g. for unknown accesssion types) Prophane can use BLAST to transfer taxonimical information from the best hit using NCBI NR protein database. For this a threshold (maximal e-value) can be defined by the user.

II.5Details on annotation proofing

If annotation data has been retrieved from different sources it will be checked whether taxonimical and sequence information is consistent (functional annotation is not checked since it is not standardized, see II.7).  If this is not the case or data are missing the user is informed and has to select the correct information or to submit the missing information manually. If sequence information is missing or not consistent, it is very important that you provide or select the sequence listed in the target database which has been used for spectra correlation (see I.1).

II.6Details on taxonimical analysis

Since taxonimical annotation is standardized, comparison is easy. Prophane considers seven different taxonimical levels: superkingdom, phylum, class, order, family, genus, and species. The taxonimical lineage of each protein group is elucidated by comparing the taxonimical lineage of the belonging protein members. If the taxonimical unit of the respective level is shared by all group members it is assigned to the group. If not the respective taxonimical level and all subsidiary levels are stated „heterogeneous“.

II.7Details on functional analysis

In contrast to taxonimical annotation (see II.6) functional data is quite diverse and less standardized. To allow the comparison of protein group members on functional level Prophane performs functional predictions for each protein using RPSBLAST or HMMER3. Please read the manuals (RPSBLAST: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSBWhat; HMMER3: ftp://selab.janelia.org/pub/software/hmmer/CURRENT/Userguide.pdf) to get more information about these algorithms. Using RPSBLAST, COG and KOG classifications (Tatusov et al., 2003) can be assigned to prokaryotic and eukaryotic proteins, respectively. A maximal e-value can be defined by the user. Prophane considers taxonimical information of each protein automatically to choose the COG or KOG collection. Using HMMER3 functional predictions are based on Hidden Markov Model profiles provided by TIGRFAMs and PFAMs, respectively (Punta et al., 2012; Haft et al., 2013). In contrast to PFAMs TIGRFAMs consider mainly prokaryotic proteins. The lowest classification levels of TIGRFAMs and PFAMs are functions and motifs/domains, respectively. Results can be restricted by a maximal e-value threshold or the recommended gathering threshold. Finally, the functional prediction which is shared by all members is assigned to each protein group. If there are different functional predictions common to all group members the prediction with the lowest overall e-value (calculated by summing up all returned e-values of the respective prediction) is selected. If there is no common functional prediction the respective group is stated as functionally „heterogeneous“.

II.8Details on quantification

Prophane is estimating protein abundance based on spectral counts. The normalized spectral abundance factor (introduced by Zybailov et al., 2006) is calculated in a slightly modified form:

IIIThe output provided by the Prophane pipeline
III.1Overviewing the output

Prophane provides a single output file to the user which can be opened in any generic web browser (works best in newest versions of Chrome and Firefox). In this file all images, data and functionalities are embedded to simulate a full result website which is intuitive to use.
Please consider, that JavaScript has to be enabled and the web browser has to support CSS3 (if you are not sure, please update your browser and refer to its manual).
The protein report is separated in different sections:

To save the output choose „SAVE AS HTML ONLY“ in your web browser (see your browser’s manual for help).

IVReferences


Eng, J., A. McCormack and J. Yates (1994). "An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database." Journal of the American Society for Mass Spectrometry 5(11): 976-989.

Haft, D. H., J. D. Selengut, R. A. Richter, D. Harkins, M. K. Basu and E. Beck (2013). "TIGRFAMs and Genome Properties in 2013." Nucleic acids research 41(D1): D387-395.

Schneider T, Schmid E, de Castro JV Jr, Cardinale M, Eberl L, Grube M, Berg G, Riedel K. (2011). Structure and function of the symbiosis partners of the lung lichen (Lobaria pulmonaria L. Hoffm.) analyzed by metaproteomics. Proteomics 11(13):2752-6.

Koskinen, V. R., P. A. Emery, D. M. Creasy and J. S. Cottrell (2011). "Hierarchical clustering of shotgun proteomics data." Mol Cell Proteomics 10(6): M110 003822.

Nesvizhskii, A. I. and R. Aebersold (2005). "Interpretation of shotgun proteomic data: the protein inference problem." Molecular & cellular proteomics : MCP 4(10): 1419-1440.

Punta, M., P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. Sonnhammer, S. R. Eddy, A. Bateman and R. D. Finn (2012). "The Pfam protein families database." Nucleic acids research 40(Database issue): D290-301.

Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin and D. A. Natale (2003). "The COG database: an updated version includes eukaryotes." BMC Bioinformatics 4: 41.

B. Zybailov, A.L. Mosley, M.E. Sardiu, M.K. Coleman, L. Florens, M.P. Washburn (2006). Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. Journal of Proteome Research 9:2339-47.


© 2013 - 2014 currently 8 jobs running