Report files
The report file is the protein identification file of the mass spectrometry analysis and must contain information about protein groups (also known as metaprotein or protein-families). Prophane supports various different input formats, as listed below. If you plan to use another input format on a regulary basis, please contact us (support@prophane.de).
So far, Prophane supports proteomic search result data provided in:
FASTA file
The FASTA file can be plain or gzipped and should contain sequences for all identified proteins. Ensure that the protein accessions or IDs in the FASTA file match those used in the protein identification report. You can use the same FASTA file you used for peptide-spectrum matching. However, if the FASTA file is excessively large, it can significantly increase Prophane's runtime. Ideally, the FASTA file should contain all sequences present in the identification file and no more. Avoid using sequence accession IDs containing pipes ('|') to ensure consistent data processing by Prophane. Sequence IDs with two or more pipes, such as Uniprot protein accessions, will be processed (e.g. from sp|O70558|SPR2G_MOUSE to 070558).
Report Files Formats
Generic Format
A generic table format (tab separated values) that is designed to be easy to recreate. The column "protein accessions" contains all protein accessions of one protein group and the quantification is based on spectral counts per protein group. Example table:
sample category | sample name | protein accessions | spectrum count |
---|---|---|---|
sample A | replicate 1 | WP_003131952.1, NP_268346.1, KZK33282.1 | 1 |
sample A | replicate 1 | WP_000665196.1 | 2 |
mzTab 1.0.0 exchange format
Data provided in the mzTab exchange format can be processed by prophane. Protein group information must be provided in the ambiguity_members column. PSM section must be present. To distinguish between different samples, the entries in the column spectra_ref in the PSM section must be in the following format: ms_run[1-n]:"SPECTRA_REFERENCE". This format must also be kept if only one ms run is analyzed. Quantification is performed by counting the spectra/PSMs assigned to a protein group.
mzIdentML1.2.0 exchange format
Data provided in the mzIdentML 1.2.0 exchange format containing ProteinAmbiguityGroup can be processed by Prophane. In the mzIdent 1.2.0 version there is the possibility of summarizing similar proteins in 'ProteinAmbiguityGroup' sections. These groups are essential for performing metaproteome analysis with Prophane. Since mzIdent version 1.1.0 doesn't support protein groups, these files cannot be processed by Prophane. Quantification is performed by counting the spectra/PSMs assigned to a protein group.
MPA (single experiment)
Data provided by Metaproteome Analyzer (MPA) can be used by exporting metaprotein report (CSV).
MPA (multiple experiment)
Data provided by Metaproteome Analyzer (MPA) can be used by exporting metaprotein multisample report (CSV).
Protein Discoverer protein group
Proteome Discoverer protein-group data export: 1. check for filters set under "Display Folters -> Proteins", and remove the filter for master proteins 2. go to Protein Groups table and request "Show associated tables -> Proteins" 3. export this record as follows: File -> Export -> to Microsoft Excel Abundance of the master protein is used for the protein group abundances/ as quantification values. If multiple abundances are available, the order of selected abundances is as follows: 'Abundance: sample' (=raw abundance), 'Abundances (Scaled): sample', 'Abundances (Normalized): sample', If the master_protein row of a protein group contains no values for all samples in the abundance columns, these protein groups are ignored. If you use scaled or normalized abundance values we strongly recommend selecting "Raw value (no normalization)" in step "Quantification method".
Scaffold
Data provided by Scaffold can be used by exporting Scaffold's protein report (TSV/XLS).
Usage of sample groups
The definition of sample groups are relevant for getting mean quantification values about different samples. This allows a comparison of different sample groups, e.g untreated samples vs. treated samples. Sample group mean quantification values are reported in the "summary.txt" or "lca_summary.mztab" file and displayed in the Krona-Plots. The quantification values for each sample are reported additionally. If no groups are specified, quantification is carried out for each sample separately.
Sample name extraction of different report files by Prophane
The following table describes how Prophane extract sample names from different protein reports. Use the same sample names within sample group definition.
input file | sample columns or elements | comment | example | extracted sample name |
---|---|---|---|---|
GENERIC table | sample category, sample name | sample name is generated by fusing the sample category and sample name column with :: as separator | sample1 | replicateA | acc1 | 2 | sample1::replicateA |
mzTab | MTD section of mzTab. All ms_run[number] elements are extracted. | Samples are identified in the MTD section if they match the following pattern: ms_run[number]. | "MTD ms_run[1]-location null" | ms_run[1] |
mzIdent1.2.0 | SpectrumIdentificationResult Element, spectraData_ref attribute | sample names are extracted from the XML Element >SpectrumIdentificationResult> ,attribute spectraData_ref | >SpectrumIdentificationResult spectraData_ref="qExactive2.mgf" spectrumID="index=5708" id="SIR_1769"> | qExactive2.mgf |
MPA's single sample Metaprotein Report | column 'sample' or None | if column 'sample' is present in given report, content of this sample will be used. Else generic sample name `sample1` is used. | no sample information provided | sample1 |
MPA's Multiprotein Report | column headers starting with EXP:sample_name | sample names are extracted from column headers that start with "EXP:", with the string after the colon (':') being the sample name | ... | Protein Accessions | EXP:S13.mgf | EXP:S20.mgf | EXP:S23.mgf | ... | S13.mgf, S20.mgf, S23.mgf |
Proteome Discoverer's protein group report | e.g. Abundances (Scaled): F1: Sample | Sample names are extracted from the abundance columns in protein group headers that follow the schema "Abundance (x): Fnumber" | ... | Abundances (Scaled): F1: Sample | Abundances (Scaled): F2: Sample | ... | F1, F2 |
Scaffold's protein report | both columns: Biological sample category AND Biological sample name are combined. | values of both columns are fused using :: as separator | ... | sample A | replicate 1 | ... | sample1::replicateA |
quant method | description |
---|---|
NSAF (normalized to longest metaprotein sequence) | NSAF are calculated. Normalization is based on the largest sequence length of the respective protein group. |
NSAF (normalized to shortest metaprotein sequence) | NSAF are calculated. Normalization is based on the smallest sequence length of the respective protein group. |
NSAF (normalized to mean metaprotein sequence) | NSAF are calculated. Normalization is based on the averaged sequence length of the respective protein group. |
Raw value (no normalization) | Unprocessed quantification values are shown. No NSAF are calculated. |
Functional Annotation with Profile-HMM Databases
Prophane utilizes the HMMER3 algorithm to search functional profile-HMM databases, which include PFAM, TIGRFAM, Resfams, CAzY/dbCAN, and FOAM. You have the option to choose between two search modes:
Functional Annotation with EggNOG Database and emapper
In Prophane, when utilizing the EggNOG database and emapper for functional annotation, the approach is based on precomputed orthology assignments rather than homology. Orthologs, also known as orthologous genes, are genes found in different species that share a common origin from a single gene in the last common ancestor. This method has been demonstrated to provide more accurate predictions compared to homology-based approaches.
The EggNOG database is built upon a carefully curated selection of representative species, encompassing a diverse range of organisms across the tree of life. The orthologs identified from this database are associated with common annotations, including GO terms, protein descriptions, KEGG annotations, PFAMs, and COG categories.
The evalue defines how similar the reported proteins have to be to be reported. The smaller the evalue, the more similar the protein match. For additional information about the parameters used for emapper and the EggNOG database, please refer to the EggNog wiki: https://github.com/eggnogdb/eggnog-mapper/wiki
acc | superkingdom | phylum | genus |
---|---|---|---|
KZK33282.1 | Bacteria | Firmicutes | Lactococcus |
PKC80086.1 | Bacteria | Actinobacteria | Bifidobacterium |
acc | level1 | level2 | level3 |
---|---|---|---|
1000565.METUNv1_03812 | information storage and processing | Translation, ribosomal structure and biogenesis | ATPase |
362663.ECP_0061 | information storage and processing | Replication, recombination and repair | DNA polymerase |
Annotation Databases
Annotation Algorithms Parameters
Database(s) | Search Algorithm | mandatory parameter [default] | optional parameters [default] |
---|---|---|---|
UniprotKB (TrEMBL & Swiss-Prot) NCBI protein nr | diamond blastp | evalue [0.001] | algo [0], band [], block-size [2.0], comp-based-stats [1], dbsize [40000000], frameshift [0], freq-sd[],gapextend [0], gapopen [0], gapped-xdrop [ ], hit-band: [ ], hit-score: [ ], id [0.0], id2 [ ], index-mode [0], masking [1], max-hsps [1], rank-ratio [ ], rank-ratio2 [ ], sensitive: [ ], shape-mask [ ], matrix: ['BLOSUM62'], strand: [both], unapped-score [ ], window [ ], shapes [0], xdrop [ ], query-cover [0], more-sensitive [ ], min-score [20] |
PFAMs TIGRFAMs FOAM DBcan ResFam | hmmscan OR hmmsearch | evalue [0.001] | E [10], T [0.0], domE [10], domT [0.0], incE [10], incT [0.0], incdomE [10], incomT [0.0], cut_ga [ ], cut_nc [ ], cut_tc [ ], max [ ], F1 [0.02], F2 [0.001], F3 [0.00001], nobias [ ], nonull2 [ ], domZ [int], seed [42], Z [int] |
EggNOG | emapper | m: diamond OR hmmer | gapextend [0], gapopen [0], go_evidence ['experimental'], gguessdb [131567], hmm_maxhits [1], hmm_maxseqlen [5000], hmm_qcov [0], hmm_score [20.0], query-cover [0], seed_ortholog_score [60.0], subject-cover [0], target_orthologs [one2one], tax_scope [131567], Z [40000000] |
LCA method: LCA per group
The "LCA per group" method determines the LCA based on annotations of proteins within a protein group. It allows users to set a threshold value for each group, ranging from 0 to 1.
If multiple annotations passing the threshold, the annotation with highest support is chosen. If no annotation passing this threshold, prophane returns the value "various".
A threshold value of 1 returns an LCA only if all annotations of all protein group members share the same annotation; otherwise the LCA is reported as "various". With a threshold value of 0.51, Prophane returns an LCA if more than half of the annotations are the same.
LCA method: democratic LCA
The "democratic LCA" method selects the annotation that occurs most frequently across all protein groups from each protein group.
The democratic LCA method chooses of all annotations always the one, which is the most often identified about all protein groups. So with this method you get the smallest variance in the results. Depending on the chosen LCA-method, you get different results. Compare Picture 1 und 2.
Advanced Option: ignore_unclassified
This advanced option considers only proteins for LCA determination if an annotation was found (excluding "unclassified" annotations).
Advanced Option: minimum_number_of_annotations
The "minimum_number_of_annotations" advanced option in Prophane empowers users to specify a minimum count of annotations within the annotation lineage that will be taken into account by the LCA methods. This feature proves especially useful for effectively managing annotations that might be considered extraneous or less informative.
For instance, when dealing with annotations like vector annotation that typically consist of lineage with only one-level annotations, this option allows users to filter out these less informative annotations, ensuring a more meaningful and accurate LCA determination process.
Structure of the Result Folder
Location | Content | File Type |
---|---|---|
summary.txt | Main result file: summary table of all analyses | TSV |
lca_summary.mztab | Main result file: summary table containing protein group information (LCA and Quantification) | mztab |
protein_summary.mztab | Main result file: annotations on protein level | mztab |
plots/plot_of_{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.html | interactive Krona plots | HTML |
pgs/protein_groups.sql | SQL database of protein groups | TSV |
seqs/all.faa | all accessions from the proteomic search result and their respective protein sequences | FASTA |
seqs/missing_taxa.faa | FASTA file containing sequences of taxonomically undefined proteins (those not covered by the optionally provided taxmap input file. In most cases, this file is the same as seqs/all.faa) | FASTA |
| seqs/ambiguous_sequences.txt | Usually not created; accessions, which could not be found in provided input FASTA. | TXT |
seqs/missing_sequences.txt | Usually not created; accessions, which could not be found in provided input FASTA. | TXT |
segs/pgs/pg.{n}.faa | sequence information of protein group n | FASTA |
tasks/quant.tsv | quantification information for each taxonomic or functional lowest common ancestor (LCA) | TSV |
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.best_hits | best hits extracted from raw annotation result | TXT |
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.lca | LCA of each protein group | TSV |
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.log | log of annotation command | TXT |
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.map | respective accession-annotation linking | TSV |
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.quant | LCA of each sample and replicate and their calculated quantification | TSV |
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.result | raw annotation result | TXT/TSV |
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.result-cmd.txt | annotation command, as executed in the shell | TXT |
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.xml | input for creation of krona plot | XML |
tasks/{annot_type}_annot_by_{tool}_on_ {db_type}.v{db_version}.task{taskid}.yaml | annotation task parameters (full path to db, annotation algorithm, command line parameters, shortname of task, annotation type) | YAML |
tax/taxmap.txt | taxonomic information extracted from user-defined taxmaps | TXT |
tax/missing_taxa_map.txt | accessions without any taxonomic data/prediction | TXT |
tax/ambiguous_taxa.txt | Usually not created; accessions with ambiguous taxonomies extracted from user-defined taxmaps | TXT |
The summary.txt File
The summary.txt file is a tab-separated file that can be easily imported into Excel via the data panel for further analysis. The file begins with a header row, followed by information about protein groups. Protein groups consist of two types of rows: "group" and "member" rows and have a unique protein group number (column "#pg").
Group Rows (column "level" = group):
These rows summarize information about the protein group:
Krona Plots
Krona plots are a powerful visualization tool in Prophane, allowing users to explore and interpret the taxonomic or functional abundances of metagenomic data with ease. Krona plots are automatically generated for each task in Prophane. The size of each field in the Krona plot corresponds to the normalized quantification values, providing a visual representation of the relative abundance. Krona plots support the viewing of results for different sample groups and different samples and provide an interactive zoom function.