Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2026-02-11 17:20:27 +08:00 · 2024-08-13 10:57:13 -04:00 · 2024-08-13 10:57:13 -04:00 · 27605a4003
commit 27605a4003
parent aa95684e83
1 changed files with 40 additions and 23 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -1,24 +1,30 @@
 # Overview and Modularity
-PhyloToL Part 2 is designed to: 1) generate multisequence alignments (**MSAs**) using Guidance version 2.02 (https://taux.evolseq.net/guidance/source), iterating to remove sequences that are not homologous (default sequence score is 0.3); 2) generate gene trees using a 3rd party program (RaxML, IQTree, FastTree); 3) remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options (see BioRxiv); and 4) ortholog selection and generation of species trees from concatenated alignments.
+PhyloToL Part 2 is designed to: 1) generate multisequence alignments (**MSAs**) firsts filtering in a "Pre-Guidance" step and then using [Guidance](https://taux.evolseq.net/guidance/source) version 2.02 , iterating to remove sequences considered non-homologs (based on sequence score is 0.3); 2) generate gene trees using a 3rd party program (RaxML, IQTree, FastTree); 3) remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options; and 4) ortholog selection and generation of species trees from concatenated alignments.
-PTL6p2 starts from the “ReadyToGo” files produced by part 1 (or any set of per-taxon sequences with names that match PLT6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by PTL6p1, unaligned amino acid sequences per GF (in which case pre-Guidance filters are not run), aligned sequence files (one per each GF; in which case Guidance does not run), or even trees if a user is running only the contamination loop or concatenation. All details are provided bellow.
+PTL6p2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match PLT6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by PTL6p1, unaligned amino acid sequences per GF (in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop or concatenation. Details are provided bellow.
 # Set Up
 Requirements include: fasta files with controlled names.
 # Databases
-
+For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (**TBD - point to figshare file here -- table of names). This database is in the form of "ReadyToGo" files, the output of PhyloToL part 1 (**TBD - point to figshare file here -- R2Gs). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above). 
 We provide a diverse database of 1,000 genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity. This database is in the form of "ReadyToGo" files, the output of PhyloToL part 1. This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database. If you want to add your own samples or use a different set of OGs, you should check out [PhyloToL part 1](https://github.com/Katzlab/PhyloToL-6/wiki/PhyloToL-Part-1).
 # Running PhyloToL Part 2
-Running PhyloToL Part 2 requires at least 4 items in your main directory: 1) A folder named Scripts and containing all scripts from PhyloToL part 2, 2) a folder containing your input, 3) a taxon list and 4) an OG list. As said earlier, PhyloToL part 2 is highly modular and flexible, so before you start, be sure of what point in the process of PTLp2 you wish to start and end. Default script starts with raw data and produces trees. This will also influences your input files. If you want to produce trees, you will keep the default '--end’ parameter set to 'trees' 
+Running PhyloToL Part 2 requires at least 4 items in your main directory: 1) A folder named Scripts and containing all scripts from PhyloToL part 2 (**TBD - link to Github?), 2) a folder containing your input files, 3) a taxon list (**TBD - provide example below?) and 4) an OG list (**TBD - provide example below?). Given that PhyloToL part 2 is highly modular and flexible, you will want to be sure of what point to the process of PTLp2 you wish to start and end (**TBD - point to figshare file here -- some figure). The default script starts with raw data and produces trees using scripts 1 - TBD. 
-If you want to produce up to pre-guidance files, you will change the default '--end’ parameter to 'unaligned'
+
-If you want to produce up to guidance files, you will change the default '--end’ parameter to 'aligned'
+The code to run PhyloToL Part 2 in full is
 If you want to start at a different point other than raw data, you will change the default '--start’ parameter to 'unaligned', 'aligned', or 'trees'. 
 This is the line you could run, with minimum requirements: 
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder  > Output1.out
 Modularities
 * If you want to produce trees, you will keep the default '--end’ parameter set to 'trees' 
 * If you want to produce up to pre-guidance files, you will change the default '--end’ parameter to 'unaligned'
 * If you want to produce up to guidance files, you will change the default '--end’ parameter to 'aligned'
 * If you want to start at a different point other than raw data, you will change the default '--start’ parameter to 'unaligned', 'aligned', or 'trees'. 
 Below is a list of arguments and their descriptions.
 Argument | Default | Choices | Help
 -- | -- | -- | --
 --start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL.
@ -31,20 +37,27 @@ Argument | Default | Choices | Help
 Optional arguments can then be added to the command line, and will be described bellow.
-## Filtering sequences 
+## PreGuidance: Reorganizing files by GF and optional filters (similarity, GC)
-Filtering sequences options are provided within PhyloToL to remove sequences at the "pre-guidance" step, before any intense computer resources are needed. This pre-guidance step of PhyloToL part 2 takes ReadyToGo files (aka "species files") and convert them into "Gene Family files", one file for each one of the GF you chose in the listofOGs.txt, regrouping all sequences from your taxa_list.txt. It allows to optionally choose to remove sequences based on GC composition, similarity or sequences previously identified as "non wanted".
+The pre-guidance step of PhyloToL part 2 takes ReadyToGo files (aka "species files" from your taxa_list.txt) and convert them into "Gene Family files", one file for each one of the GF you chose in the listofOGs.txt. 
-### Filtering on GC composition
+* Similarity filter (optional): You can turn on the similarity filter to remove likely ingroup paralogs prior to running Guidance (a CPU intensive process). It allows to optionally choose to remove sequences based on GC composition, similarity or sequences previously identified as "non wanted". TBD The parameters for this when running pre-guidance is ‘--og_identifier’ and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’ and passing all the sequences to guidance without filtering. 
-The filtering by GC content selects only sequences that fall within a specified range (user defined ranges). 
+
-The renaming of each sequence is done using a utility script (GC_identifier.py), previously from running PhyloToL part 2, which renames the sequences with OGG, OG6, and OGA depending on if the sequence GC content falls below or above the user specified GC range. As input for running PhyloToL part 2, user will give the labeled Ready To Go instead of the usual ones.
+* Filtering by GC (optional): The filtering by GC content selects only sequences that fall within a specified user defined ranges. 
 The renaming of each sequence is done using a utility script (GC_identifier.py), previous to running PhyloToL part 2, which renames the sequences with OGG, OG6, and OGA depending on if the sequence GC content falls below (OGA) or above (OGG) or within (OG6) the user specified GC range. As input for running PhyloToL part 2, user will give the labeled Ready To Go instead of the usual ones.
 The parameters for this when running pre-guidance is ‘--og_identifier’ and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’ and passing all the sequences to guidance without filtering.
 Adding these options to the command line will give:
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG > Output1.out
-### Overlap and similarity filters
+A note on filtering by GC: we do this based on plots that compare GC content at silent sites (GC3s) to effective number of codons as calculated by Wright 1991. 
 TBD - Marie put example figure here with explanation...
 TBD - i think this is already above now?
 ### similarity filters
 Another option to filter sequences from the ReadyToGo files at the pre-guidance step (and so reducing computer resources needed, and sequence redundancy) is the similarity filter. Here, user can choose to remove sequences that are too similar by adding the '--similarity_filter' parameter into the command line, on all the dataset or only on a specific set of taxa. All parameters involved in reducing sequences by similarity are summarized in this table:
@ -59,9 +72,11 @@ Adding these options to the command line will give:
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt > Output1.out
 Other optional parameters:
 ### Blacklist
-As you run PhyloToL, list of sequences removed by Similarity Filter and Guidance (detailed below) are provided. This list of sequences can then be re-used in next PhyloToL runs to remove sequences prior to running the Guidance steps (list of sequences not to include in your dataset). As Guidance can take time, removing these already identified non-homologous sequences can save computer time. For our study, we chose to include only sequences removed by Guidance in our Blacklist, but user can choose (wisely) what fits best for their study and their data. To include this parameter to your PhyloToL run, you will need to add the '--blacklist' flag to the command line as follow:
+The Blacklist is a user-defined set of sequences to be removed from runs. You might choose to list of sequences removed by Guidance (to avoid reconsidering these non-homologs in future runs as this can save computer time. For our study, we chose to include only sequences removed by Guidance in our Blacklist, but user can choose (wisely) what fits best for their study and their data. To include this parameter to your PhyloToL run, you will need to add the '--blacklist' flag to the command line as follow:
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt > Output1.out
@ -71,7 +86,7 @@ Argument | Default | Choices | Help
 ## Guidance
-Within PhyloToL, we are using Guidance as a proxy to assess homology in Gene Families. PTL6p2 runs Guidance in an iterative fashion to remove non-homologous sequences defined as those that fall below the sequence score cutoff. (We note that there is some stochasticity here given the iteration of alignments built into the method.) After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iterations. Some options are available to change the default set up for Guidance:
+Within PhyloToL, we are using Guidance as a proxy to assess homology within GFs. PTL6p2 runs Guidance in an iterative fashion to remove non-homologous sequences defined as those that fall below the sequence score cutoff. (We note that there is some stochasticity here given the iteration of alignments built into the method.) After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iterations. Some options are available to change the default set up for Guidance:
 Argument | Default | Choices | Help
 -- | -- | -- | --
@ -90,6 +105,8 @@ Argument | Default | Choices | Help
 -- | -- | -- | --
 --tree_method | iqtree | iqtree, raxml, all | Program to use for tree-building.
 ## Summary for basic launching
 In summary, PhyloToL is a modular and flexible pipeline than can be started and stopped at any points, with options at each point to completely personalize your run. For a full PhyloToL run, it start with an input Folder of ReadyToGo files, and ends with an Output Folder containing PreGuidance files (one file by GF regrouping sequences from all taxa in your list), NotGapTrimmed files (post-guidance files, non-homologous sequences removed, but non trimmed by trimal), Guidance files (post-guidance and trimmed files), and Trees.
@ -132,13 +149,13 @@ Argument | Default | Choices | Help
 ## Contamination loop
-The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylgoeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. 
+The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylogeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. We first provide an overview of the three modes and then give details on running below.
-Sisters-based contamination removal identifies sequences as putative contaminants based on their sister relationships. If a sequence from sample A appears on a tree sister to a sequence from sample B, and sample B is known to have contaminated sample A, then the sequence from sample A will be removed. Subsisters-based removal operates similarly, but looks at the taxa that are sister to sample A's _parent_ node, useful for when multiple samples are contaminated by the same other sample.
+**Sisters-based contamination removal** identifies sequences as putative contaminants based on their sister relationships. If a sequence from sample A appears on a tree sister to a sequence from sample B, and sample B is known to have contaminated sample A, then the sequence from sample A will be removed. **Subsisters-based removal** operates similarly, but looks at the taxa that are sister to sample A's _parent_ node, useful for when multiple samples are contaminated by the same other sample. 
-Clade-based contamination removal operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokont clades, we might choose to keep only Opisthokont sequences that fall in a monophyletic clade of 12 or more species of Opisthokont; all other Opisthokont sequences in the tree would be removed.
+**Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opithokonts; all other opisthokont sequences in the tree are removed and their sequences listed in TBD seqremoved.txt.
-The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees. Starting with a set of trees and a list of rules (i.e. a sequence from a ciliate to be removed if it falls sister to a known food source), PhyloToL will: identify a list of sequences as contaminants (writing them out to xxxx file), generate a fasta file for each gene family without contaminating sequences, reconstruct an alignment using ?Guidance with x iterations?, and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
+The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees. Starting with a set of trees and a list of rules (i.e. a sequence from a ciliate to be removed if it falls sister to a known food source), PhyloToL will: identify a list of sequences as contaminants (writing them out to TBD xxxx file), generate a fasta file for each gene family without contaminating sequences, reconstruct an alignment using TBD Guidance with TBD iterations?, and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
 ### Contamination loop setup
@ -147,7 +164,7 @@ The CL requires 1) a folder of alignments (not gap trimmed) and 2) a folder of g
 You will also need to create a 'rules' file. The format here varies between the different modes of the CL.
-_describe rules files here_
+TBD _describe rules files here_
 Having well-organized ten-digit-codes for sample identification is vital for running the CL, especially clade grabbing, because it will allow the CL to find all sequences belonging to taxa from a specific taxonomic group. For example, in the ten-digit-codes used in the 1000-ReadyToGo file database, in order to clade grab for Opisthokonta the CL will look for all sequences with ten-digit-codes starting with `Op_`, and to clade grab for Metazoa, all ten-digit-codes starting with `Op_me`, etc.
@ -173,6 +190,6 @@ To run the CL, use a similar command structure as described for running PhyloToL
 ## Orthologs selection and concatenation with PhyloToL
-
+TBD