Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2025-12-29 23:50:25 +08:00 · 2024-08-13 18:09:11 -04:00 · 2024-08-13 18:09:11 -04:00 · a201876293
commit a201876293
parent 0c7beba74c
1 changed files with 32 additions and 84 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -32,7 +32,7 @@ For those users interested in eukaryotic phylogeny, we provide a database of 1,0

 # Running PhyloToL Part 2

-Once you have set up your run folder as described in the [Set-up](set-up) section above, you're ready to run PhyloToL part 2. The pipeline is highly modular, and contains five main sections:
+Once you have set up your run folder as described in the Setup section above, you're ready to run PhyloToL part 2. The pipeline is highly modular, and contains five main sections:
 1. pre-Guidance (groups your sequences by GF instead of by sample and applies some basic filters; produces an unaligned amino acid file for each GF)
 2. Guidance (aligns your amino acid sequences and iteratively removes putative non-homologs; produces an aligned amino acid file for each GF)
 3. Tree building (produces a tree file [newick string] for each GF)
@ -63,118 +63,71 @@ Argument | Default | Choices | Description
 --data |   |   | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
 --output | ./ |   | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.

-Optional arguments can then be added to the base command, and will be described bellow.
+Optional arguments can then be added to the base command, and will be described below. In the following is described each stage of PhyloToL, and some key parameters to know for each step.

 ## Pre-Guidance: Reorganizing files by gene family, and optional filters

-The pre-guidance step of PhyloToL part 2 takes ReadyToGo files (aka "species files" from your taxa_list.txt) and convert them into "Gene Family files", one file for each one of the GF you chose in the listofOGs.txt. 
+The pre-guidance step of PhyloToL part 2 takes ReadyToGo files (one file per sample as output by PhyloToL part 1) and reorganized the amino acid sequences into one file per gene family (one file for each one of the GF you chose in the listofOGs.txt, see above). It also applies some optional filters, described below.

-* Similarity filter (optional): You can turn on the similarity filter to remove likely ingroup paralogs prior to running Guidance (a CPU intensive process). It allows to optionally choose to remove sequences based on GC composition, similarity or sequences previously identified as "non wanted". TBD The parameters for this when running pre-guidance is ‘--og_identifier’ and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’ and passing all the sequences to guidance without filtering. 
-
-
-* Filtering by GC (optional): The filtering by GC content selects only sequences that fall within a specified user defined ranges. 
-The renaming of each sequence is done using a utility script (GC_identifier.py), previous to running PhyloToL part 2, which renames the sequences with OGG, OG6, and OGA depending on if the sequence GC content falls below (OGA) or above (OGG) or within (OG6) the user specified GC range. As input for running PhyloToL part 2, user will give the labeled Ready To Go instead of the usual ones.
-The parameters for this when running pre-guidance is ‘--og_identifier’ and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’ and passing all the sequences to guidance without filtering.
+### Filtering by GC content
+ 
+Using a utility script (GC_identifier.py), prior to running PhyloToL part 2, users can choose to rename each sequence in the ReadyToGo file depending on whether the sequence falls outside of a user-defined range of GC content. The gene family identifier (last ten digits of the sequence identifier and by default prefixed by "OG6_") will be renamed depending on if the sequence GC content falls below (OGA) or above (OGG) or within (OG6) the user specified GC range. If the user wishes to only include one category of sequences when running PhyloToL part 2, the user should give the relabeled ReadyToGo files. The parameters for this when running pre-guidance is `--og_identifier` and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’, which passing all the sequences to guidance without filtering.

 Adding these options to the command line will give:
-> python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG > Output1.out

-A note on filtering by GC: we do this based on plots that compare GC content at silent sites (GC3s) to effective number of codons as calculated by Wright 1991. 
+`python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG`

-TBD - Marie put example figure here with explanation...
+A note on filtering by GC: we set our bounds by visually inspecting plots that compare GC content at silent sites (GC3 Degen) to effective number of codons (ENc) as calculated by Wright 1991. These graphs can be made using the utility script [CUB.py](https://github.com/Katzlab/PhyloToL-6/blob/main/Utilities/for_fastas/CUB.py). For example for the following graph, we might choose to exclude all points with a GC content less than 50% (so, set the lower bound to be 50% [OGA] and the upper bound to be 100% [OGG], then in part 2 use `--og_identifier OG6`).

-TBD - i think this is already above now?
+<img src='https://github.com/Katzlab/PhyloToL-6/blob/main/Other/JF_example.png' width='30%'>

-### similarity filters
+### Similarity filter

-Another option to filter sequences from the ReadyToGo files at the pre-guidance step (and so reducing computer resources needed, and sequence redundancy) is the similarity filter. Here, user can choose to remove sequences that are too similar by adding the '--similarity_filter' parameter into the command line, on all the dataset or only on a specific set of taxa. All parameters involved in reducing sequences by similarity are summarized in this table:
+Another option to filter sequences from the ReadyToGo files at the pre-guidance step (and so reducing computer resources needed, and sequence redundancy) is the similarity filter. The purpose of the similarity filter is to remove likely in-group paralogs prior to running Guidance (a CPU intensive process). Here, user can choose to remove sequences that are too similar by adding the `--similarity_filter` parameter into the command line, on all the dataset or only on a specific set of taxa (see the table below for a description of the parameters to use).=

 Argument | Default | Choices | Help
 -- | -- | -- | --
 --og_identifier | OG | OG, OG6, OGA, OGG | Program to use for selecting sequences by GC width.
 --similarity_filter | store_true |   | Run the similarity filter in pre-Guidance.
--sim_cutoff | 1 |   | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff.
--sim_taxa | None |   | Path to the file with the taxa (10-digit codes) to apply the similarity filter on.
+--sim_cutoff | 1 | _float_ | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar (% amino acid identity over 20% of their length) than this cutoff.
+--sim_taxa | None | Path to file | Path to the file with the taxa (10-digit codes) to apply the similarity filter on.

 Adding these options to the command line will give:

-> python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt > Output1.out
+`python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt`

+###Other optional parameters:
+#### Blacklist

-Other optional parameters:
-### Blacklist
+The blacklist is a user-defined set of sequences to be removed from runs. You might choose to keep a list of sequences removed by Guidance to avoid reconsidering these non-homologs in future runs as this can save computer time, if you're going to be using the same ReadyToGo files and GFs in multiple runs of PhyloToL part 2. For our study, we chose to include only sequences removed by Guidance in our blacklist, but users should choose what fits best for their study and their data. To include this parameter in your PhyloToL run, you will need to add the `--blacklist` flag to the command line as follows:

-The Blacklist is a user-defined set of sequences to be removed from runs. You might choose to list of sequences removed by Guidance (to avoid reconsidering these non-homologs in future runs as this can save computer time. For our study, we chose to include only sequences removed by Guidance in our Blacklist, but user can choose (wisely) what fits best for their study and their data. To include this parameter to your PhyloToL run, you will need to add the '--blacklist' flag to the command line as follow:
-
-> python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt > Output1.out
+`python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt`

 Argument | Default | Choices | Help
 -- | -- | -- | --
--blacklist |   |   | A text file with a list of sequence names not to consider.
+--blacklist | None | Path to a file | A text file with a list of sequence names not to consider.

 ## Guidance

-Within PhyloToL, we are using Guidance as a proxy to assess homology within GFs. PTL6p2 runs Guidance in an iterative fashion to remove non-homologous sequences defined as those that fall below the sequence score cutoff. (We note that there is some stochasticity here given the iteration of alignments built into the method.) After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iterations. Some options are available to change the default set up for Guidance:
+Within PhyloToL part 2, we use Guidance to assess homology within gene families. PhyloToL part 2 runs Guidance in an iterative fashion to remove non-homologous sequences, defined as those that fall below the sequence score cutoff (we note that there is some stochasticity here). Users should consult the [Guidance 2.02 documentation](https://taux.evolseq.net/guidance/) for details on the significance of these parameters. After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iteration. Some options are available to change the default set up for Guidance:

-Argument | Default | Choices | Help
+Argument | Default | Choices | Description
 -- | -- | -- | --
--guidance_iters | 5 |   | Number of Guidance iterations for sequence removal.
--seq_cutoff | 0.3 |   | During guidance, taxa are removed if their score is below this cutoff.
--col_cutoff | 0.0 |   | During guidance, columns are removed if their score is below this cutoff.
--res_cutoff | 0.0 |   | During guidance, residues are removed if their score is below this cutoff.
--keep_temp | store_true |   | Use this to keep ALL Guidance intermediate files.
--keep_iter / -z | store_true |   | Keep all Guidance iterations (beware this will be very large)
+--guidance_iters | 5 |  _int_ | Number of Guidance iterations for sequence removal.
+--seq_cutoff | 0.3 | _float_  | During guidance, taxa are removed if their score is below this cutoff.
+--col_cutoff | 0.0 |  _float_ | During guidance, columns are removed if their score is below this cutoff.
+--res_cutoff | 0.0 |  _float_ | During guidance, residues are removed if their score is below this cutoff.
+--keep_temp | True | flag | Use this to keep ALL Guidance intermediate files.
+--keep_iter / -z | True | flag  | Keep all Guidance iterations (beware this will be very large)

 ## Gene trees

-After homology assessment and building MSA (Guidance step), PhyloToL trim alignment and build trees. By default, alignments are trimmed at 0.95% with trimal, and trees are built by IqTREE with an LG+G model. All parameters can be altered here by adding the correct flags (see table), including the choice of tool for phylogenetic reconstruction.
-
-Argument | Default | Choices | Help
+After homology assessment and building MSAs (the Guidance step), PhyloToL trims alignments and build trees. By default, alignments are trimmed at 0.95% with TrimAL, and trees by default are built by IqTREE with an LG+G model; users may choose to use a different third-party tool for phylogenetic reconstruction.
+ 
+Argument | Default | Choices | Description
 -- | -- | -- | --
 --tree_method | iqtree | iqtree, raxml, all | Program to use for tree-building.

-
-
-## Summary for basic launching
-
-In summary, PhyloToL is a modular and flexible pipeline than can be started and stopped at any points, with options at each point to completely personalize your run. For a full PhyloToL run, it start with an input Folder of ReadyToGo files, and ends with an Output Folder containing PreGuidance files (one file by GF regrouping sequences from all taxa in your list), NotGapTrimmed files (post-guidance files, non-homologous sequences removed, but non trimmed by trimal), Guidance files (post-guidance and trimmed files), and Trees.
-
-Minimum requirements: 
-> python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder  > Output1.out
-
-List of all parameters included in PhyloToL:
-
-
-Argument | Default | Choices | Help
-- | -- | -- | --
--start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL.
--end | trees | unaligned, aligned, trees | Stage until which to run PhyloToL. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
--gf_list | None |   | Path to the file with the GFs of interest. Only required if starting from the raw dataset.
--taxon_list | None |   | Path to the file with the taxa (10-digit codes) to include in the output.
--data |   |   | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
--output | ./ |   | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.
--force | store_true |   | Overwrite all existing files in the "Output" folder.
--tree_method | iqtree | iqtree, raxml, all | Program to use for tree-building.
--blacklist |   |   | A text file with a list of sequence names not to consider.
--og_identifier | OG | OG, OG6, OGA, OGG | Program to use for selecting sequences by GC width.
--sim_taxa | None |   | Path to the file with the taxa (10-digit codes) to apply the similarity filter on.
-
-Core parameters - rarely changed from defaults
-
-Argument | Default | Choices | Help
-- | -- | -- | --
--blast_cutoff | 1e-20 |   | Blast e-value cutoff.
--len_cutoff | 10 |   | Amino acid length cutoff for removal of very short sequences after column removal in Guidance.
--similarity_filter | store_true |   | Run the similarity filter in pre-Guidance.
--sim_cutoff | 1 |   | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff.
--guidance_iters | 5 |   | Number of Guidance iterations for sequence removal.
--seq_cutoff | 0.3 |   | During guidance, taxa are removed if their score is below this cutoff.
--col_cutoff | 0.0 |   | During guidance, columns are removed if their score is below this cutoff.
--res_cutoff | 0.0 |   | During guidance, residues are removed if their score is below this cutoff.
--keep_temp | store_true |   | Use this to keep ALL Guidance intermediate files.
--keep_iter / -z | store_true |   | Keep all Guidance iterations (beware this will be very large)
-
-
 ## Contamination loop

 The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylogeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. We first provide an overview of the three modes and then give details on running below.
@ -183,7 +136,7 @@ The contamination coop (CL) is implemented within PhyloToL to allow the removal

 **Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opithokonts; all other opisthokont sequences in the tree are removed and their sequences listed in TBD seqremoved.txt.

-The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees. Starting with a set of trees and a list of rules (i.e. a sequence from a ciliate to be removed if it falls sister to a known food source), PhyloToL will: identify a list of sequences as contaminants (writing them out to TBD xxxx file), generate a fasta file for each gene family without contaminating sequences, reconstruct an alignment using TBD Guidance with TBD iterations?, and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
+The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees. Starting with a set of trees and a list of rules (i.e. a sequence from a ciliate to be removed if it falls sister to a known food source), PhyloToL will: identify a list of sequences as contaminants (writing them out to a file called TBD), generate a fasta file for each gene family without contaminating sequences, reconstruct an alignment using TBD Guidance with TBD iterations?, and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.


 ### Contamination loop setup
@ -215,9 +168,4 @@ To run the CL, use a similar command structure as described for running PhyloToL
 | --clade_grabbing_rules  | in clade mode | Path to a file | Clade-grabbing rules file | none |
 | --clade_grabbing_exceptions  | no | Path to a file | List of taxa to _not_ remove for any reason | none |

-
-
-## Orthologs selection and concatenation with PhyloToL
-TBD
-
-
+## Ortholog selection and concatenation