Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2025-12-28 21:00:25 +08:00 · 2024-08-14 11:39:45 -04:00 · 2024-08-14 11:39:45 -04:00 · 1ea28fece4
commit 1ea28fece4
parent 6fdb41760e
1 changed files with 7 additions and 4 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -22,10 +22,13 @@ The following are required to run PhyloToL part 1. The dependencies are confirme
 ## Input data and folder structure
-_ACL is going to return here_
+Running PhyloToL Part 2 requires at least four items in your main directory:
 * A folder named [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL2/Scripts) and containing all scripts from PhyloToL part 2
 * A folder containing your input files
 * A taxon list
 * An OG list
-Running PhyloToL Part 2 requires at least four items in your main directory (see setup section above)
+See below for details.
 A folder named [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL2/Scripts) and containing all scripts from PhyloToL part 2, 2) a folder containing your input files, 3) a taxon list and 4) an OG list. Given that PhyloToL part 2 is highly modular and flexible, you will want to be sure of what point to the process of PTLp2 you wish to start and end (**TBD - point to figshare file here -- some figure). The default script starts with raw data and produces trees using scripts 1 - TBD. 
 # Databases
 For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (see [dataset S1 and S6](https://figshare.com/account/projects/196552/articles/26540599?file=48494653)). This database is in the form of "ReadyToGo" files, the output of PhyloToL part 1 ([Dataset S16](https://figshare.com/account/projects/196552/articles/25336129?file=48356116)). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above). 
@ -136,7 +139,7 @@ The contamination coop (CL) is implemented within PhyloToL to allow the removal
 **Sisters-based contamination removal** identifies sequences as putative contaminants based on their sister relationships. If a sequence from sample A appears on a tree sister to a sequence from sample B, and sample B is known to have contaminated sample A, then the sequence from sample A will be removed. **Subsisters-based removal** operates similarly, but looks at the taxa that are sister to sample A's _parent_ node, useful for when multiple samples are contaminated by the same other sample. 
-**Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opisthokonts; all other opisthokont sequences in the tree are removed and their sequences listed in TBD seqremoved.txt.
+**Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opisthokonts; all other opisthokont sequences in the tree are removed.
 The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees using the `--nloops` parameter. Starting with a set of trees and a list of rules (e.g., a sequence from a ciliate is to be removed if it falls sister to a known food source), PhyloToL will for each iteration: identify a list of sequences as contaminants (writing them out to a file called `SequencesRemoved_ContaminationLoop.txt`), generate a fasta file for each gene family excluding contaminating sequences, reconstruct an alignment using Guidance (or just MAFFT), and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.