Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2025-12-29 18:10:24 +08:00 · 2024-08-13 18:52:01 -04:00 · 2024-08-13 18:52:01 -04:00 · 04ee27ef19
commit 04ee27ef19
parent bd3a00c717
1 changed files with 19 additions and 12 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -134,27 +134,32 @@ The contamination coop (CL) is implemented within PhyloToL to allow the removal
 **Sisters-based contamination removal** identifies sequences as putative contaminants based on their sister relationships. If a sequence from sample A appears on a tree sister to a sequence from sample B, and sample B is known to have contaminated sample A, then the sequence from sample A will be removed. **Subsisters-based removal** operates similarly, but looks at the taxa that are sister to sample A's _parent_ node, useful for when multiple samples are contaminated by the same other sample. 
-**Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opithokonts; all other opisthokont sequences in the tree are removed and their sequences listed in TBD seqremoved.txt.
+**Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opisthokonts; all other opisthokont sequences in the tree are removed and their sequences listed in TBD seqremoved.txt.
 The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees. Starting with a set of trees and a list of rules (i.e. a sequence from a ciliate to be removed if it falls sister to a known food source), PhyloToL will: identify a list of sequences as contaminants (writing them out to a file called TBD), generate a fasta file for each gene family without contaminating sequences, reconstruct an alignment using TBD Guidance with TBD iterations?, and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
 The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees using the `--nloops` parameter. Starting with a set of trees and a list of rules (e.g., a sequence from a ciliate is to be removed if it falls sister to a known food source), PhyloToL will for each iteration: identify a list of sequences as contaminants (writing them out to a file called `SequencesRemoved_ContaminationLoop.txt`), generate a fasta file for each gene family excluding contaminating sequences, reconstruct an alignment using Guidance (or just MAFFT), and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
 ### Contamination loop setup
-The CL requires 1) a folder of alignments (not gap trimmed) and 2) a folder of gene trees in order to run, and they should be formatted in the same way as output by the preceding steps of PhyloToL part 2 (i.e. in the "Output" folder, see above). You can also give it data _not_ output by PhyloToL, but you will need to match the folder, file, and sequence name formats.
+The CL requires 1) a folder of alignments (it is most correct to not give gap trimmed alignments here) and 2) a folder of gene trees, and they should be formatted in the same way as output by the preceding steps of PhyloToL part 2 (i.e. in the "Output" folder, see above); most importantly, the alignment and tree files should begin with the same unique GF identifer. You can also give it data _not_ output by PhyloToL, but you will need to match the folder, file, and sequence name formats.
-You will also need to create a 'rules' file. The format here varies between the different modes of the CL.
+You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in [Datasets S4-S8 on this project's Figshare](https://doi.org/10.6084/m9.figshare.26540599).
-TBD _describe rules files here_
+In the sisters- and subsisters-modes, your rules file should include three columns. Each row represents a rule, for which a sequence from a taxon (identified by a ten-digit code or shorter code in the first column) will be removed if it is sister to a sequence from the taxon in the second column and on a branch that is shorter than X times the average branch length in the tree, where X is the number in the third column. Set the third column to “NA” if you do not desire to put any branch length restriction for the rule. For example, the line
 |Op_ch_Dgra | Sr_di | 0.1|
 |-|-|-|
 indicates that the a sequence from the choanoflagellate Op_ch_Dgra should be removed if it is sister to any dinoflagellate (any sequence beginning with the prefix Sr_di) on a branch that is less than one tenth the average branch length in the gene tree.
 In clade-grabbing mode, each row again represents a rule. This time, there are five columns. The first column gives the target taxonomic group for which you are clade grabbing. Here you can give a ten-digit code, a subset of a code, or even the path to a text file containing a list of multiple codes if they don't all share a precise enough prefix. The third column gives the minimum number of target taxa that must be in a clade for it to be kept, and the second column gives the minimum proportion (or absolute number of >1) of taxa in that clade that are not in the target group. The fourth column allows you to give a list of 'special' taxa (or just a ten-digit code or a subset of a code), X of which must be present in a clade for it to be selected, where X is the number in the fifth column. For example, the line
 |Sr_ci | 0.1 | 13 | ciliate_genomes.txt | 1|
 |-|-|-|-|
 indicates that all ciliate sequences should be removed if they don't fall in a clade with at least 13 ciliate species (unique ten digit codes beginning with Sr_ci), where no more than 1/10 of the species in the clade are non-ciliates, and containing at least 1 sequence that begins with a prefix listed in the ciliate_genomes.txt file (i.e., if you're more confident in genomic data, you may want to make sure that there's a genome in your clade).
 Having well-organized ten-digit-codes for sample identification is vital for running the CL, especially clade grabbing, because it will allow the CL to find all sequences belonging to taxa from a specific taxonomic group. For example, in the ten-digit-codes used in the 1000-ReadyToGo file database, in order to clade grab for Opisthokonta the CL will look for all sequences with ten-digit-codes starting with `Op_`, and to clade grab for Metazoa, all ten-digit-codes starting with `Op_me`, etc.
 The columns in the rules file are
 | Column | Input options | Description |
 | ------------- | ------------- | ------------- |
 | Target taxa | Sample code prefix or path to file | Clade identifier or list of ten-digit-codes of the target taxon group |
 ### Running
 To run the CL, use a similar command structure as described for running PhyloToL part 2 above, and add the `--contamination_loop` parameter to activate the contamination loop and specify a mode and the path to a rules file. Available parameters are:
@ -169,3 +174,5 @@ To run the CL, use a similar command structure as described for running PhyloToL
 | --clade_grabbing_exceptions  | no | Path to a file | List of taxa to _not_ remove for any reason | none |
 ## Ortholog selection and concatenation
 TBD