From 04ee27ef198260d9b4c151d04667acbe5e25970e Mon Sep 17 00:00:00 2001
From: Katzlab <katzlab@smith.edu>
Date: Tue, 13 Aug 2024 18:52:01 -0400
Subject: [PATCH] Updated PhyloToL Part 2: MSAs, trees, and contamination loop
 (markdown)

---
 ...-2:-MSAs,-trees,-and-contamination-loop.md | 31 ++++++++++++-------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
index e06e5dc..34dc9bd 100644
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@@ -134,27 +134,32 @@ The contamination coop (CL) is implemented within PhyloToL to allow the removal
 
 **Sisters-based contamination removal** identifies sequences as putative contaminants based on their sister relationships. If a sequence from sample A appears on a tree sister to a sequence from sample B, and sample B is known to have contaminated sample A, then the sequence from sample A will be removed. **Subsisters-based removal** operates similarly, but looks at the taxa that are sister to sample A's _parent_ node, useful for when multiple samples are contaminated by the same other sample. 
 
-**Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opithokonts; all other opisthokont sequences in the tree are removed and their sequences listed in TBD seqremoved.txt.
-
-The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees. Starting with a set of trees and a list of rules (i.e. a sequence from a ciliate to be removed if it falls sister to a known food source), PhyloToL will: identify a list of sequences as contaminants (writing them out to a file called TBD), generate a fasta file for each gene family without contaminating sequences, reconstruct an alignment using TBD Guidance with TBD iterations?, and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
+**Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opisthokonts; all other opisthokont sequences in the tree are removed and their sequences listed in TBD seqremoved.txt.
 
+The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees using the `--nloops` parameter. Starting with a set of trees and a list of rules (e.g., a sequence from a ciliate is to be removed if it falls sister to a known food source), PhyloToL will for each iteration: identify a list of sequences as contaminants (writing them out to a file called `SequencesRemoved_ContaminationLoop.txt`), generate a fasta file for each gene family excluding contaminating sequences, reconstruct an alignment using Guidance (or just MAFFT), and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
 
 ### Contamination loop setup
 
-The CL requires 1) a folder of alignments (not gap trimmed) and 2) a folder of gene trees in order to run, and they should be formatted in the same way as output by the preceding steps of PhyloToL part 2 (i.e. in the "Output" folder, see above). You can also give it data _not_ output by PhyloToL, but you will need to match the folder, file, and sequence name formats.
+The CL requires 1) a folder of alignments (it is most correct to not give gap trimmed alignments here) and 2) a folder of gene trees, and they should be formatted in the same way as output by the preceding steps of PhyloToL part 2 (i.e. in the "Output" folder, see above); most importantly, the alignment and tree files should begin with the same unique GF identifer. You can also give it data _not_ output by PhyloToL, but you will need to match the folder, file, and sequence name formats.
 
-You will also need to create a 'rules' file. The format here varies between the different modes of the CL.
+You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in [Datasets S4-S8 on this project's Figshare](https://doi.org/10.6084/m9.figshare.26540599).
 
-TBD _describe rules files here_
+In the sisters- and subsisters-modes, your rules file should include three columns. Each row represents a rule, for which a sequence from a taxon (identified by a ten-digit code or shorter code in the first column) will be removed if it is sister to a sequence from the taxon in the second column and on a branch that is shorter than X times the average branch length in the tree, where X is the number in the third column. Set the third column to “NA” if you do not desire to put any branch length restriction for the rule. For example, the line
+
+|Op_ch_Dgra | Sr_di | 0.1|
+|-|-|-|
+
+indicates that the a sequence from the choanoflagellate Op_ch_Dgra should be removed if it is sister to any dinoflagellate (any sequence beginning with the prefix Sr_di) on a branch that is less than one tenth the average branch length in the gene tree.
+
+In clade-grabbing mode, each row again represents a rule. This time, there are five columns. The first column gives the target taxonomic group for which you are clade grabbing. Here you can give a ten-digit code, a subset of a code, or even the path to a text file containing a list of multiple codes if they don't all share a precise enough prefix. The third column gives the minimum number of target taxa that must be in a clade for it to be kept, and the second column gives the minimum proportion (or absolute number of >1) of taxa in that clade that are not in the target group. The fourth column allows you to give a list of 'special' taxa (or just a ten-digit code or a subset of a code), X of which must be present in a clade for it to be selected, where X is the number in the fifth column. For example, the line
+
+|Sr_ci | 0.1 | 13 | ciliate_genomes.txt | 1|
+|-|-|-|-|
+
+indicates that all ciliate sequences should be removed if they don't fall in a clade with at least 13 ciliate species (unique ten digit codes beginning with Sr_ci), where no more than 1/10 of the species in the clade are non-ciliates, and containing at least 1 sequence that begins with a prefix listed in the ciliate_genomes.txt file (i.e., if you're more confident in genomic data, you may want to make sure that there's a genome in your clade).
 
 Having well-organized ten-digit-codes for sample identification is vital for running the CL, especially clade grabbing, because it will allow the CL to find all sequences belonging to taxa from a specific taxonomic group. For example, in the ten-digit-codes used in the 1000-ReadyToGo file database, in order to clade grab for Opisthokonta the CL will look for all sequences with ten-digit-codes starting with `Op_`, and to clade grab for Metazoa, all ten-digit-codes starting with `Op_me`, etc.
 
-The columns in the rules file are
-
-| Column | Input options | Description |
-| ------------- | ------------- | ------------- |
-| Target taxa | Sample code prefix or path to file | Clade identifier or list of ten-digit-codes of the target taxon group |
-
 ### Running
 
 To run the CL, use a similar command structure as described for running PhyloToL part 2 above, and add the `--contamination_loop` parameter to activate the contamination loop and specify a mode and the path to a rules file. Available parameters are:
@@ -169,3 +174,5 @@ To run the CL, use a similar command structure as described for running PhyloToL
 | --clade_grabbing_exceptions  | no | Path to a file | List of taxa to _not_ remove for any reason | none |
 
 ## Ortholog selection and concatenation
+
+TBD