Updated PhyloToL Part 2 (markdown)

Katzlab 2024-08-09 17:34:08 -04:00
parent 38e32534ba
commit 0517d7c1f2

@ -12,4 +12,22 @@ We provide a diverse database of 1,000 genomes and transcriptomes from across th
# Contamination loop
The Contamination Loop is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (= Phylogenetic based contamination removal). 3 modes are available and described in this section: sister, subsister and clade. All modes take a user defined rules file to identify the sequences to remove.
The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylgoeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove.
Sisters-based contamination removal identifies sequences as putative contaminants based on their sister relationships. If a sequence from sample A appears on a tree sister to a sequence from sample B, and sample B is known to have contaminated sample A, then the sequence from sample A will be removed. Subsisters-based removal operates similarly, but looks at the taxa that are sister to sample A's _parent_ node, useful for when multiple samples are contaminated by the same other sample.
Clade-based contamination removal operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokont clades, we might choose to keep only Opisthokont sequences that fall in a monophyletic clade of 12 or more species of Opisthokont; all other Opisthokont sequences in the tree would be removed.
## Setup
The CL requires 1) a folder of alignments (not gap trimmed) and 2) a folder of gene trees in order to run, and they should be formatted in the same way as output by the preceding steps of PhyloToL part 2 (i.e. in the "Output" folder, see above). You can also give it data _not_ output by PhyloToL, but you will need to match the folder, file, and sequence name formats.
You will also need to create a 'rules' file. The format here varies between the different modes of the CL.
## Running