Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

Katzlab 2024-08-13 10:33:19 -04:00
parent dc3137df91
commit 9d31a6da25

@ -1,5 +1,5 @@
# Overview and Modularity # Overview and Modularity
PhyloToL Part 2 is designed to: 1) generate multisequence alignments (**MSAs**) using Guidance (ref/github link ******), iterating to remove sequences that are not homologous (default sequence score is ****); 2) generate gene trees using a 3rd party program (RaxML, IQTree, FastTree); 3) remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options (see BioRxiv); and 4) ortholog selection and generation of species trees from concatenated alignments. PhyloToL Part 2 is designed to: 1) generate multisequence alignments (**MSAs**) using Guidance version 2.02 (https://taux.evolseq.net/guidance/source), iterating to remove sequences that are not homologous (default sequence score is 0.3); 2) generate gene trees using a 3rd party program (RaxML, IQTree, FastTree); 3) remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options (see BioRxiv); and 4) ortholog selection and generation of species trees from concatenated alignments.
PTL6p2 starts from the “ReadyToGo” files produced by part 1 (or any set of per-taxon sequences with names that match PLT6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by PTL6p1, unaligned amino acid sequences per GF (in which case pre-Guidance filters are not run), aligned sequence files (one per each GF; in which case Guidance does not run), or even trees if a user is running only the contamination loop or concatenation. All details are provided bellow. PTL6p2 starts from the “ReadyToGo” files produced by part 1 (or any set of per-taxon sequences with names that match PLT6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by PTL6p1, unaligned amino acid sequences per GF (in which case pre-Guidance filters are not run), aligned sequence files (one per each GF; in which case Guidance does not run), or even trees if a user is running only the contamination loop or concatenation. All details are provided bellow.