Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2025-12-29 06:10:30 +08:00 · 2024-08-20 11:20:43 -04:00 · 2024-08-20 11:20:43 -04:00 · f40a8198f6
commit f40a8198f6
parent aafa36f3b9
1 changed files with 1 additions and 1 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -185,7 +185,7 @@ To run the CL, use a similar command structure as described for running PhyloToL
 ## Ortholog selection and concatenation
-PhyloToL includes an optional step, which can be run after the tree-building stage (or by using `--start trees` and passing to the `--data` argument a folder of trees and corresponding alignments), to select orthologs (one sequence at most per tax from each GF) and build a concatenated alignment. PhyloToL first identifies for each taxon the monophyletic clade with the greatest number of species from that taxon's minor clade, using the first five digits of that taxon's sample identifier (e.g., Op_me for metazoa); alternatively, a user can select orthologs for only a target group of taxon using the `--concat_target_taxa` argument by inputting a file with a list of ten digit codes, or just a single ten-digit code or clade prefix. If only one sequence from the taxon falls into this largest clade, that's the sequence chosen for concatenation; otherwise, then a score is given to each sequence equal to its length times is k-mer coverage for transcriptomic data, and just the sequence length for genomic data, and the sequence with the highest score is taken. If a GF is not present in a taxon, then the space is filled with gaps in the concatenated alignments. This step produces a clearly labeled concatenated alignment, as well as a folder called "DataToConcatenate" in which can be found all the selected orthologs for each GF, aligned and unaligned. 
+PhyloToL includes an optional step, which can be run after the tree-building stage (or by using `--start trees` and passing to the `--data` argument a folder of trees and corresponding alignments or unaligned sequences files), to select orthologs (one sequence at most per tax from each GF) and build a concatenated alignment. PhyloToL first identifies for each taxon the monophyletic clade with the greatest number of species from that taxon's minor clade, using the first five digits of that taxon's sample identifier (e.g., Op_me for metazoa); alternatively, a user can select orthologs for only a target group of taxon using the `--concat_target_taxa` argument by inputting a file with a list of ten digit codes, or just a single ten-digit code or clade prefix. If only one sequence from the taxon falls into this largest clade, that's the sequence chosen for concatenation; otherwise, then a score is given to each sequence equal to its length times is k-mer coverage for transcriptomic data, and just the sequence length for genomic data, and the sequence with the highest score is taken. If a GF is not present in a taxon, then the space is filled with gaps in the concatenated alignments. This step produces a clearly labeled concatenated alignment, as well as a folder called "DataToConcatenate" in which can be found all the selected orthologs for each GF, aligned and unaligned. 
 To run this step, add the `--concatenate` flag to your PhyloToL command. Parameters are: