Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2026-02-12 18:30:25 +08:00 · 2024-08-13 10:07:58 -04:00 · 2024-08-13 10:07:58 -04:00 · 7c4fd903cf
commit 7c4fd903cf
parent 18353f9a63
1 changed files with 32 additions and 4 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -19,7 +19,15 @@ If you want to start at a different point other than raw data, you will change t
 This is the line you could run, with minimum requirements: 
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder  > Output1.out

-__provides the table with list of options flags parameters here__
+Argument | Default | Choices | Help
+-- | -- | -- | --
+--start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL.
+--end | trees | unaligned, aligned, trees | Stage until which to run PhyloToL. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
+--gf_list | None |   | Path to the file with the GFs of interest. Only required if starting from the raw dataset.
+--taxon_list | None |   | Path to the file with the taxa (10-digit codes) to include in the output.
+--data |   |   | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
+--output | ./ |   | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.
+

 Optional arguments can then be added to the command line, and will be described bellow.

@ -40,7 +48,12 @@ Adding these options to the command line will give:

 Another option to filter sequences from the ReadyToGo files at the pre-guidance step (and so reducing computer resources needed, and sequence redundancy) is the similarity filter. Here, user can choose to remove sequences that are too similar by adding the '--similarity_filter' parameter into the command line, on all the dataset or only on a specific set of taxa. All parameters involved in reducing sequences by similarity are summarized in this table:

-__table here__
+Argument | Default | Choices | Help
+-- | -- | -- | --
+--og_identifier | OG | OG, OG6, OGA, OGG | Program to use for selecting sequences by GC width.
+--similarity_filter | store_true |   | Run the similarity filter in pre-Guidance.
+--sim_cutoff | 1 |   | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff.
+--sim_taxa | None |   | Path to the file with the taxa (10-digit codes) to apply the similarity filter on.

 Adding these options to the command line will give:

@ -52,17 +65,30 @@ As you run PhyloToL, list of sequences removed by Similarity Filter and Guidance

 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt > Output1.out

+Argument | Default | Choices | Help
+-- | -- | -- | --
+--blacklist |   |   | A text file with a list of sequence names not to consider.
+
 ## Guidance

 Within PhyloToL, we are using Guidance as a proxy to assess homology in Gene Families. PTL6p2 runs Guidance in an iterative fashion to remove non-homologous sequences defined as those that fall below the sequence score cutoff. (We note that there is some stochasticity here given the iteration of alignments built into the method.) After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iterations. Some options are available to change the default set up for Guidance:

-__table guidance options__
+Argument | Default | Choices | Help
+-- | -- | -- | --
+--guidance_iters | 5 |   | Number of Guidance iterations for sequence removal.
+--seq_cutoff | 0.3 |   | During guidance, taxa are removed if their score is below this cutoff.
+--col_cutoff | 0.0 |   | During guidance, columns are removed if their score is below this cutoff.
+--res_cutoff | 0.0 |   | During guidance, residues are removed if their score is below this cutoff.
+--keep_temp | store_true |   | Use this to keep ALL Guidance intermediate files.
+--keep_iter / -z | store_true |   | Keep all Guidance iterations (beware this will be very large)

 ## Gene trees

 After homology assessment and building MSA (Guidance step), PhyloToL trim alignment and build trees. By default, alignments are trimmed at 0.95% with trimal, and trees are built by IqTREE with an LG+G model. All parameters can be altered here by adding the correct flags (see table), including the choice of tool for phylogenetic reconstruction.

-__table parameters here__
+Argument | Default | Choices | Help
+-- | -- | -- | --
+--tree_method | iqtree | iqtree, raxml, all | Program to use for tree-building.

 ## Summary for basic launching

@ -87,6 +113,8 @@ Argument | Default | Choices | Help
 --blacklist |   |   | A text file with a list of sequence names not to consider.
 --og_identifier | OG | OG, OG6, OGA, OGG | Program to use for selecting sequences by GC width.
 --sim_taxa | None |   | Path to the file with the taxa (10-digit codes) to apply the similarity filter on.
+
+Core parameters - rarely changed from defaults
 --blast_cutoff | 1e-20 |   | Blast e-value cutoff.
 --len_cutoff | 10 |   | Amino acid length cutoff for removal of very short sequences after column removal in Guidance.
 --similarity_filter | store_true |   | Run the similarity filter in pre-Guidance.