Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

MCLeleu 2024-08-13 10:02:05 -04:00
parent 1b9e337b94
commit 18353f9a63

@ -74,6 +74,31 @@ Minimum requirements:
List of all parameters included in PhyloToL:
Argument | Default | Choices | Help
-- | -- | -- | --
--start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL.
--end | trees | unaligned, aligned, trees | Stage until which to run PhyloToL. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
--gf_list | None |   | Path to the file with the GFs of interest. Only required if starting from the raw dataset.
--taxon_list | None |   | Path to the file with the taxa (10-digit codes) to include in the output.
--data |   |   | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
--output | ./ |   | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.
--force | store_true |   | Overwrite all existing files in the "Output" folder.
--tree_method | iqtree | iqtree, raxml, all | Program to use for tree-building.
--blacklist |   |   | A text file with a list of sequence names not to consider.
--og_identifier | OG | OG, OG6, OGA, OGG | Program to use for selecting sequences by GC width.
--sim_taxa | None |   | Path to the file with the taxa (10-digit codes) to apply the similarity filter on.
--blast_cutoff | 1e-20 |   | Blast e-value cutoff.
--len_cutoff | 10 |   | Amino acid length cutoff for removal of very short sequences after column removal in Guidance.
--similarity_filter | store_true |   | Run the similarity filter in pre-Guidance.
--sim_cutoff | 1 |   | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff.
--guidance_iters | 5 |   | Number of Guidance iterations for sequence removal.
--seq_cutoff | 0.3 |   | During guidance, taxa are removed if their score is below this cutoff.
--col_cutoff | 0.0 |   | During guidance, columns are removed if their score is below this cutoff.
--res_cutoff | 0.0 |   | During guidance, residues are removed if their score is below this cutoff.
--keep_temp | store_true |   | Use this to keep ALL Guidance intermediate files.
--keep_iter / -z | store_true |   | Keep all Guidance iterations (beware this will be very large)
## Contamination loop
The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylgoeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove.