Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

MCLeleu 2024-08-13 10:02:05 -04:00
parent 1b9e337b94
commit 18353f9a63

@ -74,6 +74,31 @@ Minimum requirements:
List of all parameters included in PhyloToL: List of all parameters included in PhyloToL:
Argument | Default | Choices | Help
-- | -- | -- | --
--start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL.
--end | trees | unaligned, aligned, trees | Stage until which to run PhyloToL. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
--gf_list | None |   | Path to the file with the GFs of interest. Only required if starting from the raw dataset.
--taxon_list | None |   | Path to the file with the taxa (10-digit codes) to include in the output.
--data |   |   | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
--output | ./ |   | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.
--force | store_true |   | Overwrite all existing files in the "Output" folder.
--tree_method | iqtree | iqtree, raxml, all | Program to use for tree-building.
--blacklist |   |   | A text file with a list of sequence names not to consider.
--og_identifier | OG | OG, OG6, OGA, OGG | Program to use for selecting sequences by GC width.
--sim_taxa | None |   | Path to the file with the taxa (10-digit codes) to apply the similarity filter on.
--blast_cutoff | 1e-20 |   | Blast e-value cutoff.
--len_cutoff | 10 |   | Amino acid length cutoff for removal of very short sequences after column removal in Guidance.
--similarity_filter | store_true |   | Run the similarity filter in pre-Guidance.
--sim_cutoff | 1 |   | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff.
--guidance_iters | 5 |   | Number of Guidance iterations for sequence removal.
--seq_cutoff | 0.3 |   | During guidance, taxa are removed if their score is below this cutoff.
--col_cutoff | 0.0 |   | During guidance, columns are removed if their score is below this cutoff.
--res_cutoff | 0.0 |   | During guidance, residues are removed if their score is below this cutoff.
--keep_temp | store_true |   | Use this to keep ALL Guidance intermediate files.
--keep_iter / -z | store_true |   | Keep all Guidance iterations (beware this will be very large)
## Contamination loop ## Contamination loop
The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylgoeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylgoeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove.