Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

Auden Cote-L'Heureux 2025-01-10 10:59:02 -05:00
parent 2b35b2edcc
commit 95979d4a74

@ -59,14 +59,14 @@ You will also need to give PhyloToL part 2 a list of all of the sample identifie
Below is a list of basic PhyloToL part 2 parameters: Below is a list of basic PhyloToL part 2 parameters:
Argument | Default | Choices | Description Argument | Default value | Options | Description
-- | -- | -- | -- -- | -- | -- | --
--start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL. --start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL.
--end | trees | unaligned, aligned, trees | Stage until which to run PhyloToL. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML. --end | trees | unaligned, aligned, trees | Stage until which to run PhyloToL. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
--gf_list | None |   | Path to the file with the GFs of interest. Only required if starting from the raw dataset. --gf_list | No default | Any valid path | Path to the file with the GFs of interest. Only required if starting from the raw dataset.
--taxon_list | None |   | Path to the file with the taxa (10-digit codes) to include in the output. --taxon_list | No default |  Any valid path | Path to the file with the taxa (10-digit codes) to include in the output.
--data |   |   | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names). --data | No default | Any valid path | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
--output | ./ |   | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts. --output | Current directory | Any valid path  | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.
Optional arguments can then be added to the base command, and will be described below. In the following is described each stage of PhyloToL, and some key parameters to know for each step. Optional arguments can then be added to the base command, and will be described below. In the following is described each stage of PhyloToL, and some key parameters to know for each step.
@ -93,9 +93,9 @@ Another option to filter sequences from the ReadyToGo files at the pre-guidance
Argument | Default | Choices | Help Argument | Default | Choices | Help
-- | -- | -- | -- -- | -- | -- | --
--og_identifier | OG | OG, OG6, OGA, OGG | Program to use for selecting sequences by GC width. --og_identifier | OG | OG, OG6, OGA, OGG | Program to use for selecting sequences by GC width.
--similarity_filter | store_true |   | Run the similarity filter in pre-Guidance. --similarity_filter | flag (true/false) | include or exclude the argument | Run the similarity filter in pre-Guidance.
--sim_cutoff | 1 | _float_ | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar (% amino acid identity over 20% of their length) than this cutoff. --sim_cutoff | 1 | Any number between zero and one | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar (% amino acid identity over 20% of their length) than this cutoff.
--sim_taxa | None | Path to file | Path to the file with the taxa (10-digit codes) to apply the similarity filter on. --sim_taxa | No default | Any valid path | Path to the file with the taxa (10-digit codes) to apply the similarity filter on.
Adding these options to the command line will give: Adding these options to the command line will give:
@ -110,7 +110,7 @@ The blacklist is a user-defined set of sequences to be removed from runs. You mi
Argument | Default | Choices | Help Argument | Default | Choices | Help
-- | -- | -- | -- -- | -- | -- | --
--blacklist | None | Path to a file | A text file with a list of sequence names not to consider. --blacklist | No default | Any valid path | Path to a text file with a list of sequence names not to consider.
## Guidance ## Guidance
@ -118,12 +118,12 @@ Within PhyloToL part 2, we use Guidance to assess homology within gene families.
Argument | Default | Choices | Description Argument | Default | Choices | Description
-- | -- | -- | -- -- | -- | -- | --
--guidance_iters | 5 |  _int_ | Number of Guidance iterations for sequence removal. --guidance_iters | 5 | Any positive integer | Number of Guidance iterations for sequence removal.
--seq_cutoff | 0.3 | _float_  | During guidance, taxa are removed if their score is below this cutoff. --seq_cutoff | 0.3 | Any number between 0 and 1  | During guidance, taxa are removed if their score is below this cutoff.
--col_cutoff | 0.0 |  _float_ | During guidance, columns are removed if their score is below this cutoff. --col_cutoff | 0.0 |  Any number between 0 and 1 | During guidance, columns are removed if their score is below this cutoff.
--res_cutoff | 0.0 |  _float_ | During guidance, residues are removed if their score is below this cutoff. --res_cutoff | 0.0 |  Any number between 0 and 1 | During guidance, residues are removed if their score is below this cutoff.
--keep_temp | True | flag | Use this to keep ALL Guidance intermediate files. --keep_temp | False | include or exclude the argument | Use this to keep ALL Guidance intermediate files.
--keep_iter / -z | True | flag  | Keep all Guidance iterations (beware this will be very large) --keep_iter / -z | False | include or exclude the argument  | Keep all Guidance iterations (beware this will be very large)
## Gene trees ## Gene trees
@ -177,11 +177,11 @@ To run the CL, use a similar command structure as described for running PhyloToL
| Parameter | Required | Options | Description | Default | | Parameter | Required | Options | Description | Default |
| ------------- | ------------- | ------------- | ------------- | ------------- | | ------------- | ------------- | ------------- | ------------- | ------------- |
| --contamination_loop | yes | seq, clade | The mode in which to run the CL | none | | --contamination_loop | yes | seq, clade | The mode in which to run the CL | none |
| --nloops | no | _int_ | Number of iterations | 5 | | --nloops | no | any positive integer | Number of iterations | 5 |
| --sister_rules | in sisters mode | Path to a file | Sisters rules file | none | | --sister_rules | only in sisters mode | Any valid path | Path to a text file containing 'sisters rules' | none |
| --subsister_rules | in subsisters mode | Path to a file | Subsisters rules file | none | | --subsister_rules | only in subsisters mode | Any valid path | Path to a text file containing 'subsisters rules' | none |
| --clade_grabbing_rules | in clade mode | Path to a file | Clade-grabbing rules file | none | | --clade_grabbing_rules | only in clade mode | Any valid path | Path to a text file containing 'clade-grabbing rules' | none |
| --clade_grabbing_exceptions | no | Path to a file | List of taxa to _not_ remove for any reason | none | | --clade_grabbing_exceptions | no | Any valid path | List of taxa to _not_ remove for any reason | none |
## Ortholog selection and concatenation ## Ortholog selection and concatenation