Updated EukPhylo QuickStart (markdown)

Adri K. Grow 2025-02-09 01:44:51 -05:00
parent 527e4e2b8b
commit 76b75b01b3

@ -130,10 +130,10 @@ For additional input parameter options, see table below or run: `python phylotol
|:---|:-------|:---|:---|
|`--start`|`raw`, `unaligned`, `aligned`, `trees`|Stage at which to start running PhyloToL|`raw`|
|`--end`|`unaligned`, `aligned`, `trees`|Stage until which to run PhyloToL. Options are `unaligned` (which will run up to but not including guidance), `aligned` (which will run up to but not including RAxML), and `trees` which will run through RAxML')|`trees`|
|`--gf_list`|Any valid path|Path to the file with the GFs of interest. Only required if starting from the raw dataset|None|
|`--taxon_list`|Any valid path|Path to the file with the taxa (10-digit codes) to include in the output|None|
|`--data`|Any valid path|Path to the input dataset. The format of this varies depending on your `--start` parameter. If you are running the contamination loop starting with trees, this folder must include both trees **AND** a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names)|None|
|`--output`|Any valid path|Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts|`../`|
|`--gf_list`|Valid path|Path to the file with the GFs of interest. Only required if starting from the raw dataset|None|
|`--taxon_list`|Valid path|Path to the file with the taxa (10-digit codes) to include in the output|None|
|`--data`|Valid path|Path to the input dataset. The format of this varies depending on your `--start` parameter. If you are running the contamination loop starting with trees, this folder must include both trees **AND** a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names)|None|
|`--output`|Valid path|Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts|`../`|
### Modularity
Below are several optional ways to parameterize EukPhylo Part 2
@ -163,16 +163,16 @@ Below are several optional ways to parameterize EukPhylo Part 2
|`--sim_taxa`|no|A file listing taxa (10-digit codes) to apply the similarity filter on (e.g. sim_taxa.txt)|NA|
**For removing known poor-quality or contaminant sequences (user informed):**
|Parameter|Description|
|:---|:---|
|`--blacklist`|str; A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)|
|Parameter|Options|Description|
|:---|:---|:---|
|`--blacklist`|str|A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)|
**For removing sequences based on GC composition:**
*Note: you must first identify sequences with OGA, OGG, OG6 using the GC_identifier.py script [here](https://github.com/Katzlab/EukPhylo/tree/main/Utilities/for_fastas) on GitHub*
|Parameter|Options|Description|Default|
|:---|:---|:---|:---|
|`--og_identifier`|`OG`,`OG6`,`OGA`,`OGG`|Select sequences by GC width|`OG`
|`--og_identifier`|`OG`, `OG6`, `OGA`, `OGG`|Select sequences by GC width|`OG`
## Contamination Removal
Contamination removal within EukPhylo (also called Contamination Loop or CL) allows for sequence removal based on Sisters/Subsisters identification or based on Clades diversity. An examplar run is available in [Figshare](https://figshare.com/articles/dataset/Examplar_runs_PhyloToL_and_CLoop/26662018)
@ -199,15 +199,15 @@ Options:
| Parameter | Required | Options | Description | Default |
| ------------- | ------------- | ------------- | ------------- | ------------- |
|`--contamination_loop`|yes|seq, clade|The mode in which to run the CL|NA|
|`--contamination_loop`|yes|`seq`, `clade`|The mode in which to run the CL|NA|
|`--nloops`|no|positive int|Number of iterations|`5`|
|`--sister_rules`|only in sisters mode|Any valid path|Path to a text file containing sisters rules|NA|
|`--subsister_rules`|only in subsisters mode|Any valid path|Path to a text file containing subsisters rules|NA|
|`--clade_grabbing_rules`|only in clade mode|Any valid path|Path to a text file containing clade-grabbing rules|NA|
|`--clade_grabbing_exceptions`|no|Any valid path|List of taxa to _not_ remove for any reason|NA|
|`--sister_rules`|only in sisters mode|Valid path|Path to a text file containing sisters rules|NA|
|`--subsister_rules`|only in subsisters mode|Valid path|Path to a text file containing subsisters rules|NA|
|`--clade_grabbing_rules`|only in clade mode|Valid path|Path to a text file containing clade-grabbing rules|NA|
|`--clade_grabbing_exceptions`|no|Valid path|List of taxa to _not_ remove for any reason|NA|
|`--cl_tree_method`|no|`iqtree`, `raxml`, `fasttree`, `iqtree_fast`|Tree-building method to use in each contamination loop iteration|`fasttree`|
|`--cl_alignment_method`|no|`mafft_only`, `guidance`|Alignment method to use in each contamination loop iteration|`mafft_only`|
|`--cl_exclude_taxa`|no|Any valid path|Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop|NA|
|`--cl_exclude_taxa`|no|Valid path|Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop|NA|
## Concatenation