Updated EukPhylo QuickStart (markdown)

Adri K. Grow 2025-02-09 01:41:45 -05:00
parent ee9bb48a45
commit 527e4e2b8b

@ -141,7 +141,7 @@ Below are several optional ways to parameterize EukPhylo Part 2
**General:**
|Parameter|Options|Description|Default|
|:---|:---|:---|:---|
|`--force`||Overwrites all existing files in the `Output` folder|
|`--force`||Overwrites all existing files in the `Output` folder|NA|
|`--tree_method`|`iqtree`, `iqtree_fast`, `raxml`, `fasttree`|Change tree building software|`iqtree`|
**For BLAST and GUIDANCE:**
@ -156,26 +156,26 @@ Below are several optional ways to parameterize EukPhylo Part 2
|`--guidance_threads`|int|Number of threads to allocate to Guidance|`20`|
**For reducing number of similar sequences:**
|Parameter|Required|Options|Help|
|Parameter|Required|Description|Default|
|:---|:---|:---|:---|
|`--similarity_filter`|yes|action = store_true|Run the similarity filter in pre-Guidance|
|`--sim_cutoff`|yes|default = 1, type = float|Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff|
|`--sim_taxa`|no|default = None|A file listing taxa (10-digit codes) to apply the similarity filter on (e.g. sim_taxa.txt)|
|`--similarity_filter`|yes|Run the similarity filter in pre-Guidance|NA|
|`--sim_cutoff`|yes|float|Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff|`1`|
|`--sim_taxa`|no|A file listing taxa (10-digit codes) to apply the similarity filter on (e.g. sim_taxa.txt)|NA|
**For removing known poor-quality or contaminant sequences (user informed):**
|Parameter|Description|
|:---|:---|
|`--blacklist`|type = str; A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)|
|`--blacklist`|str; A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)|
**For removing sequences based on GC composition:**
*Note: you must first identify sequences with OGA, OGG, OG6 using the GC_identifier.py script [here](https://github.com/Katzlab/EukPhylo/tree/main/Utilities/for_fastas) on GitHub*
|Parameter|Options|Help|
|:---|:---|:---|
|`--og_identifier`|default = `OG`, choices = `OG`,`OG6`,`OGA`,`OGG`|Select sequences by GC width|
|Parameter|Options|Description|Default|
|:---|:---|:---|:---|
|`--og_identifier`|`OG`,`OG6`,`OGA`,`OGG`|Select sequences by GC width|`OG`
## Contamination Removal
Contamination removal within EukPhylo (also called Contamination Loop) allows for sequence removal based on Sisters/Subsisters identification or based on Clades diversity. An examplar run is available in [Figshare](https://figshare.com/articles/dataset/Examplar_runs_PhyloToL_and_CLoop/26662018)
Contamination removal within EukPhylo (also called Contamination Loop or CL) allows for sequence removal based on Sisters/Subsisters identification or based on Clades diversity. An examplar run is available in [Figshare](https://figshare.com/articles/dataset/Examplar_runs_PhyloToL_and_CLoop/26662018)
### Set up:
* An input folder (called for example Input), with both
@ -187,11 +187,11 @@ Contamination removal within EukPhylo (also called Contamination Loop) allows fo
* the Scripts Folder
### Running:
Basic running of the Contamination loop, with the sister mode:
Basic running of the Contamination Loop, with the sister mode:
`python3 Scripts/eukphylo.py --start trees --end trees --data Input --output Output --contamination_loop seq --sister_rules sister_rules_file.txt > log.out`
Basic running of the Contamination loop, with the clade mode:
Basic running of the Contamination Loop, with the clade mode:
`python3 Scripts/eukphylo.py --start trees --end trees --data Input --output Output --contamination_loop clade --clade_grabbing_rules_file clade_grabbing_rules.txt > log.out`
@ -199,15 +199,15 @@ Options:
| Parameter | Required | Options | Description | Default |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| --contamination_loop | yes | seq, clade | The mode in which to run the CL | none |
| --nloops | no | any positive integer | Number of iterations | `5` |
| --sister_rules | only in sisters mode | Any valid path | Path to a text file containing sisters rules | none |
| --subsister_rules | only in subsisters mode | Any valid path | Path to a text file containing subsisters rules | none |
| --clade_grabbing_rules | only in clade mode | Any valid path | Path to a text file containing clade-grabbing rules | none |
| --clade_grabbing_exceptions | no | Any valid path | List of taxa to _not_ remove for any reason | none |
| --cl_tree_method | no | `iqtree`, `raxml`, `fasttree`, `iqtree_fast` | Tree-building method to use in each contamination loop iteration. | fasttree |
| --cl_alignment_method | no | `mafft_only`, `guidance` | Alignment method to use in each contamination loop iteration. | `mafft_only`|
| --cl_exclude_taxa | no | Any valid path | Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop. | none |
|`--contamination_loop`|yes|seq, clade|The mode in which to run the CL|NA|
|`--nloops`|no|positive int|Number of iterations|`5`|
|`--sister_rules`|only in sisters mode|Any valid path|Path to a text file containing sisters rules|NA|
|`--subsister_rules`|only in subsisters mode|Any valid path|Path to a text file containing subsisters rules|NA|
|`--clade_grabbing_rules`|only in clade mode|Any valid path|Path to a text file containing clade-grabbing rules|NA|
|`--clade_grabbing_exceptions`|no|Any valid path|List of taxa to _not_ remove for any reason|NA|
|`--cl_tree_method`|no|`iqtree`, `raxml`, `fasttree`, `iqtree_fast`|Tree-building method to use in each contamination loop iteration|`fasttree`|
|`--cl_alignment_method`|no|`mafft_only`, `guidance`|Alignment method to use in each contamination loop iteration|`mafft_only`|
|`--cl_exclude_taxa`|no|Any valid path|Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop|NA|
## Concatenation