From 76b75b01b3b839c0ca89be8f99ea1292bca65575 Mon Sep 17 00:00:00 2001 From: "Adri K. Grow" <42044618+adriannagrow@users.noreply.github.com> Date: Sun, 9 Feb 2025 01:44:51 -0500 Subject: [PATCH] Updated EukPhylo QuickStart (markdown) --- EukPhylo-QuickStart.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/EukPhylo-QuickStart.md b/EukPhylo-QuickStart.md index 2675334..931aeb6 100644 --- a/EukPhylo-QuickStart.md +++ b/EukPhylo-QuickStart.md @@ -130,10 +130,10 @@ For additional input parameter options, see table below or run: `python phylotol |:---|:-------|:---|:---| |`--start`|`raw`, `unaligned`, `aligned`, `trees`|Stage at which to start running PhyloToL|`raw`| |`--end`|`unaligned`, `aligned`, `trees`|Stage until which to run PhyloToL. Options are `unaligned` (which will run up to but not including guidance), `aligned` (which will run up to but not including RAxML), and `trees` which will run through RAxML')|`trees`| -|`--gf_list`|Any valid path|Path to the file with the GFs of interest. Only required if starting from the raw dataset|None| -|`--taxon_list`|Any valid path|Path to the file with the taxa (10-digit codes) to include in the output|None| -|`--data`|Any valid path|Path to the input dataset. The format of this varies depending on your `--start` parameter. If you are running the contamination loop starting with trees, this folder must include both trees **AND** a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names)|None| -|`--output`|Any valid path|Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts|`../`| +|`--gf_list`|Valid path|Path to the file with the GFs of interest. Only required if starting from the raw dataset|None| +|`--taxon_list`|Valid path|Path to the file with the taxa (10-digit codes) to include in the output|None| +|`--data`|Valid path|Path to the input dataset. The format of this varies depending on your `--start` parameter. If you are running the contamination loop starting with trees, this folder must include both trees **AND** a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names)|None| +|`--output`|Valid path|Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts|`../`| ### Modularity Below are several optional ways to parameterize EukPhylo Part 2 @@ -163,16 +163,16 @@ Below are several optional ways to parameterize EukPhylo Part 2 |`--sim_taxa`|no|A file listing taxa (10-digit codes) to apply the similarity filter on (e.g. sim_taxa.txt)|NA| **For removing known poor-quality or contaminant sequences (user informed):** -|Parameter|Description| -|:---|:---| -|`--blacklist`|str; A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)| +|Parameter|Options|Description| +|:---|:---|:---| +|`--blacklist`|str|A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)| **For removing sequences based on GC composition:** *Note: you must first identify sequences with OGA, OGG, OG6 using the GC_identifier.py script [here](https://github.com/Katzlab/EukPhylo/tree/main/Utilities/for_fastas) on GitHub* |Parameter|Options|Description|Default| |:---|:---|:---|:---| -|`--og_identifier`|`OG`,`OG6`,`OGA`,`OGG`|Select sequences by GC width|`OG` +|`--og_identifier`|`OG`, `OG6`, `OGA`, `OGG`|Select sequences by GC width|`OG` ## Contamination Removal Contamination removal within EukPhylo (also called Contamination Loop or CL) allows for sequence removal based on Sisters/Subsisters identification or based on Clades diversity. An examplar run is available in [Figshare](https://figshare.com/articles/dataset/Examplar_runs_PhyloToL_and_CLoop/26662018) @@ -199,15 +199,15 @@ Options: | Parameter | Required | Options | Description | Default | | ------------- | ------------- | ------------- | ------------- | ------------- | -|`--contamination_loop`|yes|seq, clade|The mode in which to run the CL|NA| +|`--contamination_loop`|yes|`seq`, `clade`|The mode in which to run the CL|NA| |`--nloops`|no|positive int|Number of iterations|`5`| -|`--sister_rules`|only in sisters mode|Any valid path|Path to a text file containing sisters rules|NA| -|`--subsister_rules`|only in subsisters mode|Any valid path|Path to a text file containing subsisters rules|NA| -|`--clade_grabbing_rules`|only in clade mode|Any valid path|Path to a text file containing clade-grabbing rules|NA| -|`--clade_grabbing_exceptions`|no|Any valid path|List of taxa to _not_ remove for any reason|NA| +|`--sister_rules`|only in sisters mode|Valid path|Path to a text file containing sisters rules|NA| +|`--subsister_rules`|only in subsisters mode|Valid path|Path to a text file containing subsisters rules|NA| +|`--clade_grabbing_rules`|only in clade mode|Valid path|Path to a text file containing clade-grabbing rules|NA| +|`--clade_grabbing_exceptions`|no|Valid path|List of taxa to _not_ remove for any reason|NA| |`--cl_tree_method`|no|`iqtree`, `raxml`, `fasttree`, `iqtree_fast`|Tree-building method to use in each contamination loop iteration|`fasttree`| |`--cl_alignment_method`|no|`mafft_only`, `guidance`|Alignment method to use in each contamination loop iteration|`mafft_only`| -|`--cl_exclude_taxa`|no|Any valid path|Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop|NA| +|`--cl_exclude_taxa`|no|Valid path|Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop|NA| ## Concatenation