Updated EukPhylo QuickStart (markdown)

2025-12-28 00:40:26 +08:00 · 2025-02-05 14:32:47 -05:00 · 2025-02-05 14:32:47 -05:00 · fd915bf93c
commit fd915bf93c
parent 4553eec336
1 changed files with 15 additions and 19 deletions
--- a/EukPhylo-QuickStart.md
+++ b/EukPhylo-QuickStart.md
@ -1,16 +1,17 @@
 # General Steps

-EukPhylo pipeline is composed of two parts, that can be run individually: Part 1 can be run only once, to assign gene families; Part 2 builds MSA, Trees, implement contamination removal and Concatenation.
-It's preferable to run Part 2 with the outputs of Part 1, but this is not required as long as the users files are in the same format (one file per species, with sequences name start by 10 digit code and ends with OGx_xxxxxx. See extended version of the wiki for details)
+EukPhylo pipeline is composed of two parts, that can be run individually: Part 1 can be run only once, to assign gene families; Part 2 builds MSAs, trees, and implements contamination removal and concatenation. It's preferable to run Part 2 using the outputs of Part 1 as input, but this is not required as long as the input files are in the same format (one fasta file per species, with sequences IDs starting with a 10 digit taxon identifier and ending in a gene family identifier with the format OGx_xxxxxx. See extended version of the wiki for details.)
+
+## Contents

 1. Install EukPhylo
 2. Run EukPhylo part 1
   a. with the Hook database or with custom database
-   b. with Assembled Transcripts or Assembled Genomes
-   c. modularity of options
+   b. with assembled transcripts or genomic CDS
+   c. modularity
 3. Run EukPhylo part 2
   a. basic running
-   b. modularity of options
+   b. modularity
   c. contamination removal
   d. choosing orthologs and concatenation

@ -31,9 +32,9 @@ Dependencies & third party tools, along with the versions that we use at the Kat
 * tqdm


-# EukPhylo part 1 = Assigning Gene families
+# EukPhylo part 1: Assigning Gene families

-EukPhylo part 1 runs CDS (genome) or assembled transcripts (transcriptome) through several scripts in order (5 for CDS, 7 for assembled transcripts) to remove bacterial contamination and produce ReadyToGo files. These scripts are run through a ‘wrapper’ script.
+EukPhylo part 1 runs CDS (genomes) or assembled transcripts (transcriptomes) through several scripts in order (5 for CDS, 7 for assembled transcripts) to remove bacterial contamination and produce ReadyToGo files. These scripts are run through a ‘wrapper’ script.

 ## Transcriptomes:
 ### Set Up:
@ -67,9 +68,8 @@ Code parameters:
 1. **ReadyToGo files = AA, NTD**
 2. **Summary and statistics of sequences**

-### Modularity of options and replacing the Hook database
-EukPhylo part 1 for transcriptomes is composed of 7 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options.
-If a user chooses to use their own gene family database, they need to replace the Hook.fasta file in the Databases folder and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.
+### Modularity, and replacing the Hook database
+EukPhylo part 1 for transcriptomes is composed of 7 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options. If a user chooses to use their own gene family database, they need to replace the Hook.fasta file in the Databases folder and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.

 ## Genomes:
 ### Set Up:
@ -102,13 +102,11 @@ Code parameters:
 1. **ReadyToGo files = AA, NTD**
 2. **Summary and statistics of sequences**

-### Modularity of options and replacing the Hook database
+### Modularity, and replacing the Hook database
 EukPhylo part 1 for genomes is composed of 5 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options.
-If a user choose to use their own gene families database, they need to replace the Hook.fasta file in the Database folder, and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.
-
-
-# EukPhylo part 2 = MSAs, Trees, Contamination Removal, and Concatenation
+If a user choose to use their own gene families database, they need to replace the Hook fasta file in the Database folder, and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.

+# EukPhylo part 2: MSAs, trees, contamination removal, and concatenation

 ## MSAs and Trees
 ### Set up:
@ -120,7 +118,6 @@ In a main project directory:
 * Create a folder called `R2Gs` that contains the AA ReadyToGo fasta files for all taxa (from `taxa.txt`)
 * Create a list of OGs for tree building called `OG_list.txt`

-
 ### Running:

 Basic running for building MSAs and Trees:
@ -138,8 +135,7 @@ For additional input parameter options, see table below or run: `python phylotol
 |`--data`|Any valid path|Path to the input dataset. The format of this varies depending on your `--start` parameter. If you are running the contamination loop starting with trees, this folder must include both trees **AND** a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names)|None|
 |`--output`|Any valid path|Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts|`../`|

-
-### Modularity of options
+### Modularity
 Below are several optional ways to parameterize EukPhylo Part 2

 **General:**
@ -166,7 +162,7 @@ Below are several optional ways to parameterize EukPhylo Part 2
 |`--sim_cutoff`|yes|default = 1, type = float|Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff|
 |`--sim_taxa`|no|default = None|A file listing taxa (10-digit codes) to apply the similarity filter on (e.g. sim_taxa.txt)|

-**For removing poor sequences (user informed):**
+**For removing known poor-quality or contaminant sequences (user informed):**
 |Parameter|Description|
 |:---|:---|
 |`--blacklist`|type = str; A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)|