Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2025-12-30 01:40:25 +08:00 · 2024-08-13 15:23:16 -04:00 · 2024-08-13 15:23:16 -04:00 · 0c7beba74c
commit 0c7beba74c
parent 825f53ae99
1 changed files with 43 additions and 34 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -1,50 +1,60 @@
 # Overview and Modularity
-PhyloToL Part 2 is designed to: 1) generate multisequence alignments (**MSAs**) firsts filtering in a "Pre-Guidance" step and then using [Guidance](https://taux.evolseq.net/guidance/source) version 2.02 , iterating to remove sequences considered non-homologs (based on sequence score is 0.3); 2) generate gene trees using a 3rd party program (RaxML, IQTree, FastTree); 3) remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options; and 4) ortholog selection and generation of species trees from concatenated alignments.
+PhyloToL Part 2 is designed to:
 1. Generate multisequence alignments (**MSAs**) firsts filtering in a "Pre-Guidance" step and then using [Guidance](https://taux.evolseq.net/guidance/source) version 2.02 , iterating to remove sequences considered non-homologs (based on sequence score is 0.3)
 2. Generate gene trees using a 3rd party program (RaxML, IQTree, FastTree)
 3. Remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options
 4) Select orthologs and construct concatenated alignments for building species trees.
-PTL6p2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match PLT6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by PTL6p1, unaligned amino acid sequences per GF (in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop or concatenation. Details are provided bellow.
+PhyloToL part 2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match PhyloToL 6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by part 1, unaligned amino acid sequences per gene family (GF; in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop. Details are provided bellow.
 # Set Up
-Requirements include: fasta files with controlled names.
+
 ## Dependencies
 The following are required to run PhyloToL part 1. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well.
 * Python 3, and the following libraries:
    * [Biopython](https://biopython.org/docs/latest/index.html) (version 1.75)
    * [ETE3](http://etetoolkit.org/) (version 3.1.2.)
    * [tqdm](https://tqdm.github.io/) (version 4.66.4; unlikely to matter)
 * [Guidance](https://taux.evolseq.net/guidance/) (version 2.0.2)
 * [TrimAL](https://vicfero.github.io/trimal/index.html) (version 2.rev0)
 * [MAFFT](https://mafft.cbrc.jp/alignment/software/) (version 7.49)
 * [IQ-Tree](http://www.iqtree.org/) (version 2.1.2) and/OR [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/) (version 8.2.12) and/OR [FastTree](https://microbesonline.org/fasttree/) (version 2.1)
 ## Input data and folder structure
 _ACL is going to return here_
 Running PhyloToL Part 2 requires at least four items in your main directory (see setup section above)
 A folder named [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL2/Scripts) and containing all scripts from PhyloToL part 2, 2) a folder containing your input files, 3) a taxon list and 4) an OG list. Given that PhyloToL part 2 is highly modular and flexible, you will want to be sure of what point to the process of PTLp2 you wish to start and end (**TBD - point to figshare file here -- some figure). The default script starts with raw data and produces trees using scripts 1 - TBD. 
 # Databases
 For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (**TBD - point to figshare file here -- table of names). This database is in the form of "ReadyToGo" files, the output of PhyloToL part 1 (**TBD - point to figshare file here -- R2Gs). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above). 
 # Running PhyloToL Part 2
-Running PhyloToL Part 2 requires at least 4 items in your main directory: 1) A folder named [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL2/Scripts) and containing all scripts from PhyloToL part 2, 2) a folder containing your input files, 3) a taxon list and 4) an OG list. Given that PhyloToL part 2 is highly modular and flexible, you will want to be sure of what point to the process of PTLp2 you wish to start and end (**TBD - point to figshare file here -- some figure). The default script starts with raw data and produces trees using scripts 1 - TBD. 
+Once you have set up your run folder as described in the [Set-up](set-up) section above, you're ready to run PhyloToL part 2. The pipeline is highly modular, and contains five main sections:
 1. pre-Guidance (groups your sequences by GF instead of by sample and applies some basic filters; produces an unaligned amino acid file for each GF)
 2. Guidance (aligns your amino acid sequences and iteratively removes putative non-homologs; produces an aligned amino acid file for each GF)
 3. Tree building (produces a tree file [newick string] for each GF)
 4. Contamination loop (see below)
 5. Concatenation (see below)
 You can start and stop running PhyloToL part 2 at any of these sections, and the input to each of these sections is going to be different. This will be explained below. You're going to run part 2 using a Python command; if you're starting from scratch (ReadyToGo files as output by part 1) and would like to run all the way through tree-building (the most basic way of running the pipeline), you'll use the following command
-The code to run PhyloToL Part 2 in full is
+`python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder`
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder  > Output1.out
-Example of listofOGs.txt
+The `--start` and `--end` parameters tell PhyloToL what to expect in terms of input, and when to stop running the pipeline.
-OG6_111062
+* If you want to produce trees, keep the default `--end` parameter set to 'trees' 
-OG6_105827
+* If you want to run through the pre-Guidance step, set `--end` to 'unaligned'
-OG6_107533
+* If you want to run through the Guidance step, set `--end` to 'aligned'
-OG6_116087
+* If you want to start at a different point other than raw data, you will change the default `--start` parameter to 'unaligned' (input a fasta file of unaligned amino acid sequences for each GF), 'aligned' (input a fasta file of aligned amino acid sequences for each GF) , or 'trees' (input a newick string file for each GF). 
 OG6_105227
-Example of taxon_list.txt
+The `--data` parameter is where you point PhyloToL to your input file. If starting from ReadyToGo files (`--start raw`), this should be the path to a folder containing amino acid ReadyToGo files as output by part 1. If starting with Guidance, this should be a path to a folder of unaligned amino acid files (`--start unaligned`), etc.
 Am_tu_He65
 EE_cr_Gthe
 Pl_gr_Vaul
 Ba_pa_Mlot
 Op_me_Hsap
 Sr_st_Fves
 Sr_ci_Sx06
 Op_fu_Gzea
 Op_me_Lcha
 You will also need to give PhyloToL part 2 a list of all of the sample identifiers (taxon_list.txt) and gene family identifiers (listofOGs.txt) you want to include in your analysis; these text files should have no header and should just contain the list of identifiers, with one identifier per row.
-Modularities
+Below is a list of basic PhyloToL part 2 parameters:
 * If you want to produce trees, you will keep the default '--end’ parameter set to 'trees' 
 * If you want to produce up to pre-guidance files, you will change the default '--end’ parameter to 'unaligned'
 * If you want to produce up to guidance files, you will change the default '--end’ parameter to 'aligned'
 * If you want to start at a different point other than raw data, you will change the default '--start’ parameter to 'unaligned', 'aligned', or 'trees'. 
-
+Argument | Default | Choices | Description
 Below is a list of arguments and their descriptions.
 Argument | Default | Choices | Help
 -- | -- | -- | --
 --start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL.
 --end | trees | unaligned, aligned, trees | Stage until which to run PhyloToL. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
@ -53,10 +63,9 @@ Argument | Default | Choices | Help
 --data |   |   | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
 --output | ./ |   | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.
 Optional arguments can then be added to the base command, and will be described bellow.
-Optional arguments can then be added to the command line, and will be described bellow.
+## Pre-Guidance: Reorganizing files by gene family, and optional filters
 ## PreGuidance: Reorganizing files by GF and optional filters (similarity, GC)
 The pre-guidance step of PhyloToL part 2 takes ReadyToGo files (aka "species files" from your taxa_list.txt) and convert them into "Gene Family files", one file for each one of the GF you chose in the listofOGs.txt.