Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2026-02-12 20:10:25 +08:00 · 2024-08-13 09:52:25 -04:00 · 2024-08-13 09:52:25 -04:00 · 1b9e337b94
commit 1b9e337b94
parent 68dcf12ff7
1 changed files with 34 additions and 9 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -15,23 +15,28 @@ We provide a diverse database of 1,000 genomes and transcriptomes from across th
 Running PhyloToL Part 2 requires at least 4 items in your main directory: 1) A folder named Scripts and containing all scripts from PhyloToL part 2, 2) a folder containing your input, 3) a taxon list and 4) an OG list. As said earlier, PhyloToL part 2 is highly modular and flexible, so before you start, be sure of what point in the process of PTLp2 you wish to start and end. Default script starts with raw data and produces trees. This will also influences your input files. If you want to produce trees, you will keep the default '--end’ parameter set to 'trees' 
 If you want to produce up to pre-guidance files, you will change the default '--end’ parameter to 'unaligned'
 If you want to produce up to guidance files, you will change the default '--end’ parameter to 'aligned'
-If you want to start at a different point other than raw data, you will change the default '--start’ parameter to 'unaligned', 'aligned', or 'trees'. This is the line you could run, with minimum requirements: 
+If you want to start at a different point other than raw data, you will change the default '--start’ parameter to 'unaligned', 'aligned', or 'trees'. 
 This is the line you could run, with minimum requirements: 
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder  > Output1.out
 __provides the table with list of options flags parameters here__
 Optional arguments can then be added to the command line, and will be described bellow.
-## Filtering on GC composition
+## Filtering sequences 
-The filtering by GC content is done during pre-guidance and it selects only sequences that fall within a specified range (user defined ranges). 
+Filtering sequences options are provided within PhyloToL to remove sequences at the "pre-guidance" step, before any intense computer resources are needed. This pre-guidance step of PhyloToL part 2 takes ReadyToGo files (aka "species files") and convert them into "Gene Family files", one file for each one of the GF you chose in the listofOGs.txt, regrouping all sequences from your taxa_list.txt. It allows to optionally choose to remove sequences based on GC composition, similarity or sequences previously identified as "non wanted".
-The renaming of each sequence is done using a utility script (GC_identifier.py), previously from running PhyloToL part 2, which renames the sequences with OGG, OG6, and OGA depending on if the sequence GC content falls below or above the user specified GC range. As input for running PhyloToL part 2, user will give the labeled Ready To Go instead of the Usual ones.
+
 ### Filtering on GC composition
 The filtering by GC content selects only sequences that fall within a specified range (user defined ranges). 
 The renaming of each sequence is done using a utility script (GC_identifier.py), previously from running PhyloToL part 2, which renames the sequences with OGG, OG6, and OGA depending on if the sequence GC content falls below or above the user specified GC range. As input for running PhyloToL part 2, user will give the labeled Ready To Go instead of the usual ones.
 The parameters for this when running pre-guidance is ‘--og_identifier’ and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’ and passing all the sequences to guidance without filtering.
 Adding these options to the command line will give:
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG > Output1.out
-## Overlap and similarity filters
+### Overlap and similarity filters
 Another option to filter sequences from the ReadyToGo files at the pre-guidance step (and so reducing computer resources needed, and sequence redundancy) is the similarity filter. Here, user can choose to remove sequences that are too similar by adding the '--similarity_filter' parameter into the command line, on all the dataset or only on a specific set of taxa. All parameters involved in reducing sequences by similarity are summarized in this table:
@ -39,16 +44,36 @@ __table here__
 Adding these options to the command line will give:
-> python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt > Output1.out
+> python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt > Output1.out
-## Blacklist
+### Blacklist
 As you run PhyloToL, list of sequences removed by Similarity Filter and Guidance (detailed below) are provided. This list of sequences can then be re-used in next PhyloToL runs to remove sequences prior to running the Guidance steps (list of sequences not to include in your dataset). As Guidance can take time, removing these already identified non-homologous sequences can save computer time. For our study, we chose to include only sequences removed by Guidance in our Blacklist, but user can choose (wisely) what fits best for their study and their data. To include this parameter to your PhyloToL run, you will need to add the '--blacklist' flag to the command line as follow:
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt > Output1.out
 ## Guidance
 Within PhyloToL, we are using Guidance as a proxy to assess homology in Gene Families. PTL6p2 runs Guidance in an iterative fashion to remove non-homologous sequences defined as those that fall below the sequence score cutoff. (We note that there is some stochasticity here given the iteration of alignments built into the method.) After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iterations. Some options are available to change the default set up for Guidance:
 __table guidance options__
 ## Gene trees
 After homology assessment and building MSA (Guidance step), PhyloToL trim alignment and build trees. By default, alignments are trimmed at 0.95% with trimal, and trees are built by IqTREE with an LG+G model. All parameters can be altered here by adding the correct flags (see table), including the choice of tool for phylogenetic reconstruction.
 __table parameters here__
 ## Summary for basic launching
 In summary, PhyloToL is a modular and flexible pipeline than can be started and stopped at any points, with options at each point to completely personalize your run. For a full PhyloToL run, it start with an input Folder of ReadyToGo files, and ends with an Output Folder containing PreGuidance files (one file by GF regrouping sequences from all taxa in your list), NotGapTrimmed files (post-guidance files, non-homologous sequences removed, but non trimmed by trimal), Guidance files (post-guidance and trimmed files), and Trees.
 Minimum requirements: 
 > python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder  > Output1.out
 List of all parameters included in PhyloToL:
 ## Contamination loop
 The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylgoeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. 
@ -60,7 +85,7 @@ Clade-based contamination removal operates differently. In this mode, the CL sea
 The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees. Starting with a set of trees and a list of rules (i.e. a sequence from a ciliate to be removed if it falls sister to a known food source), PhyloToL will: identify a list of sequences as contaminants (writing them out to xxxx file), generate a fasta file for each gene family without contaminating sequences, reconstruct an alignment using ?Guidance with x iterations?, and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
-## Contamination loop setup
+### Contamination loop setup
 The CL requires 1) a folder of alignments (not gap trimmed) and 2) a folder of gene trees in order to run, and they should be formatted in the same way as output by the preceding steps of PhyloToL part 2 (i.e. in the "Output" folder, see above). You can also give it data _not_ output by PhyloToL, but you will need to match the folder, file, and sequence name formats.
@ -76,7 +101,7 @@ The columns in the rules file are
 | ------------- | ------------- | ------------- |
 | Target taxa | Sample code prefix or path to file | Clade identifier or list of ten-digit-codes of the target taxon group |
-## Running
+### Running
 To run the CL, use a similar command structure as described for running PhyloToL part 2 above, and add the `--contamination_loop` parameter to activate the contamination loop and specify a mode and the path to a rules file. Available parameters are: