Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

MCLeleu 2024-08-12 23:28:04 +02:00
parent c56f9a9cdf
commit 8b09b0de95

@ -1,5 +1,7 @@
# Overview and Modularity
PhyloToL Part 2 is designed to: 1) generate multisequence alignments (**MSAs**) using Guidance (ref/github link ******), iterating to remove sequences that are not homologous (default sequence score is ****); 2) generate gene trees using a 3rd party program (RaxML, IQTree, FastTree); 3) remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options (see BioRxiv); and 4) ortholog selection and generation of species trees from concatenated alignments.
PTL6p2 starts from the “ReadyToGo” files produced by part 1 (or any set of per-taxon sequences with names that match PLT6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by PTL6p1, unaligned amino acid sequences per GF (in which case pre-Guidance filters are not run), aligned sequence files (one per each GF; in which case Guidance does not run), or even trees if a user is running only the contamination loop or concatenation. All details are provided bellow.
# Set Up
Requirements include: fasta files with controlled names.
@ -10,6 +12,12 @@ We provide a diverse database of 1,000 genomes and transcriptomes from across th
# Running PhyloToL Part 2
Running PhyloToL Part 2 requires at least 4 items in your main directory: 1) A folder named Scripts and containing all scripts from PhyloToL part 2, 2) a folder containing your input, 3) a taxon list and 4) an OG list. As said earlier, PhyloToL part 2 is highly modular and flexible, so before you start, be sure of what point in the process of PTLp2 you wish to start and end. Default script starts with raw data and produces trees. This will also influences your input files. If you want to produce trees, you will keep the default '--end parameter set to 'trees'
If you want to produce up to pre-guidance files, you will change the default '--end parameter to 'unaligned'
If you want to produce up to guidance files, you will change the default '--end parameter to 'aligned'
If you want to start at a different point other than raw data, you will change the default '--start parameter to 'unaligned', 'aligned', or 'trees'. With these choices, this is the line you could run, with minimum requierements:
> python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder > Output1.out
## Overlap and similarity filters
## Guidance