Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

MCLeleu 2024-08-13 08:11:43 -04:00
parent 7dce3919df
commit 0ca462c1cb

@ -15,25 +15,40 @@ We provide a diverse database of 1,000 genomes and transcriptomes from across th
Running PhyloToL Part 2 requires at least 4 items in your main directory: 1) A folder named Scripts and containing all scripts from PhyloToL part 2, 2) a folder containing your input, 3) a taxon list and 4) an OG list. As said earlier, PhyloToL part 2 is highly modular and flexible, so before you start, be sure of what point in the process of PTLp2 you wish to start and end. Default script starts with raw data and produces trees. This will also influences your input files. If you want to produce trees, you will keep the default '--end parameter set to 'trees' Running PhyloToL Part 2 requires at least 4 items in your main directory: 1) A folder named Scripts and containing all scripts from PhyloToL part 2, 2) a folder containing your input, 3) a taxon list and 4) an OG list. As said earlier, PhyloToL part 2 is highly modular and flexible, so before you start, be sure of what point in the process of PTLp2 you wish to start and end. Default script starts with raw data and produces trees. This will also influences your input files. If you want to produce trees, you will keep the default '--end parameter set to 'trees'
If you want to produce up to pre-guidance files, you will change the default '--end parameter to 'unaligned' If you want to produce up to pre-guidance files, you will change the default '--end parameter to 'unaligned'
If you want to produce up to guidance files, you will change the default '--end parameter to 'aligned' If you want to produce up to guidance files, you will change the default '--end parameter to 'aligned'
If you want to start at a different point other than raw data, you will change the default '--start parameter to 'unaligned', 'aligned', or 'trees'. With these choices, this is the line you could run, with minimum requierements: If you want to start at a different point other than raw data, you will change the default '--start parameter to 'unaligned', 'aligned', or 'trees'. This is the line you could run, with minimum requirements:
> python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder > Output1.out > python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder > Output1.out
**provides the table with list of options flags parameters here ** __provides the table with list of options flags parameters here__
Optional arguments can then be added to the command line, and will be described bellow. Optional arguments can then be added to the command line, and will be described bellow.
## Filtering on GC composition ## Filtering on GC composition
The filtering by GC content is done during pre-guidance and it selects only sequences that fall within a specified range (user defined ranges). The filtering by GC content is done during pre-guidance and it selects only sequences that fall within a specified range (user defined ranges).
The renaming of each sequence is done using a utility script (GC_identifier.py) which renames the sequences with OGG, OG6, and OGA depending on if the sequence GC content falls below or above the user specified GC range. The renaming of each sequence is done using a utility script (GC_identifier.py), previously from running PhyloToL part 2, which renames the sequences with OGG, OG6, and OGA depending on if the sequence GC content falls below or above the user specified GC range. As input for running PhyloToL part 2, user will give the labeled Ready To Go instead of the Usual ones.
The parameters for this when running pre-guidance is --og_identifier and the options are 'OG','OG6','OGA','OGG' with the default being OG and passing all the sequences to guidance without filtering. The parameters for this when running pre-guidance is --og_identifier and the options are 'OG','OG6','OGA','OGG' with the default being OG and passing all the sequences to guidance without filtering.
Adding these options to the command line will give:
> python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG > Output1.out
## Overlap and similarity filters ## Overlap and similarity filters
Another option to filter sequences from the ReadyToGo files at the pre-guidance step (and so reducing computer resources needed, and sequence redundancy) is the similarity filter. Here, user can choose to remove sequences that are too similar by adding the '--similarity_filter' parameter into the command line, on all the dataset or only on a specific set of taxa. All parameters involved in reducing sequences by similarity are summarized in this table:
__table here__
Adding these options to the command line will give:
> python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt > Output1.out
## Blacklist
## Guidance ## Guidance
## Gene trees ## Gene trees
## Summary for basic launching
## Contamination loop ## Contamination loop
The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylgoeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylgoeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove.
@ -76,7 +91,7 @@ To run the CL, use a similar command structure as described for running PhyloToL
## Orthologs selection and concatenation with PhyloToL