mirror of
http://43.156.76.180:8026/YuuMJ/EukPhylo.git
synced 2025-12-28 03:50:25 +08:00
Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)
parent
fca697bfc2
commit
0f112bb71e
@ -1,18 +1,18 @@
|
||||
# Overview and Modularity
|
||||
PhyloToL Part 2 is designed to:
|
||||
EukPhylo Part 2 is designed to:
|
||||
1. Generate multisequence alignments (**MSAs**) after first filtering in a "Pre-Guidance" step and then using [Guidance](https://taux.evolseq.net/guidance/source) version 2.02 , iterating to remove sequences considered non-homologs (based on sequence score is 0.3)
|
||||
2. Generate gene trees using a 3rd party program (RaxML, IQTree, FastTree)
|
||||
3. Remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options
|
||||
4) Select orthologs and construct concatenated alignments for building species trees.
|
||||
|
||||
PhyloToL part 2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match PhyloToL 6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by part 1, unaligned amino acid sequences per gene family (GF; in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop. Details are provided bellow.
|
||||
EukPhylo part 2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match EukPhylo 6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by part 1, unaligned amino acid sequences per gene family (GF; in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop. Details are provided bellow.
|
||||
|
||||
Example output files for PhyloToL part 2 and the contamination loop can be found [here](https://doi.org/10.6084/m9.figshare.26662018.v1).
|
||||
Example output files for EukPhylo part 2 and the contamination loop can be found [here](https://doi.org/10.6084/m9.figshare.26662018.v1).
|
||||
|
||||
# Set Up
|
||||
|
||||
## Dependencies
|
||||
The following are required to run PhyloToL part 1. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well.
|
||||
The following are required to run EukPhylo part 1. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well.
|
||||
* Python 3, and the following libraries:
|
||||
* [Biopython](https://biopython.org/docs/latest/index.html) (version 1.75)
|
||||
* [ETE3](http://etetoolkit.org/) (version 3.1.2.)
|
||||
@ -24,8 +24,8 @@ The following are required to run PhyloToL part 1. The dependencies are confirme
|
||||
|
||||
## Input data and folder structure
|
||||
|
||||
Running PhyloToL Part 2 requires at least four items in your main directory:
|
||||
* A folder named [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL2/Scripts) and containing all scripts from PhyloToL part 2
|
||||
Running EukPhylo Part 2 requires at least four items in your main directory:
|
||||
* A folder named [Scripts](https://github.com/Katzlab/EukPhylo/tree/main/PTL2/Scripts) and containing all scripts from EukPhylo part 2
|
||||
* A folder containing your input files
|
||||
* A taxon list
|
||||
* An OG list
|
||||
@ -33,60 +33,60 @@ Running PhyloToL Part 2 requires at least four items in your main directory:
|
||||
See below for details.
|
||||
|
||||
# Databases
|
||||
For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (see [dataset S1 and S6](https://figshare.com/account/projects/196552/articles/26540599?file=48494653)). This database is in the form of "ReadyToGo" files, the output of PhyloToL part 1 ([Dataset S16](https://figshare.com/account/projects/196552/articles/25336129?file=48356116)). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above).
|
||||
For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (see [dataset S1 and S6](https://figshare.com/account/projects/196552/articles/26540599?file=48494653)). This database is in the form of "ReadyToGo" files, the output of EukPhylo part 1 ([Dataset S16](https://figshare.com/account/projects/196552/articles/25336129?file=48356116)). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above).
|
||||
|
||||
# Running PhyloToL Part 2
|
||||
# Running EukPhylo Part 2
|
||||
|
||||
You can find exemplar Run for PhyloToL in [figshare](https://figshare.com/account/projects/196552/articles/26662018?file=48495760)
|
||||
You can find exemplar Run for EukPhylo in [figshare](https://figshare.com/account/projects/196552/articles/26662018?file=48495760)
|
||||
|
||||
Once you have set up your run folder as described in the Setup section above, you're ready to run PhyloToL part 2. The pipeline is highly modular, and contains five main sections:
|
||||
Once you have set up your run folder as described in the Setup section above, you're ready to run EukPhylo part 2. The pipeline is highly modular, and contains five main sections:
|
||||
1. pre-Guidance (groups your sequences by GF instead of by sample and applies some basic filters; produces an unaligned amino acid file for each GF)
|
||||
2. Guidance (aligns your amino acid sequences and iteratively removes putative non-homologs; produces an aligned amino acid file for each GF)
|
||||
3. Tree building (produces a tree file [newick string] for each GF)
|
||||
4. Contamination loop (see below)
|
||||
5. Concatenation (see below)
|
||||
You can start and stop running PhyloToL part 2 at any of these sections, and the input to each of these sections is going to be different. This will be explained below. You're going to run part 2 using a Python command; if you're starting from scratch (ReadyToGo files as output by part 1) and would like to run all the way through tree-building (the most basic way of running the pipeline), you'll use the following command
|
||||
You can start and stop running EukPhylo part 2 at any of these sections, and the input to each of these sections is going to be different. This will be explained below. You're going to run part 2 using a Python command; if you're starting from scratch (ReadyToGo files as output by part 1) and would like to run all the way through tree-building (the most basic way of running the pipeline), you'll use the following command
|
||||
|
||||
`python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder`
|
||||
`python Scripts/eukphylo.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder`
|
||||
|
||||
The `--start` and `--end` parameters tell PhyloToL what to expect in terms of input, and when to stop running the pipeline.
|
||||
The `--start` and `--end` parameters tell EukPhylo what to expect in terms of input, and when to stop running the pipeline.
|
||||
* If you want to produce trees, keep the default `--end` parameter set to 'trees'
|
||||
* If you want to run through the pre-Guidance step, set `--end` to 'unaligned'
|
||||
* If you want to run through the Guidance step, set `--end` to 'aligned'
|
||||
* If you want to start at a different point other than raw data, you will change the default `--start` parameter to 'unaligned' (input a fasta file of unaligned amino acid sequences for each GF), 'aligned' (input a fasta file of aligned amino acid sequences for each GF) , or 'trees' (input a newick string file for each GF).
|
||||
|
||||
The `--data` parameter is where you point PhyloToL to your input file. If starting from ReadyToGo files (`--start raw`), this should be the path to a folder containing amino acid ReadyToGo files as output by part 1. If starting with Guidance, this should be a path to a folder of unaligned amino acid files (`--start unaligned`), etc.
|
||||
The `--data` parameter is where you point EukPhylo to your input file. If starting from ReadyToGo files (`--start raw`), this should be the path to a folder containing amino acid ReadyToGo files as output by part 1. If starting with Guidance, this should be a path to a folder of unaligned amino acid files (`--start unaligned`), etc.
|
||||
|
||||
You will also need to give PhyloToL part 2 a list of all of the sample identifiers (taxon_list.txt) and gene family identifiers (listofOGs.txt) you want to include in your analysis; these text files should have no header and should just contain the list of identifiers, with one identifier per row.
|
||||
You will also need to give EukPhylo part 2 a list of all of the sample identifiers (taxon_list.txt) and gene family identifiers (listofOGs.txt) you want to include in your analysis; these text files should have no header and should just contain the list of identifiers, with one identifier per row.
|
||||
|
||||
Below is a list of basic PhyloToL part 2 parameters:
|
||||
Below is a list of basic EukPhylo part 2 parameters:
|
||||
|
||||
Argument | Default value | Options | Description
|
||||
-- | -- | -- | --
|
||||
--start | raw | raw, unaligned, aligned, trees | Stage at which to start running PhyloToL.
|
||||
--end | trees | unaligned, aligned, trees | Stage until which to run PhyloToL. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
|
||||
--start | raw | raw, unaligned, aligned, trees | Stage at which to start running EukPhylo.
|
||||
--end | trees | unaligned, aligned, trees | Stage until which to run EukPhylo. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
|
||||
--gf_list | No default | Any valid path | Path to the file with the GFs of interest. Only required if starting from the raw dataset.
|
||||
--taxon_list | No default | Any valid path | Path to the file with the taxa (10-digit codes) to include in the output.
|
||||
--data | No default | Any valid path | Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
|
||||
--output | Current directory | Any valid path | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.
|
||||
|
||||
Optional arguments can then be added to the base command, and will be described below. In the following is described each stage of PhyloToL, and some key parameters to know for each step.
|
||||
Optional arguments can then be added to the base command, and will be described below. In the following is described each stage of EukPhylo, and some key parameters to know for each step.
|
||||
|
||||
## Pre-Guidance: Reorganizing files by gene family, and optional filters
|
||||
|
||||
The pre-guidance step of PhyloToL part 2 takes ReadyToGo files (one file per sample as output by PhyloToL part 1) and reorganized the amino acid sequences into one file per gene family (one file for each one of the GF you chose in the listofOGs.txt, see above). It also applies some optional filters, described below.
|
||||
The pre-guidance step of EukPhylo part 2 takes ReadyToGo files (one file per sample as output by EukPhylo part 1) and reorganized the amino acid sequences into one file per gene family (one file for each one of the GF you chose in the listofOGs.txt, see above). It also applies some optional filters, described below.
|
||||
|
||||
### Filtering by GC content
|
||||
|
||||
Using a [utility script](https://github.com/Katzlab/PhyloToL-6/tree/main/Utilities) (GC_identifier.py), prior to running PhyloToL part 2, users can choose to rename each sequence in the ReadyToGo file depending on whether the sequence falls outside of a user-defined range of GC content. The gene family identifier (last ten digits of the sequence identifier and by default prefixed by "OG6_") will be renamed depending on if the sequence GC content falls below (OGA) or above (OGG) or within (OG6) the user specified GC range. If the user wishes to only include one category of sequences when running PhyloToL part 2, the user should give the relabeled ReadyToGo files. The parameters for this when running pre-guidance is `--og_identifier` and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’, which passing all the sequences to guidance without filtering.
|
||||
Using a [utility script](https://github.com/Katzlab/EukPhylo/tree/main/Utilities) (GC_identifier.py), prior to running EukPhylo part 2, users can choose to rename each sequence in the ReadyToGo file depending on whether the sequence falls outside of a user-defined range of GC content. The gene family identifier (last ten digits of the sequence identifier and by default prefixed by "OG6_") will be renamed depending on if the sequence GC content falls below (OGA) or above (OGG) or within (OG6) the user specified GC range. If the user wishes to only include one category of sequences when running EukPhylo part 2, the user should give the relabeled ReadyToGo files. The parameters for this when running pre-guidance is `--og_identifier` and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’, which passing all the sequences to guidance without filtering.
|
||||
|
||||
Adding these options to the command line will give:
|
||||
|
||||
`python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG`
|
||||
`python Scripts/eukphylo.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG`
|
||||
|
||||
A note on filtering by GC: we set our bounds by visually inspecting plots that compare GC content at silent sites (GC3 Degen) to effective number of codons (ENc) as calculated by Wright 1991. These graphs can be made using the utility script [CUB.py](https://github.com/Katzlab/PhyloToL-6/blob/main/Utilities/for_fastas/CUB.py). For example for the following graph, we might choose to exclude all points with a GC content less than 50% (so, set the lower bound to be 50% [OGA] and the upper bound to be 100% [OGG], then in part 2 use `--og_identifier OG6`).
|
||||
A note on filtering by GC: we set our bounds by visually inspecting plots that compare GC content at silent sites (GC3 Degen) to effective number of codons (ENc) as calculated by Wright 1991. These graphs can be made using the utility script [CUB.py](https://github.com/Katzlab/EukPhylo/blob/main/Utilities/for_fastas/CUB.py). For example for the following graph, we might choose to exclude all points with a GC content less than 50% (so, set the lower bound to be 50% [OGA] and the upper bound to be 100% [OGG], then in part 2 use `--og_identifier OG6`).
|
||||
|
||||
<img src='https://github.com/Katzlab/PhyloToL-6/blob/main/Other/JF_example.png' width='30%'>
|
||||
<img src='https://github.com/Katzlab/EukPhylo/blob/main/Other/JF_example.png' width='30%'>
|
||||
|
||||
### Similarity filter
|
||||
|
||||
@ -101,14 +101,14 @@ Argument | Default | Choices | Help
|
||||
|
||||
Adding these options to the command line will give:
|
||||
|
||||
`python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt`
|
||||
`python Scripts/eukphylo.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt`
|
||||
|
||||
### Other optional parameters:
|
||||
#### Blacklist
|
||||
|
||||
The blacklist is a user-defined set of sequences to be removed from runs. You might choose to keep a list of sequences removed by Guidance to avoid reconsidering these non-homologs in future runs as this can save computer time, if you're going to be using the same ReadyToGo files and GFs in multiple runs of PhyloToL part 2. For our study, we chose to include only sequences removed by Guidance in our [blacklist](https://figshare.com/account/projects/196552/articles/26539618?file=48354880), but users should choose what fits best for their study and their data. To include this parameter in your PhyloToL run, you will need to add the `--blacklist` flag to the command line as follows:
|
||||
The blacklist is a user-defined set of sequences to be removed from runs. You might choose to keep a list of sequences removed by Guidance to avoid reconsidering these non-homologs in future runs as this can save computer time, if you're going to be using the same ReadyToGo files and GFs in multiple runs of EukPhylo part 2. For our study, we chose to include only sequences removed by Guidance in our [blacklist](https://figshare.com/account/projects/196552/articles/26539618?file=48354880), but users should choose what fits best for their study and their data. To include this parameter in your EukPhylo run, you will need to add the `--blacklist` flag to the command line as follows:
|
||||
|
||||
`python Scripts/phylotol.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt`
|
||||
`python Scripts/eukphylo.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt`
|
||||
|
||||
Argument | Default | Choices | Help
|
||||
-- | -- | -- | --
|
||||
@ -116,7 +116,7 @@ Argument | Default | Choices | Help
|
||||
|
||||
## Guidance
|
||||
|
||||
Within PhyloToL part 2, we use Guidance to assess homology within gene families. PhyloToL part 2 runs Guidance in an iterative fashion to remove non-homologous sequences, defined as those that fall below the sequence score cutoff (we note that there is some stochasticity here). Users should consult the [Guidance 2.02 documentation](https://taux.evolseq.net/guidance/) for details on the significance of these parameters. After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iteration. Some options are available to change the default set up for Guidance:
|
||||
Within EukPhylo part 2, we use Guidance to assess homology within gene families. EukPhylo part 2 runs Guidance in an iterative fashion to remove non-homologous sequences, defined as those that fall below the sequence score cutoff (we note that there is some stochasticity here). Users should consult the [Guidance 2.02 documentation](https://taux.evolseq.net/guidance/) for details on the significance of these parameters. After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iteration. Some options are available to change the default set up for Guidance:
|
||||
|
||||
Argument | Default | Choices | Description
|
||||
-- | -- | -- | --
|
||||
@ -129,7 +129,7 @@ Argument | Default | Choices | Description
|
||||
|
||||
## Gene trees
|
||||
|
||||
After homology assessment and building MSAs (the Guidance step), PhyloToL trims alignments and build trees. By default, alignments are trimmed at 0.95% with TrimAL, and trees by default are built by IqTREE with an LG+G model; users may choose to use a different third-party tool for phylogenetic reconstruction.
|
||||
After homology assessment and building MSAs (the Guidance step), EukPhylo trims alignments and build trees. By default, alignments are trimmed at 0.95% with TrimAL, and trees by default are built by IqTREE with an LG+G model; users may choose to use a different third-party tool for phylogenetic reconstruction.
|
||||
|
||||
Argument | Default | Choices | Description
|
||||
-- | -- | -- | --
|
||||
@ -137,19 +137,19 @@ Argument | Default | Choices | Description
|
||||
|
||||
## Contamination loop
|
||||
|
||||
The contamination coop (CL) is implemented within PhyloToL to allow the removal of contaminants based on the topology of each tree (phylogeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. We first provide an overview of the three modes and then give details on running below.
|
||||
The contamination coop (CL) is implemented within EukPhylo to allow the removal of contaminants based on the topology of each tree (phylogeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. We first provide an overview of the three modes and then give details on running below.
|
||||
|
||||
**Sisters-based contamination removal** identifies sequences as putative contaminants based on their sister relationships. If a sequence from sample A appears on a tree sister to a sequence from sample B, and sample B is known to have contaminated sample A, then the sequence from sample A will be removed. **Subsisters-based removal** operates similarly, but looks at the taxa that are sister to sample A's _parent_ node, useful for when multiple samples are contaminated by the same other sample.
|
||||
|
||||
**Clade-based contamination removal** operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opisthokonts; all other opisthokont sequences in the tree are removed.
|
||||
|
||||
The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees using the `--nloops` parameter. Starting with a set of trees and a list of rules (e.g., a sequence from a ciliate is to be removed if it falls sister to a known food source), PhyloToL will for each iteration: identify a list of sequences as contaminants (writing them out to a file called `SequencesRemoved_ContaminationLoop.txt`), generate a fasta file for each gene family excluding contaminating sequences, reconstruct an alignment using Guidance (or just MAFFT), and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
|
||||
The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees using the `--nloops` parameter. Starting with a set of trees and a list of rules (e.g., a sequence from a ciliate is to be removed if it falls sister to a known food source), EukPhylo will for each iteration: identify a list of sequences as contaminants (writing them out to a file called `SequencesRemoved_ContaminationLoop.txt`), generate a fasta file for each gene family excluding contaminating sequences, reconstruct an alignment using Guidance (or just MAFFT), and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.
|
||||
|
||||
### Contamination loop setup
|
||||
|
||||
You can find exemplar Run for the Contamination Loop in [figshare](https://figshare.com/account/projects/196552/articles/26662018?file=48495760)
|
||||
|
||||
The CL requires 1) a folder of alignments (it is most correct to not give gap trimmed alignments here) and 2) a folder of gene trees, and they should be formatted in the same way as output by the preceding steps of PhyloToL part 2 (i.e. in the "Output" folder, see above); most importantly, the alignment and tree files should begin with the same unique GF identifer. You can also give it data _not_ output by PhyloToL, but you will need to match the folder, file, and sequence name formats.
|
||||
The CL requires 1) a folder of alignments (it is most correct to not give gap trimmed alignments here) and 2) a folder of gene trees, and they should be formatted in the same way as output by the preceding steps of EukPhylo part 2 (i.e. in the "Output" folder, see above); most importantly, the alignment and tree files should begin with the same unique GF identifer. You can also give it data _not_ output by EukPhylo, but you will need to match the folder, file, and sequence name formats.
|
||||
|
||||
You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in [Datasets S8-S12 on this project's Figshare](https://doi.org/10.6084/m9.figshare.26540599).
|
||||
|
||||
@ -174,7 +174,7 @@ Having well-organized ten-digit-codes for sample identification is vital for run
|
||||
|
||||
### Running
|
||||
|
||||
To run the CL, use a similar command structure as described for running PhyloToL part 2 above, and add the `--contamination_loop` parameter to activate the contamination loop and specify a mode and the path to a rules file. Available parameters are:
|
||||
To run the CL, use a similar command structure as described for running EukPhylo part 2 above, and add the `--contamination_loop` parameter to activate the contamination loop and specify a mode and the path to a rules file. Available parameters are:
|
||||
|
||||
| Parameter | Required | Options | Description | Default |
|
||||
| ------------- | ------------- | ------------- | ------------- | ------------- |
|
||||
@ -187,13 +187,13 @@ To run the CL, use a similar command structure as described for running PhyloToL
|
||||
|
||||
## Ortholog selection and concatenation
|
||||
|
||||
PhyloToL includes an optional step, which can be run after the tree-building stage (or by using `--start trees` and passing to the `--data` argument a folder of trees and corresponding aligned or unaligned sequences files), to select orthologs (one sequence at most per taxon from each GF) and build a concatenated alignment. PhyloToL first identifies for each taxon the monophyletic clade with the greatest number of species from that taxon's minor clade, using the first five digits of that taxon's sample identifier (e.g., Op_me for metazoa); alternatively, a user can select orthologs for only a target group of taxon using the `--concat_target_taxa` argument by inputting a file with a list of ten digit codes, or just a single ten-digit code or clade prefix. If only one sequence from the taxon falls into this largest clade, that's the sequence chosen for concatenation; otherwise, then a score is given to each sequence equal to its length times is k-mer coverage for transcriptomic data, and just the sequence length for genomic data, and the sequence with the highest score is taken. If a GF is not present in a taxon, then the space is filled with gaps in the concatenated alignments. This step produces a clearly labeled concatenated alignment, as well as a folder called "DataToConcatenate" in which can be found all the selected orthologs for each GF, aligned and unaligned.
|
||||
EukPhylo includes an optional step, which can be run after the tree-building stage (or by using `--start trees` and passing to the `--data` argument a folder of trees and corresponding aligned or unaligned sequences files), to select orthologs (one sequence at most per taxon from each GF) and build a concatenated alignment. EukPhylo first identifies for each taxon the monophyletic clade with the greatest number of species from that taxon's minor clade, using the first five digits of that taxon's sample identifier (e.g., Op_me for metazoa); alternatively, a user can select orthologs for only a target group of taxon using the `--concat_target_taxa` argument by inputting a file with a list of ten digit codes, or just a single ten-digit code or clade prefix. If only one sequence from the taxon falls into this largest clade, that's the sequence chosen for concatenation; otherwise, then a score is given to each sequence equal to its length times is k-mer coverage for transcriptomic data, and just the sequence length for genomic data, and the sequence with the highest score is taken. If a GF is not present in a taxon, then the space is filled with gaps in the concatenated alignments. This step produces a clearly labeled concatenated alignment, as well as a folder called "DataToConcatenate" in which can be found all the selected orthologs for each GF, aligned and unaligned.
|
||||
|
||||
To run this step, add the `--concatenate` flag to your PhyloToL command. If you don't include this flag, concatenation will by default not be run. If you want to run concatenation alone (i.e., you already have trees and alignments) then you'll have set the start parameter to `--start trees` and set up your input data in the style of PhyloToL's "Output" folder. Namely, create a folder called "Output" in the same directory that contains the "Scripts" folder. Inside the "Output" folder, create a folder called "Trees" and a folder called "Guidance." Put your input trees in the Trees folder and the input aligned or unaligned sequences files in the Guidance folder. Each file in the Trees folder should have a corresponding file in the Guidance folder; the names of these files should match up until the last period (i.e., until the file extension). For example, for gene family OG6_100206 you might have a tree file called OG6_100206.PostCL.tre and an alignment file called OG6_100206.PostCL.fasta. Here, the "OG6_100206.PostCL" must match. Below is an example command:
|
||||
To run this step, add the `--concatenate` flag to your EukPhylo command. If you don't include this flag, concatenation will by default not be run. If you want to run concatenation alone (i.e., you already have trees and alignments) then you'll have set the start parameter to `--start trees` and set up your input data in the style of EukPhylo's "Output" folder. Namely, create a folder called "Output" in the same directory that contains the "Scripts" folder. Inside the "Output" folder, create a folder called "Trees" and a folder called "Guidance." Put your input trees in the Trees folder and the input aligned or unaligned sequences files in the Guidance folder. Each file in the Trees folder should have a corresponding file in the Guidance folder; the names of these files should match up until the last period (i.e., until the file extension). For example, for gene family OG6_100206 you might have a tree file called OG6_100206.PostCL.tre and an alignment file called OG6_100206.PostCL.fasta. Here, the "OG6_100206.PostCL" must match. Below is an example command:
|
||||
|
||||
`python phylotol.py --start trees --concatenate --concat_target_taxa Sr_rh --data Output`
|
||||
`python eukphylo.py --start trees --concatenate --concat_target_taxa Sr_rh --data Output`
|
||||
|
||||
In this case, the user is starting with an already built set of trees and alignments (in their Output folder) and would only like to concatenate sequences from Rhizaria (Sr_rh). If concatenating as part of an end-to-end PhyloToL run, just change your `--start` parameter accordingly and include other parameters as defined above.
|
||||
In this case, the user is starting with an already built set of trees and alignments (in their Output folder) and would only like to concatenate sequences from Rhizaria (Sr_rh). If concatenating as part of an end-to-end EukPhylo run, just change your `--start` parameter accordingly and include other parameters as defined above.
|
||||
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user