mirror of
http://43.156.76.180:8026/YuuMJ/EukPhylo.git
synced 2025-12-28 03:10:26 +08:00
Updated EukPhylo Part 2: MSAs, trees, and contamination loop (markdown)
parent
00494529b1
commit
c19c2d2b4e
@ -7,7 +7,7 @@ EukPhylo Part 2 is designed to:
|
|||||||
|
|
||||||
EukPhylo part 2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match EukPhylo 6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by part 1, unaligned amino acid sequences per gene family (GF; in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop. Details are provided below.
|
EukPhylo part 2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match EukPhylo 6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by part 1, unaligned amino acid sequences per gene family (GF; in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop. Details are provided below.
|
||||||
|
|
||||||
Example output files for EukPhylo part 2 and the contamination loop can be found [here](https://doi.org/10.6084/m9.figshare.26662018.v1).
|
Example output files for EukPhylo part 2 and the contamination loop can be found [here](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552).
|
||||||
|
|
||||||
# Installation
|
# Installation
|
||||||
|
|
||||||
@ -38,11 +38,11 @@ Running EukPhylo Part 2 requires at least four items in your main directory:
|
|||||||
See below for details.
|
See below for details.
|
||||||
|
|
||||||
# Databases
|
# Databases
|
||||||
For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (see [dataset S1 and S6](https://figshare.com/account/projects/196552/articles/26540599?file=48494653)). This database is in the form of "ReadyToGo" files, the output of EukPhylo part 1 ([Dataset S16](https://figshare.com/account/projects/196552/articles/25336129?file=48356116)). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above).
|
For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (see [dataset S1 and S6](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552)). This database is in the form of "ReadyToGo" files, the output of EukPhylo part 1 ([Dataset S16](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552)). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above).
|
||||||
|
|
||||||
# Running EukPhylo Part 2
|
# Running EukPhylo Part 2
|
||||||
|
|
||||||
You can find exemplar Run for EukPhylo in [figshare](https://figshare.com/account/projects/196552/articles/26662018?file=48495760)
|
You can find exemplar Run for EukPhylo in [figshare](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552)
|
||||||
|
|
||||||
Once you have set up your run folder as described in the Setup section above, you're ready to run EukPhylo part 2. The pipeline is highly modular, and contains five main sections:
|
Once you have set up your run folder as described in the Setup section above, you're ready to run EukPhylo part 2. The pipeline is highly modular, and contains five main sections:
|
||||||
1. pre-Guidance (groups your sequences by GF instead of by sample and applies some basic filters; produces an unaligned amino acid file for each GF)
|
1. pre-Guidance (groups your sequences by GF instead of by sample and applies some basic filters; produces an unaligned amino acid file for each GF)
|
||||||
@ -155,11 +155,11 @@ The CL runs iteratively and users must set the number of times that rules should
|
|||||||
|
|
||||||
### Contamination loop setup
|
### Contamination loop setup
|
||||||
|
|
||||||
You can find exemplar Run for the Contamination Loop in [figshare](https://figshare.com/account/projects/196552/articles/26662018?file=48495760)
|
You can find exemplar Run for the Contamination Loop in [figshare](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552)
|
||||||
|
|
||||||
The CL requires 1) a folder of alignments (it is most correct to not give gap trimmed alignments here) and 2) a folder of gene trees, and they should be formatted in the same way as output by the preceding steps of EukPhylo part 2 (i.e. in the "Output" folder, see above); most importantly, the alignment and tree files should begin with the same unique GF identifer. You can also give it data _not_ output by EukPhylo, but you will need to match the folder, file, and sequence name formats.
|
The CL requires 1) a folder of alignments (it is most correct to not give gap trimmed alignments here) and 2) a folder of gene trees, and they should be formatted in the same way as output by the preceding steps of EukPhylo part 2 (i.e. in the "Output" folder, see above); most importantly, the alignment and tree files should begin with the same unique GF identifer. You can also give it data _not_ output by EukPhylo, but you will need to match the folder, file, and sequence name formats.
|
||||||
|
|
||||||
You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in [Datasets S8-S12 on this project's Figshare](https://doi.org/10.6084/m9.figshare.26540599).
|
You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in [Datasets S8-S12 on this project's Figshare](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552).
|
||||||
|
|
||||||
##### Sisters/subsisters mode
|
##### Sisters/subsisters mode
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user