Updated EukPhylo Part 2: MSAs, trees, and contamination loop (markdown)

Katzlab 2025-02-13 15:32:15 -05:00
parent d5a90de215
commit ae78b04c9c

@ -5,14 +5,14 @@ EukPhylo Part 2 is designed to:
3. Remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options 3. Remove contaminating sequences with the "Contamination Loop" (**CL**) and user-defined rules for both sister/subsister and clade grabbing options
4) Select orthologs and construct concatenated alignments for building species trees. 4) Select orthologs and construct concatenated alignments for building species trees.
EukPhylo part 2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match EukPhylo 6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by part 1, unaligned amino acid sequences per gene family (GF; in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop. Details are provided bellow. EukPhylo part 2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match EukPhylo 6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by part 1, unaligned amino acid sequences per gene family (GF; in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop. Details are provided below.
Example output files for EukPhylo part 2 and the contamination loop can be found [here](https://doi.org/10.6084/m9.figshare.26662018.v1). Example output files for EukPhylo part 2 and the contamination loop can be found [here](https://doi.org/10.6084/m9.figshare.26662018.v1).
# Install (only need to do once) # Installation
## Dependencies ## Dependencies
The following are required to run EukPhylo part 1. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well. The following are required to run EukPhylo part 1, and only needs to be done once. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well.
* Python 3, and the following libraries: * Python 3, and the following libraries:
* [Biopython](https://biopython.org/docs/latest/index.html) (version 1.75) * [Biopython](https://biopython.org/docs/latest/index.html) (version 1.75)
* [ETE3](http://etetoolkit.org/) (version 3.1.2.) * [ETE3](http://etetoolkit.org/) (version 3.1.2.)