This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Note: The EukPhylo pipeline is currently being dockerised for easier installation and use. This is still in progress so please do not install. More information about the dockerfile can be found here - Docker branch
Dockerfile
Users will need to have the docker software running and installed. The docker file for part 2 can be executed with:
# Build the container
docker build -f Dockerfile.txt . --tag eukphylo:1
# Current command is:
docker run -it \
--mount type=bind,src=$(pwd)/input_data,dst=/Input_data \
--mount type=bind,src=$(pwd)/output_data,dst=/Output_data \
eukphylo:1
After development, GitHub CICD workflows can be added to automatically build and release the dockerfile for the end user.
General Steps
EukPhylo pipeline is composed of two parts, that can be run individually: Part 1 can be run only once, to assign gene families; Part 2 builds MSAs, trees, and implements contamination removal and concatenation. It's preferable to run Part 2 using the outputs of Part 1 as input, but this is not required as long as the input files are in the same format (one fasta file per species, with sequences IDs starting with a 10 digit taxon identifier and ending in a gene family identifier with the format OGx_xxxxxx. See extended version of the wiki for details.)
Contents
- Install EukPhylo
- Run EukPhylo part 1 a. with the Hook database or with custom database b. with assembled transcripts or genomic CDS c. modularity
- Run EukPhylo part 2 a. basic running b. modularity c. contamination removal d. choosing orthologs and concatenation
Installing EukPhylo
Scripts can be used as downloaded from the GitHub, and should work on any platform Dependencies & third party tools, along with the versions that we use at the Katz lab
- TrimAl (1.2)
- Guidance (2.2)
- Diamond (0.9.30, compiled with GCC 8.3.0)
- MAFFT (7.475)
- IQ-Tree (2.1.12)
- RAxML (8.2.12)
- BLAST+ (2.9.0)
- Vsearch (2.21.1, compiled with GCC 10.3.0)
- Python (3.9.6, libraries can be installed with Pip)
- ETE3 (pip install ete3)
- BioPython (1.79-foss-2021b)
- tqdm
EukPhylo part 1: Assigning Gene families
EukPhylo part 1 runs CDS (genomes) or assembled transcripts (transcriptomes) through several scripts in order (5 for CDS, 7 for assembled transcripts) to remove bacterial contamination and produce ReadyToGo files. These scripts are run through a ‘wrapper’ script.
Transcriptomes:
Set Up:
- A folder called “AssembledTranscripts” with your assembled transcript fasta files
- A folder called “Databases” with the three sub folders:
- db_BvsE (how we ID likely-bacterial sequences)
- db_StopFreq (for stop codon assignment)
- db_OG
- Hook *.dmnd file ([Current version Hook-1.0.dmnd])
- Hook *.fasta file ([Current version Hook-1.0.fasta])
- A folder called “Scripts” with scripts from here on GitHub
- An empty folder for the output named as indicated in the run command for flag -o
Running:
python wrapper.py -1 1 -2 7 --assembled_transcripts AssembledTranscripts -o Output_Folder --genetic_code Universal -d Databases > log.txt
Code parameters:
| Parameter | Description |
|---|---|
-1 |
Start/first script to run. |
-2 |
End/last script to run. |
--assembled_transcripts |
Path to folder with Assembled transcripts in fasta format. |
-o |
Path to output folder. |
--genetic_code |
Specified genetic code, name of .txt file with Genetic codes; optional. |
-d |
Path to Databases folder. |
> log.txt |
If added to the end of the command, it will output a log file with progress, warning, or error messages. |
-x |
Run cross-plate contamination (XPC). Only available for transcriptomes. |
Output:
- ReadyToGo files = AA, NTD
- Summary and statistics of sequences
Modularity, and replacing the Hook database
EukPhylo part 1 for transcriptomes is composed of 7 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options. If a user chooses to use their own gene family database, they need to replace the Hook.fasta file in the Databases folder and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.
Genomes:
Set Up:
- A folder called “CDS” with your CDS fasta files
- A folder called “Databases” with the three sub folders:
- db_BvsE (how we ID likely-bacterial sequences)
- db_StopFreq (for stop codon assignment)
- db_OG
- Hook *.dmnd file ([Current version Hook-1.0.dmnd])
- Hook *.fasta file ([Current version Hook-1.0.fasta])
- A folder called “Scripts” with the 10 scripts from here on GitHub. To run locally, pull out all scripts into main folder
Running:
python wrapper.py -1 1 -2 5 --cds CDS -o Output_Folder --genetic_code Gcodes.txt -d Databases > log.txt
Code parameters:
| Parameter | Description |
|---|---|
-1 |
Start/first script to run. |
-2 |
End/last script to run. |
--cds |
Path to folder with CDS files in fasta format. |
-o |
Path to output folder. |
--genetic_code |
Specified genetic code, name of .txt file with Genetic codes; optional. |
-d |
Path to Databases folder. |
>log.txt |
If added to the end of the command, it will output a log file with progress, warning, or error messages. |
Output:
- ReadyToGo files = AA, NTD
- Summary and statistics of sequences
Modularity, and replacing the Hook database
EukPhylo part 1 for genomes is composed of 5 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options. If a user choose to use their own gene families database, they need to replace the Hook fasta file in the Database folder, and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.
EukPhylo part 2: MSAs, trees, contamination removal, and concatenation
MSAs and Trees
Set up:
In a main project directory:
- Create a
Scriptsfolder containing the 8 scripts from GitHub here- In addition to the scripts, also add the
trimal-trimAlandguidance.v2.02folders, as downloaded from here and here. Smith College HPC (Grid) users see here - IMPORTANT NOTE: Please make sure to correct the paths in the
guidance.pyscript with the full path of the location of yourtrimal-trimAlandguidance.v2.02as exemplified in the script.
- In addition to the scripts, also add the
- Create an empty output folder (e.g.
Output) for results (i.e. guidance and tree outputs) - Create a list of ten-digit codes for your target and outgroup taxa (e.g.
taxa.txt) - Create a folder (e.g.
R2Gs) that contains the AA ReadyToGo fasta files for all taxa (fromtaxa.txt) - Create a list of OGs (e.g.
OG_list.txt) for tree building
Running:
Basic running for building MSAs and Trees:
python3 Scripts/phylotol.py --start raw --end trees --gf_list OG_list.txt --taxon_list taxa.txt --data R2Gs --output Output > Output.out
For additional input parameter options, see table below or run: python phylotol.py --help
| Flag | Options | Description | Default |
|---|---|---|---|
--start |
raw, unaligned, aligned, trees |
Stage at which to start running PhyloToL | raw |
--end |
unaligned, aligned, trees |
Stage until which to run PhyloToL. Options are unaligned (which will run up to but not including guidance), aligned (which will run up to but not including RAxML), and trees which will run through RAxML') |
trees |
--gf_list |
Valid path | Path to the file with the GFs of interest. Only required if starting from the raw dataset | None |
--taxon_list |
Valid path | Path to the file with the taxa (10-digit codes) to include in the output | None |
--data |
Valid path | Path to the input dataset. The format of this varies depending on your --start parameter. If you are running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names) |
None |
--output |
Valid path | Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts | ../ |
Modularity
Below are several optional ways to parameterize EukPhylo Part 2
General:
| Parameter | Options | Description | Default |
|---|---|---|---|
--force |
Overwrites all existing files in the Output folder |
NA | |
--tree_method |
iqtree, iqtree_fast, raxml, fasttree |
Change tree building software | iqtree_fast |
For BLAST and GUIDANCE:
| Parameter | Options | Description | Default |
|---|---|---|---|
--blast_cutoff |
float | Blast e-value cutoff | 1e-20 |
--len_cutoff |
int | Amino acid length cutoff for removal of very short sequences after column removal in Guidance | 10 |
--guidance_iters |
int | Number of Guidance iterations for sequence removal | 5 |
--seq_cutoff |
float | During guidance, taxa are removed if their score is below this cutoff | 0.3 |
--col_cutoff |
float | During guidance, columns are removed if their score is below this cutoff | 0.0 |
--res_cutoff |
float | During guidance, residues are removed if their score is below this cutoff | 0.0 |
--guidance_threads |
int | Number of threads to allocate to Guidance | 20 |
For reducing number of similar sequences:
| Parameter | Required | Description | Default |
|---|---|---|---|
--similarity_filter |
yes | Run the similarity filter in pre-Guidance | NA |
--sim_cutoff |
yes | float | Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff |
--sim_taxa |
no | A file listing taxa (10-digit codes) to apply the similarity filter on (e.g. sim_taxa.txt) | NA |
For removing known poor-quality or contaminant sequences (user informed):
| Parameter | Options | Description |
|---|---|---|
--blacklist |
str | A file listing sequence IDs to remove from analysis (e.g. to_remove.txt) |
For removing sequences based on GC composition:
Note: you must first identify sequences with OGA, OGG, OG6 using the GC_identifier.py script here on GitHub
| Parameter | Options | Description | Default |
|---|---|---|---|
--og_identifier |
OG, OG6, OGA, OGG |
Select sequences by GC width | OG |
Contamination Removal
Contamination removal within EukPhylo (also called Contamination Loop or CL) allows for sequence removal based on Sisters/Subsisters identification or based on Clades diversity. An examplar run is available in Figshare
Set up:
- An input folder (called for example Input), with both
- the treefiles
- the fasta files matching the trees
- both set of files need to be name exactly the same except the extension (ex: File1.fasta, File1.tree)
- An empty output folder (called for example Output)
- a txt file containing the rules for contamination removal
- the Scripts Folder
Running:
Basic running of the Contamination Loop, with the sister mode:
python3 Scripts/eukphylo.py --start trees --end trees --data Input --output Output --contamination_loop seq --sister_rules sister_rules_file.txt > log.out
Basic running of the Contamination Loop, with the clade mode:
python3 Scripts/eukphylo.py --start trees --end trees --data Input --output Output --contamination_loop clade --clade_grabbing_rules_file clade_grabbing_rules.txt > log.out
Options:
| Parameter | Required | Options | Description | Default |
|---|---|---|---|---|
--contamination_loop |
yes | seq, clade |
The mode in which to run the CL | NA |
--nloops |
no | positive int | Number of iterations | 5 |
--sister_rules |
only in sisters mode | Valid path | Path to a text file containing sisters rules | NA |
--subsister_rules |
only in subsisters mode | Valid path | Path to a text file containing subsisters rules | NA |
--clade_grabbing_rules |
only in clade mode | Valid path | Path to a text file containing clade-grabbing rules | NA |
--clade_grabbing_exceptions |
no | Valid path | List of taxa to not remove for any reason | NA |
--cl_tree_method |
no | iqtree, raxml, fasttree, iqtree_fast |
Tree-building method to use in each contamination loop iteration | iqtree_fast |
--cl_alignment_method |
no | mafft_only, guidance |
Alignment method to use in each contamination loop iteration | mafft_only |
--cl_exclude_taxa |
no | Valid path | Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop | NA |
Concatenation
EukPhylo includes an option to choose orthologs and produce a concatenated alignement.
Set up:
- A folder called
Outputcontaining all outputs from the main pipeline (with Guidance, Trees, Pre-Guidance, NotGapTrimmed folders) - the Scripts folder
- a list of taxa to concatenate
Running:
Basic running of the concatenate mode
python eukphylo.py --start trees --concatenate --concat_target_taxa taxa_file.txt --data Output