This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Note: The EukPhylo pipeline is currently being dockerised for easier installation and use. This is still in progress so please do not install. More information about the dockerfile can be found here - Docker branch

Dockerfile

Users will need to have the docker software running and installed. The docker file for part 2 can be executed with:

# Build the container
docker build -f Dockerfile.txt . --tag eukphylo:1

# Current command is:
docker run -it \
    --mount type=bind,src=$(pwd)/input_data,dst=/Input_data \
    --mount type=bind,src=$(pwd)/output_data,dst=/Output_data \
    eukphylo:1

After development, GitHub CICD workflows can be added to automatically build and release the dockerfile for the end user.

General Steps

EukPhylo pipeline is composed of two parts, that can be run individually: Part 1 can be run only once, to assign gene families; Part 2 builds MSAs, trees, and implements contamination removal and concatenation. It's preferable to run Part 2 using the outputs of Part 1 as input, but this is not required as long as the input files are in the same format (one fasta file per species, with sequences IDs starting with a 10 digit taxon identifier and ending in a gene family identifier with the format OGx_xxxxxx. See extended version of the wiki for details.)

Install EukPhylo
Run EukPhylo part 1 a. with the Hook database or with custom database b. with assembled transcripts or genomic CDS c. modularity
Run EukPhylo part 2 a. basic running b. modularity c. contamination removal d. choosing orthologs and concatenation

Installing EukPhylo

Scripts can be used as downloaded from the GitHub, and should work on any platform Dependencies & third party tools, along with the versions that we use at the Katz lab

TrimAl (1.2)
Guidance (2.2)
Diamond (0.9.30, compiled with GCC 8.3.0)
MAFFT (7.475)
IQ-Tree (2.1.12)
RAxML (8.2.12)
BLAST+ (2.9.0)
Vsearch (2.21.1, compiled with GCC 10.3.0)
Python (3.9.6, libraries can be installed with Pip)
ETE3 (pip install ete3)
BioPython (1.79-foss-2021b)
tqdm

EukPhylo part 1: Assigning Gene families

EukPhylo part 1 runs CDS (genomes) or assembled transcripts (transcriptomes) through several scripts in order (5 for CDS, 7 for assembled transcripts) to remove bacterial contamination and produce ReadyToGo files. These scripts are run through a ‘wrapper’ script.

Transcriptomes:

Set Up:

A folder called “AssembledTranscripts” with your assembled transcript fasta files
A folder called “Databases” with the three sub folders:
- db_BvsE (how we ID likely-bacterial sequences)
- db_StopFreq (for stop codon assignment)
- db_OG
  - Hook *.dmnd file ([Current version Hook-1.0.dmnd])
  - Hook *.fasta file ([Current version Hook-1.0.fasta])
A folder called “Scripts” with scripts from here on GitHub
An empty folder for the output named as indicated in the run command for flag -o

Running:

python wrapper.py -1 1 -2 7 --assembled_transcripts AssembledTranscripts -o Output_Folder --genetic_code Universal -d Databases > log.txt

Code parameters:

Parameter	Description
`-1`	Start/first script to run.
`-2`	End/last script to run.
`--assembled_transcripts`	Path to folder with Assembled transcripts in fasta format.
`-o`	Path to output folder.
`--genetic_code`	Specified genetic code, name of .txt file with Genetic codes; optional.
`-d`	Path to Databases folder.
`> log.txt`	If added to the end of the command, it will output a log file with progress, warning, or error messages.
`-x`	Run cross-plate contamination (XPC). Only available for transcriptomes.

Output:

ReadyToGo files = AA, NTD
Summary and statistics of sequences

Modularity, and replacing the Hook database

EukPhylo part 1 for transcriptomes is composed of 7 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options. If a user chooses to use their own gene family database, they need to replace the Hook.fasta file in the Databases folder and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.

Genomes:

Set Up:

A folder called “CDS” with your CDS fasta files
A folder called “Databases” with the three sub folders:
- db_BvsE (how we ID likely-bacterial sequences)
- db_StopFreq (for stop codon assignment)
- db_OG
  - Hook *.dmnd file ([Current version Hook-1.0.dmnd])
  - Hook *.fasta file ([Current version Hook-1.0.fasta])
A folder called “Scripts” with the 10 scripts from here on GitHub. To run locally, pull out all scripts into main folder

Running:

python wrapper.py -1 1 -2 5 --cds CDS -o Output_Folder --genetic_code Gcodes.txt -d Databases > log.txt

Code parameters:

Parameter	Description
`-1`	Start/first script to run.
`-2`	End/last script to run.
`--cds`	Path to folder with CDS files in fasta format.
`-o`	Path to output folder.
`--genetic_code`	Specified genetic code, name of .txt file with Genetic codes; optional.
`-d`	Path to Databases folder.
`>log.txt`	If added to the end of the command, it will output a log file with progress, warning, or error messages.

Output:

ReadyToGo files = AA, NTD
Summary and statistics of sequences

Modularity, and replacing the Hook database

EukPhylo part 1 for genomes is composed of 5 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options. If a user choose to use their own gene families database, they need to replace the Hook fasta file in the Database folder, and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.

EukPhylo part 2: MSAs, trees, contamination removal, and concatenation

MSAs and Trees

Set up:

In a main project directory:

Create a Scripts folder containing the 8 scripts from GitHub here
- In addition to the scripts, also add the trimal-trimAl and guidance.v2.02 folders, as downloaded from here and here. Smith College HPC (Grid) users see here
- IMPORTANT NOTE: Please make sure to correct the paths in the guidance.py script with the full path of the location of your trimal-trimAl and guidance.v2.02 as exemplified in the script.
Create an empty output folder (e.g. Output) for results (i.e. guidance and tree outputs)
Create a list of ten-digit codes for your target and outgroup taxa (e.g. taxa.txt)
Create a folder (e.g. R2Gs) that contains the AA ReadyToGo fasta files for all taxa (from taxa.txt)
Create a list of OGs (e.g. OG_list.txt) for tree building

Running:

Basic running for building MSAs and Trees:

python3 Scripts/phylotol.py --start raw --end trees --gf_list OG_list.txt --taxon_list taxa.txt --data R2Gs --output Output > Output.out

For additional input parameter options, see table below or run: python phylotol.py --help

Flag	Options	Description	Default
`--start`	`raw`, `unaligned`, `aligned`, `trees`	Stage at which to start running PhyloToL	`raw`
`--end`	`unaligned`, `aligned`, `trees`	Stage until which to run PhyloToL. Options are `unaligned` (which will run up to but not including guidance), `aligned` (which will run up to but not including RAxML), and `trees` which will run through RAxML')	`trees`
`--gf_list`	Valid path	Path to the file with the GFs of interest. Only required if starting from the raw dataset	None
`--taxon_list`	Valid path	Path to the file with the taxa (10-digit codes) to include in the output	None
`--data`	Valid path	Path to the input dataset. The format of this varies depending on your `--start` parameter. If you are running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names)	None
`--output`	Valid path	Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts	`../`

Modularity

Below are several optional ways to parameterize EukPhylo Part 2

General:

Parameter	Options	Description	Default
`--force`		Overwrites all existing files in the `Output` folder	NA
`--tree_method`	`iqtree`, `iqtree_fast`, `raxml`, `fasttree`	Change tree building software	`iqtree_fast`

For BLAST and GUIDANCE:

Parameter	Options	Description	Default
`--blast_cutoff`	float	Blast e-value cutoff	`1e-20`
`--len_cutoff`	int	Amino acid length cutoff for removal of very short sequences after column removal in Guidance	`10`
`--guidance_iters`	int	Number of Guidance iterations for sequence removal	`5`
`--seq_cutoff`	float	During guidance, taxa are removed if their score is below this cutoff	`0.3`
`--col_cutoff`	float	During guidance, columns are removed if their score is below this cutoff	`0.0`
`--res_cutoff`	float	During guidance, residues are removed if their score is below this cutoff	`0.0`
`--guidance_threads`	int	Number of threads to allocate to Guidance	`20`

For reducing number of similar sequences:

Parameter	Required	Description	Default
`--similarity_filter`	yes	Run the similarity filter in pre-Guidance	NA
`--sim_cutoff`	yes	float	Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff
`--sim_taxa`	no	A file listing taxa (10-digit codes) to apply the similarity filter on (e.g. sim_taxa.txt)	NA

For removing known poor-quality or contaminant sequences (user informed):

Parameter	Options	Description
`--blacklist`	str	A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)

For removing sequences based on GC composition:

Note: you must first identify sequences with OGA, OGG, OG6 using the GC_identifier.py script here on GitHub

Parameter	Options	Description	Default
`--og_identifier`	`OG`, `OG6`, `OGA`, `OGG`	Select sequences by GC width	`OG`

Contamination Removal

Contamination removal within EukPhylo (also called Contamination Loop or CL) allows for sequence removal based on Sisters/Subsisters identification or based on Clades diversity. An examplar run is available in Figshare

Set up:

An input folder (called for example Input), with both
- the treefiles
- the fasta files matching the trees
- both set of files need to be name exactly the same except the extension (ex: File1.fasta, File1.tree)
An empty output folder (called for example Output)
a txt file containing the rules for contamination removal
the Scripts Folder

Running:

Basic running of the Contamination Loop, with the sister mode:

python3 Scripts/eukphylo.py --start trees --end trees --data Input --output Output --contamination_loop seq --sister_rules sister_rules_file.txt > log.out

Basic running of the Contamination Loop, with the clade mode:

python3 Scripts/eukphylo.py --start trees --end trees --data Input --output Output --contamination_loop clade --clade_grabbing_rules_file clade_grabbing_rules.txt > log.out

Options:

Parameter	Required	Options	Description	Default
`--contamination_loop`	yes	`seq`, `clade`	The mode in which to run the CL	NA
`--nloops`	no	positive int	Number of iterations	`5`
`--sister_rules`	only in sisters mode	Valid path	Path to a text file containing sisters rules	NA
`--subsister_rules`	only in subsisters mode	Valid path	Path to a text file containing subsisters rules	NA
`--clade_grabbing_rules`	only in clade mode	Valid path	Path to a text file containing clade-grabbing rules	NA
`--clade_grabbing_exceptions`	no	Valid path	List of taxa to not remove for any reason	NA
`--cl_tree_method`	no	`iqtree`, `raxml`, `fasttree`, `iqtree_fast`	Tree-building method to use in each contamination loop iteration	`iqtree_fast`
`--cl_alignment_method`	no	`mafft_only`, `guidance`	Alignment method to use in each contamination loop iteration	`mafft_only`
`--cl_exclude_taxa`	no	Valid path	Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop	NA

Concatenation

EukPhylo includes an option to choose orthologs and produce a concatenated alignement.

Set up:

A folder called Output containing all outputs from the main pipeline (with Guidance, Trees, Pre-Guidance, NotGapTrimmed folders)
the Scripts folder
a list of taxa to concatenate

Running:

Basic running of the concatenate mode

python eukphylo.py --start trees --concatenate --concat_target_taxa taxa_file.txt --data Output