25
EukPhylo QuickStart
Adri K. Grow edited this page 2025-10-28 13:27:12 -04:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Note: The EukPhylo pipeline is currently being dockerised for easier installation and use. This is still in progress so please do not install. More information about the dockerfile can be found here - Docker branch

Dockerfile

Users will need to have the docker software running and installed. The docker file for part 2 can be executed with:

# Build the container
docker build -f Dockerfile.txt . --tag eukphylo:1

# Current command is:
docker run -it \
    --mount type=bind,src=$(pwd)/input_data,dst=/Input_data \
    --mount type=bind,src=$(pwd)/output_data,dst=/Output_data \
    eukphylo:1

After development, GitHub CICD workflows can be added to automatically build and release the dockerfile for the end user.

General Steps

EukPhylo pipeline is composed of two parts, that can be run individually: Part 1 can be run only once, to assign gene families; Part 2 builds MSAs, trees, and implements contamination removal and concatenation. It's preferable to run Part 2 using the outputs of Part 1 as input, but this is not required as long as the input files are in the same format (one fasta file per species, with sequences IDs starting with a 10 digit taxon identifier and ending in a gene family identifier with the format OGx_xxxxxx. See extended version of the wiki for details.)

Contents

  1. Install EukPhylo
  2. Run EukPhylo part 1 a. with the Hook database or with custom database b. with assembled transcripts or genomic CDS c. modularity
  3. Run EukPhylo part 2 a. basic running b. modularity c. contamination removal d. choosing orthologs and concatenation

Installing EukPhylo

Scripts can be used as downloaded from the GitHub, and should work on any platform Dependencies & third party tools, along with the versions that we use at the Katz lab

  • TrimAl (1.2)
  • Guidance (2.2)
  • Diamond (0.9.30, compiled with GCC 8.3.0)
  • MAFFT (7.475)
  • IQ-Tree (2.1.12)
  • RAxML (8.2.12)
  • BLAST+ (2.9.0)
  • Vsearch (2.21.1, compiled with GCC 10.3.0)
  • Python (3.9.6, libraries can be installed with Pip)
  • ETE3 (pip install ete3)
  • BioPython (1.79-foss-2021b)
  • tqdm

EukPhylo part 1: Assigning Gene families

EukPhylo part 1 runs CDS (genomes) or assembled transcripts (transcriptomes) through several scripts in order (5 for CDS, 7 for assembled transcripts) to remove bacterial contamination and produce ReadyToGo files. These scripts are run through a wrapper script.

Transcriptomes:

Set Up:

  • A folder called “AssembledTranscripts” with your assembled transcript fasta files
  • A folder called “Databases” with the three sub folders:
    • db_BvsE (how we ID likely-bacterial sequences)
    • db_StopFreq (for stop codon assignment)
    • db_OG
      • Hook *.dmnd file ([Current version Hook-1.0.dmnd])
      • Hook *.fasta file ([Current version Hook-1.0.fasta])
  • A folder called “Scripts” with scripts from here on GitHub
  • An empty folder for the output named as indicated in the run command for flag -o

Running:

python wrapper.py -1 1 -2 7 --assembled_transcripts AssembledTranscripts -o Output_Folder --genetic_code Universal -d Databases > log.txt

Code parameters:

Parameter Description
-1 Start/first script to run.
-2 End/last script to run.
--assembled_transcripts Path to folder with Assembled transcripts in fasta format.
-o Path to output folder.
--genetic_code Specified genetic code, name of .txt file with Genetic codes; optional.
-d Path to Databases folder.
> log.txt If added to the end of the command, it will output a log file with progress, warning, or error messages.
-x Run cross-plate contamination (XPC). Only available for transcriptomes.

Output:

  1. ReadyToGo files = AA, NTD
  2. Summary and statistics of sequences

Modularity, and replacing the Hook database

EukPhylo part 1 for transcriptomes is composed of 7 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options. If a user chooses to use their own gene family database, they need to replace the Hook.fasta file in the Databases folder and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.

Genomes:

Set Up:

  • A folder called “CDS” with your CDS fasta files
  • A folder called “Databases” with the three sub folders:
    • db_BvsE (how we ID likely-bacterial sequences)
    • db_StopFreq (for stop codon assignment)
    • db_OG
      • Hook *.dmnd file ([Current version Hook-1.0.dmnd])
      • Hook *.fasta file ([Current version Hook-1.0.fasta])
  • A folder called “Scripts” with the 10 scripts from here on GitHub. To run locally, pull out all scripts into main folder

Running:

python wrapper.py -1 1 -2 5 --cds CDS -o Output_Folder --genetic_code Gcodes.txt -d Databases > log.txt

Code parameters:

Parameter Description
-1 Start/first script to run.
-2 End/last script to run.
--cds Path to folder with CDS files in fasta format.
-o Path to output folder.
--genetic_code Specified genetic code, name of .txt file with Genetic codes; optional.
-d Path to Databases folder.
>log.txt If added to the end of the command, it will output a log file with progress, warning, or error messages.

Output:

  1. ReadyToGo files = AA, NTD
  2. Summary and statistics of sequences

Modularity, and replacing the Hook database

EukPhylo part 1 for genomes is composed of 5 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options. If a user choose to use their own gene families database, they need to replace the Hook fasta file in the Database folder, and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.

EukPhylo part 2: MSAs, trees, contamination removal, and concatenation

MSAs and Trees

Set up:

In a main project directory:

  • Create a Scripts folder containing the 8 scripts from GitHub here
    • In addition to the scripts, also add the trimal-trimAl and guidance.v2.02 folders, as downloaded from here and here. Smith College HPC (Grid) users see here
    • IMPORTANT NOTE: Please make sure to correct the paths in the guidance.py script with the full path of the location of your trimal-trimAl and guidance.v2.02 as exemplified in the script.
  • Create an empty output folder (e.g. Output) for results (i.e. guidance and tree outputs)
  • Create a list of ten-digit codes for your target and outgroup taxa (e.g. taxa.txt)
  • Create a folder (e.g. R2Gs) that contains the AA ReadyToGo fasta files for all taxa (from taxa.txt)
  • Create a list of OGs (e.g. OG_list.txt) for tree building

Running:

Basic running for building MSAs and Trees:

python3 Scripts/phylotol.py --start raw --end trees --gf_list OG_list.txt --taxon_list taxa.txt --data R2Gs --output Output > Output.out

For additional input parameter options, see table below or run: python phylotol.py --help

Flag Options Description Default
--start raw, unaligned, aligned, trees Stage at which to start running PhyloToL raw
--end unaligned, aligned, trees Stage until which to run PhyloToL. Options are unaligned (which will run up to but not including guidance), aligned (which will run up to but not including RAxML), and trees which will run through RAxML') trees
--gf_list Valid path Path to the file with the GFs of interest. Only required if starting from the raw dataset None
--taxon_list Valid path Path to the file with the taxa (10-digit codes) to include in the output None
--data Valid path Path to the input dataset. The format of this varies depending on your --start parameter. If you are running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names) None
--output Valid path Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts ../

Modularity

Below are several optional ways to parameterize EukPhylo Part 2

General:

Parameter Options Description Default
--force Overwrites all existing files in the Output folder NA
--tree_method iqtree, iqtree_fast, raxml, fasttree Change tree building software iqtree_fast

For BLAST and GUIDANCE:

Parameter Options Description Default
--blast_cutoff float Blast e-value cutoff 1e-20
--len_cutoff int Amino acid length cutoff for removal of very short sequences after column removal in Guidance 10
--guidance_iters int Number of Guidance iterations for sequence removal 5
--seq_cutoff float During guidance, taxa are removed if their score is below this cutoff 0.3
--col_cutoff float During guidance, columns are removed if their score is below this cutoff 0.0
--res_cutoff float During guidance, residues are removed if their score is below this cutoff 0.0
--guidance_threads int Number of threads to allocate to Guidance 20

For reducing number of similar sequences:

Parameter Required Description Default
--similarity_filter yes Run the similarity filter in pre-Guidance NA
--sim_cutoff yes float Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff
--sim_taxa no A file listing taxa (10-digit codes) to apply the similarity filter on (e.g. sim_taxa.txt) NA

For removing known poor-quality or contaminant sequences (user informed):

Parameter Options Description
--blacklist str A file listing sequence IDs to remove from analysis (e.g. to_remove.txt)

For removing sequences based on GC composition:

Note: you must first identify sequences with OGA, OGG, OG6 using the GC_identifier.py script here on GitHub

Parameter Options Description Default
--og_identifier OG, OG6, OGA, OGG Select sequences by GC width OG

Contamination Removal

Contamination removal within EukPhylo (also called Contamination Loop or CL) allows for sequence removal based on Sisters/Subsisters identification or based on Clades diversity. An examplar run is available in Figshare

Set up:

  • An input folder (called for example Input), with both
    • the treefiles
    • the fasta files matching the trees
    • both set of files need to be name exactly the same except the extension (ex: File1.fasta, File1.tree)
  • An empty output folder (called for example Output)
  • a txt file containing the rules for contamination removal
  • the Scripts Folder

Running:

Basic running of the Contamination Loop, with the sister mode:

python3 Scripts/eukphylo.py --start trees --end trees --data Input --output Output --contamination_loop seq --sister_rules sister_rules_file.txt > log.out

Basic running of the Contamination Loop, with the clade mode:

python3 Scripts/eukphylo.py --start trees --end trees --data Input --output Output --contamination_loop clade --clade_grabbing_rules_file clade_grabbing_rules.txt > log.out

Options:

Parameter Required Options Description Default
--contamination_loop yes seq, clade The mode in which to run the CL NA
--nloops no positive int Number of iterations 5
--sister_rules only in sisters mode Valid path Path to a text file containing sisters rules NA
--subsister_rules only in subsisters mode Valid path Path to a text file containing subsisters rules NA
--clade_grabbing_rules only in clade mode Valid path Path to a text file containing clade-grabbing rules NA
--clade_grabbing_exceptions no Valid path List of taxa to not remove for any reason NA
--cl_tree_method no iqtree, raxml, fasttree, iqtree_fast Tree-building method to use in each contamination loop iteration iqtree_fast
--cl_alignment_method no mafft_only, guidance Alignment method to use in each contamination loop iteration mafft_only
--cl_exclude_taxa no Valid path Path to a file containing taxon names present in input MSA/tree files but which should be removed in the first iteration of the contamination loop NA

Concatenation

EukPhylo includes an option to choose orthologs and produce a concatenated alignement.

Set up:

  • A folder called Output containing all outputs from the main pipeline (with Guidance, Trees, Pre-Guidance, NotGapTrimmed folders)
  • the Scripts folder
  • a list of taxa to concatenate

Running:

Basic running of the concatenate mode

python eukphylo.py --start trees --concatenate --concat_target_taxa taxa_file.txt --data Output