EukPhylo

YuuMJ1997/EukPhylo

mirror of http://43.156.76.180:8026/YuuMJ/EukPhylo.git synced 2025-12-27 01:10:25 +08:00

Table of Contents

Dockerfile
EukPhylo Part 1 – Gene family assignment
EukPhylo Part 2 – MSAs, gene trees, and contamination loop
Other

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

EukPhylo version 1.0 is designed to be a modular, accessible pipeline that includes sophisticated data curation methods, including a newly designed method of phylogeny-informed contamination removal, gene families homology estimation, and generating multisequence alignments (MSAs) and gene trees. More details can be found in the manuscript files (pending publication)

The core EukPhylo pipeline comprises two main components, which we refer to as EukPhylo parts 1 and 2. EukPhylo part 1 takes input sequences from a whole genome or transcriptome assembly, applies several curation steps, and provides initial homology assessment against a customizable database of reference sequences to assign GFs. Part one outputs a fasta file of curated nucleotide and amino acid sequences with gene families assigned, as well as a dataset of descriptive statistics (e.g. length, coverage, and composition) for each input sample. EukPhylo part 2 is highly modular; for a given selection of taxa and GFs it stringently assesses homology by iterating the tool Guidance (Penn et al., 2010; Sela et al., 2015), which outputs an MSA for each gene family. From MSAs it builds gene trees, and then includes an innovative workflow for tree topology-based contamination removal.

We also provide a suite of utility scripts for describing data output by EukPhylo (e.g. basic sequence statistics such as composition, coverage, and length) as well as performing some of EukPhylo's more complex operations (e.g. "clade-grabbing") in a stand-alone form.

Note that EukPhylo used to go by the name of "PhyloToL," hence some residual instances where we refer to "PTL1" and "PTL2" instead of "EukPhylo part 1" and "EukPhylo part 2" respectively (for instance, in the Github folder structure).

Note: The EukPhylo pipeline is currently being dockerised for easier installation and use. This is still in progress so please do not install. More information about the dockerfile can be found here - Docker branch

Dockerfile

We are in the process of containerizing EukPhylo with Docker, and we have started with Part 2. This work is ongoing, and users will need to have the docker software running and installed. The docker file for part 2 can be executed with:

# Build the container
docker build -f Dockerfile.txt . --tag eukphylo


# Current command is:
docker run -it \
    --mount type=bind,src=$(pwd)/databases,dst=/Databases \
    --mount type=bind,src=$(pwd)/input_data,dst=/Input_data \
    --mount type=bind,src=$(pwd)/output_data,dst=/Output_data \
    eukphylo

An example for running the dockerfile that takes in an OGlist, taxonlist, and R2Gs as input. It also requires an Output folder.

# Build the container 
# Note: This needs to only be done once.
docker build -f Dockerfile.txt . --tag eukphylo

⚠️ Do not change the "dst=/$(path)", only change "src=$(pwd)"

docker run -it \
--mount type=bind,src=/Users/gani/phylotol_ms/Docker/PT2/OG_list.txt,dst=/EukPhylo/PTL2listofOGs.txt \
--mount type=bind,src=/Users/gani/phylotol_ms/Docker/PT2/taxon_list.txt,dst=/EukPhylo/PTL2taxon_list.txt \
--mount type=bind,src=/Users/gani/phylotol_ms/Docker/PT2/R2G,dst=/Input_data \
--mount type=bind,src=/Users/gani/phylotol_ms/Docker/PT2/Output_data,dst=/Output_data \
eukphylo

After development, GitHub CICD workflows can be added to automatically build and release the dockerfile for the end user.

Dockerfile

EukPhylo Part 1 – Gene family assignment

EukPhylo Part 2 – MSAs, gene trees, and contamination loop

Other