EukPhylo Part 1: GF assignment
Katzlab edited this page 2025-06-02 16:39:25 -04:00

Overview and modularity

EukPhylo part 1 assigns gene families (GFs) to assembled transcripts or genomic CDS, and includes a number of quality filters and other curation steps. For transcriptomic data, quality filters include removing sequences <200 bp, identifying and sequestering putative ribosomal RNA sequences, and labeling sequences as either likely eukaryotic (_E) or prokaryotic (_P). Initial gene family assignments for both transcripts and genome CDS are done through Diamond analysis against either the EukPhylo Hook database (>15,000 gene families found across diverse eukaryotes), or a user-defined database of genes of interest. Renamed nucleotide and amino acid sequences are stored in 'ready to go' (R2G) files, and a set of statistics are generated per sequence and per taxon. Optional analyses for transcriptomes include "cross plate contamination (XPC))", which seeks to remove contamination by index switching, and exploration of alternative genetic code (of particular importance for lineages like ciliates). Additional details are outline in Figure S2.

Setup

Running part 1 requires a set of scripts, a small number of databases, and input data in the form of either assembled transcripts or genome CDS. Below is a description of everything you need in order to start running part 1 on transcriptomic or genomic samples.

Dependencies

The following dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well.

Input data and folder structure

An important aspect of EukPhylo is that it controls taxon, gene family, and sequence names very carefully. EukPhylo part 1 processes input data one sample at a time, and we require that each input sample be given a ten digit code in the format Op_me_Hsap. We generally use the first two digits to represent a major clade (Opisthokonta), the second two digits to represent a minor clade (Metazoa), and the last four to represent genus and species (Homo sapiens). More information on codes in Dataset 2 on figshare.

Transcriptomes

To run EukPhylo part 1 with transcriptomic data, you need to first assemble your transcripts. We use rnaSpades, so our scripts are designed to read sequence IDs as formatted by rnaSpades. You can use your assembler of choice, but you'll need to rename your sequence IDs in the fasta file of assembled transcripts input to the pipeline. Each input fasta file of assembled transcripts ("transcripts.fasta" as output by rnaSpades) must be renamed in the format.

Op_me_Hsap_assembledTranscripts.fasta

where the first ten digits represent a variable sample identifier (see above). Each sequence in the fasta file must be named in the format

>NODE_40535_length_253_cov_2.87
GTACAATATGCCTTCTTACAGTGATGAAGCTCTAACAGAAGAAAAGGTTGGATGAAAATG GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA AAATGGGTACAAAAT...

where the numbers after NODE_ are a unique transcript identifier, and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise, including capitalization and spacing, here and throughout). Next, download the Scripts and Databases folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of EukPhylo part 1 will be located, looking something like this

At this point, you are ready to run the code! See the Processing transcriptomes section below for next steps.

Genomes

EukPhylo part 1 for genomes takes as input genomic CDS, such as are available to download for many genome assemblies on GenBank. Similarly to the transcriptome setup above, each input file must be named in the format

Op_me_Hsap_GenBankCDS.fasta

with the first ten digits representing a unique sample identifier. Each sequence in the CDS fasta file should be formatted as downloaded from GenBank:

>lcl|NC_000001.11_cds_NP_001005484.2_1 [gene=OR4F5] [db_xref=CCDS:CCDS30547.1,Ensembl:ENSP00000493376.2,GeneID:79501] [protein=olfactory receptor 4F5] [protein_id=NP_001005484.2] [location=join(65565..65573,69037..70008)] [gbkey=CDS]

ATGAAGAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATGTACTTCCTGCTAGCCAACCTCTGA...

And all of the CDS fasta files should be in a folder alongside the Scripts and Databases folders, as described above for transcriptomes.

Databases

EukPhylo part 1 requires several reference databases used at various steps in the pipeline. Using the folder structure shown above, the Databases folder should look like this:

Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file with the suffix _rRNAseqs.fasta). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberrant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. Hook-6.6.fasta, Hook-6.6.dmnd).

You can download these databases from the EukPhylo Figshare page. You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database here. Convert it to a Diamond database, being mindful of Diamond version, and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).

The Hook database

Users can either use the EukPhylo Hook database or a set of gene families of interest (e.g. targeting a specific function or taxon). The EukPhylo Hook Database is composed of 1,453,081 sequences across 15,414 GFs, and serves as a reference database against which assembled transcripts and are similarity-searched for GF assignment. The EukPhylo Hook Database captures a broad diversity of eukaryotic gene families and was built using sequence data from OrthoMCL version 6.13, which we sampled to select for OGs that are present across the eukaryotic tree and/or present in under-sampled lineages of eukaryotes (Fig. S1, Figure 2). To add value for users, we also include functional annotations for each OG in the Hook (Dataset S11; see methods in SI Appendix). Alternatively, users can replace the hook as described below.

Swapping out the Hook Database

Replacing the EukPhylo Hook Database with a user-defined set of gene families is straightforward:

  • You need two files, a fasta file with your target sequences, and this same file reformatted as a diamond database (diamond makedb --in <path to fasta file> -d <name of output>.dmnd)
  • The hook then is put in the db_OG database folder prior to running Part 1
  • Be aware that you must give your fasta file and your Diamond database the same exact prefix, or else your run will fail at the end. For example, the Katzlab Hook-6.6 files are named: Hook-6.6.fasta, Hook-6.6.dmnd

Running EukPhylo part 1

Processing transcriptomes

Running EPp1 relies on a variety of scripts as described here:

Running EukPhylo Part 1 on transcriptomes requires three items in your main directory:

  1. A folder named Scripts containing all Scripts for EukPhylo part 1
  2. A folder containing your assembled transcripts (as described above)
  3. The Databases folder described above

EukPhylo part 1 starts with your assembled transcripts and produces ReadyToGo files (R2G; nucleotide coding regions and inferred amino acid sequences with gene families assigned) for each input sample, and summary statistics (e.g. composition, length, coverage) for each sequence processed, as well as aggregated across all sequences for each taxon. This part of the pipeline includes seven scripts which must be run in order. Script 1b (removal of contamination from index switching, 'XPC') is optional (see below), and users may choose to stop after script 4 if they are unsure of correct genetic code assignment. Otherwise, users are recommended to run their transcripts through scripts 1 to 7 in a single run. The simplest way to run EukPhylo part 1 is with one of the following commands:

On a grid python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --genetic_code Gcode.txt --databases Databases > log.txt

On a local computer: First navigate to within the Scripts folder and then run:

python3 wrapper.py --first_script 1 --last_script 7 --assembled_transcripts [full_path]/EukPhylo/PTL1/Transcriptomes/AssembledTranscripts --genetic_code universal --databases [full_path]/EukPhylo/PTL1/Transcriptomes/Databases --output [full_path]/EukPhylo/PTL1/Transcriptomes/ > log.txt

In this case, the file Gcode.txt is a text file designating genetic code assignments for each taxon. The file should contain two tab-separated columns; the first column gives a ten-digit sample identifier, and the second column the genetic code assignment to be used in translation (script 5). The genetic code options are: universal, blepharisma, chilodonella, condylostoma, euplotes, peritrich, vorticella, ciliate, mesodinium, taa, tag, tga, and none. If you are not working with ciliates, you should probably choose "universal" for each taxon, or just use the argument --genetic_code universal instead of creating a text file.

Other available parameters are:

Parameter Input type Options Description
--first_script integer 1, 2, 3, 4, 5, 6 First script to run
--last_script integer 1, 2, 3, 4, 5, 6, 7 Last script to run
--assembled_transcripts string Path to a folder of assembled transcripts, assembled by rnaSPAdes. Each assembled transcript file name should start with a unique 10 digit code, and end in "_assembledTranscripts.fasta", E.g. Op_me_hsap_assembledTranscripts.fasta
--databases string Path to databases folder The folder should contain all 3 databases
--output string Path for the output files An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder
--xplate_contam flag (true/false) include or exclude the argument Run cross-plate contamination removal (includes all files)
--genetic_code string A .txt or .tsv with two tab-separated columns, the first with the ten-digit codes and the second column with the corresponding genetics codes If all of your taxa use the same genetic code, you may enter it here. Alternatively, if you need to use a variety of genetic codes but know which codes to use, you may fill give here the path to a file.
--conspecific_names string A .txt or .tsv file with two tab-separated columns; the first should have 10 digit codes, the second species or other identifying names This is used to determine which sequences to remove (only between "species") in index switching (cross-plate contamination) assessment.
--minlen integer any positive integer Minimum transcript length
--maxlen integer any positive integer Maximum transcript length
--seq_count integer any positive integer Minimum number of sequences after assigning GFs

Index Switching (Cross plate contamination)

As you run EukPhylo part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by clustering all of your input assembled transcripts with Vsearch at a nucleotide identity of 99%. Sequences with less than one-tenth the k-mer coverage of the highest covered sequence in the cluster are removed, as long as both sequences are not 'conspecific' (usually, this means from the same species or genus). You can tell EukPhylo which of your taxa are conspecific by inputting a text file to the --conspecific_names argument with two tab-separated columns; the first column should be a ten-digit sample identifier and the second column a group (e.g., species, genus) identifier; samples with the same group identifier are taken to be conspecific.

To run this step, you will need to add the '--xplate_contam' flag to the command line as follows:

python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases --xplate_contam --conspecific_names Conspecific.txt > log.txt

Processing genomes

EukPhylo part 1 uses a different but similar set of scripts to process input genomic CDSs when compared to assembled transcripts:

The setup here is the same as described above, and running PTL6p1 on genomic CDSs is similar to transcriptomes, except there is no option to remove contamination as a result of index hopping and no need to identify genetic codes to use in translation (start/stop codon positions are already known), so the process is simpler. We recommend running scripts 1 through 5 in a single run, as follows:

python Scripts/wrapper.py --first_script 1 --last_script 5 --cds CDS --genetic_code Gcode.txt --databases Databases > log.txt

The parameter options are:

Parameter Type Options Description
--first_script integer 1, 2, 3, 4 First script to run
--last_script integer 2, 3, 4, 5 First script to run
--cds string Path to a folder of nucleotide CDS Each file name should start with a unique 10 digit code, and end in "_GenBankCDS.fasta", E.g. Op_me_hsap_GenBankCDS.fasta
--output string Path for the output files An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder
--genetic_code string Path to a file, Universal If all of your taxa use the same genetic code, you may enter it here
--databases string Path to databases folder The folder should contain all 3 databases

Output

When you run EukPhylo part 1 on transcriptomic data it will generate a folder called "Output" in the same folder as your "Scripts" folder. This Output folder should have the following structure (if it doesn't, something likely went wrong, and users will want to inspect input files and paths to determine errors):

The "ReadyToGo" folder contains the final cleaned output of EukPhylo part 1. The ReadyToGo_AA folder contains translated sequences, and these are the files (one per input sample) that should be input to EukPhylo part 2. The ReadyToGo_NTD folder contains the same sequences, untranslated, and ReadyToGo_TSV contains the a summary of the best Diamond hit in the Hook (or other GF reference) database for each sequence, which determines OG assignment. The PerSequenceStatSummaries and PerTaxonSummary files are also final products, giving basic sequence descriptions for all taxa in spreadsheet form.

EukPhylo part 1 also provides intermediate files used in producing the above finalized outputs. Of greatest interest to most users here are likely to be the files in the Intermediate/TranslatedTranscriptomes folder. Most importantly, users can find a record of all Diamond hits against the Hook Database (not filtered to keep only the best hits) in the file DiamondOG/allOGresults.tsv. This is useful in trying to assess alternative gene family assignments. See the headers of each EukPhylo part 1 scripts for a description of the individual intermediate outputs.

The output of PhyloToL part 1 when run on genomic data is very similar, if a bit simpler. The key files (ReadyToGo and all Diamond hits against the Hook) are located in the same places, except there is no TranslatedTranscriptomes folder (key intermediate files are given directly in the Intermediate folder). Again, see the headers of each EukPhylo part 1 scripts for a more detailed description of the individual intermediate outputs.