Updated EukPhylo Part 1: GF assignment (markdown)

2026-02-11 19:00:25 +08:00 · 2025-01-17 10:11:11 -05:00 · 2025-01-17 10:11:11 -05:00 · fca697bfc2
commit fca697bfc2
parent d75cc3f9ea
1 changed files with 28 additions and 28 deletions
--- a/EukPhylo-Part-1:-GF-assignment.md
+++ b/EukPhylo-Part-1:-GF-assignment.md
@ -1,10 +1,10 @@
 ## Overview and modularity

-PhyloToL part 1 (PTLp1) assigns gene families (GFs) to assembled transcripts or genomic CDS, and includes a number of quality filters and other curation steps. For transcriptomic data, quality filters include removing sequences <200 bp, identifying and sequestering putative ribosomal RNA sequences, and labeling sequences as either likely eukaryotic (_E) or prokaryotic (_P). Initial gene family assignments for both transcripts and genome CDS are done through Diamond analysis against either the PhyloToL Hook database (>15,000 gene families found across diverse eukaryotes), or a user-defined database of genes of interest. Renamed nucleotide and amino acid sequences are stored in 'ready to go' (R2G) files, and a set of statistics are generated per sequence and per taxon. Optional analyses for transcriptomes include "cross plate contamination (XPC))", which seeks to remove contamination by index switching, and exploration of alternative genetic code (of particular importance for lineages like ciliates). Additional details are outline in Figure S2.
+EukPhylo part 1 assigns gene families (GFs) to assembled transcripts or genomic CDS, and includes a number of quality filters and other curation steps. For transcriptomic data, quality filters include removing sequences <200 bp, identifying and sequestering putative ribosomal RNA sequences, and labeling sequences as either likely eukaryotic (_E) or prokaryotic (_P). Initial gene family assignments for both transcripts and genome CDS are done through Diamond analysis against either the EukPhylo Hook database (>15,000 gene families found across diverse eukaryotes), or a user-defined database of genes of interest. Renamed nucleotide and amino acid sequences are stored in 'ready to go' (R2G) files, and a set of statistics are generated per sequence and per taxon. Optional analyses for transcriptomes include "cross plate contamination (XPC))", which seeks to remove contamination by index switching, and exploration of alternative genetic code (of particular importance for lineages like ciliates). Additional details are outline in Figure S2.

 ## Setup

-Running PTLp1 requires a set of scripts, a small number of databases, and input data in the form of either assembled transcripts or genome CDS. Below is a description of everything you need in order to start running PTLp1 on transcriptomic or genomic samples.
+Running part 1 requires a set of scripts, a small number of databases, and input data in the form of either assembled transcripts or genome CDS. Below is a description of everything you need in order to start running PTLp1 on transcriptomic or genomic samples.

 ### Dependencies

@ -17,11 +17,11 @@ The following dependencies are confirmed to work using the version numbers in pa

 ### Input data and folder structure

-An important aspect of PhyloToL is that it controls taxon, gene family, and sequence names very carefully. PhyloToL part 1 processes input data one sample at a time, and we require that each input sample be given a ten digit code in the format Op_me_Hsap. We generally use the first two digits to represent a major clade (Opisthokonta), the second two digits to represent a minor clade (Metazoa), and the last four to represent genus and species (_Homo sapiens_). More information on codes in Dataset 2 on [figshare](https://figshare.com/account/home#/projects/196552).
+An important aspect of EukPhylo is that it controls taxon, gene family, and sequence names very carefully. EukPhylo part 1 processes input data one sample at a time, and we require that each input sample be given a ten digit code in the format Op_me_Hsap. We generally use the first two digits to represent a major clade (Opisthokonta), the second two digits to represent a minor clade (Metazoa), and the last four to represent genus and species (_Homo sapiens_). More information on codes in Dataset 2 on [figshare](https://figshare.com/account/home#/projects/196552).

 #### Transcriptomes

-To run PhyloToL part 1 with transcriptomic data, you need to first assemble your transcripts. We use rnaSpades, so our scripts are designed to read sequence IDs as formatted by rnaSpades. You can use your assembler of choice, but you'll need to rename your sequence IDs in the fasta file of assembled transcripts input to the pipeline. Each input fasta file of assembled transcripts ("transcripts.fasta" as output by rnaSpades) must be renamed in the format. 
+To run EukPhylo part 1 with transcriptomic data, you need to first assemble your transcripts. We use rnaSpades, so our scripts are designed to read sequence IDs as formatted by rnaSpades. You can use your assembler of choice, but you'll need to rename your sequence IDs in the fasta file of assembled transcripts input to the pipeline. Each input fasta file of assembled transcripts ("transcripts.fasta" as output by rnaSpades) must be renamed in the format. 

 > Op_me_Hsap_assembledTranscripts.fasta

@ -32,15 +32,15 @@ GTACAATATGCCTTCTTACAGTGATGAAGCTCTAACAGAAGAAAAGGTTGGATGAAAATG
 GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA
 AAATGGGTACAAAAT...

-where the numbers after NODE_ are a unique transcript identifier, the and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise, including capitalization and spacing, here and throughout). Next, download the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://doi.org/10.6084/m9.figshare.26597368) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of PhyloToL part 1 will be located, looking something like this
+where the numbers after NODE_ are a unique transcript identifier, the and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise, including capitalization and spacing, here and throughout). Next, download the [Scripts](https://github.com/Katzlab/EukPhylo/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://doi.org/10.6084/m9.figshare.26597368) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of EukPhylo part 1 will be located, looking something like this

-<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_trans_foldersetup.png" width="30%">
+<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_trans_foldersetup.png" width="30%">

 At this point, you are ready to run the code! See the [Processing transcriptomes](processing-transcriptomes) section below for next steps.

 #### Genomes

-PhyloToL part 1 for genomes takes as input genomic CDS, such as are available to download for many genome assemblies on GenBank. Similarly to the transcriptome setup above, each input file must be named in the format 
+EukPhylo part 1 for genomes takes as input genomic CDS, such as are available to download for many genome assemblies on GenBank. Similarly to the transcriptome setup above, each input file must be named in the format 

 >Op_me_Hsap_GenBankCDS.fasta

@ -50,48 +50,48 @@ with the first ten digits representing a unique sample identifier. Each sequence

 > ATGAAGAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATGTACTTCCTGCTAGCCAACCTCTGA...

-And all of the CDS fasta files should be in a folder alongside the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Databases) folders, as described above for transcriptomes.
+And all of the CDS fasta files should be in a folder alongside the [Scripts](https://github.com/Katzlab/EukPhylo/blob/main/PTL1/Genomes/Scripts) and [Databases](https://github.com/Katzlab/EukPhylo/blob/main/PTL1/Genomes/Databases) folders, as described above for transcriptomes.


 ## Databases

-PhyloToL part 1 requires several reference databases used at various steps in the pipeline. Using the folder structure shown above, the Databases folder should look like this:
+EukPhylo part 1 requires several reference databases used at various steps in the pipeline. Using the folder structure shown above, the Databases folder should look like this:

-<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_databases_folder_structure.png" width="30%">
+<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_databases_folder_structure.png" width="30%">

 Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file with the suffix `_rRNAseqs.fasta`). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. Hook-6.6.fasta, Hook-6.6.dmnd).

-You can download these databases from the [PhyloToL Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).
+You can download these databases from the [EukPhylo Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).

 ### The Hook database
-Users can either use the PhyloToL Hook database or a set of gene families of interest (e.g. targeting a specific function or taxon). The PhyloToL Hook Database is composed of 1,453,081 sequences across 15,414 GFs, and serves as a reference database against which assembled transcripts and are similarity-searched for GF assignment. The PhyloToL Hook Database captures a broad diversity of eukaryotic gene families and was built using sequence data from OrthoMCL version 6.13, which we sampled to select for OGs that are present across the eukaryotic tree and/or present in under-sampled lineages of eukaryotes (Fig. S1, Figure 2). To add value for users, we also include functional annotations for each OG in the Hook (Dataset S11; see methods in SI Appendix). Alternatively, users can replace the hook as described below.
+Users can either use the EukPhylo Hook database or a set of gene families of interest (e.g. targeting a specific function or taxon). The EukPhylo Hook Database is composed of 1,453,081 sequences across 15,414 GFs, and serves as a reference database against which assembled transcripts and are similarity-searched for GF assignment. The EukPhylo Hook Database captures a broad diversity of eukaryotic gene families and was built using sequence data from OrthoMCL version 6.13, which we sampled to select for OGs that are present across the eukaryotic tree and/or present in under-sampled lineages of eukaryotes (Fig. S1, Figure 2). To add value for users, we also include functional annotations for each OG in the Hook (Dataset S11; see methods in SI Appendix). Alternatively, users can replace the hook as described below.

 ### Swapping out the Hook Database
-Replacing the PhyloToL 6 Hook Database with a user-defined set of gene families is straightforward:
+Replacing the EukPhylo Hook Database with a user-defined set of gene families is straightforward:
 * You need two files, a fasta file with your target sequences, and this same file reformatted as a diamond database (`diamond makedb --in <path to fasta file> -d <name of output>.dmnd`)
 * The hook then is put in the db_OG database folder prior to running Part 1
 * Be aware that you must give your fasta file and your Diamond database the same exact prefix, or else your run will fail at the end. For example, the Katzlab Hook-6.6 files are named: `Hook-6.6.fasta`, `Hook-6.6.dmnd`

-# Running PhyloToL part 1
+# Running EukPhylo part 1

 ## Processing transcriptomes

 Running PTL6p1 relies on a variety of scripts as described here:
-<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Transcriptomes_scripts.png" width="100%">
+<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_Processing_Transcriptomes_scripts.png" width="100%">

-Running PhyloToL Part 1 on transcriptomes requires three items in your main directory:
-1. A folder named Scripts containing all **[Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/Scripts)** for PhyloToL part 1
-2. A folder containing your **assembled [transcripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/TestData)** (as described above)
+Running EukPhylo Part 1 on transcriptomes requires three items in your main directory:
+1. A folder named Scripts containing all **[Scripts](https://github.com/Katzlab/EukPhylo/tree/main/PTL1/Transcriptomes/Scripts)** for EukPhylo part 1
+2. A folder containing your **assembled [transcripts](https://github.com/Katzlab/EukPhylo/tree/main/PTL1/Transcriptomes/TestData)** (as described above)
 3. The **Databases** folder described above

-PhyloToL part 1 starts with your **assembled transcripts** and produces **ReadyToGo files** (R2G; nucleotide coding regions and inferred amino acid sequences with gene families assigned) for each input sample, and **summary statistics** (e.g. composition, length, coverage) for each sequence processed, as well as aggregated across all sequences for each taxon. This part of the pipeline includes seven scripts which must be run in order. Script 1b (removal of contamination from index switching, 'XPC') is optional (see below), and users may choose to stop after script 4 if they are unsure of correct genetic code assignment. Otherwise, users are recommended to run their transcripts through scripts 1 to 7 in a single run. The simplest way to run PhyloToL part 1 is with one of the following command:
+EukPhylo part 1 starts with your **assembled transcripts** and produces **ReadyToGo files** (R2G; nucleotide coding regions and inferred amino acid sequences with gene families assigned) for each input sample, and **summary statistics** (e.g. composition, length, coverage) for each sequence processed, as well as aggregated across all sequences for each taxon. This part of the pipeline includes seven scripts which must be run in order. Script 1b (removal of contamination from index switching, 'XPC') is optional (see below), and users may choose to stop after script 4 if they are unsure of correct genetic code assignment. Otherwise, users are recommended to run their transcripts through scripts 1 to 7 in a single run. The simplest way to run EukPhylo part 1 is with one of the following command:

 On a grid
 `python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --genetic_code Gcode.txt --databases Databases > log.txt`

 On a local computer: First navigate to within the Scripts folder and then run:

-`python3 wrapper.py --first_script 1 --last_script 7 --assembled_transcripts [full_path]/PhyloToL-6/PTL1/Transcriptomes/AssembledTranscripts --genetic_code universal --databases [full_path]/PhyloToL-6/PTL1/Transcriptomes/Databases --output [full_path]/PhyloToL-6/PTL1/Transcriptomes/ > log.txt`
+`python3 wrapper.py --first_script 1 --last_script 7 --assembled_transcripts [full_path]/EukPhylo/PTL1/Transcriptomes/AssembledTranscripts --genetic_code universal --databases [full_path]/EukPhylo/PTL1/Transcriptomes/Databases --output [full_path]/EukPhylo/PTL1/Transcriptomes/ > log.txt`

 In this case, the file `Gcode.txt` is a text file designating genetic code assignments for each taxon. The file should contain two tab-separated columns; the first column gives a ten-digit sample identifier, and the second column the genetic code assignment to be used in translation (script 5). The genetic code options are: universal, blepharisma, chilodonella, condylostoma, euplotes, peritrich, vorticella, ciliate, mesodinium, taa, tag, tga, and none. If you are not working with ciliates, you should probably choose "universal" for each taxon, or just use the argument `--genetic_code universal` instead of creating a text file.

@ -111,16 +111,16 @@ Other available parameters are:
 | --seq_count  |integer|any positive integer| Minimum number of sequences after assigning GFs | 

 ### Index Switching (Cross plate contamination)
-As you run PhyloToL part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by clustering all of your input assembled transcripts with Vsearch at a nucleotide identity of 99%. Sequences with less than one-tenth the k-mer coverage of the highest covered sequence in the cluster are removed, as long as both sequences are not 'conspecific' (usually, this means from the same species or genus). You can tell PhyloToL which of your taxa are conspecific by inputting a text file to the --conspecific_names argument with two tab-separated columns; the first column should be a ten-digit sample identifer and the second column a group (e.g., species, genus) identifier; samples with the same group identifier are taken to be conspecific.
+As you run EukPhylo part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by clustering all of your input assembled transcripts with Vsearch at a nucleotide identity of 99%. Sequences with less than one-tenth the k-mer coverage of the highest covered sequence in the cluster are removed, as long as both sequences are not 'conspecific' (usually, this means from the same species or genus). You can tell EukPhylo which of your taxa are conspecific by inputting a text file to the --conspecific_names argument with two tab-separated columns; the first column should be a ten-digit sample identifer and the second column a group (e.g., species, genus) identifier; samples with the same group identifier are taken to be conspecific.

 To run this step, you will need to add the '--xplate_contam' flag to the command line as follows:

 `python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases --xplate_contam --conspecific_names Conspecific.txt > log.txt`

 ## Processing genomes
-PhyloToL part 1 uses a different but similar set of scripts to process input genomic CDSs when compared to assembled transcripts:
+EukPhylo part 1 uses a different but similar set of scripts to process input genomic CDSs when compared to assembled transcripts:
 
-<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Genomes_scripts.png" width="100%">
+<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_Processing_Genomes_scripts.png" width="100%">

 The setup here is the same as described above, and running PTL6p1 on genomic CDSs is similar to transcriptomes, except there is no option to remove contamination as a result of index hopping and no need to identify genetic codes to use in translation (start/stop codon positions are already known), so the process is simpler. We recommend running scripts 1 through 5 in a single run, as follows:

@ -139,15 +139,15 @@ The parameter options are:

 ## Output

-When you run PhyloToL part 1 on **transcriptomic data** it will generate a folder called "Output" in the same folder as your "Scripts" folder. This Output folder should have the following structure (if it doesn't, something likely went wrong, and users will want to inspect input files and paths to determine errors):
+When you run EukPhylo part 1 on **transcriptomic data** it will generate a folder called "Output" in the same folder as your "Scripts" folder. This Output folder should have the following structure (if it doesn't, something likely went wrong, and users will want to inspect input files and paths to determine errors):

-<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL_trans_output_1.png" width="30%">
+<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL_trans_output_1.png" width="30%">

-The "ReadyToGo" folder contains the final cleaned output of PhyloToL part 1. The ReadyToGo_AA folder contains translated sequences, and these are the files (one per input sample) that should be input to PhyloToL part 2. The ReadyToGo_NTD folder contains the same sequences, untranslated, and ReadyToGo_TSV contains the a summary of the best Diamond hit in the Hook (or other GF reference) database for each sequence, which determines OG assignment. The PerSequenceStatSummaries and PerTaxonSummary files are also final products, giving basic sequence descriptions for all taxa in spreadsheet form.
+The "ReadyToGo" folder contains the final cleaned output of EukPhylo part 1. The ReadyToGo_AA folder contains translated sequences, and these are the files (one per input sample) that should be input to EukPhylo part 2. The ReadyToGo_NTD folder contains the same sequences, untranslated, and ReadyToGo_TSV contains the a summary of the best Diamond hit in the Hook (or other GF reference) database for each sequence, which determines OG assignment. The PerSequenceStatSummaries and PerTaxonSummary files are also final products, giving basic sequence descriptions for all taxa in spreadsheet form.

-PhyloToL part 1 also provides intermediate files used in producing the above finalized outputs. Of greatest interest to most users here are likely to be the files in the `Intermediate/TranslatedTranscriptomes` folder. Most importantly, users can find a record of all Diamond hits against the Hook Database (not filtered to keep only the best hits) in the file `DiamondOG/allOGresults.tsv`. This is useful in trying to assess alternative gene family assignments. See the headers of each PhyloToL part 1 scripts for a description of the individual intermediate outputs.
+EukPhylo part 1 also provides intermediate files used in producing the above finalized outputs. Of greatest interest to most users here are likely to be the files in the `Intermediate/TranslatedTranscriptomes` folder. Most importantly, users can find a record of all Diamond hits against the Hook Database (not filtered to keep only the best hits) in the file `DiamondOG/allOGresults.tsv`. This is useful in trying to assess alternative gene family assignments. See the headers of each EukPhylo part 1 scripts for a description of the individual intermediate outputs.

-The output of PhylToL part 1 when run on **genomic data** is very similar, if a bit simpler. The key files (ReadyToGo and all Diamond hits against the Hook) are located in the same places, except there is no `TranslatedTranscriptomes` folder (key intermediate files are given directly in the `Intermediate` folder). Again, see the headers of each PhyloToL part 1 scripts for a more detailed description of the individual intermediate outputs.
+The output of PhylToL part 1 when run on **genomic data** is very similar, if a bit simpler. The key files (ReadyToGo and all Diamond hits against the Hook) are located in the same places, except there is no `TranslatedTranscriptomes` folder (key intermediate files are given directly in the `Intermediate` folder). Again, see the headers of each EukPhylo part 1 scripts for a more detailed description of the individual intermediate outputs.