Updated EukPhylo Part 1: GF assignment (markdown)

2026-02-11 10:40:24 +08:00 · 2025-04-29 11:10:06 +02:00 · 2025-04-29 11:10:06 +02:00 · 00494529b1
commit 00494529b1
parent a8e7a1d336
1 changed files with 3 additions and 3 deletions
--- a/EukPhylo-Part-1:-GF-assignment.md
+++ b/EukPhylo-Part-1:-GF-assignment.md
@ -17,7 +17,7 @@ The following dependencies are confirmed to work using the version numbers in pa

 ### Input data and folder structure

-An important aspect of EukPhylo is that it controls taxon, gene family, and sequence names very carefully. EukPhylo part 1 processes input data one sample at a time, and we require that each input sample be given a ten digit code in the format Op_me_Hsap. We generally use the first two digits to represent a major clade (Opisthokonta), the second two digits to represent a minor clade (Metazoa), and the last four to represent genus and species (_Homo sapiens_). More information on codes in Dataset 2 on [figshare](https://figshare.com/account/home#/projects/196552).
+An important aspect of EukPhylo is that it controls taxon, gene family, and sequence names very carefully. EukPhylo part 1 processes input data one sample at a time, and we require that each input sample be given a ten digit code in the format Op_me_Hsap. We generally use the first two digits to represent a major clade (Opisthokonta), the second two digits to represent a minor clade (Metazoa), and the last four to represent genus and species (_Homo sapiens_). More information on codes in Dataset 2 on [figshare](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552).

 #### Transcriptomes

@ -32,7 +32,7 @@ GTACAATATGCCTTCTTACAGTGATGAAGCTCTAACAGAAGAAAAGGTTGGATGAAAATG
 GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA
 AAATGGGTACAAAAT...

-where the numbers after NODE_ are a unique transcript identifier, and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise, including capitalization and spacing, here and throughout). Next, download the [Scripts](https://github.com/Katzlab/EukPhylo/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://doi.org/10.6084/m9.figshare.26597368) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of EukPhylo part 1 will be located, looking something like this
+where the numbers after NODE_ are a unique transcript identifier, and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise, including capitalization and spacing, here and throughout). Next, download the [Scripts](https://github.com/Katzlab/EukPhylo/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of EukPhylo part 1 will be located, looking something like this

 <img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_trans_foldersetup.png" width="30%">

@ -61,7 +61,7 @@ EukPhylo part 1 requires several reference databases used at various steps in th

 Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file with the suffix `_rRNAseqs.fasta`). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberrant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. Hook-6.6.fasta, Hook-6.6.dmnd).

-You can download these databases from the [EukPhylo Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).
+You can download these databases from the [EukPhylo Figshare page](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://figshare.com/projects/EukPhylo_Supplemental_Files/196552). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).

 ### The Hook database
 Users can either use the EukPhylo Hook database or a set of gene families of interest (e.g. targeting a specific function or taxon). The EukPhylo Hook Database is composed of 1,453,081 sequences across 15,414 GFs, and serves as a reference database against which assembled transcripts and are similarity-searched for GF assignment. The EukPhylo Hook Database captures a broad diversity of eukaryotic gene families and was built using sequence data from OrthoMCL version 6.13, which we sampled to select for OGs that are present across the eukaryotic tree and/or present in under-sampled lineages of eukaryotes (Fig. S1, Figure 2). To add value for users, we also include functional annotations for each OG in the Hook (Dataset S11; see methods in SI Appendix). Alternatively, users can replace the hook as described below.