From 5c84f9e84daf685bdc41b05dd9709152528df850 Mon Sep 17 00:00:00 2001 From: Katzlab Date: Wed, 14 Aug 2024 04:39:41 -0400 Subject: [PATCH] Updated PhyloToL Part 1: GF assignment (markdown) --- PhyloToL-Part-1:-GF-assignment.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/PhyloToL-Part-1:-GF-assignment.md b/PhyloToL-Part-1:-GF-assignment.md index 0680055..ed52c51 100644 --- a/PhyloToL-Part-1:-GF-assignment.md +++ b/PhyloToL-Part-1:-GF-assignment.md @@ -50,7 +50,7 @@ with the first ten digits representing a unique sample identifier. Each sequence > ATGAAGAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATGTACTTCCTGCTAGCCAACCTCTGA... -And all of the CDS fasta files should be in a folder alongside the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Databases) folders, as above. +And all of the CDS fasta files should be in a folder alongside the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Databases) folders, as described above for transcriptomes. ## Databases @@ -59,7 +59,7 @@ PhyloToL part 1 requires several reference databases used at various steps in th -Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and bacterial (micout.dmnd) sequences, used for identification of putative bacterial contamination. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA. The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment. The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension. +Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file called **TBD**). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. **TBD** Hook.fasta, Hook,dmd). You can download these databases from the [PhyloToL Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).