From 5c84f9e84daf685bdc41b05dd9709152528df850 Mon Sep 17 00:00:00 2001
From: Katzlab <katzlab@smith.edu>
Date: Wed, 14 Aug 2024 04:39:41 -0400
Subject: [PATCH] Updated PhyloToL Part 1: GF assignment (markdown)

---
 PhyloToL-Part-1:-GF-assignment.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/PhyloToL-Part-1:-GF-assignment.md b/PhyloToL-Part-1:-GF-assignment.md
index 0680055..ed52c51 100644
--- a/PhyloToL-Part-1:-GF-assignment.md
+++ b/PhyloToL-Part-1:-GF-assignment.md
@@ -50,7 +50,7 @@ with the first ten digits representing a unique sample identifier. Each sequence
 
 > ATGAAGAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATGTACTTCCTGCTAGCCAACCTCTGA...
 
-And all of the CDS fasta files should be in a folder alongside the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Databases) folders, as above.
+And all of the CDS fasta files should be in a folder alongside the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Databases) folders, as described above for transcriptomes.
 
 
 ## Databases
@@ -59,7 +59,7 @@ PhyloToL part 1 requires several reference databases used at various steps in th
 
 <img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_databases_folder_structure.png" width="30%">
 
-Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and bacterial (micout.dmnd) sequences, used for identification of putative bacterial contamination. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA. The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment. The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension.
+Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file called **TBD**). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. **TBD** Hook.fasta, Hook,dmd).
 
 You can download these databases from the [PhyloToL Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).