Updated PhyloToL Part 1: GF assignment (markdown)

Katzlab 2024-08-14 11:40:45 -04:00
parent 1ea28fece4
commit 9af12dac99

@ -59,7 +59,7 @@ PhyloToL part 1 requires several reference databases used at various steps in th
<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_databases_folder_structure.png" width="30%">
Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file called **TBD**). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. Hook-6.6.fasta, Hook-6.6.dmnd).
Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file with the suffix `_rRNAseqs.fasta`). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. Hook-6.6.fasta, Hook-6.6.dmnd).
You can download these databases from the [PhyloToL Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).