Updated PhyloToL Part 1: GF assignment (markdown)

Katzlab 2024-08-13 14:49:32 -04:00
parent 1962150462
commit bc2cedb913

@ -9,6 +9,7 @@ Below is a description of everything you need in order to start running PhyloToL
### Dependencies ### Dependencies
The following are required to run PhyloToL part 1. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well. The following are required to run PhyloToL part 1. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well.
* Python 3
* [Biopython](https://biopython.org/docs/latest/index.html) (version 1.75) * [Biopython](https://biopython.org/docs/latest/index.html) (version 1.75)
* [DIAMOND](https://github.com/bbuchfink/diamond) (version 0.9.30) * [DIAMOND](https://github.com/bbuchfink/diamond) (version 0.9.30)
* [BLAST+](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html) (version 2.9.0) * [BLAST+](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html) (version 2.9.0)
@ -31,7 +32,7 @@ GTACAATATGCCTTCTTACAGTGATGAAGCTCTAACAGAAGAAAAGGTTGGATGAAAATG
GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA
AAATGGGTACAAAAT... AAATGGGTACAAAAT...
where the numbers after NODE_ are a unique transcript identifier, the and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise here and throughout). Next, download the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Transcriptomes/Databases) folders from this repository, and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of PhyloToL part 1 will be located, looking something like this where the numbers after NODE_ are a unique transcript identifier, the and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise here and throughout). Next, download the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://doi.org/10.6084/m9.figshare.26597368) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of PhyloToL part 1 will be located, looking something like this
<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_trans_foldersetup.png" width="30%"> <img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_trans_foldersetup.png" width="30%">
@ -53,29 +54,43 @@ TTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATGTACTTCCTGCTAGCCAACCTC
And all of the CDS fasta files should be in a folder alongside the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Databases) folders, as above. And all of the CDS fasta files should be in a folder alongside the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Genomes/Databases) folders, as above.
## The Hook Database ## Databases
PhyloToL part 1 requires several reference databases used at various steps in the pipeline. Using the folder structure shown above, the Databases folder should look like this:
<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_databases_folder_structure.png" width="30%">
Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and bacterial (micout.dmnd) sequences, used for identification of putative bacterial contamination. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA. The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment. The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension.
You can download these databases from the [PhyloToL Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).
### The Hook database
Users can either use the PhyloToL Hook database or a set of gene families of interest (e.g. targeting a specific function or taxon). The PhyloToL Hook Database is composed of 1,453,081 sequences across 15,414 GFs, and serves as a reference database against which assembled transcripts and are similarity-searched for GF assignment. The PhyloToL Hook Database captures a broad diversity of eukaryotic gene families and was built using sequence data from OrthoMCL version 6.13, which we sampled to select for OGs that are present across the eukaryotic tree and/or present in under-sampled lineages of eukaryotes (Fig. S1, Figure 2). To add value for users, we also include functional annotations for each OG in the Hook (Dataset S11; see methods in SI Appendix). Alternatively, users can replace the hook as described below. Users can either use the PhyloToL Hook database or a set of gene families of interest (e.g. targeting a specific function or taxon). The PhyloToL Hook Database is composed of 1,453,081 sequences across 15,414 GFs, and serves as a reference database against which assembled transcripts and are similarity-searched for GF assignment. The PhyloToL Hook Database captures a broad diversity of eukaryotic gene families and was built using sequence data from OrthoMCL version 6.13, which we sampled to select for OGs that are present across the eukaryotic tree and/or present in under-sampled lineages of eukaryotes (Fig. S1, Figure 2). To add value for users, we also include functional annotations for each OG in the Hook (Dataset S11; see methods in SI Appendix). Alternatively, users can replace the hook as described below.
### Swapping out the Hook Database
### Swapping out the hook Replacing the PhyloToL 6 Hook Database with a user-defined set of gene families is straightforward:
Replacing the PhyloToL Hook DB with a user-defined set of gene families is straightforward:
* You need two files, a fasta file with your target sequences, and this same file reformatted as a diamond database (`diamond makedb --in <path to fasta file> -d <name of output>.dmnd`) * You need two files, a fasta file with your target sequences, and this same file reformatted as a diamond database (`diamond makedb --in <path to fasta file> -d <name of output>.dmnd`)
* The hook then is put in the db_OG database folder prior to running Part 1 * The hook then is put in the db_OG database folder prior to running Part 1
* Be aware that you must give your fasta file and your diamond file the same exact prefix, or else your run will fail at the end. For example, the Katzlab Hook-6.6 files are named: `Hook-6.6.fasta`, `Hook-6.6.dmnd` * Be aware that you must give your fasta file and your Diamond database the same exact prefix, or else your run will fail at the end. For example, the Katzlab Hook-6.6 files are named: `Hook-6.6.fasta`, `Hook-6.6.dmnd`
# Running PhyloToL part 1 # Running PhyloToL part 1
## Processing transcriptomes ## Processing transcriptomes
Role of each script
<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Transcriptomes_scripts.png" width="100%"> <img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Transcriptomes_scripts.png" width="100%">
Running PhyloToL Part 1 on transcriptomes requires at least 3 items in your main directory: 1) A folder named Scripts and containing all **[Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/Scripts)** from PhyloToL part 1 github, 2) a folder containing your **assembled [transcripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/TestData)** (as described above), and 3) a folder containing the **Databases** with three subfolders(db_BvsE (how we ID likely-bacterial sequences), db_StopFreq (for stop codon assignment), and db_OG (The hook database as described above)). Default script starts with your **assembled transcripts** and produces **ReadyToGo files** (nucleotide and amino acid sequences) of each taxa, and **summary information** of the sequences processed for those taxa. Running PhyloToL Part 1 on transcriptomes requires three items in your main directory:
* To run the PhyloToL part 1 for processing transcriptomes, run: 1. A folder named Scripts containing all **[Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/Scripts)** for PhyloToL part 1
`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases > log.txt` 2. A folder containing your **assembled [transcripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/TestData)** (as described above)
3. The **Databases** folder described above
Available parameters are: PhyloToL part 1 starts with your **assembled transcripts** and produces **ReadyToGo files** (nucleotide coding regions and amino acid sequences with gene families assigned) for each input sample, and **summary statistics** (composition, length, coverage) for each sequence processed, as well as aggregated across all sequences for each taxon. This part of the pipeline includes seven scripts which must be run in order. Script 1b (removal of contamination from index switching) is optional (see below), and users may choose to stop after script 4 if they are unsure of correct genetic code assignment. Otherwise, users are recommended to run their transcripts through scripts 1 to 7 in a single run. The simplest way to run PhyloToL part 1 is with the following command:
`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --genetic_code Gcode.txt --databases Databases > log.txt`
In this case, the file `Gcode.txt` is a text file designating genetic code assignments for each taxon. The file should contain two tab-separated columns; the first column gives a ten-digit sample identifier, and the second column the genetic code assignment to be used in translation (script 5). The genetic code options are: universal, blepharisma, chilodonella, condylostoma, euplotes, peritrich, vorticella, ciliate, mesodinium, taa, tag, tga, and none. If you are not working with ciliates, you should probably choose "universal" for each taxon, or just use the argument `--genetic_code universal` instead of creating a text file.
Other available parameters are:
| Parameter | Type| Options| Description| | Parameter | Type| Options| Description|
| ----------- | ----------------- |----------- | ----------------- | | ----------- | ----------------- |----------- | ----------------- |
| --first_script |int |1, 2, 3, 4, 5, 6 | First script to run | | --first_script |int |1, 2, 3, 4, 5, 6 | First script to run |
@ -88,26 +103,16 @@ Available parameters are:
|--conspecific_names |str| A .txt or .tsv file with two tab-separated columns; the first should have 10 digit codes, the second species or other identifying names|This is used to determine which sequences to remove (only between "species") in index switching (cross-plate contamination) assessment. | |--conspecific_names |str| A .txt or .tsv file with two tab-separated columns; the first should have 10 digit codes, the second species or other identifying names|This is used to determine which sequences to remove (only between "species") in index switching (cross-plate contamination) assessment. |
| --minlen |int| -| Minimum transcript length | | --minlen |int| -| Minimum transcript length |
| --maxlen |int|-| Maximum transcript length | | --maxlen |int|-| Maximum transcript length |
| --seq_count |int|-| minimum number of sequences after assigning OGs | | --seq_count |int|-| Minimum number of sequences after assigning GFs |
### Index Switching (Cross plate contamination) ### Index Switching (Cross plate contamination)
As you run PhyloToL part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by (**LAK and ACL on XPC removal process with conspecific file**). To include this parameter to your PhyloToL part 1 run, you will need to add the '--xplate_contam --conspecific_names Conspecific.txt' flag to the command line as follow: As you run PhyloToL part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by clustering all of your input assembled transcripts with Vsearch at a nucleotide identity of 99%. Sequences with less than one-tenth the k-mer coverage of the highest covered sequence in the cluster are removed, as long as both sequences are not 'conspecific' (usually, this means from the same species or genus). You can tell PhyloToL which of your taxa are conspecific by inputting a text file to the --conspecific_names argument with two tab-separated columns; the first column should be a ten-digit sample identifer and the second column a group (e.g., species, genus) identifier; samples with the same group identifier are taken to be consepecific.
To run this step, you will need to add the '--xplate_contam' flag to the command line as follows:
`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases --xplate_contam --conspecific_names Conspecific.txt > log.txt` `python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases --xplate_contam --conspecific_names Conspecific.txt > log.txt`
Example of a Conspecific.txt file
| Taxa | identifying name |
| ----- |------ |
| EE_uc_Me04 | Metatranscriptome |
| EE_uc_Me05 | Metatranscriptome |
| EE_uc_Me06 | Metatranscriptome |
| EE_uc_Me07 | Metatranscriptome |
## Processing genomes ## Processing genomes
Role of each script
<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Genomes_scripts.png" width="100%"> <img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Genomes_scripts.png" width="100%">
Running PhyloToL Part 1 on genomes requires at least 3 items in your main directory: 1) A folder named Scripts and containing all **[Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/Scripts)** from PhyloToL part 1 github, 2) a folder containing your **[CDS](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/TestData)** (as described above), and 3) a folder containing the **Databases** with three subfolders(db_BvsE (how we ID likely-bacterial sequences), db_StopFreq (for stop codon assignment), and db_OG (The hook database as described above)). Default script starts with your **CDS** and produces **ReadyToGo files** (nucleotide and amino acid sequences) of each taxa, and **summary information** of the sequences processed for those taxa. Running PhyloToL Part 1 on genomes requires at least 3 items in your main directory: 1) A folder named Scripts and containing all **[Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/Scripts)** from PhyloToL part 1 github, 2) a folder containing your **[CDS](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/TestData)** (as described above), and 3) a folder containing the **Databases** with three subfolders(db_BvsE (how we ID likely-bacterial sequences), db_StopFreq (for stop codon assignment), and db_OG (The hook database as described above)). Default script starts with your **CDS** and produces **ReadyToGo files** (nucleotide and amino acid sequences) of each taxa, and **summary information** of the sequences processed for those taxa.