diff --git a/PhyloToL-Part-1:-GF-assignment.md b/PhyloToL-Part-1:-GF-assignment.md index c5695b1..41ba2b3 100644 --- a/PhyloToL-Part-1:-GF-assignment.md +++ b/PhyloToL-Part-1:-GF-assignment.md @@ -1,14 +1,14 @@ ## Overview and modularity -PhyloToL part 1 assigns gene families (GFs) to assembled transcripts or genomic CDS, and includes a number of quality filters and other curation steps. For transcriptomic data, quality filters include removing sequences <200 bp, identifying and sequestering putative ribosomal RNA sequences, and labeling sequences as either likely eukaryotic (_E) or prokaryotic (_P). Initial gene family assignments for both transcripts and genome CDS are done through Diamond analysis against either the PhyloToL Hook database (>15,000 gene families found across diverse eukaryotes), or a user-defined database of genes of interest. Renamed nucleotide and amino acid sequences are stored in 'ready to go' (R2G) files, and a set of statistics are generated per sequence and per taxon. Optional analyses for transcriptomes include "cross plate contamination (XPC))", which seeks to remove contamination by index switching, and exploration of alternative genetic code (of particular importance for lineages like ciliates). Additional details are outline in Figure S2. +PhyloToL part 1 (PTLp1) assigns gene families (GFs) to assembled transcripts or genomic CDS, and includes a number of quality filters and other curation steps. For transcriptomic data, quality filters include removing sequences <200 bp, identifying and sequestering putative ribosomal RNA sequences, and labeling sequences as either likely eukaryotic (_E) or prokaryotic (_P). Initial gene family assignments for both transcripts and genome CDS are done through Diamond analysis against either the PhyloToL Hook database (>15,000 gene families found across diverse eukaryotes), or a user-defined database of genes of interest. Renamed nucleotide and amino acid sequences are stored in 'ready to go' (R2G) files, and a set of statistics are generated per sequence and per taxon. Optional analyses for transcriptomes include "cross plate contamination (XPC))", which seeks to remove contamination by index switching, and exploration of alternative genetic code (of particular importance for lineages like ciliates). Additional details are outline in Figure S2. ## Setup -Below is a description of everything you need in order to start running PhyloToL part 1 on transcriptomic or genomic samples! +Running PTLp1 requires a set of scripts, a small number of databases, and input data in the form of either assembled transcripts or genome CDS. Below is a description of everything you need in order to start running PTLp1 on transcriptomic or genomic samples. ### Dependencies -The following are required to run PhyloToL part 1. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well. +The following dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well. * Python 3 * [Biopython](https://biopython.org/docs/latest/index.html) (version 1.75) * [DIAMOND](https://github.com/bbuchfink/diamond) (version 0.9.30) @@ -17,11 +17,11 @@ The following are required to run PhyloToL part 1. The dependencies are confirme ### Input data and folder structure -An important aspect of PhyloToL is that it controls taxon, gene family, and sequence names very carefully. PhyloToL part 1 processes input data one sample at a time, and we require that each input sample be given a ten digit code in the format Op_me_Hsap. We generally use the first two digits to represent a major clade (Opisthokonta), the second two digits to represent a minor clade (Metazoa), and the last four to represent genus and species (Homo sapiens). +An important aspect of PhyloToL is that it controls taxon, gene family, and sequence names very carefully. PhyloToL part 1 processes input data one sample at a time, and we require that each input sample be given a ten digit code in the format Op_me_Hsap. We generally use the first two digits to represent a major clade (Opisthokonta), the second two digits to represent a minor clade (Metazoa), and the last four to represent genus and species (_Homo sapiens_). More information on codes in Dataset 2 on [figshare](https://figshare.com/account/home#/projects/196552). #### Transcriptomes -To run PhyloToL part 1 with transcriptomic data, you need to first assemble your transcripts. We use rnaSpades, so our scripts are designed to read sequence IDs as formatted by rnaSpades. You can use your assembler of choice, but you'll need to rename your sequence IDs in the fasta file of assembled transcripts input to the pipeline. Each input fasta file of assembled transcripts ("transcripts.fasta" as output by rnaSpades) must be renamed in the format +To run PhyloToL part 1 with transcriptomic data, you need to first assemble your transcripts. We use rnaSpades, so our scripts are designed to read sequence IDs as formatted by rnaSpades. You can use your assembler of choice, but you'll need to rename your sequence IDs in the fasta file of assembled transcripts input to the pipeline. Each input fasta file of assembled transcripts ("transcripts.fasta" as output by rnaSpades) must be renamed in the format. > Op_me_Hsap_assembledTranscripts.fasta @@ -32,7 +32,7 @@ GTACAATATGCCTTCTTACAGTGATGAAGCTCTAACAGAAGAAAAGGTTGGATGAAAATG GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA AAATGGGTACAAAAT... -where the numbers after NODE_ are a unique transcript identifier, the and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise here and throughout). Next, download the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://doi.org/10.6084/m9.figshare.26597368) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of PhyloToL part 1 will be located, looking something like this +where the numbers after NODE_ are a unique transcript identifier, the and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise, including capitalization and spacing, here and throughout). Next, download the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://doi.org/10.6084/m9.figshare.26597368) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of PhyloToL part 1 will be located, looking something like this