From d6ab88cf1392e5fc2ac87d38441119f93c9ee3f7 Mon Sep 17 00:00:00 2001 From: Katzlab Date: Fri, 9 Aug 2024 16:24:39 -0400 Subject: [PATCH] Updated PhyloToL Part 1 (markdown) --- PhyloToL-Part-1.md | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/PhyloToL-Part-1.md b/PhyloToL-Part-1.md index 839944e..b895c04 100644 --- a/PhyloToL-Part-1.md +++ b/PhyloToL-Part-1.md @@ -1,7 +1,11 @@ ## Overview and modularity +PhyloToL part 1 is primarily intended to assign gene families to assembled transcripts or genomic CDS, but also contains a number of quality filters and other curation steps. _More description here_ + ## Setup +Below is a description of everything you need in order to start running PhyloToL part 1 on transcriptomic or genomic samples! + ### Dependencies The following are required to run PhyloToL part 1. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well. @@ -12,13 +16,31 @@ The following are required to run PhyloToL part 1. The dependencies are confirme ### Input data and folder structure +An important aspect of PhyloToL is that it controls taxon, gene family, and sequence names very carefully. PhyloToL part 1 processes input data one sample at a time, and we require that each input sample be given a ten digit code in the format Op_me_Hsap. We generally use the first two digits to represent a major clade (Opisthokonta), the second two digits to represent a minor clade (Metazoa), and the last four to represent genus and species (Homo sapiens). + #### Transcriptomes - +To run PhyloToL part 1 with transcriptomic data, you need to first assemble your transcripts. We use rnaSpades, so our scripts are designed to read sequence IDs as formatted by rnaSpades. You can use your assembler of choice, but you'll need to rename your sequence IDs in the fasta file of assembled transcripts input to the pipeline. Each input fasta file of assembled transcripts ("transcripts.fasta" as output by rnaSpades) must be renamed in the format +> Op_me_Hsap_assembledTranscripts.fasta + +where the first ten digits represent a variable sample identifier (see above). Each sequence in the fasta file must be named in the format + +>\>NODE_40535_length_253_cov_2.87\ +GTACAATATGCCTTCTTACAGTGATGAAGCTCTAACAGAAGAAAAGGTTGGATGAAAATG +GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA +AAATGGGTACAAAAT... + +where the numbers after NODE_ are a unique transcript identifier, the and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise here and throughout). Next, download the [Scripts](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://github.com/Katzlab/PhyloToL-6/blob/main/PTL1/Transcriptomes/Databases) folders from this repository, and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of PhyloToL part 1 will be located, looking something like this + + + +At this point, you are ready to run the code! See the [Processing transcriptomes](processing-transcriptomes) section below for next steps. #### Genomes +PhyloToL part 1 for genomes takes as input genomic CDS, such as are available to download for many genome assemblies on GenBank + ## The Hook Database