Updated PhyloToL Part 1: GF assignment (markdown)

2025-12-28 17:10:26 +08:00 · 2024-08-14 04:50:45 -04:00 · 2024-08-14 04:50:45 -04:00 · 9a58a5e14d
commit 9a58a5e14d
parent 6da0e66ae2
1 changed files with 8 additions and 5 deletions
--- a/PhyloToL-Part-1:-GF-assignment.md
+++ b/PhyloToL-Part-1:-GF-assignment.md
@ -76,6 +76,7 @@ Replacing the PhyloToL 6 Hook Database with a user-defined set of gene families
 ## Processing transcriptomes
 Running PTL6p1 relies on a variety of scripts as described here:
 <img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Transcriptomes_scripts.png" width="100%">
 Running PhyloToL Part 1 on transcriptomes requires three items in your main directory:
@ -83,7 +84,7 @@ Running PhyloToL Part 1 on transcriptomes requires three items in your main dire
 2. A folder containing your **assembled [transcripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/TestData)** (as described above)
 3. The **Databases** folder described above
-PhyloToL part 1 starts with your **assembled transcripts** and produces **ReadyToGo files** (nucleotide coding regions and amino acid sequences with gene families assigned) for each input sample, and **summary statistics** (composition, length, coverage) for each sequence processed, as well as aggregated across all sequences for each taxon. This part of the pipeline includes seven scripts which must be run in order. Script 1b (removal of contamination from index switching) is optional (see below), and users may choose to stop after script 4 if they are unsure of correct genetic code assignment. Otherwise, users are recommended to run their transcripts through scripts 1 to 7 in a single run. The simplest way to run PhyloToL part 1 is with the following command:
+PhyloToL part 1 starts with your **assembled transcripts** and produces **ReadyToGo files** (R2G; nucleotide coding regions and inferred amino acid sequences with gene families assigned) for each input sample, and **summary statistics** (e.g. composition, length, coverage) for each sequence processed, as well as aggregated across all sequences for each taxon. This part of the pipeline includes seven scripts which must be run in order. Script 1b (removal of contamination from index switching, 'XPC') is optional (see below), and users may choose to stop after script 4 if they are unsure of correct genetic code assignment. Otherwise, users are recommended to run their transcripts through scripts 1 to 7 in a single run. The simplest way to run PhyloToL part 1 is with the following command:
 `python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --genetic_code Gcode.txt --databases Databases > log.txt`
@ -105,16 +106,18 @@ Other available parameters are:
 | --seq_count  |int|-| Minimum number of sequences after assigning GFs | 
 ### Index Switching (Cross plate contamination)
-As you run PhyloToL part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by clustering all of your input assembled transcripts with Vsearch at a nucleotide identity of 99%. Sequences with less than one-tenth the k-mer coverage of the highest covered sequence in the cluster are removed, as long as both sequences are not 'conspecific' (usually, this means from the same species or genus). You can tell PhyloToL which of your taxa are conspecific by inputting a text file to the --conspecific_names argument with two tab-separated columns; the first column should be a ten-digit sample identifer and the second column a group (e.g., species, genus) identifier; samples with the same group identifier are taken to be consepecific.
+As you run PhyloToL part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by clustering all of your input assembled transcripts with Vsearch at a nucleotide identity of 99%. Sequences with less than one-tenth the k-mer coverage of the highest covered sequence in the cluster are removed, as long as both sequences are not 'conspecific' (usually, this means from the same species or genus). You can tell PhyloToL which of your taxa are conspecific by inputting a text file to the --conspecific_names argument with two tab-separated columns; the first column should be a ten-digit sample identifer and the second column a group (e.g., species, genus) identifier; samples with the same group identifier are taken to be conspecific.
 To run this step, you will need to add the '--xplate_contam' flag to the command line as follows:
 `python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases --xplate_contam --conspecific_names Conspecific.txt > log.txt`
 ## Processing genomes
 PhyloToL part 1 uses a different but similar set of scripts to process input genomic CDSs when compared to assembled transcripts:
 <img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Genomes_scripts.png" width="100%">
-PhyloToL part 1 uses a different but similar set of scripts to process input genomic CDSs (as opposed to assembled transcripts). The setup here is the same as described above, and running PhyloToL part 1 on genomic CDSs is similar to as described above for transcriptomes, except there is no option to remove contamination as a result of index hopping and no need to identify genetic codes to use in translation (start/stop codon positions are already known), so the process is simpler. We recommend always running scripts 1 through 5 in a single run, as follows:
+The setup here is the same as described above, and running PTL6p1 on genomic CDSs is similar to transcriptomes, except there is no option to remove contamination as a result of index hopping and no need to identify genetic codes to use in translation (start/stop codon positions are already known), so the process is simpler. We recommend running scripts 1 through 5 in a single run, as follows:
 `python Scripts/wrapper.py --first_script 1 --last_script 5 --cds CDS --genetic_code Gcode.txt --databases Databases > log.txt`
@ -131,13 +134,13 @@ The parameter options are:
 ## Output
-When you run PhyloToL part 1 on **transcriptomic data** it will generate a folder called "Output" in the same folder as your "Scripts" folder. This Output folder should have the following structure (if it doesn't, something likely went wrong):
+When you run PhyloToL part 1 on **transcriptomic data** it will generate a folder called "Output" in the same folder as your "Scripts" folder. This Output folder should have the following structure (if it doesn't, something likely went wrong, and users will want to inspect input files and paths to determine errors):
 <img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL_trans_output_1.png" width="30%">
 The "ReadyToGo" folder contains the final cleaned output of PhyloToL part 1. The ReadyToGo_AA folder contains translated sequences, and these are the files (one per input sample) that should be input to PhyloToL part 2. The ReadyToGo_NTD folder contains the same sequences, untranslated, and ReadyToGo_TSV contains the a summary of the best Diamond hit in the Hook (or other GF reference) database for each sequence, which determines OG assignment. The PerSequenceStatSummaries and PerTaxonSummary files are also final products, giving basic sequence descriptions for all taxa in spreadsheet form.
-PhyloToL part 1 also provides all intermediate files used in producing the above finalized outputs. Of greatest interest to most users here are likely to be the files in the `Intermediate/TranslatedTranscriptomes` folder, in which all intermediate files for each taxon are stored. Most importantly, users can find a record of all Diamond hits against the Hook Database (not filtered to keep only the best hits) in the file `DiamondOG/allOGresults.tsv`. This could be useful in trying to assess alternative gene family assignments. See the headers of each PhyloToL part 1 scripts for a description of the individual intermediate outputs.
+PhyloToL part 1 also provides intermediate files used in producing the above finalized outputs. Of greatest interest to most users here are likely to be the files in the `Intermediate/TranslatedTranscriptomes` folder. Most importantly, users can find a record of all Diamond hits against the Hook Database (not filtered to keep only the best hits) in the file `DiamondOG/allOGresults.tsv`. This is useful in trying to assess alternative gene family assignments. See the headers of each PhyloToL part 1 scripts for a description of the individual intermediate outputs.
 The output of PhylToL part 1 when run on **genomic data** is very similar, if a bit simpler. The key files (ReadyToGo and all Diamond hits against the Hook) are located in the same places, except there is no `TranslatedTranscriptomes` folder (key intermediate files are given directly in the `Intermediate` folder). Again, see the headers of each PhyloToL part 1 scripts for a more detailed description of the individual intermediate outputs.