Updated PhyloToL Part 1: GF assignment (markdown)

2025-12-29 06:20:25 +08:00 · 2024-08-12 15:49:58 -04:00 · 2024-08-12 15:49:58 -04:00 · d36cc2f156
commit d36cc2f156
parent f78bc26282
1 changed files with 42 additions and 41 deletions
--- a/PhyloToL-Part-1:-GF-assignment.md
+++ b/PhyloToL-Part-1:-GF-assignment.md
@ -71,6 +71,47 @@ Replacing the PhyloToL Hook DB with a user-defined set of gene families is strai
 Role of each script
 <img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Transcriptomes_scripts.png" width="100%">

+
+To run the PhyloToL part 1 for processing transcriptomes, run:
+`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases > log.txt`
+
+Available parameters are:
+| Parameter | Type| Options| Description|
+| ----------- | ----------------- |----------- | ----------------- |
+| --first_script |int |1, 2, 3, 4, 5, 6 | First script to run | 
+| --last_script  |int|1, 2, 3, 4, 5, 6, 7 | Last script to run | 
+| --assembled_transcripts  |str|Path to a folder of assembled transcripts, assembled by rnaSPAdes. | Each assembled transcript file name should start with a unique 10 digit code, and end in "_assembledTranscripts.fasta", E.g. Op_me_hsap_assembledTranscripts.fasta | 
+| --databases| str| Path to databases folder | The folder should contain all 3 databases|
+| --output|str|Path for the output files | An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder |
+|--xplate_contam |-|- | Run cross-plate contamination removal (includes all files) | 
+| --genetic_code  |str|A .txt or .tsv with two tab-separated columns, the first with the ten-digit codes and the second column with the corresponding genetics codes| If all of your taxa use the same genetic code, you may enter it here. Alternatively, if you need to use a variety of genetic codes but know which codes to use, you may fill give here the path to a file.  | 
+|--conspecific_names  |str| A .txt or .tsv file with two tab-separated columns; the first should have 10 digit codes, the second species or other identifying names|This is used to determine which sequences to remove (only between "species") in cross-plate contamination assessment. | 
+| --minlen |int| -| Minimum transcript length | 
+| --maxlen  |int|-| Maximum transcript length | 
+| --seq_count  |int|-| minimum number of sequences after assigning OGs | 
+
+
+To run the PhyloToL part 1 for both processing transcriptomes and removing sequences that resulted from index switching (cross plate contamination), run: 
+`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases --xplate_contam --conspecific_names Conspecific.txt > log.txt`
+
+
+### Processing genomes
+Role of each script
+<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Genomes_scripts.png" width="100%">
+* **Main inputs** : A folder containing the [CDS](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/TestData), a folder containing the Databases, and a folder containing the [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/Scripts).
+* **Outputs** : ReadyToGo files (AA and NTD) and taxon summary.
+
+
+| Parameter | Type| Options| Description|
+| ----------- | ----------------- |----------- | ----------------- |
+| --first_script| int |  1, 2, 3, 4 | First script to run |
+| --last_script | int | 2, 3, 4, 5 | First script to run|
+| --cds| str|Path to a folder of nucleotide CDS| Each file name should start with a unique 10 digit code, and end in "_GenBankCDS.fasta", E.g. Op_me_hsap_GenBankCDS.fasta|
+| --output|str|Path for the output files | An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder |
+| --genetic_code| str| Path to a file, Universal | If all of your taxa use the same genetic code, you may enter it here|
+| --databases| str| Path to databases folder | The folder should contain all 3 databases|
+
+
 * **Main inputs** : A folder containing the assembled [transcripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/TestData), a folder containing the Databases, and a folder containing the [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/Scripts).
 * **Outputs** : ReadyToGo files (AA and NTD) and taxon summary.
 * *Optional inputs : Gcodes.txt and Conspecific.txt
@ -91,44 +132,4 @@ Role of each script
 | ----------- | ----------------- |
 | EE_uc_Me03  | Metatranscriptome | 
 | EE_uc_Me04  | Metatranscriptome | 
-| EE_uc_Me05  | Metatranscriptome | 
-
-To process transcriptomes, run:
-
-`python Scripts/wrapper.py -1 1 -2 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Universal -d Databases > log.txt`
-
-| Parameter | Type| Options| Description|
-| ----------- | ----------------- |----------- | ----------------- |
-| --first_script |int |1, 2, 3, 4, 5, 6 | First script to run | 
-| --last_script  |int|1, 2, 3, 4, 5, 6, 7 | Last script to run | 
-| --assembled_transcripts  |str|Path to a folder of assembled transcripts, assembled by rnaSPAdes. | Each assembled transcript file name should start with a unique 10 digit code, and end in "_assembledTranscripts.fasta", E.g. Op_me_hsap_assembledTranscripts.fasta | 
-| --databases| str| Path to databases folder | The folder should contain all 3 databases|
-| --output|str|Path for the output files | An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder |
-|--xplate_contam |str|x | Run cross-plate contamination removal (includes all files) | 
-| --genetic_code  |str|A .txt or .tsv with two tab-separated columns, the first with the ten-digit codes and the second column with the corresponding genetics codes| If all of your taxa use the same genetic code, you may enter it here. Alternatively, if you need to use a variety of genetic codes but know which codes to use, you may fill give here the path to a file.  | 
-|--conspecific_names  |str| A .txt or .tsv file with two tab-separated columns; the first should have 10 digit codes, the second species or other identifying names|This is used to determine which sequences to remove (only between "species") in cross-plate contamination assessment. | 
-| --minlen |int| -| Minimum transcript length | 
-| --maxlen  |int|-| Maximum transcript length | 
-| --seq_count  |int|-| minimum number of sequences after assigning OGs | 
-
- 
-
-* \>log.txt = if added to the end of the command, it will output a log file with progress, warning, or error messages
-* *For running with cross plate contamination removal, add `-x -n Conspecific.txt` to the line of code.
-
-### Processing genomes
-Role of each script
-<img src="https://github.com/Katzlab/PhyloToL-6/blob/main/Other/PTL1_Processing_Genomes_scripts.png" width="100%">
-* **Main inputs** : A folder containing the [CDS](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/TestData), a folder containing the Databases, and a folder containing the [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/Scripts).
-* **Outputs** : ReadyToGo files (AA and NTD) and taxon summary.
-
-
-| Parameter | Type| Options| Description|
-| ----------- | ----------------- |----------- | ----------------- |
-| --first_script| int |  1, 2, 3, 4 | First script to run |
-| --last_script | int | 2, 3, 4, 5 | First script to run|
-| --cds| str|Path to a folder of nucleotide CDS| Each file name should start with a unique 10 digit code, and end in "_GenBankCDS.fasta", E.g. Op_me_hsap_GenBankCDS.fasta|
-| --output|str|Path for the output files | An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder |
-| --genetic_code| str| Path to a file, Universal | If all of your taxa use the same genetic code, you may enter it here|
-| --databases| str| Path to databases folder | The folder should contain all 3 databases|
-
+| EE_uc_Me05  | Metatranscriptome |