From d36cc2f156ba81e78b2f9c0cbf345e790a0deaa9 Mon Sep 17 00:00:00 2001 From: Godwin Ani Date: Mon, 12 Aug 2024 15:49:58 -0400 Subject: [PATCH] Updated PhyloToL Part 1: GF assignment (markdown) --- PhyloToL-Part-1:-GF-assignment.md | 83 ++++++++++++++++--------------- 1 file changed, 42 insertions(+), 41 deletions(-) diff --git a/PhyloToL-Part-1:-GF-assignment.md b/PhyloToL-Part-1:-GF-assignment.md index ccae041..efc7020 100644 --- a/PhyloToL-Part-1:-GF-assignment.md +++ b/PhyloToL-Part-1:-GF-assignment.md @@ -71,6 +71,47 @@ Replacing the PhyloToL Hook DB with a user-defined set of gene families is strai Role of each script + +To run the PhyloToL part 1 for processing transcriptomes, run: +`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases > log.txt` + +Available parameters are: +| Parameter | Type| Options| Description| +| ----------- | ----------------- |----------- | ----------------- | +| --first_script |int |1, 2, 3, 4, 5, 6 | First script to run | +| --last_script |int|1, 2, 3, 4, 5, 6, 7 | Last script to run | +| --assembled_transcripts |str|Path to a folder of assembled transcripts, assembled by rnaSPAdes. | Each assembled transcript file name should start with a unique 10 digit code, and end in "_assembledTranscripts.fasta", E.g. Op_me_hsap_assembledTranscripts.fasta | +| --databases| str| Path to databases folder | The folder should contain all 3 databases| +| --output|str|Path for the output files | An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder | +|--xplate_contam |-|- | Run cross-plate contamination removal (includes all files) | +| --genetic_code |str|A .txt or .tsv with two tab-separated columns, the first with the ten-digit codes and the second column with the corresponding genetics codes| If all of your taxa use the same genetic code, you may enter it here. Alternatively, if you need to use a variety of genetic codes but know which codes to use, you may fill give here the path to a file. | +|--conspecific_names |str| A .txt or .tsv file with two tab-separated columns; the first should have 10 digit codes, the second species or other identifying names|This is used to determine which sequences to remove (only between "species") in cross-plate contamination assessment. | +| --minlen |int| -| Minimum transcript length | +| --maxlen |int|-| Maximum transcript length | +| --seq_count |int|-| minimum number of sequences after assigning OGs | + + +To run the PhyloToL part 1 for both processing transcriptomes and removing sequences that resulted from index switching (cross plate contamination), run: +`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Gcode.txt --databases Databases --xplate_contam --conspecific_names Conspecific.txt > log.txt` + + +### Processing genomes +Role of each script + +* **Main inputs** : A folder containing the [CDS](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/TestData), a folder containing the Databases, and a folder containing the [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/Scripts). +* **Outputs** : ReadyToGo files (AA and NTD) and taxon summary. + + +| Parameter | Type| Options| Description| +| ----------- | ----------------- |----------- | ----------------- | +| --first_script| int | 1, 2, 3, 4 | First script to run | +| --last_script | int | 2, 3, 4, 5 | First script to run| +| --cds| str|Path to a folder of nucleotide CDS| Each file name should start with a unique 10 digit code, and end in "_GenBankCDS.fasta", E.g. Op_me_hsap_GenBankCDS.fasta| +| --output|str|Path for the output files | An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder | +| --genetic_code| str| Path to a file, Universal | If all of your taxa use the same genetic code, you may enter it here| +| --databases| str| Path to databases folder | The folder should contain all 3 databases| + + * **Main inputs** : A folder containing the assembled [transcripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/TestData), a folder containing the Databases, and a folder containing the [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/Scripts). * **Outputs** : ReadyToGo files (AA and NTD) and taxon summary. * *Optional inputs : Gcodes.txt and Conspecific.txt @@ -91,44 +132,4 @@ Role of each script | ----------- | ----------------- | | EE_uc_Me03 | Metatranscriptome | | EE_uc_Me04 | Metatranscriptome | -| EE_uc_Me05 | Metatranscriptome | - -To process transcriptomes, run: - -`python Scripts/wrapper.py -1 1 -2 7 --assembled_transcripts AssembledTranscripts --output . --genetic_code Universal -d Databases > log.txt` - -| Parameter | Type| Options| Description| -| ----------- | ----------------- |----------- | ----------------- | -| --first_script |int |1, 2, 3, 4, 5, 6 | First script to run | -| --last_script |int|1, 2, 3, 4, 5, 6, 7 | Last script to run | -| --assembled_transcripts |str|Path to a folder of assembled transcripts, assembled by rnaSPAdes. | Each assembled transcript file name should start with a unique 10 digit code, and end in "_assembledTranscripts.fasta", E.g. Op_me_hsap_assembledTranscripts.fasta | -| --databases| str| Path to databases folder | The folder should contain all 3 databases| -| --output|str|Path for the output files | An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder | -|--xplate_contam |str|x | Run cross-plate contamination removal (includes all files) | -| --genetic_code |str|A .txt or .tsv with two tab-separated columns, the first with the ten-digit codes and the second column with the corresponding genetics codes| If all of your taxa use the same genetic code, you may enter it here. Alternatively, if you need to use a variety of genetic codes but know which codes to use, you may fill give here the path to a file. | -|--conspecific_names |str| A .txt or .tsv file with two tab-separated columns; the first should have 10 digit codes, the second species or other identifying names|This is used to determine which sequences to remove (only between "species") in cross-plate contamination assessment. | -| --minlen |int| -| Minimum transcript length | -| --maxlen |int|-| Maximum transcript length | -| --seq_count |int|-| minimum number of sequences after assigning OGs | - - - -* \>log.txt = if added to the end of the command, it will output a log file with progress, warning, or error messages -* *For running with cross plate contamination removal, add `-x -n Conspecific.txt` to the line of code. - -### Processing genomes -Role of each script - -* **Main inputs** : A folder containing the [CDS](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/TestData), a folder containing the Databases, and a folder containing the [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/Scripts). -* **Outputs** : ReadyToGo files (AA and NTD) and taxon summary. - - -| Parameter | Type| Options| Description| -| ----------- | ----------------- |----------- | ----------------- | -| --first_script| int | 1, 2, 3, 4 | First script to run | -| --last_script | int | 2, 3, 4, 5 | First script to run| -| --cds| str|Path to a folder of nucleotide CDS| Each file name should start with a unique 10 digit code, and end in "_GenBankCDS.fasta", E.g. Op_me_hsap_GenBankCDS.fasta| -| --output|str|Path for the output files | An "Output" folder will be created at this directory to contain all output files. By default this folder will be created at the parent directory of the Scripts folder | -| --genetic_code| str| Path to a file, Universal | If all of your taxa use the same genetic code, you may enter it here| -| --databases| str| Path to databases folder | The folder should contain all 3 databases| - +| EE_uc_Me05 | Metatranscriptome | \ No newline at end of file