mirror of
http://43.156.76.180:8026/YuuMJ/EukPhylo.git
synced 2025-12-28 04:40:27 +08:00
Fixed grammatical/spelling issues
parent
6bf4cbbb9c
commit
c275086f89
@ -32,7 +32,7 @@ GTACAATATGCCTTCTTACAGTGATGAAGCTCTAACAGAAGAAAAGGTTGGATGAAAATG
|
|||||||
GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA
|
GCATTATATGGTACGATTGCTGGTTTTGTTGCAGGTACAATCTTTGGATGGAAATTTAGA
|
||||||
AAATGGGTACAAAAT...
|
AAATGGGTACAAAAT...
|
||||||
|
|
||||||
where the numbers after NODE_ are a unique transcript identifier, the and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise, including capitalization and spacing, here and throughout). Next, download the [Scripts](https://github.com/Katzlab/EukPhylo/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://doi.org/10.6084/m9.figshare.26597368) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of EukPhylo part 1 will be located, looking something like this
|
where the numbers after NODE_ are a unique transcript identifier, and the following numbers representing the length and k-mer coverage, respectively. All assembled transcript files should be put into a folder called "AssembledTranscripts" (folder names are important and must be precise, including capitalization and spacing, here and throughout). Next, download the [Scripts](https://github.com/Katzlab/EukPhylo/blob/main/PTL1/Transcriptomes/Scripts) and [Databases](https://doi.org/10.6084/m9.figshare.26597368) folders (see below), and put these in the same folder as the AssembledTranscripts folder. This location is also where the Output folder containing the output of EukPhylo part 1 will be located, looking something like this
|
||||||
|
|
||||||
<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_trans_foldersetup.png" width="30%">
|
<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_trans_foldersetup.png" width="30%">
|
||||||
|
|
||||||
@ -59,7 +59,7 @@ EukPhylo part 1 requires several reference databases used at various steps in th
|
|||||||
|
|
||||||
<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_databases_folder_structure.png" width="30%">
|
<img src="https://github.com/Katzlab/EukPhylo/blob/main/Other/PTL1_databases_folder_structure.png" width="30%">
|
||||||
|
|
||||||
Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file with the suffix `_rRNAseqs.fasta`). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. Hook-6.6.fasta, Hook-6.6.dmnd).
|
Inside the db_BvsE folder are two Diamond-formatted reference databases of diverse eukaryotic (eukout.dmnd) and prokaryotic (micout.dmnd) sequences, used for identification of putative contamination (ultimately labeled _P for putative prokaryotic, vs _E for likely eukaryotic). These are just preliminary assignments that help users interpret data on trees, and should be treated as such. The folder also contains a BLAST+ formatted database of rRNA sequences, used for removal of putative rRNA (putative rDNAs are sequestered in a separate file with the suffix `_rRNAseqs.fasta`). The db_StopFreq folder contains one Diamond-formatted reference database of diverse eukaryotic protein sequences, used for identifying putative reading frames in the calculation of in-frame stop codon frequencies for genetic code assignment (i.e for studies of ciliates and other lineages with aberrant codes). The db_OG folder contains the Hook Database, which MUST be provided as BOTH and fasta file and a Diamond-formatted database, and these files should have the same name up to the extension (e.g. Hook-6.6.fasta, Hook-6.6.dmnd).
|
||||||
|
|
||||||
You can download these databases from the [EukPhylo Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).
|
You can download these databases from the [EukPhylo Figshare page](https://doi.org/10.6084/m9.figshare.26597368). You will have to add the Hook Database to the db_OG folder manually; you can find the Hook Database [here](https://doi.org/10.6084/m9.figshare.26539753.v1). Convert it to a Diamond database and proceed. Alternatively, you can create your own reference database for gene family assignment (described below).
|
||||||
|
|
||||||
@ -84,7 +84,7 @@ Running EukPhylo Part 1 on transcriptomes requires three items in your main dire
|
|||||||
2. A folder containing your **assembled [transcripts](https://github.com/Katzlab/EukPhylo/tree/main/PTL1/Transcriptomes/TestData)** (as described above)
|
2. A folder containing your **assembled [transcripts](https://github.com/Katzlab/EukPhylo/tree/main/PTL1/Transcriptomes/TestData)** (as described above)
|
||||||
3. The **Databases** folder described above
|
3. The **Databases** folder described above
|
||||||
|
|
||||||
EukPhylo part 1 starts with your **assembled transcripts** and produces **ReadyToGo files** (R2G; nucleotide coding regions and inferred amino acid sequences with gene families assigned) for each input sample, and **summary statistics** (e.g. composition, length, coverage) for each sequence processed, as well as aggregated across all sequences for each taxon. This part of the pipeline includes seven scripts which must be run in order. Script 1b (removal of contamination from index switching, 'XPC') is optional (see below), and users may choose to stop after script 4 if they are unsure of correct genetic code assignment. Otherwise, users are recommended to run their transcripts through scripts 1 to 7 in a single run. The simplest way to run EukPhylo part 1 is with one of the following command:
|
EukPhylo part 1 starts with your **assembled transcripts** and produces **ReadyToGo files** (R2G; nucleotide coding regions and inferred amino acid sequences with gene families assigned) for each input sample, and **summary statistics** (e.g. composition, length, coverage) for each sequence processed, as well as aggregated across all sequences for each taxon. This part of the pipeline includes seven scripts which must be run in order. Script 1b (removal of contamination from index switching, 'XPC') is optional (see below), and users may choose to stop after script 4 if they are unsure of correct genetic code assignment. Otherwise, users are recommended to run their transcripts through scripts 1 to 7 in a single run. The simplest way to run EukPhylo part 1 is with one of the following commands:
|
||||||
|
|
||||||
On a grid
|
On a grid
|
||||||
`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --genetic_code Gcode.txt --databases Databases > log.txt`
|
`python Scripts/wrapper.py --first_script 1 --last_script 7 --assembled_transcripts AssembledTranscripts --genetic_code Gcode.txt --databases Databases > log.txt`
|
||||||
@ -111,7 +111,7 @@ Other available parameters are:
|
|||||||
| --seq_count |integer|any positive integer| Minimum number of sequences after assigning GFs |
|
| --seq_count |integer|any positive integer| Minimum number of sequences after assigning GFs |
|
||||||
|
|
||||||
### Index Switching (Cross plate contamination)
|
### Index Switching (Cross plate contamination)
|
||||||
As you run EukPhylo part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by clustering all of your input assembled transcripts with Vsearch at a nucleotide identity of 99%. Sequences with less than one-tenth the k-mer coverage of the highest covered sequence in the cluster are removed, as long as both sequences are not 'conspecific' (usually, this means from the same species or genus). You can tell EukPhylo which of your taxa are conspecific by inputting a text file to the --conspecific_names argument with two tab-separated columns; the first column should be a ten-digit sample identifer and the second column a group (e.g., species, genus) identifier; samples with the same group identifier are taken to be conspecific.
|
As you run EukPhylo part 1 on transcriptomes, you might want to remove sequences from your assembled transcripts that are a result of index switching. This is done by clustering all of your input assembled transcripts with Vsearch at a nucleotide identity of 99%. Sequences with less than one-tenth the k-mer coverage of the highest covered sequence in the cluster are removed, as long as both sequences are not 'conspecific' (usually, this means from the same species or genus). You can tell EukPhylo which of your taxa are conspecific by inputting a text file to the --conspecific_names argument with two tab-separated columns; the first column should be a ten-digit sample identifier and the second column a group (e.g., species, genus) identifier; samples with the same group identifier are taken to be conspecific.
|
||||||
|
|
||||||
To run this step, you will need to add the '--xplate_contam' flag to the command line as follows:
|
To run this step, you will need to add the '--xplate_contam' flag to the command line as follows:
|
||||||
|
|
||||||
@ -147,7 +147,7 @@ The "ReadyToGo" folder contains the final cleaned output of EukPhylo part 1. The
|
|||||||
|
|
||||||
EukPhylo part 1 also provides intermediate files used in producing the above finalized outputs. Of greatest interest to most users here are likely to be the files in the `Intermediate/TranslatedTranscriptomes` folder. Most importantly, users can find a record of all Diamond hits against the Hook Database (not filtered to keep only the best hits) in the file `DiamondOG/allOGresults.tsv`. This is useful in trying to assess alternative gene family assignments. See the headers of each EukPhylo part 1 scripts for a description of the individual intermediate outputs.
|
EukPhylo part 1 also provides intermediate files used in producing the above finalized outputs. Of greatest interest to most users here are likely to be the files in the `Intermediate/TranslatedTranscriptomes` folder. Most importantly, users can find a record of all Diamond hits against the Hook Database (not filtered to keep only the best hits) in the file `DiamondOG/allOGresults.tsv`. This is useful in trying to assess alternative gene family assignments. See the headers of each EukPhylo part 1 scripts for a description of the individual intermediate outputs.
|
||||||
|
|
||||||
The output of PhylToL part 1 when run on **genomic data** is very similar, if a bit simpler. The key files (ReadyToGo and all Diamond hits against the Hook) are located in the same places, except there is no `TranslatedTranscriptomes` folder (key intermediate files are given directly in the `Intermediate` folder). Again, see the headers of each EukPhylo part 1 scripts for a more detailed description of the individual intermediate outputs.
|
The output of PhyloToL part 1 when run on **genomic data** is very similar, if a bit simpler. The key files (ReadyToGo and all Diamond hits against the Hook) are located in the same places, except there is no `TranslatedTranscriptomes` folder (key intermediate files are given directly in the `Intermediate` folder). Again, see the headers of each EukPhylo part 1 scripts for a more detailed description of the individual intermediate outputs.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user