Updated QuickStart (markdown)

2025-12-29 02:21:56 +08:00 · 2025-01-24 16:23:42 +01:00 · 2025-01-24 16:23:42 +01:00 · 6d0f4d86a6
commit 6d0f4d86a6
parent 94c7bad9e5
1 changed files with 64 additions and 19 deletions
--- a/QuickStart-EukPhylo.md
+++ b/QuickStart-EukPhylo.md
@ -1,3 +1,19 @@
 # General Steps
 EukPhylo is composed of two parts, that can be run individually: Part 1 can be run only once, to assign gene families; Part 2 builds MSA, Trees, implement contamination removal and Concatenation.
 It's preferable to run Part 2 with the outputs of Part 1, but this is not required as long as the users files are in the same format (one file per species, with sequences name start by 10 digit code and ends with OGx_xxxxxx. See extended version of the wiki for details)
 1. Install EukPhylo
 2. Run EukPhylo part 1
   a. with the Hook database or with custom database
   b. with Assembled Transcripts or Assembled Genomes
   c. modularity of options
 3. Run EukPhylo part 2
   a. basic running
   b. modularity of options
   c. contamination removal
   d. choosing orthologs and concatenation
 # Installing EukPhylo
 Scripts can be used as downloaded from the [GitHub](https://github.com/Katzlab/EukPhylo), and should work on any platform
 Dependencies & third party tools, along with the versions that we use at the Katz lab
@ -26,12 +42,12 @@ EukPhylo part 1 runs CDS or assembled transcripts through several scripts in ord
 * * db_BvsE (how we ID likely-bacterial sequences)
 * * db_StopFreq (for stop codon assignment)
 * * db_OG
-* * * Hook *.dmnd file ([Current version Hook-6.6.dmnd](https://drive.google.com/open?id=1ywYLZXzcTERDFCysz5vPbI9u6WRxz5r0&usp=drive_copy))
+* * * Hook *.dmnd file ([Current version Hook-6.6.dmnd])
-* * * Hook *.fasta file ([Current version Hook-6.6.fasta](https://drive.google.com/open?id=1AN4_SmZUYFH6_xh2qOhyNUlFZ_NT9_-D&usp=drive_copy)) 
+* * * Hook *.fasta file ([Current version Hook-6.6.fasta]) 
 * A folder called “Scripts” filled with scripts from [here](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Transcriptomes/Scripts) on Github
 ### Running:
-python wrapper.py -1 1 -2 7 --assembled_transcripts AssembledTranscripts -o . --genetic_code Universal -d Databases > log.txt
+python wrapper.py -1 1 -2 7 --assembled_transcripts AssembledTranscripts -o Output_Folder --genetic_code Universal -d Databases > log.txt
 Here add detail of each option possible:
 * -1 = start script
@ -46,6 +62,11 @@ Here add detail of each option possible:
 * ReadyToGo = AA, NTD
 * Sequences summary
 ### Modularity of options and replacing the Hook database
 EukPhylo part 1 for transcriptomes is composed of 7 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options.
 If a user choose to use their own gene families database, they need to replace the Hook.fasta file in the Database folder, and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.
 To add the XPC option, only available for transcriptomes, user need to add the -x option in their command line.
 ## Genomes:
 ### Set Up:
@ -59,7 +80,7 @@ Here add detail of each option possible:
 * A folder called “Scripts” filled with the 10 scripts from [here](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL1/Genomes/Scripts) on Github. To run locally, pull out all scripts into main folder 
 ### Running:
-python wrapper.py -1 1 -2 5 --cds CDS -o . --genetic_code Gcodes.txt -d Databases > log.txt
+python wrapper.py -1 1 -2 5 --cds CDS -o Output_Folder --genetic_code Gcodes.txt -d Databases > log.txt
 Here add detail of each options possible:
 * -1 = start script
@ -74,11 +95,16 @@ Here add detail of each options possible:
 ReadyToGo = AA, NTD
 Sequences summary
 ### Modularity of options and replacing the Hook database
 EukPhylo part 1 for genomes is composed of 5 scripts. User can choose to start or stop at each step by changing the -1 and/or -2 options.
 If a user choose to use their own gene families database, they need to replace the Hook.fasta file in the Database folder, and build a diamond version of their own file; with adjusting the naming system of sequences as the only requirement.
 # EukPhylo part 2 = MSA - Trees - Contamination Removal - Concatenation
 ## Set Up:
-* A folder callled “Scripts” containing all the [scripts from Github](https://github.com/Katzlab/EukPhylo/tree/main/PTL2/Scripts)
+* A folder called “Scripts” containing all the [scripts from Github](https://github.com/Katzlab/EukPhylo/tree/main/PTL2/Scripts)
 * Inside the Scripts folder, you also need to add the trimal-trimAl and guidance.v2.02 folders, as downloaded from [here](http://trimal.cgenomics.org/downloads) and [here](https://github.com/anzaika/guidance)
 * An empty output folder named as you wish for all output files (which will include trees and guidance files when done running), for example: Output_folder
 * A folder called “OutgroupR2Gs” containing the amino acid (AA) ReadyToGo fasta files for your target and outgroup taxa listed in your taxon_list.txt
@ -87,8 +113,10 @@ Sequences summary
 ## Running
 python3 Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data OutgroupR2Gs --output Output_folder  > Output1.out
 ### Basic running for building MSAs and Trees
 python3 Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data OutgroupR2Gs --output Output_folder  > Output1.out
 For information on each of the possible input parameters, read below and run “python phylotol.py --help”
@ -99,17 +127,34 @@ For information on each of the possible input parameters, read below and run “
 * '--taxon_list', default = None, help = 'Path to the file with the taxa (10-digit codes) to include in the output.')
 * '--data', help = 'Path to the input dataset. The format of this varies depending on your --start parameter. If you are running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with the sequence names matching exactly the tip names).')
 * '--output', default = '../', help = 'Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.')
 ### Modularity of options
 Here is a list of options or parameters user can add to the basic command line:
 * '--force', action = 'store_true', help = 'Overwrite all existing files in the "Output" folder.')
-* '--tree_method', default = 'iqtree', choices = {'iqtree', 'raxml', 'all'}, help = 'Program to use for tree-building')
+* Changing tree building program 
-* '--blacklist', type = str, help = 'A text file with a list of sequence names not to consider')
+* * '--tree_method', default = 'iqtree', choices = {'iqtree', 'raxml', 'all'}, help = 'Program to use for tree-building')
-* '--og_identifier', default = 'OG', choices = {'OG','OG6','OGA','OGG'}, help = 'Program to use for selecting seq by GC width')
+
-* '--sim_taxa', default = None, help = 'Path to the file with the taxa (10-digit codes) to apply the similarity filter on.')
+* Changing parameters for blast and Guidance
-* '--blast_cutoff', default = 1e-20, type = float, help = 'Blast e-value cutoff')
+* * '--blast_cutoff', default = 1e-20, type = float, help = 'Blast e-value cutoff')
-* '--len_cutoff', default = 10, type = int, help = 'Amino acid length cutoff for removal of very short sequences after column removal in Guidance.')
+* * '--len_cutoff', default = 10, type = int, help = 'Amino acid length cutoff for removal of very short sequences after column removal in Guidance.')
-* '--similarity_filter', action = 'store_true', help = 'Run the similarity filter in pre-Guidance')
+* * '--guidance_iters', default = 5, type = int, help = 'Number of Guidance iterations for sequence removal')
-* '--sim_cutoff', default = 1, type = float, help = 'Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff')
+* * '--seq_cutoff', default = 0.3, type = float, help = 'During guidance, taxa are removed if their score is below this cutoff')
-* '--guidance_iters', default = 5, type = int, help = 'Number of Guidance iterations for sequence removal')
+* * '--col_cutoff', default = 0.0, type = float, help = 'During guidance, columns are removed if their score is below this cutoff')
-* '--seq_cutoff', default = 0.3, type = float, help = 'During guidance, taxa are removed if their score is below this cutoff')
+* * '--res_cutoff', default = 0.0, type = float, help = 'During guidance, residues are removed if their score is below this cutoff')
-* '--col_cutoff', default = 0.0, type = float, help = 'During guidance, columns are removed if their score is below this cutoff')
+* * '--guidance_threads', default = 20, type = int, help = 'Number of threads to allocate to Guidance')
-* '--res_cutoff', default = 0.0, type = float, help = 'During guidance, residues are removed if their score is below this cutoff')
+
-* '--guidance_threads', default = 20, type = int, help = 'Number of threads to allocate to Guidance')
+* Enabling the reduction of too similar sequences
 * * Necessary options:
 * * * '--similarity_filter', action = 'store_true', help = 'Run the similarity filter in pre-Guidance')
 * * * '--sim_cutoff', default = 1, type = float, help = 'Sequences from the same taxa that are assigned to the same OG are removed if they are more similar than this cutoff')
 * * Optional options: only apply the option on some taxa
 * * '--sim_taxa', default = None, help = 'Path to the file with the taxa (10-digit codes) to apply the similarity filter on.')
 * Removing sequences at the start of EukPhylo part 2 (list of sequences known by user to be removed)
 * * '--blacklist', type = str, help = 'A text file with a list of sequence names not to consider')
 * Removing sequences based on GC composition
 * * This needs first to identify the sequences by OGA, OGG, OG6, with the utility script OGidentifiers.py
 * * '--og_identifier', default = 'OG', choices = {'OG','OG6','OGA','OGG'}, help = 'Program to use for selecting seq by GC width')