Updated PhyloToL Part 2: MSAs, trees, and contamination loop (markdown)

2025-12-29 01:00:24 +08:00 · 2024-08-14 09:55:46 -04:00 · 2024-08-14 09:55:46 -04:00 · 8ce5eab1f1
commit 8ce5eab1f1
parent 3b141e0380
1 changed files with 4 additions and 4 deletions
--- a/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
+++ b/PhyloToL-Part-2:-MSAs,-trees,-and-contamination-loop.md
@ -28,7 +28,7 @@ Running PhyloToL Part 2 requires at least four items in your main directory (see
 A folder named [Scripts](https://github.com/Katzlab/PhyloToL-6/tree/main/PTL2/Scripts) and containing all scripts from PhyloToL part 2, 2) a folder containing your input files, 3) a taxon list and 4) an OG list. Given that PhyloToL part 2 is highly modular and flexible, you will want to be sure of what point to the process of PTLp2 you wish to start and end (**TBD - point to figshare file here -- some figure). The default script starts with raw data and produces trees using scripts 1 - TBD. 

 # Databases
-For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (**TBD - point to figshare file here -- table of names). This database is in the form of "ReadyToGo" files, the output of PhyloToL part 1 (**TBD - point to figshare file here -- R2Gs). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above). 
+For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (see [dataset S1 and S6](https://figshare.com/account/projects/196552/articles/26540599?file=48494653)). This database is in the form of "ReadyToGo" files, the output of PhyloToL part 1 ([Dataset S16](https://figshare.com/account/projects/196552/articles/25336129?file=48356116)). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above). 

 # Running PhyloToL Part 2

@ -71,7 +71,7 @@ The pre-guidance step of PhyloToL part 2 takes ReadyToGo files (one file per sam

 ### Filtering by GC content
 
-Using a utility script (GC_identifier.py), prior to running PhyloToL part 2, users can choose to rename each sequence in the ReadyToGo file depending on whether the sequence falls outside of a user-defined range of GC content. The gene family identifier (last ten digits of the sequence identifier and by default prefixed by "OG6_") will be renamed depending on if the sequence GC content falls below (OGA) or above (OGG) or within (OG6) the user specified GC range. If the user wishes to only include one category of sequences when running PhyloToL part 2, the user should give the relabeled ReadyToGo files. The parameters for this when running pre-guidance is `--og_identifier` and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’, which passing all the sequences to guidance without filtering.
+Using a [utility script](https://github.com/Katzlab/PhyloToL-6/tree/main/Utilities) (GC_identifier.py), prior to running PhyloToL part 2, users can choose to rename each sequence in the ReadyToGo file depending on whether the sequence falls outside of a user-defined range of GC content. The gene family identifier (last ten digits of the sequence identifier and by default prefixed by "OG6_") will be renamed depending on if the sequence GC content falls below (OGA) or above (OGG) or within (OG6) the user specified GC range. If the user wishes to only include one category of sequences when running PhyloToL part 2, the user should give the relabeled ReadyToGo files. The parameters for this when running pre-guidance is `--og_identifier` and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’, which passing all the sequences to guidance without filtering.

 Adding these options to the command line will give:

@ -99,7 +99,7 @@ Adding these options to the command line will give:
 ### Other optional parameters:
 #### Blacklist

-The blacklist is a user-defined set of sequences to be removed from runs. You might choose to keep a list of sequences removed by Guidance to avoid reconsidering these non-homologs in future runs as this can save computer time, if you're going to be using the same ReadyToGo files and GFs in multiple runs of PhyloToL part 2. For our study, we chose to include only sequences removed by Guidance in our blacklist, but users should choose what fits best for their study and their data. To include this parameter in your PhyloToL run, you will need to add the `--blacklist` flag to the command line as follows:
+The blacklist is a user-defined set of sequences to be removed from runs. You might choose to keep a list of sequences removed by Guidance to avoid reconsidering these non-homologs in future runs as this can save computer time, if you're going to be using the same ReadyToGo files and GFs in multiple runs of PhyloToL part 2. For our study, we chose to include only sequences removed by Guidance in our [blacklist](https://figshare.com/account/projects/196552/articles/26539618?file=48354880), but users should choose what fits best for their study and their data. To include this parameter in your PhyloToL run, you will need to add the `--blacklist` flag to the command line as follows:

 `python Scripts/phylotol.py --start raw --end trees --gf_list  listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt`

@ -142,7 +142,7 @@ The CL runs iteratively and users must set the number of times that rules should

 The CL requires 1) a folder of alignments (it is most correct to not give gap trimmed alignments here) and 2) a folder of gene trees, and they should be formatted in the same way as output by the preceding steps of PhyloToL part 2 (i.e. in the "Output" folder, see above); most importantly, the alignment and tree files should begin with the same unique GF identifer. You can also give it data _not_ output by PhyloToL, but you will need to match the folder, file, and sequence name formats.

-You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in [Datasets S4-S8 on this project's Figshare](https://doi.org/10.6084/m9.figshare.26540599).
+You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in [Datasets S8-S12 on this project's Figshare](https://doi.org/10.6084/m9.figshare.26540599).

 In the sisters- and subsisters-modes, your rules file should include three columns. Each row represents a rule, for which a sequence from a taxon (identified by a ten-digit code or shorter code in the first column) will be removed if it is sister to a sequence from the taxon in the second column and on a branch that is shorter than X times the average branch length in the tree, where X is the number in the third column. Set the third column to “NA” if you do not desire to put any branch length restriction for the rule. For example, the line