Updated EukPhylo Part 2: MSAs, trees, and contamination loop (markdown)

Auden Cote-L'Heureux 2025-03-28 15:49:54 -04:00
parent a3fe134577
commit 936b984cbe

@ -158,6 +158,8 @@ The CL requires 1) a folder of alignments (it is most correct to not give gap tr
You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in [Datasets S8-S12 on this project's Figshare](https://doi.org/10.6084/m9.figshare.26540599).
##### Sisters/subsisters mode
In the sisters- and subsisters-modes, your rules file should include three columns. Each row represents a rule, for which a sequence from a taxon (identified by a ten-digit code or shorter code in the first column) will be removed if it is sister to a sequence from the taxon in the second column and on a branch that is shorter than X times the average branch length in the tree, where X is the number in the third column. Set the third column to “NA” if you do not desire to put any branch length restriction for the rule. For example, the line
|Op_ch_Dgra | Sr_di | 0.1|
@ -168,6 +170,8 @@ indicates that the a sequence from the choanoflagellate Op_ch_Dgra should be rem
|Sr_ci_Fsal | Pl_gr |
|-|-|
##### Clade-grabbing mode
In clade-grabbing mode, each row again represents a rule. This time, there are five columns. The first column gives the target taxonomic group for which you are clade grabbing. Here you can give a ten-digit code, a subset of a code, or even the path to a text file containing a list of multiple codes if they don't all share a precise enough prefix. The third column gives the minimum number of target taxa that must be in a clade for it to be kept, and the second column gives the minimum proportion (or absolute number of >1) of taxa in that clade that are not in the target group. The fourth column allows you to give a list of 'special' taxa (or just a ten-digit code or a subset of a code), X of which must be present in a clade for it to be selected, where X is the number in the fifth column. For example, the line
|Sr_ci | 0.1 | 13 | ciliate_genomes.txt | 1|
@ -179,17 +183,43 @@ Having well-organized ten-digit-codes for sample identification is vital for run
### Running
To run the CL, use a similar command structure as described for running EukPhylo part 2 above, and add the `--contamination_loop` parameter to activate the contamination loop and specify a mode and the path to a rules file. Available parameters are:
To run the CL, use a similar command structure as described for running EukPhylo part 2 above, and add the `--contamination_loop` parameter to activate the contamination loop and specify a mode and the path to a rules file. There are different inputs required for the two different modes. You can't run both modes at the same time, so users should run one mode, then give the output files of that run as input to run the other mode. No matter what, if you want to run the contamination loop, you'll want to use the following two arguments:
| Parameter | Required | Options | Description | Default |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| --contamination_loop | yes | seq, clade | The mode in which to run the CL | none |
| --nloops | no | any positive integer | Number of iterations | 5 |
| --sister_rules | only in sisters mode | Any valid path | Path to a text file containing 'sisters rules' | none |
| --subsister_rules | only in subsisters mode | Any valid path | Path to a text file containing 'subsisters rules' | none |
| --clade_grabbing_rules | only in clade mode | Any valid path | Path to a text file containing 'clade-grabbing rules' | none |
You can also change the tree-building method using the `--cl_tree_method` argument (`iqtree`, `iqtree_fast`, or `fasttree`). You can choose whether to run Guidance between each iteration, or just MAFFT, by using the `--cl_alignment_method` argument (`mafft_only` or `guidance`). Here are some mode-specific arguments:
#### Sisters/subsisters mode
Below are the parameters for running sisters/subsisters. You must give a rules file to EITHER --sister_rules or --subsister_rules, or both.
| Parameter | Required | Options | Description | Default |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| --sister_rules | only if filtering for sisters | Any valid path | Path to a text file containing 'sisters rules' | none |
| --subsister_rules | only if filtering for subsisters | Any valid path | Path to a text file containing 'subsisters rules' | none |
Here's an example command for filtering by sisters relationships using the contamination loop:
`python eukphylo.py --start trees --end trees --data <path to folder of input data> --output <path to output folder> --contamination_loop seq --sister_rules sister_rules.txt`
In this case, the file `sister_rules.txt` should contain the rules for sisters-based removal. The format of this file is described above.
#### Clade-grabbing mode
| Parameter | Required | Options | Description | Default |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| --clade_grabbing_rules | yes | Any valid path | Path to a text file containing 'clade-grabbing rules' | none |
| --clade_grabbing_exceptions | no | Any valid path | List of taxa to _not_ remove for any reason | none |
Here's an example command for filtering by sisters relationships using the contamination loop:
`python eukphylo.py --start trees --end trees --data <path to folder of input data> --output <path to output folder> --contamination_loop clade --clade_grabbing_rules clade_rules.txt`
In this case, the file `clade_rules.txt` should contain the rules for clade-based removal. The format of this file is described above.
## Ortholog selection and concatenation
EukPhylo includes an optional step, which can be run after the tree-building stage (or by using `--start trees` and passing to the `--data` argument a folder of trees and corresponding aligned or unaligned sequences files), to select orthologs (one sequence at most per taxon from each GF) and build a concatenated alignment. EukPhylo first identifies for each taxon the monophyletic clade with the greatest number of species from that taxon's minor clade, using the first five digits of that taxon's sample identifier (e.g., Op_me for metazoa); alternatively, a user can select orthologs for only a target group of taxon using the `--concat_target_taxa` argument by inputting a file with a list of ten digit codes, or just a single ten-digit code or clade prefix. If only one sequence from the taxon falls into this largest clade, that's the sequence chosen for concatenation; otherwise, then a score is given to each sequence equal to its length times is k-mer coverage for transcriptomic data, and just the sequence length for genomic data, and the sequence with the highest score is taken. If a GF is not present in a taxon, then the space is filled with gaps in the concatenated alignments. This step produces a clearly labeled concatenated alignment, as well as a folder called "DataToConcatenate" in which can be found all the selected orthologs for each GF, aligned and unaligned.