13
Utilities
Adri K. Grow edited this page 2025-08-19 17:00:31 -04:00

Utilities summary

EukPhylo includes a set of stand-alone utility scripts that aim to increase the power of the analysis done with or without the core EukPhylo pipeline. We divide these scripts into five main categories: assembly and fasta tools, sequence composition, MSA tools, gene tree descriptions and stand alone clade grabbing.

  • assembly and fasta tools capture tasks including downloading sequences from GenBank, clustering sequences, calculating statistics on assemblies, and estimating most shared gene families (OGs) for use in EukPhylo part 2
  • sequence composition analysis calculates statistics for coding domains (e.g. composition, effective number of codons), plots outputs, and enables users to rename sequences in "ready to gos" based on GC content at silent sites.
  • MSA tools include assessment of gaps, a wrapper for Guidance analyses, and a tool to count taxa across gene families (useful for deciding on which trees to run after part 1
  • Gene tree description utilities allows users to modify trees (i.e. to rename and color tips) and to assess clade sizes and levels of contamination
  • Stand-alone clade grabbing allows users to select sequences from robust clades based on user defined rules

All utilities are written in Python and contain headers that provide information on usage, and a summary of utilities is divided by category here

Category Script name Intent Output
Assembly and fasta tools Assess_transcriptomes.py Calculates the length, GC content, and coverage of assembled files Spreadsheet containing the length, coverage, and GC of each transcript.
Cluster.py Clusters sequences in a fasta file Clustered fasta files
GetTaxonomy.py Collects taxonomic classification of organisms from NCBI Spreadsheet with NCBI taxonomy
GetUniqueTaxa.py Gets the unique taxa from a taxonomic classification Spreadsheet with unique taxa
Plot_transcriptomes.py Plots the length, coverage, and GC distribution of transcriptomes. Plots of transcripts distribution.
QuerySRA.py Downloads assemblies from NCBI Assemblies, IDs, and GCA or SRR codes.
ReadMapping.py Maps a group of trimmed reads to a reference Sam/Bam files.
SeqLenToCsv.py Calculates the length of DNA sequences in fasta files Spreadsheet containing the length of all sequences.
SharedOGs.py Summarizes the gene family presence in fasta files Spreadsheet with the gene families
Sequence composition analysis CUB.py Summarizes the nucleotide composition of fasta files Fasta file and several spreadsheets summarizing the nucelotide composition
GC_identifier.py Renames sequence ID by GC composition Fasta files with relabeled sequence ID
PlotComps.r Produces GC3 width plots GC3 width plots
Plotcomps_SppName.R Produces GC3 width plots with the species name and # seqs added to each plot GC3 width plots
MSA tools BacktranslateAlignment.py Produces new nucleotide alignment from an amino acid alignment Aligned nucelotide file
CountTaxonOccurence.py Counts the occurences of each taxa in each gene family of a post guidance file Spreadsheet with counts of taxa
friendlessness.py Describes the internal regions of insertion unique or nearly unique to a sequence Spreadsheet with each sequence statistics
Gappiness.py Produces statistics on the terminal and internal gaps of an alignment Spreadsheet with the paralogs statistics
GuidanceWrapper.py Guidance wrapper that can be used in place of EukPhylo pipeline Guidanced alignment files
Gene tree description CladeSizes.py Describes clade sizes for different taxonomic groups Spreadsheet describing clade sizes
ColorByClade.py Visualizes placement of taxa by taxonomic group in trees Colored trees
ContaminationBySisters.py Summarizes the taxonomic distribution of sister sequences for each taxon in a tree Two spreadsheets summarizing tree tips relationship
RenameTips.py Renames the tip labels of trees to include metadata such as location and date Renamed trees
Stand-alone clade grabbing CladeGrabbing.py Selects clades of interest from trees using taxonomic specifications Phylogenetic trees

Using the Utilities (in progress of drafting)