We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence.  Previous sequence-sampling methods, such as minimizers, syncmers, and minimally-overlapping words, were developed by heuristic intuition, and are not optimal.
We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly-evolving sequences.  It it likely near-optimal for a wide range of alignment-based and alignment-free analyses.  For real biological DNA, it increases specificity by avoiding simple repeats.  Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once), and polar sets (which guarantee to sample a sequence at most once).  This helps us understand how to do rapid sequence analysis as accurately as possible.
Nonlinear low-dimensional embeddings allow humans to visualize high-dimensional data, as is often seen in bioinformatics, where data sets may have tens of thousands of dimensions. However, relating the axes of a nonlinear embedding to the original dimensions is a nontrivial problem.  In particular, humans may identify patterns or interesting subsections in the embedding, but cannot easily identify what those patterns correspond to in the original data.  Thus, we present SlowMoMan (SLOWMOtions onMANifolds), a web application which allows the user to draw a 1-dimensional path onto the 2-dimensional embedding. Then, by back projecting the manifold to the original, high-dimensional space, we sort the original features such that thosemost discriminative along the manifold are ranked highly.  We show a number of pertinent use cases for our tools, including trajectory inference, spatial transcriptomics, and automatic cell classification.
Glioma is the most common type of brain tumors in adults.  Among gliomas, glioblastoma remains incurable with a 5-year survival rate of 5.1%.  The unresponsiveness to treatment is mainly due to the high level of intratumor heterogeneity and the poor understanding of its molecular pathogenesis.  Thus, the development of biomarkers for proper stratification of glioma patients is of utmost importance for a deeper understanding of complex genetics of gliomas and novel targeted therapeutics.  Recent studies have revealed clinical relevance of adenosine to inosine (A-to-I) RNA editing in human cancers.  However, the prognostic and regulatory roles of A-to-I RNA editing in glioma remain unclear.  By analyzing editing signatures of two independent glioma cohorts, we found that editing signatures predicted patient survival in a sex-dependent manner; hyper-editing was associated with poor survival in females but better survival in males.  Moreover, a number of editing events were associated with glioma progression and affected mRNA abundance of their host genes, including genes associated with inflammatory response (EIF2AK2) and fatty acid oxidation (acyl-CoA oxidase 1).  Our study indicates that RNA editing could serve as a promising prognostic biomarker and a potential therapeutic strategy against glioma and highlights the importance of developing sexual dimorphic treatments.
Motivation: With the advance in single-cell RNA sequencing (scRNA-seq) technology, deriving inherent biological system information from expression profiles at a single-cell resolution has become possible.  It has been known that network modeling by estimating the associations between genes could better reveal dynamic changes in biological systems.  However, accurately constructing a single-cell network (SCN) to capture the network architecture of each cell and further explore cell-to-cell heterogeneity remains challenging.
Results: We introduce SINUM, a method for constructing the SIngle-cell Network Using Mutual information, which estimates mutual information between any two genes from scRNA-seq data to determine whether they are dependent or independent in a specific cell.  Experiments on various scRNA-seq datasets validated the accuracy and robustness of SINUM in cell type identification, superior to the state-of-the-art SCN inference method.  The SINUM SCNs exhibit high overlap with the human interactome and possess the scale-free property.  Additionally, SINUM presents a view of biological systems at the network level to detect cell-type marker genes/gene pairs and investigate time-dependent changes in gene associations during embryo development.
Availability and implementation: Codes for SINUM are freely available at https://github.com/SysMednet/SINUM. This research used the publicly available datasets published by other researchers as cited in the manuscript.
Background: Different T cells can be distinguished via their unique T cell receptor (TCR) genes that undergo V(D)J recombination.  The domination of an abundant TCR sequence often suggests expansion of the corresponding T cells.  Thus, recombined TCR sequences can serve as a biomarker of immune related diseases.  A few recent works have shown the biomarker potential of non-recombined TCR sequences, i.e., those composed of only J gene segments.  However, the generation mechanisms are little studied.  Such a study should be at the single cell level because recombination status may play a role and the status varies between T cells.  RNA-Seq data of T cell lines are great resource for studying non-recombined TCR sequences.  However, currently no tool is available for extracting both recombined and non-recombined TCR sequences from RNA-Seq data.
Results: To develop a computational pipeline for extracting all TCRB reads, we applied two spliced alignment tools, HISAT2 and TRIg, and compared their performance.  We found HISAT2 was more comprehensive in spliced alignments while only TRIg could identify reads containing some non-reference nucleotides added during VDJ recombination.  Both HISAT2 and TRIg identified false TCRB reads, which could be filtered by comparing their alignments.  On RNA-Seq data of 17 cell lines of T cell lymphoma, our pipeline found TCRB reads in 0.01-0.6% of the data.  We found that recombination status was related to the expression of TCRB gene.  The coverage profiles of the TCRB reads revealed that the non-recombined TCRB J2-2P~J2-3 segments were relatively more abundant in T cells.  When the top abundant VJ clone involved J2-1 or a J1 gene, especially when the top clone was non-productive, relatively more J2-2P~J2-3 segments were observed.
Conclusion: Our pipeline identified TCRB reads in the RNA-Seq data accurately and comprehensively.  Analysis of the TCRB reads provides insights into biogenesis of non-recombined TCRB sequences.  Specifically, the J2-2P~J2-3 segments were likely by-products of non-desired splicing of VJ clones involving J2-1.  For T cells which top VJ clone involved a J1 gene, J2-2P~J2-3 segments may come from transcription from the D2 promoter.
Retroviruses are important contributors to disease and evolution in vertebrates.  Sometimes, retrovirus DNA is heritably inserted in a vertebrate genome: an endogenous retrovirus (ERV).  Vertebrate genomes have many such virus-derived fragments, usually with mutations disabling their original functions.
Some primate ERVs appear to encode an overlooked protein.  This protein has significant homology to protein MC132 from Molluscum contagiosum virus, which is a human poxvirus, not a retrovirus.  MC132 suppresses the immune system by targeting NF-κB, and it had no known homologs until now.  The ERV homologs of MC132 in the human genome are mostly disrupted by mutations, but there is an intact copy on chromosome 4.  We found homologs of MC132 in ERVs of apes, monkeys, and bushbaby, but not tarsiers, lemurs or non-primates.  This suggests that some primate retroviruses had, or have, an extra immune-suppressing protein, which underwent horizontal genetic transfer between unrelated viruses.
Tandem repeats, such as CATCATCATCAT, are abundant in natural DNA.  They tend to mutate and evolve rapidly by expansion and contraction.  These mutations often cause disease, and are probably a major source of variation in evolution.  A few previous studies have surveyed tandem repeats in eukaryote genomes, and found curious patterns such as 4-mer repeats being abundant in vertebrates, or 3-mer repeats being infrequent in mammals and birds.  However, repeat identification depends on criteria such as minimum length threshold, and these surveys arguably used non-equivalent thresholds for different repeat unit sizes.
Here, we survey tandem repeats with 3 different methods (TANTAN, ULTRA, and TRF), considering both exact and inexact repeats.  We confirm that 3-mer repeats are relatively infrequent in therian mammals.  More generally, in chordates 3-mer repeats often have anomalous frequencies: high in some genomes and low in others.  We test and reject several hypotheses: that repeat abundances are driven by evolutionary forces in genic regions, or by transposable elements, or by palindromic repeats.  In conclusion, tandem repeat surveys are sensitive to repeat definition criteria, but 3-mer repeats often have anomalous abundance in chordate genomes, which seems to be a genome-wide phenomenon.
In contrast to RNA-seq analysis, which has various standard methods, no standard methods for identifying differentially methylated cytosines (DMCs) exist.  To identify DMCs, we tested principal component analysis and tensor decomposition-based unsupervised feature extraction with optimized standard deviation whose effectiveness toward differentially expressed gene (DEG) identification was recently recognized.  The proposed methods can outperform certain conventional methods, including those that must assume beta-binomial distribution for methylation that the proposed methods do not have to assume, especially when applied to methylation profiles measured using high throughput sequencing.  DMCs identified by the proposed method are also significantly overlapped with various functional sites, including known differentially methylated regions, enhancers, and DNase I hypersensitive sites.  This suggests that the proposed method is a promising candidate for standard methods for identifying DMCs.
Novel protein-coding genes were considered to be born by re-organization of pre-existing genes, such as gene duplication and gene fusion.  However, recent progress of genome research revealed that more protein-coding genes than expected were born de novo, that is, gene origination by accumulating mutations in non-genic DNA sequences.  Nonetheless, the in-depth process (scenario) for de novo origination is not well understood.  We have conceived bioinformatics analysis for sketching a scenario for de novo origination of protein-coding genes.  For each de novo protein-coding gene, we firstly identified an edge of a given phylogenetic tree where the gene was born based on parsimonious principle.  Then, from a multiple sequence alignment of the de novo gene and its orthologous regions, we constructed ancestral DNA sequences of the gene corresponding to both ends of the edge.  We finally revealed statistical features observed in evolution between the two ancestral sequences.  In the analysis of the Saccharomyces cerevisiae lineage, we have successfully sketched a putative scenario for de novo origination of protein-coding genes.  (1) In the beginning was GC-rich genome regions.  (2) Neutral mutations were accumulated in the regions.  (3) ORFs were extended/combined, and then (4) translation signature (Kozak consensus sequence) was recruited.  To the best of our knowledge, this is the first report outlining a scenario of de novo origination of protein-coding genes.
When building a model for predicting a phenotype of interest based on the omics data, data imbalance, resulting from either unbalanced covariates or unbalanced groups, can lead to inaccurate training of the model.  To handle the covariate imbalance between groups in the dataset, a matching algorithm, such as propensity score matching (PSM), can be used to reduce the bias in the unbalanced covariates on the phenotype of interest.  Although PSM reduces bias of the effect of a covariate, many samples can be filtered out.  In addition to PSM, oversampling methods, such as borderline synthetic minority oversampling technique (borderline-SMOTE), can be used to reduce the bias from class imbalance when building prediction models.  However, newly created borderline-SMOTE samples could be uninformative if they are not close to the borderline.
To manage the limitations of PSM and borderline-SMOTE, we propose a hybrid sampling method, PSM-SMOTE, that handles both covariate and class imbalance in microbiome data.  Logistic regression (LR), random forest (RF), and support vector machine (SVM) models were used to evaluate PSM-SMOTE using three different microbiome datasets.  The area under the receiver operating characteristic curve (AUC) was used to compare PSM-SMOTE with other existing methods.  When compared to those using PSM, RF models using PSM-SMOTE performed the best with the highest AUC resulting an increase in AUC.
Conventional differential gene expression analysis pipelines for non-model organisms require computationally expensive transcriptome assembly.  We recently proposed an alternative strategy of directly aligning RNA-seq reads to a protein database, and demonstrated drastic improvements in speed, memory usage, and accuracy in identifying differentially expressed genes.  Here we report a further speed-up by replacing DNA-protein alignment by quasi-mapping, making our pipeline >1000x faster than assembly-based approach, yet more accurate.  We also compare quasi-mapping to other mapping techniques and show it is faster, but at the cost of sensitivity.
Two important issues in imaging neurons were how to represent the neuron activity of a calcium imaging video in an image and how to predict a set of neurons with a high precision rate and high fluorescence level changes.  In this paper, a new pipeline was proposed to predict neuron cells using a calcium imaging video.  Rather than analyzing the whole video, we generate the max-min intensity image as a representative image from an input video.  Not only for visualization, the image can be used to predict neuron cells by applying Cellpose, a two-dimensional cell detection tool, directly.  In the result, we demonstrated the max-min intensity image showing clear cell shapes with fluorescence change levels compared with other reference images.  Then, six calcium imaging videos with consensus annotations were used for the comparison.  The precision rates for our method were mostly higher than those by two other popular tools though the total numbers of predicted neurons by our method were less.  Moreover, the fluorescence change levels for the predicted neurons by our method were significantly higher than those of un-predicted annotated ones.  Thus, our method can satisfy the neuroscientists seeking a predictor producing an output preferring the quality over quantity.
Motivation: Nucleus segmentation represents the initial step for histopathological image analysis pipelines, and it remains a challenge in many quantitative analysis methods in terms of accuracy and speed.  Recently, deep learning nucleus segmentation methods have demonstrated to outperform previous intensity- or pattern-based methods.  However, the heavy computation of deep learning provides impression of lagging response in real time and hampered the adoptability of these models in routine research.
Results: We developed and implemented NuKit, a deep learning platform which accelerates nucleus segmentation and provides prompt results to the users.  NuKit platform consists of two deep learning models coupled with an interactive graphical user interface (GUI) to provide fast and automatic nucleus segmentation “on the fly”.  The two deep learning models are the whole image segmentation model and the click segmentation model.  Both deep learning models provide complementary tasks in nucleus segmentation in the NuKit platform.  The whole image segmentation model performs whole image nucleus whereas the click segmentation model supplements the nucleus segmentation with user-driven input to edits the segmented nuclei.  We trained the NuKit whole image segmentation model on a large public training data set and tested its performance in seven independent public image data sets.  The whole image segmentation model achieves average DICE = 0.814, average IoU = 0.689 and Nucleus Detected Ratio 1.052.  The outputs of Nukit could be exported into different file formats, as well as provides seamless integration with other image analysis tools such as QuPath.  NuKit can be executed on Windows, Mac and Linux using regular personal computer.  The deep learning inference time of NuKit is around 1.5 seconds.
Availability: The platform is available at http://tanlab.org/NuKit/.
Background: Microbiome studies in the recent years have established the association between the composition of gut microbiota and various diseases.  Since 16S ribosomal RNA sequencing may suffer from problems such as lower taxonomic resolution or limited sensitivity, more and more studies embraced whole-metagenome approach, which has the potential of sequencing everything in the target microbiome, to conduct microbial association analysis.  However, species profiling, which is the most popular analysis technique for whole-metagenome analysis, cannot detect uncultivated species.  Since uncultivated species may also be indispensable in the gut environments, it is crucial to identify those uncultivated species and evaluate their importance in discerning disease from healthy samples.
Results:After conducting de novo co-assembly and genome binning procedures on two colorectal cancer (CRC) cohorts, in which one of them was from the Asian population while the other was from the Caucasian population, we identified that the Asian and Caucasian cohorts shared a significant amount of microbial species in their microbiota.  In addition we found that low abundance genomes may be more important in classifying disease and healthy metagenomes.  By sorting the genomes based on their random forest importance scores in differentiating disease and healthy samples and cumulatively evaluating the subset of genomes in predicting CRC status, we identified dozens of “important” genomes for each of the cohorts that were able to predict CRC with very high accuracy (0.90 and 0.98 AUROC for the Asian and Caucasian cohorts respectively).  Uncultivated species were also identified among the selected genomes, highlighting the importance of extracting genomes of the uncultivated species in order to build better disease prediction models and evaluate the roles of the uncultivated species in the disease formation or progression.  Finally we found that the “important” species for both cohorts did not overlap with each other, hinting that the species highly associated with CRC disease may be different between the Eastern and Western populations.
The Encyclopedia of DNA Elements (ENCODE) Consortium has generated thousands of genomic datasets to annotate non-coding regions of the genome. For the research community to easily use and interpret these datasets, the ENCODE Consortium has integrated multiple datasets to create a collection of genomic annotations termed the ENCODE Encyclopedia. In particular, the ENCODE Project has performed thousands of transcription factor (TF) ChIP-seq experiments to profile the binding of nearly 900 distinct TFs genome-wide. We have built two web-based tools, SCREEN (https://screen.encodeproject.org) and Factorbook (https://factorbook.org), for exploring the rich ENCODE data and annotations of non-coding regulatory elements.
In this tutorial, I will first introduce the ENCODE Encyclopedia, focusing primarily on the database and visualization tool SCREEN: Search Candidate cis-Regulatory Elements by ENCODE. SCREEN is an online tool that enables users to explore candidate cis-Regulatory Elements (cCREs) across hundreds of cell and tissue types and filter regions by various facets.
For the second part of the tutorial, I will introduce Factorbook, a TF-centric hub of data and integrative analysis, including the binding patterns and motifs of more than 800 human TFs combined with orthogonal data, such as histone ChIP-seq, DNA methylation, GWAS, or RNA-seq data, providing powerful insights into the regulatory patterns of the genome.
Background: Sequence logos can effectively visualize position specific base preferences evident in a collection of binding sites of some transcription factor.  But those preferences usually fall far short of fully explaining binding specificity.
Interestingly, some transcription factors bind sites of potentially methylated DNA.  For example, MYC binds CpG sites.  This may increase binding specificity as such sites are 1) highly under-represented in the genome, and 2) offer additional, tissue specific information in the form of hypo- or hyper-methylation.  Fortunately, bisulfite sequencing data suitable to investigate this possibility is readily available.
Method: We developed MethylSeqLogo, an extension of sequence logos which adds DNA methylation information to sequence logos.  MethylSeqLogo includes new elements to indicate DNA methylation and under-represented dimers in each position of a set of aligned binding sites.  Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation.
When designing MethylSeqLogo, we took care to preserve the usual sequence logo meaning of heights; in which the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information.
Results: We present several figures illustrating the utility of using MethylSeqLogo to summarize data from CpG binding transcription factors.  The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.
We also compare MethylSeqLogo with two previously reported ways to create methylation aware sequence logos.
Conclusions: Our freely available software enables users to explore large-scale bisulfite and ChIP sequencing data sets - and in the process obtain publication quality figures.
Alternative splicing is a crucial mechanism of post-transcriptional modification responsible for the transcriptome plasticity and proteome diversity of a metazoan cell.  Although many splicing regulations around the exon/intron regions have been discovered, the relationship between promoter-bound transcription factors and the downstream alternative splicing remains largely unexplored.  In this study, we present computational approaches to decipher the regulation relationship connecting the promoter-bound transcription factor binding sites (TFBSs) and the splicing patterns.  We curated a fine data set, including DNase I hypersensitive sites sequencing and transcriptome in fifteen human tissues from ENCODE.  Specifically, we proposed different representations of TF binding context and splicing patterns to tackle the associations between the promoter and downstream splicing events.  Our results demonstrated that the convolutional neural network (CNN) models learned from the TF binding changes in the promoter to predict the splicing pattern changes.  Furthermore, through an in silico perturbation-based analysis of the CNN models, we identified several TFs that considerably reduced the model performance of splicing prediction.  In conclusion, our finding highlights the potential role of promoter-bound TFBSs in influencing the regulation of downstream splicing patterns and provides insights for discovering alternative splicing regulations.
Motivation: The synthesis of proteins with novel desired properties is challenging but sought after by the industry and academia. The dominating approach is based on trial-and-error-inducing point mutations, assisted by structural information or predictive models built with paired data that are difficult to collect.  This study proposes a sequence-based unpaired sample of novel protein inventor (SUNI) to build ThermalProGAN for generating thermally stable proteins based on sequence information.
Results: The ThermalProGAN can strongly mutate the input sequence with a median number of 32 residues.  A known normal protein,1RG0, was used to generate a thermally stable form by mutating 51 residues.  After superimposing the two structures high similarity is shown, indicating that the basic function would be conserved.  84 molecular dynamics simulation results of 1RG0 and the Covid-19 vaccine candidates with a total simulation time of 840 ns indicate that the thermal stability increased.
Conclusion: This proof of concept demonstrated that the transfer of a desired protein property from one set of proteins is feasible.
Availability and implementation: The source code of ThermalProGAN can be freely accessed at https://github.com/markliou/ThermalProGAN/ with an MIT license.  The website is https://thermalprogan.markliou.tw:433.
Supplementary information: Supplementary data are available on Github.
Methods for network-based missing protein prediction may use either p-values or probabilities, which may render them difficult for direct cross-comparisons.  Approaches such as the Bayes Factor upper Bound (BFB) for p-value conversions may not make correct assumptions.  Here, using a well-established case study on renal cancer proteomics, we demonstrate how to compare methods based on false discovery rate (FDR) estimations, which does not make the same naïve assumptions as BFB conversions; and we introduce a powerful approach which we colloquially call “home ground testing”.  Both approaches perform better than BFB conversions.  Thus, we recommend that methods be compared by standardization to a common performance benchmark such as a global FDR.  And where not possible, then to consider reciprocal “home ground testing”.
Resolving the composition of tumor-infiltrating leukocytes is essential for expanding the cancer immunotherapy strategy, which has witnessed dramatic success in some clinical trials but remained elusive and limited in its application.  In this study, we developed a two-step streamed workflow to manage the complex bioinformatic processes involved in immune cell composition analysis.  We developed a dockerized toolkit (DOCexpress_fastqc, https://hub.docker.com/r/lsbnb/docexpress_fastqc) to perform gene expression profiling from RNA sequencing raw reads by integrating the hisat2-stringtie pipeline and our scripts with Galaxy/Docker images.  Then the output of DOCexpress_fastqc fits the input format of mySORT web, a web application that employs the deconvolution algorithm to determine the immune content of 21 cell subclasses.  The usage of mySORT was also demonstrated using a pseudo-bulk pool through single-cell datasets.  Additionally, the consistency between the estimated values and the ground-truth immune-cell composition from the single-cell datasets confirmed the exceptional performance of mySORT.  The mySORT demo website and Docker image can be accessed for free at https://mysort.iis.sinica.edu.tw and https://hub.docker.com/r/lsbnb/mysort_2022.
The interphase chromatin structure is extremely complex, precise and dynamic.  Experimental methods can only show the frequency of interaction of the various parts of the chromatin.  Therefore, it is extremely important to develop theoretical methods to predict the chromatin structure.  In this publication, we describe the necessary factors for the effective modeling of the chromatin structure in Drosophila melanogaster.  We also compared Monte Carlo with Molecular Dynamic methods.  We showed that incorporating black, non-reactive chromatin is necessary for successfully prediction of chromatin structure, while the loop extrusion model with a long range attraction potential or Lennard-Jones (with local attraction force) as well as using Hi-C data as input are not essential for the basic structure reconstruction.  We also proposed a new way to calculate the similarity of the properties of contact maps including the calculation of local similarity.