ISCB-Asia/SCCG 2012, BGI Special SessionStephen Kwok-Wing Tsui
|
With increasing computational power, availability of massive experimental databases on DNA and proteins, and mature data mining techniques, we propose a framework to discover associated TF-TFBS binding sequence patterns in the most explicit and interpretable form from TRANSFAC. The framework is based on association rule mining with Apriori algorithm. By re-categorizing the patterns with respect to varying TF amino acids, statistically significant (P values ≤ 0.005) subtypes leading to varying TFBS patterns are discovered without using TF family or domain annotations.
Resultant subtypes have various biological meanings. Conserved residues critical for maintaining TF-TFBS bindings are revealed by analyzing the subtypes. In-depth analysis on the subtype pair PKVVIL-CACGTG versus PKVEIL-CAGCTG shows the V/E variation is indicative for distinguishing Myc from MRF families. With further independent verifications from literatures, Protein Data Bank / homology modeling and ChIP-seq data, there are strong evidences that the patterns discovered reveal real TF-TFBS bindings across different TFs and TFBSs, which are informative and promising for more biological findings.