Graduate Student Projects

I advise graduate students, mostly Master students. I provide them with a thesis template to help them get started.

DNA NGS Sequencing processing

One thread in our lab is we have been interested in implementing software to analyze NGS DNA sequence data. In particular, some students have looked into extending a program called Eagle made by Tony Kuo (I was also involved in designing it and writing the paper BMC Medical Genomics, doi.org/10.1186/s12920-018-0342-1) some years ago.

EAGLE Based Variant Caller

Using the GIAB variant calling benchmark, we got as far as some anecdotal evidence that our updated version of Eagle can find missing (but still unconfirmed) variants.

I think we (or someone) should add de novo variant calling functionality to EAGLE. In principle it is easy, simply align reads to the genome (i.e. do read mapping) and consider any substitutions or indels found in high scoring read-genome alignments (not necessary the highest scoring ones) as candidate variants.

One could do this by instrumenting the Eagle code, or by coding from scratch. I would recommend using C or possibly Rust (I'm curious to learn more about Rust).

EAGLE Based Methylation+Variant Caller

The idea is to apply the EAGLE probabilistic model to simultaneously perform methylation calling and variant calling from bisulfite sequencing data. Some students have explored this to some degree, but the development has not gone very far yet. In order to make progress we need to develop (or find) a bisulfite sequencing simulation program which can do exactly what we want (the one we used so far does not). Developing a good simulator could potentially be a project in itself, the ideal result would be that simulated datasets cannot be easily discriminated from real bisulfite data.

Your Own Idea

I encourage students to propose their own project ideas. So far the best examples of this are 蔡孟勳 (Alex) who did a nice project related to CRISPR (an important topic which I knew little about); and 王士杰 who made a prototype Japanese ⟺ Chinese automatic translation program.

Language Related Ideas

日本語Moe dict

Japanese and Chinese are a hobby of mine. One practical idea I have is to make a resource based on Moe dict(萌典) but including Japanese translation. One source for automatically matching Chinese, Japanese and English words is Wikipedia, since many pages exist in cross-linked, matching versions, for example: 微積分學, 微分積分学, and Calculus. [Update 2023, currently a student is working on this project].

A Better Unicode

Unicode is wonderful. A billion times better than boring ascii. But actually it's a mess. It is inconsistent (for example superscripted a and e (ᵃᵉ) are in unicode but superscripts for b,c and d are not. It contains all sorts of exotic characters and symbols, but fails to provide very basic code points which millions of people would use. For example only two code points are provided for the character "徑", but it needs three because mainland China and Japan both use their own simplified version of this character. Unicode "merges" the Japanese and Chinese simplified characters, but what if I want to quote Chinese in a Japanese text or vice versa! (Personally I wish neither China nor Japan had "simplified" Chinese characters but hard to see how I can undo that). One might argue that while a "better unicode" is in principle a good idea, in practice unicode is too well entrenched to be displaced. Maybe so, but it could still be a nice academic research project.

Software Projects

Since I am in a computer science department, I think it would be nice if some student research projects contributed to open source community software. I will be teaching a course on emacs and have some ideas for projects provide useful elisp libraries.

Another (highly challenging) idea would be to figure out a way to speed up the run time of my favorite document processor XeTeX; such as using checkpointing to accelerate recompiling.
一個線索: XeLaTeX seems to support a LaTeX precompiling mechanism with a command like:

xelatex -ini -jobname="CommonIncludes" "&xelatex" mylatexformat.ltx CommonIncludes.tex
But gives an error because it does not support the modern OpenType font, and although I have seen a workaround, I was not able to confirm that it works.

Or how about an intelligent file renamer? Often I download files with meaningless names like btac354.pdf or n059940111401.pdf; those names might mean something to the site I downloaded them from, but they mean nothing to me - so each time I spend a minute of my life renaming them manually. It would be nice to have a tool which could heuristically derive an appropriate name for such files based on their content.

Better support for substrings, or more generally "ranges" of arrays in C. The C programming language is rather low level, but is still widely used. The standard way to represent a string is C is a "C-string", which is a chunk of memory ending in zero (null, '\0'). That representation makes it very inconvenient to represent substrings of a larger string -- it forces the programmer to allocate their own memory to copy the substring and then add a zero to it. A much more convenient representation would be a range expressed as a pair of pointers, char* beg, char* end. So to print the substring one could do

for( char* c = beg; c < end; c++ )  putchar( *c );
Unfortunately most standard C functions, especially printf and friends do not support ranges. (The Standard Template Library of C++ does, but C++ has other drawbacks). It would be nice to design a GNU C extension of printf to support ranges. Something like
struct charRange{
  char* beg;
  char* end;

char text= "abcdefgh";
struct charRange substringVar =  {.beg= text+3, .end= text+6};

my_printf( "the substring you want is '%rc'", substringVar );
// will output "the substring you want is 'def'"

GPU software development

The power of GPU computing gives impressive results when the task, implementation and hardware architecture all match. Unfortunately I don't know much about GPU programming, but I think it is an important direction and could be a good fit for a Master student who wants to acquire a programming skill not many have. In 2022, one student (盧宥霖) had some success porting an NGS application (Eagle again) to GPU. That could be further explored. Also, I have an interest in Bayesian model selection which sometimes requires computationally intensive Monte Carlo methods to numerically approximate integrals. Some work has been done on that (by other groups), but probably much more can be done.