Genome Informatics (基因體資訊) 2022. Homework assignment Oct 2022.

Download human genome From Human Genome Resources at NCBI, Download the file: GRCh38_latest_genomic.fna.gz. You can confirm you got the right one with:

% cksum GRCh38_latest_genomic.fna.gz 
1339154463 972898531 GRCh38_latest_genomic.fna.gz

Extract entry NC_000006.12 Homo sapiens chromosome 6, GRCh38.p14 Primary Assembly.

Extract bases in the half open interval [100,000–1,100,000), counting from zero. The segment starting with ttggtaccattccttctgaaactatt and ending in AAGCCAAGGCCATAGCACTATAATTGCAGT. Denote this as S.

For simplicity map all letters to lower case; i.e. A-->a, C-->c, etc.

Write a program (in C or whatever) implementing (plain) Markov models of order 0,1 and 2; for generating DNA sequences. Fit the parameters of these three models to S and compute the log base 2 probability of each of them generating S (given that they generate a sequence of that length).

Write a program (in C or whatever) implementing a hidden Markov model to generate sequences.
Because this is an exercise for learning, please implement the models yourself (do not use any software packages for Markov or hidden Markov models).
Compute the log base 2 probability of your model generating S (given that it generates some sequence that length).

Implement the Baum-Welch Expectation Maximization (EM) algorithm to learn the parameters. See if your hidden Markov model can be give a higher probability than the plain Markov models.
Note that because EM is vulnerable to local minimum in this kind of application, in order to learn a good model you probability will find it necessary to try many sets of initial values and/or use some hand tuning method to start with reasonable parameters.
To keep the models general, models are not allowed to emit more than 3 bases in one step (in particular the model cannot simply emit S in one step!).

After designing your model and fitting its parameters to S; compute the probabilities that they would generate:

The 100Mb section of chromosome 7, entry NC_000007.14 Homo sapiens chromosome 7, GRCh38.p14 Primary Assembly,
from position [100,000–1,100,000); starting with aGAAGGAAAACGGGAAACTTCACAATTagt and ending in TGAGCTATGGAACCTGGGTATGGACAACTG
A 100Mb section of another chromosome (state the section used)

Report

Write a short report summarizing your results. Submit it together with your program. Report can be in plain text or markdown. Please include running time and memory usage in your report. On linux you will find /usr/bin/time useful for this.

Write your names in the project readme. Please just give us your source files, no need to check in data files or executables (我不要你的.exe檔).