Methyl-C data from Lister et al: http://genomebrowser.wustl.edu/twang_stuff/Cydney/ General CpG content and MRE fragment files: http://hgwdev.cse.ucsc.edu/~tingwang/Costello/remc.html ----- Notes from Ting ------ (1) Yes from the Lister paper. I have nonCG data as well but that would be strand-specific. For simplicity and for display purpose I only put CpG data there. If you need the nonCG data I can send you. However, most of the nonCG methylation happens in highly methylated region (CpG). So my feeling is you don't lose anything by just working on the CpG data. (2) How the score is defined: at any given site, you will have a total C count and mC count (mC will remain C, total C is C+T). Usually people use mC/Total_C to indicate methylation level. I did mC/ Total_C*1000 - 500 -- the sole reason is for display, so that unmethylated region will have a negative score and be displayed differently. In my scoring system, 0 means there was equal amount of mC and C at that position, i.e. the site is half methylated. If the CpG didn't have a read, then it didn't receive a score. Therefore, the file on H1 has 26M lines instead of 28M. (3) I think you can start by calculating an average for each window. You don't need to normalize by CpG density with this data. Depending on the window size, you might want to throw out regions with very few CpGs just to avoid small sample bias. You may also try to distinguish regions with sub-structures, for example, the first half of the window is methylated and second half is unmethylated, and the average would be the same as a region that is partially methylated throughout. This type of data almost always asks for a variable window size. I bet you can get most you need by simply computing an average, then you can go after more subtle things with a more complex model of the score. The other thing to keep in mind is each CpG has different total C, in another word, the confidence interval of the measurement is different for each CpG. This information is lost in translation (the scores). If you are interested in regional calls, I've done this before: collect all reads in this region, and compute mC/Total_C for the entire region (not for individual CpG). I think I can pull that kind of data for you as well, if you are interested.