Get DNA sequences using R and BSgenome

My typical approach for getting a particular DNA sequence is to use the UCSC genome browser. Recently I needed to screen over 40 SNPs for restriction enzyme cut sites and didn’t want to go through all the clicks on the UCSC browser to get the flanking sequences. The Galaxy tool could certainly make quick work of this task, but lately I’ve been trying to do all of my work in R. Here is an R solution utilizing the BSgenome package from Bioconductor:

#Install the BSgenome package from Bioconductor

#Load the BSgenome package

#Download and install the genome of your choice

#Load the Celegans data package, sequences are put into memory as needed

#Create data.frame of SNPs
snps <- data.frame(chrom=c("chrI","chrII","chrIV"), start=c(8534533,7338823,
  6938443), end=c(8534534,7338824,6938444))
#   chrom   start     end
# 1  chrI 8534533 8534534
# 2 chrII 7338823 7338824
# 3 chrIV 6938443 6938444

#Get flanking sequence where SNP is in the middle of 20 bp
snp.flanks <- getSeq(Celegans, snps$chrom, start=(snps$end-9), 
names(snp.flanks) <- paste(snps$chrom, snps$end, sep = "_")
writeXStringSet(snp.flanks, "Celegans.snp.flanks.fasta")
#   V1
# 1 >chrI_8534534
# 3 >chrII_7338824
# 5 >chrIV_6938444

Dallas R Users Group Baseball Data Dive

This past Saturday I led a data dive workshop for the Dallas R Users Group using Lahman’s baseball statistics. After providing a brief introduction to the Lahman R package¬†and showing how to load the data and make some basic plots, I had the ~20 people in¬†attendance begin working on the following questions:

Visualize how the game of baseball has changed over the years.
Visualize a meaningful statistic on the US map.

Is winning the world series becoming less predictable?
Your friend Peter Daisy likes to bet on baseball games. He asserts that the best predictor of Division Winners is ERA. Is he right? If not, what is the single best predictor of Division Winners?

The consultant. Nolan Ryan and Ron Washington just called and asked for your expert advice. They are going to focus on improving three statistics this next season, what should they be and why?
The agent. You found an athlete who wants to apply his talents to the game of baseball. He is right-handed, 5 feet 8 inch tall, and weights 165 lbs. Which position makes the most sense for him to start learning and why?
The general manager. MLB has allowed you and Mark Cuban to form an American League expansion team. Mark wants you to choose the three starting outfield players. You can have any current player you want, but Mark says you can’t spend more than 15 million combined. He expects you to balance offensive and defensive performance with these players. Which players do you pick and why?
The parent. Your son is a pitcher and wants to play baseball at the best college for getting into the big leagues. Which college should he attend and why?

The idea wasn’t to complete all of the questions, but to choose one or two of interest. Most of the participants were new to R and focused on visualizing how baseball has changed over the years. Some of the more experienced R users took on the agent and general manager questions. Since the questions were somewhat open-ended, it was fun to see the different approaches and R packages people used.

Feel free to reply with your answers to any of these questions!

When feedback is worthless

I just finished teaching another semester of Anatomy and Physiology which means students will soon be evaluating my course. Well, about 20% of my students will be evaluating my course. Why such little feedback?

About a year ago the University decided to switch from in-class paper evaluations to an online system. Online is always better, right? Wrong.

While going to an online system may decrease the work of a few administrators, it renders the whole evaluation worthless because so few students choose to respond to the optional online survey. The students who do respond likely either really liked or hated you. Talk about biased results! I wonder how much money is spent on this now useless form of feedback.

Take home message: don’t waste resources collecting data that is worthless upon arrival.