All Categories

counting reads per gene with genomic ranges

3/25/2014

aligned.bam
File Size:	5552 kb
File Type:	bam

Download File

chr4.gtf.gz
File Size:	46 kb
File Type:	gz

Download File

aligned.htseq.counts
File Size:	0 kb
File Type:	counts

Download File

in a previous post, we counted the number of aligned sequences that overlapped with a list of genes using htseq. here, we will show how to do the same thing using the genomic ranges bioconductor package.

to get started, first download the aligned sequence reads and the genomic annotation set provided on this blog post. the data is a subset of the data found in the pasilla bioconductor package.

 wget http://www.weebly.com/uploads/2/6/8/5/26850053/aligned.bam 
wget http://www.weebly.com/uploads/2/6/8/5/26850053/chr4.gtf.gz

the aligned sequence reads are stored in a binary sequencing alignment/map format, and the genomic annotation set is in the gtf format.

next, open up R and install the genomic ranges package:

 source("http://bioconductor.org/biocLite.R") 
biocLite("GenomicRanges", dependencies=TRUE)

we will also need rsamtools and rtracklayer.

 source("http://bioconductor.org/biocLite.R") 
biocLite(c("Rsamtools", "rtracklayer", "GenomicAlignments"), dependencies=TRUE)

just like htseq, genomic ranges includes three overlap resolution modes that dictate how aligned reads that overlap more than one genomic feature are treated. the three overlap resolutions modes are { union, intersection-strict, intersection-nonempty }. union only counts reads that overlap any portion of exactly one feature (i.e., reads that overlap multiple features are discarded); intersection-strict only counts reads which fall completely within only one feature (.e.g, if a feature has coordinates 1-10, and the read overlaps coordinates 6-11, the read is not counted for that feature, since the read exceeds the coordinates of the feature); and, lastly, intersection-nonempty only counts reads that fall within a unique disjoint region of a feature (in the case of partially overlapping features, if a read overlaps with at least one base that is unique to one of the features, the read is counted for that feature). a figure depicting these overlap resoltuons modes is available here. for a full list of genomic ranges counting options, consult the vignette.

we are now ready to use genomic ranges to generate a list of counts per per gene.

 library(GenomicRanges) 
library(GenomicAlignments) 
library(Rsamtools) 
library(rtracklayer) 
gtf <- import("chr4.gtf.gz",                                                                    # import gtf file 
        asRangedData=FALSE)                                                                                     # return GRanges object instead of a RangedData obj 
idx <-  mcols(gtf)$type == "exon"                                                               # find lines in gtf that are exons only 
exons <- gtf[idx]                                                                                               # create table composed of only exons 
genes <- split(exons, mcols(exons)$gene_name)                                   # split by by gene name 
params <- ScanBamParam(                                                                                 # set bam file params 
        flag=scanBamFlag(isUnmappedQuery=FALSE),                                        # only consider mapped reads 
        tag="NH")                                                                                                       # include NH tag, which reports if each read maps uniquely  
bam <- readGAlignments("aligned.bam",                                                   # read bam file 
        param=params)                                                                                           # include params 
unique_hits <- bam[mcols(bam)$NH == 1]                                                  # remove multimapping reads 
counts <- summarizeOverlaps(                                                                    # summarize overlaps function
                        features=genes,                                                                                         # count reads per gene 
        reads=unique_hits,                                                                                      # the data to be counted 
        mode="Union",                                                                                           # use the union mode  
        ignore.strand=TRUE,                                                                             # data is not stranded 
        SingleEnd=TRUE,                                                                                         # data is single-end 
        param=params) 
count_table <- assays(counts, withDimnames=TRUE)$counts                 # create table of counts 
write.table(count_table,                                                                                # write the count table to disk 
        file="aligned.granges.counts",                                                          # save as file name 'aligned.granges.counts' 
        sep = "\t",                                                                                             # outut as tab delimited 
        row.names=TRUE,                                                                                         # include the row (gene) names in output 
        col.names=FALSE,                                                                                        # don't write column names 
        quote=F)                                                                                                        # don't place double quotes around factor or characters

the output is a table in the same format as htseq. examine the first five lines of the output file in R using head in the system command:

 system("head -n 5 aligned.granges.counts") 
# Actbeta       0 
# Ank   6750 
# Arf102F       0 
# Asator        205 
# ATPsyn-beta   0

to see how the counting from htseq compares to genomic ranges, we can make a bland-altman plot (also know as a 'mean-difference plot').

 htseq_file <- "http://www.weebly.com/uploads/2/6/8/5/26850053/aligned.htseq.counts" 
htseq_counts <- head(read.table(file=htseq_file,                                        # read htseq count file 
        sep='\t',                                                                                                               # the file is tab delimited 
        header=FALSE,                                                                                                   # the file has no header 
        row.names=1,                                                                                                    # the first column is the row (gene) names 
        col.names=c("gene", "count")),                                                  # provide columnn names 
        -5)                                                                                                  # remove the last 5 lines, which are htseq special counters 
granges_counts <- read.table(file="aligned.granges.counts",                     # read genomic ranges count file 
        sep='\t', 
        header=FALSE,  # plot x=mean, y=difference       
        row.names=1, 
        col.names=c("gene", "count")) 
htseq_sorted <- htseq_counts[order(row.names(htseq_counts)),]           # sort the htseq count file by row (gene) name 
granges_sorted <- granges_counts[order(row.names(granges_counts)),]     # sort the genomic ranges count file by row (gene) name 
 
md.plot <- function(x,y, xlab="mean",                                           # create a bland-altman function with defaults 
        ylab="difference",main="bland-altman plot") { 
        mean <- (x+y)/2                                                                                 # mean between x and y 
                                                                                                                      difference <- x-y                                                                               # difference between x and y 
        plot(mean, difference,                                                                  # plot the mean vs the difference 
                xlab=xlab, 
                ylab=ylab, 
                main=main,                                                                                      # the plot  title 
                pch=20)                                                                                         # point type as solid 
                abline(h=0)                                                                                     # draw a horizontal line at y=0 
} 
 
md.plot(htseq_sorted, granges_sorted,                                                           # plot the mean-difference between the htseq and genomic ranges counts 
        xlab="mean of htseq and granges counts",                                                # x-axis label   
        ylab="difference between htseq and granges counts")                             # y-axis label

a bland-altman plot measures the degree of agreement between two measurements. as can be see, there is no difference between counts per gene as measured by htseq and counts per gene as measured by genomic ranges, that is, the counts per gene as determined by htseq are equivalent to the counts per gene as determined by genomic ranges.

0 Comments

creating xkcd-styled plots in r

3/20/2014

0 Comments

xkcd is a popular webcomic created by randall munroe. here, we will show how to create xkcd-styled r plots using the xkcd package, which provides a set of ggplot2 functions for plotting data in an xkcd style.

note: if R is not installed on your system, you can download and install a precompiled binary distribution here. to get started, load up r and then install the xkcd package:

 install.packages("xkcd", dependencies=T)

once the package has been installed, you can load the package by typing:

 library(xkcd)

next, we need to install two additional fonts.

to install the fonts on linux:

 library(sysfonts) 
system("mkdir -p ~/.fonts") 
download.file("http://simonsoftware.se/other/xkcd.ttf", dest="~/.fonts/xkcd.ttf", mode="wb") 
download.file("http://dl.dropbox.com/u/12305244/Humor-Sans.ttf", dest="~/.fonts/Humor-Sans.ttf", mode="wb") 
font.paths("~/.fonts") 
font.add("xkcd", regular = "xkcd.ttf") 
font.add("Humor Sans", regular = "Humor-Sans.ttf")

to install the fonts on mac:

 library(sysfonts) 
download.file("http://simonsoftware.se/other/xkcd.ttf", dest="~/Library/Fonts/xkcd.ttf", mode="wb") 
download.file("http://dl.dropbox.com/u/12305244/Humor-Sans.ttf", dest="~/Library/Fonts/Humor-Sans.ttf", mode="wb") 
font.add("xkcd", regular = "xkcd.ttf") 
font.add("Humor Sans", regular = "Humor-Sans.ttf")

close and restart R.

creating xkcd-styled scatterplots
we will use the mtcars dataset, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.

 attach(mtcars) 
head(mtcars) 
#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb 
#Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 
#Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 
#Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 
#Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 
#Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 
#Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

to create a a basic scatterplot using ggplot:

 library(ggplot2) 
p <- ggplot(data=mtcars, aes(x=wt, y=mpg)) +  
    geom_point(shape=1)                                                 # use hollow circles 
print(p)

to create an xkcd-stylized scatterplot:

 library(xkcd) 
xrange <- range(mtcars$wt) 
yrange <- range(mtcars$mpg) 
 
p1 <- ggplot(data=mtcars, aes(x=wt, y=mpg)) +  
        geom_point(shape=20) +                                                  # use solid circles 
    xkcdaxis(xrange,yrange) +                                           # plot the xkcd-styled axis 
        xlab("weight in thoushands of pounds") +                # label the x-axis 
        ylab("miles per gallon")                                                # label the y-axis 
print(p1)   
 
p2 <- ggplot(data=mtcars, aes(x=wt, y=mpg)) +  
        geom_point(shape=20) +      
    xkcdaxis(xrange,yrange) + 
        geom_smooth(method=lm,                                                  # add linear regression line 
                color="black",                                                          # color the line black 
                 se=FALSE)      +                                                               # turn off shaded confidence region 
        xlab("weight in thoushands of pounds") + 
        ylab("miles per gallon") 
print(p2)  
 
p3 <- ggplot(data=mtcars, aes(x=wt, y=mpg)) +  
        geom_point(shape=20,  
                aes(color=as.character(vs))) +                          # color whether engine is v or straight  
    xkcdaxis(xrange,yrange) + 
        geom_smooth(method=lm,           
                                                    color="black",   
                                                                    se=FALSE)       +                
                                                    xlab("weight in thoushands of pounds") + 
        ylab("miles per gallon") + 
        theme(legend.position="top",                                    # move legend to top 
                legend.title=element_blank()) +                         # remove legend title 
        scale_colour_manual(values = c("black", "red"), # set legend colors 
        labels=c("v-engine", "straight engine"))                # change legend labels 
print(p3)

creating xkcd-styled bar and line graphs
to create a a basic bar or line graph using the mtcars dataset:

 library(ggplot2) 
attach(mtcars) 
counts <- table(gear)                                                           # count the number of cars per gear 
df <- as.data.frame.table(counts,                                       # convert the count table to a dataframe 
        responseName = "freq") 
               
df1 <- as.data.frame.table(table(vs, gear),                     # create a dataframe of car by gears and engine type 
        responseName = "freq") 
         
# basic bar graph        
                    p1 <- ggplot(data=df, aes(x=gear, y=freq)) + 
      geom_bar(stat="identity")    
  print(p1) 
 
# basic 2-variable bar graph 
p2 <- ggplot(data=df1, aes(x=gear, y=freq, fill=vs)) +  
        geom_bar(stat="identity")  
print(p2) 
 
# basic line graph 
p3 <- ggplot(data=df, aes(x=gear, y=freq, group=1)) +  
        geom_line()  
print(p3)

to create an xkcd-stylized bar or line plot:

 # bar graph 
df$xmin <- as.numeric(df$gear) - 0.1                            # where each bar should start on the x-axis 
df$xmax <- as.numeric(df$gear) + 0.1                            # where each bar should end on the x-axis 
df$ymin <- 1                                                                            # where each bar should start on the y-axis 
df$ymax <- df$freq                                                                      # where each bar should end on the y-axis 
xrange <- range(min(df$xmin) - 0.1,                             # specify the range of the x-axis                
                    max(df$xmax) + 0.1)      
yrange <- range(min(df$ymin),                                           # specify the range of the y-axis 
                                   max(df$ymax) + 1)        
mapping <- aes(xmin=xmin,ymin=ymin,xmax=xmax,ymax=ymax) 
 
p1 <- ggplot(data=df, aes(x=gear, y=freq)) + 
          xkcdrect(mapping,df) +                                                  # xkcd function to plot the bar shapes           
                            xkcdaxis(xrange,yrange) +  
        xlab("number of gears") +  
        ylab("frequency") +  
        scale_x_discrete(labels=c(as.character(df$gear))) 
print(p1) 
 
df$xmin <- as.numeric(df$gear) - 0.4                            # make the bars wider    
            df$xmax <- as.numeric(df$gear) + 0.4                            # make the bars wider 
df$ymin <- 1                                                                             
df$ymax <- df$freq                       
                                            xrange <- range(min(df$xmin) - 0.1,  
                                                                max(df$xmax) + 0.1)      
yrange <- range(min(df$ymin),  
                                                                              max(df$ymax) + 1)        
mapping <- aes(xmin=xmin,ymin=ymin,xmax=xmax,ymax=ymax) 
p2 <- ggplot(data=df, aes(x=gear, y=freq)) +  
        xkcdrect(mapping,df) +                   
                                                                    xkcdaxis(xrange,yrange) +  
        xlab("number of gears") +  
        ylab("frequency") +  
        scale_x_discrete(labels=c(as.character(df$gear))) 
print(p2) 
 
# 2-variable bar graph 
vs0 <- subset(df1, vs=="0")                                                     # subset the df1 dataframe to include only vs=0  
vs1 <- subset(df1, vs=="1")                                                     # subset the df1 dataframe to include only vs=1  
vs0$xmin <- as.numeric(vs0$gear) - 0.4 
vs0$xmax <- as.numeric(vs0$gear) + 0.4 
vs0$ymin <- 0 
vs0$ymax <- 0 
vs0[vs0$vs=="0", ]$ymin <- 1 
vs0[vs0$vs=="0", ]$ymax <- vs0[vs0$vs=="0", ]$freq 
vs1$xmin <- as.numeric(vs1$gear) - 0.4 
vs1$xmax <- as.numeric(vs1$gear) + 0.4 
vs1$ymin <- 0 
vs1$ymax <- 0 
vs1[vs1$vs=="1", ]$ymin <- vs0[vs0$vs=="0", ]$freq 
vs1[vs1$vs=="1", ]$ymax <-  vs1[vs1$vs=="1", ]$freq + vs0[vs0$vs=="0", ]$freq 
xrange <- range(min(rbind(vs0, vs1)$xmin) - 0.1,  
        max(rbind(vs0, vs1)$xmax) + 0.1) 
yrange <- range(min(rbind(vs0, vs1)$ymin),  
        max(rbind(vs0, vs1)$ymax) + 1) 
mapping <- aes(xmin=xmin,ymin=ymin,xmax=xmax,ymax=ymax) 
p3 <- ggplot(data=vs0, aes(x=gear, y=freq)) +  
        xkcdrect(mapping,vs0, size=1.8) +                               # the size controls the distance jitter                  
                                                                    xkcdaxis(xrange,yrange) +                                               # and therefore the separation between the v0 and v1 bars 
        xlab("number of gears") +  
        ylab("frequency") +  
        geom_line(aes(0, 0, color="v-engine")) +  
        scale_x_discrete(labels=c(as.character(vs1$gear))) + 
        theme(legend.position="top",                                    # move legend to top 
                legend.title=element_blank())                           # remove legend title 
p3 <- p3 + xkcdrect(mapping,vs1,fill="#EA8689") +  
        geom_line(aes(0, 0, color="straight-engine")) + 
    scale_color_manual(values=c("v-engine"="grey20", "straight-engine"="#EA8689")) 
print(p3) 
 
# line graph 
xrange <- range(1:length(df$gear)) 
yrange <- range(df$freq) 
p4 <- ggplot(data=df, aes(x=gear, y=freq, group=1)) +  
        geom_line() +  
        xkcdaxis(xrange,yrange) + 
        xlab("number of gears") +  
        ylab("frequency")  
print(p4)

creating xkcd-styled pie plots
to create a basic pie plot using a mock dataset:

 df = data.frame(count=c(25, 75), 
        category=c("A", "B")) 
 
# basic pie chart 
p1 <- ggplot(df, aes(x = factor(1), fill = category, weight=count)) +    
        geom_bar(width = 1) +  
                                                      coord_polar(theta="y") +  
                           scale_x_discrete("") +                                                                                          # remove y label         
                                                                                                                                                                            theme(axis.ticks = element_blank(),                                                             # remove tick marks      
                            axis.text.y = element_blank())                                                                  # remove y axis marks 
print(p1)        
 
# donut chart 
df$fraction = df$count / sum(df$count)                                                                  # create fraction column 
df = df[order(df$fraction), ]                                                                                   # sort dataframe by fraction 
df$ymax = cumsum(df$fraction)                                                                                   # set end for each fraction              
                                                                    df$ymin = c(0, head(df$ymax, n=-1))                                                                             # set start for each fraction 
p2 <- ggplot(df, aes(fill=category, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) + 
     geom_rect() + 
     coord_polar(theta="y") + 
     xlim(c(0, 4)) + 
         theme(panel.grid=element_blank()) +                                                            # remove grid from plot 
     theme(axis.ticks=element_blank()) 
print(p2)

to create an xkcd-stylized pie plot:

 # xkcd pie chart 
p1 <- ggplot(df, aes(x = factor(1), fill = category, weight=count)) +    
        geom_bar(width = 1, colour="grey30") +  
                                                     coord_polar(theta="y") + 
                            scale_x_discrete("") + 
          theme_xkcd() +                                                                                  # use the xkcd theme             
                                                                                                                                                                                                                                                    theme(axis.ticks = element_blank(),  
                                                                                                        axis.text.y = element_blank()) + 
        theme(axis.text = element_text(family = "Humor Sans")) + 
        scale_fill_manual(values=c("white", "black"))  
                                                                      print(p1) 
    
# xkcd donut chart 
df$fraction = df$count / sum(df$count) 
df = df[order(df$fraction), ] 
df$ymax = cumsum(df$fraction)    
                                                                    df$ymin = c(0, head(df$ymax, n=-1))      
p2 <- ggplot(df, aes(fill=category, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) + 
     geom_rect(colour="grey30") + 
     coord_polar(theta="y") + 
     xlim(c(0, 4)) + 
     theme_xkcd() + 
         theme(panel.grid=element_blank(), 
                 axis.ticks=element_blank()) + 
         scale_fill_manual(values=c("white", "black")) + 
         theme(axis.text = element_text(family = "Humor Sans")) 
print(p2)

creating xkcd-styled histograms and density plots
to create a basic histogram using ggplot:

  
bmi <- rnorm(n=1000, m=24.2, sd=2.2)  
histinfo <- hist(bmi, plot=F) 
 
# basic frequency histogram 
p1 <- ggplot(as.data.frame(bmi), aes(x=bmi)) +  
        geom_histogram(breaks=c(seq(15, 31))) 
print(p1) 
 
# with normal curve 
p2 <- ggplot(as.data.frame(bmi), aes(x=bmi)) +  
        geom_histogram(breaks=c(seq(15, 31))) + 
        stat_function(fun=function(x, mean, sd, n){ n * dnorm(x = x, mean = mean, sd = sd) },  
                args = with(as.data.frame(bmi),  
                        c(mean = mean(as.data.frame(bmi)$bmi),  
                        sd = sd(as.data.frame(bmi)$bmi),  
                        n = length(as.data.frame(bmi)$bmi)))) 
print(p2) 
 
# density plot 
p3 <- ggplot(as.data.frame(bmi), aes(x=bmi)) +  
        geom_density() 
print(p3)

although there is no histogram function in the xkcd package, we can (kind of) create one like so:

  
# histogram 
data <- data.frame(freq=1:length(histinfo$counts)) 
data$freq <- histinfo$counts 
data$xmin <- histinfo$mids 
data$xmax <-  data$xmin + 1.0 
data$ymin <- 0 
data$ymax <- data$freq 
xrange <- range(min(data$xmin) - 0.1, max(data$xmax) + 0.1) 
yrange <- range(min(data$ymin), max(data$ymax) ) 
mapping <- aes(xmin=xmin,ymin=ymin,xmax=xmax,ymax=ymax) 
p1 <- ggplot() +  
        xkcdrect(mapping,data,fill="forestgreen") +  
                xkcdaxis(xrange,yrange) +  
                xlab("body mass index") +  
                ylab("frequency") 
print(p1) 
 
# with normal curve 
data <- data.frame(freq=1:length(histinfo$counts)) 
data$freq <- histinfo$counts 
data$xmin <- histinfo$mids 
data$xmax <-  data$xmin + 1.0 
data$ymin <- 0 
data$ymax <- data$freq 
xrange <- range(min(data$xmin) - 0.1, max(data$xmax) + 0.1) 
yrange <- range(min(data$ymin), max(data$ymax) ) 
mapping <- aes(xmin=xmin,ymin=ymin,xmax=xmax,ymax=ymax) 
xfit<-seq(min(bmi),max(bmi),length=length(bmi))  
yfit<-dnorm(xfit,mean=mean(bmi),sd=sd(bmi))  
yfit <- yfit*diff(histinfo$mids[1:2])*length(bmi)  
normfit <- data.frame(x = c(xfit), y = c(yfit)) 
p2 <- ggplot() +  
        xkcdrect(mapping,data,fill="forestgreen") +  
                xkcdaxis(xrange,yrange) +  
                xlab("body mass index") +  
                ylab("frequency") + 
                geom_point(data = normfit,  
                        aes(x=x+0.5, y=y)) 
print(p2) 
 
# density plot 
d <- density(bmi)  
data <- data.frame(mids=1:length(d$x), density=1:length(d$y))  
data$x <- d$x  
data$y <- d$y  
xrange <- range(data$x)  
yrange <- range(data$y)  
p3 <- ggplot() +  
        geom_line(data = data,  
                aes(x=x, y=y)) + 
        xkcdaxis(xrange,yrange) + 
        xlab("body mass index") +  
        ylab("frequency")        
print(p3)

draw a man!
last but not least~!:

 datascaled <- data.frame(x=c(-3,3),y=c(-30,30)) 
xrange <- range(datascaled$x) 
yrange <- range(datascaled$y) 
ratioxy <- diff(xrange) / diff(yrange) 
 
mapping <- aes(x=x, 
        y=y, 
        scale=scale, 
        ratioxy=ratioxy, 
        angleofspine = angleofspine, 
        anglerighthumerus = anglerighthumerus, 
        anglelefthumerus = anglelefthumerus, 
        anglerightradius = anglerightradius, 
        angleleftradius = angleleftradius, 
        anglerightleg =  anglerightleg, 
        angleleftleg = angleleftleg, 
        angleofneck = angleofneck) 
 
dataman <- data.frame( x= c(0), y=c(0),                         # x,y position of center of head 
        scale = c(20),                                                                  # size of man in units of Y axis 
        ratioxy = ratioxy,                                                              # ratio x to y of graph 
        angleofspine =  -1,                                             # angle of spine 
        anglerighthumerus = 0,                                          # angle of right humerus 
        anglelefthumerus =      5,                                      # angle of left humerus 
        anglerightradius = 0,                                                   # angle of right radius 
        angleleftradius = -0.1,                                 # angle of left radius 
        angleleftleg = 6,                                                               # angle of left leg 
        anglerightleg = 3,                                                              # angle of right left 
        angleofneck = 5)                                                                # angle of neck 
 
p <- ggplot(data=datascaled, aes(x=x,y=y)) +  
        geom_point(color="white") + 
        xkcdman(mapping,dataman) +  
        theme_xkcd() + 
        annotate("text", x=2, y = 0, label = "I'm super cool.", family="xkcd") + 
        xlab("") + ylab("") 
print(p) 
 
# to add eyes (because why not) 
eyes <- data.frame(x=c(0, 0.5),y=c(0.8, 0.8)) 
p <- p + geom_point(data=eyes, aes(x=x, y=y), color="black")  
print(p) 
 
# and now a mouth 
mouth <- data.frame(x=c(0.2, 0.3),y=c(-5, -5)) 
p <- p + geom_line(data=mouth, aes(x=x, y=y), color="black") 
print(p)

0 Comments

mass renaming files on linux

3/13/2014

1 Comment

an important and common task in bioinformatics is renaming files. renaming a single file in linux is simple to accomplish, but renaming multiples files is often more arduous -- but it doesn't have to be. here, we will review four methods for renaming files, each one more powerful than the preceding one.

suppose, for example, that we have a set of files from the illumina platform in the fastq format. the latest version of illumina's software, casava v1.8.2, converts per-cycle basecall files into fastq files with the following naming scheme:

 <sample name>_<barcode sequence>_L<lane[0-7]{3}>_R<read number>_<set number[0-9]{3}>.fastq.gz

so, for example, the following is a illumina-valid fastq file name:

 SA1_ATCACG_L002_R1_001.fastq.gz

note that a single illumina fastq file is, by default, divided into a set of files, each of which contains no more than 4 M reads per output file. the different files are distinguished by a 0-padded 3-digit set number ([0-9]{3}). accordingly, a single sample may have numerous fastq files belonging to it, like so:

 sample(SA1) = { X : SA1_ATCACG_L002_R1_001.fastq.gz, SA1_ATCACG_L002_R1_002.fastq.gz, 
SA1_ATCACG_L002_R1_003.fastq.gz, SA1_ATCACG_L002_R1_004.fastq.gz, 
SA1_ATCACG_L002_R1_005.fastq.gz, SA1_ATCACG_L002_R1_006.fastq.gz, 
SA1_ATCACG_L002_R1_007.fastq.gz, SA1_ATCACG_L002_R1_008.fastq.gz, 
SA1_ATCACG_L002_R1_009.fastq.gz, SA1_ATCACG_L002_R1_010.fastq.gz, 
SA1_ATCACG_L002_R1_011.fastq.gz, SA1_ATCACG_L002_R1_012.fastq.gz, 
SA1_ATCACG_L002_R1_013.fastq.gz, SA1_ATCACG_L002_R1_014.fastq.gz, 
SA1_ATCACG_L002_R1_015.fastq.gz, SA1_ATCACG_L002_R1_016.fastq.gz, 
SA1_ATCACG_L002_R1_017.fastq.gz, SA1_ATCACG_L002_R1_018.fastq.gz, 
SA1_ATCACG_L002_R1_019.fastq.gz, SA1_ATCACG_L002_R1_020.fastq.gz, 
SA1_ATCACG_L002_R1_021.fastq.gz, SA1_ATCACG_L002_R1_022.fastq.gz, 
SA1_ATCACG_L002_R1_023.fastq.gz, SA1_ATCACG_L002_R1_024.fastq.gz, 
SA1_ATCACG_L002_R1_024.fastq.gz, SA1_ATCACG_L002_R1_026.fastq.gz, 
SA1_ATCACG_L002_R1_027.fastq.gz, SA1_ATCACG_L002_R1_028.fastq.gz, 
SA1_ATCACG_L002_R1_029.fastq.gz, SA1_ATCACG_L002_R1_030.fastq.gz }

now, further suppose that we would like to replace every instance of 'SA1' with 'NA10831' (say, because when we submitted the samples for sequencing, we provided sample names in code to mask the true identity of the sample, for privacy reasons).

before we delve into the four methods of renaming files, if you would like to try these methods on the above hypothetical set of fastq files, you can create mock files by using the 'touch' command by executing the following code in a bash shell:

 mkdir renaming_files ; cd renaming_files 
for l in $(seq 1 9) ; do touch SA1_ATCACG_L002_R1_00$l.fastq.gz ; done 
for l in $(seq 10 30) ; do touch SA1_ATCACG_L002_R1_0$l.fastq.gz ; done

don't worry -- the files will be empty!

method 1: use the 'mv' command
the 'mv' command can be used to rename files.

 syntax: mv [options] oldname newname

to rename the first item in our sample set, we would do the following:

 mv SA1_ATCACG_L002_R1_001.fastq.gz NA10831_ATCACG_L002_R1_001.fastq.gz

in order to rename all of the files in our sample set, we would have to type the 'mv' command 30 (!) times:

 mv SA1_ATCACG_L002_R1_001.fastq.gz NA10831_ATCACG_L002_R1_001.fastq.gz 
mv SA1_ATCACG_L002_R1_002.fastq.gz NA10831_ATCACG_L002_R1_002.fastq.gz 
mv SA1_ATCACG_L002_R1_003.fastq.gz NA10831_ATCACG_L002_R1_003.fastq.gz 
mv SA1_ATCACG_L002_R1_004.fastq.gz NA10831_ATCACG_L002_R1_004.fastq.gz 
mv SA1_ATCACG_L002_R1_005.fastq.gz NA10831_ATCACG_L002_R1_005.fastq.gz 
mv SA1_ATCACG_L002_R1_006.fastq.gz NA10831_ATCACG_L002_R1_006.fastq.gz 
mv SA1_ATCACG_L002_R1_007.fastq.gz NA10831_ATCACG_L002_R1_007.fastq.gz 
mv SA1_ATCACG_L002_R1_008.fastq.gz NA10831_ATCACG_L002_R1_008.fastq.gz 
mv SA1_ATCACG_L002_R1_009.fastq.gz NA10831_ATCACG_L002_R1_009.fastq.gz 
mv SA1_ATCACG_L002_R1_010.fastq.gz NA10831_ATCACG_L002_R1_010.fastq.gz 
mv SA1_ATCACG_L002_R1_011.fastq.gz NA10831_ATCACG_L002_R1_011.fastq.gz 
mv SA1_ATCACG_L002_R1_012.fastq.gz NA10831_ATCACG_L002_R1_012.fastq.gz 
mv SA1_ATCACG_L002_R1_013.fastq.gz NA10831_ATCACG_L002_R1_013.fastq.gz 
mv SA1_ATCACG_L002_R1_014.fastq.gz NA10831_ATCACG_L002_R1_014.fastq.gz 
mv SA1_ATCACG_L002_R1_015.fastq.gz NA10831_ATCACG_L002_R1_015.fastq.gz 
mv SA1_ATCACG_L002_R1_016.fastq.gz NA10831_ATCACG_L002_R1_016.fastq.gz 
mv SA1_ATCACG_L002_R1_017.fastq.gz NA10831_ATCACG_L002_R1_017.fastq.gz 
mv SA1_ATCACG_L002_R1_018.fastq.gz NA10831_ATCACG_L002_R1_018.fastq.gz 
mv SA1_ATCACG_L002_R1_019.fastq.gz NA10831_ATCACG_L002_R1_019.fastq.gz 
mv SA1_ATCACG_L002_R1_020.fastq.gz NA10831_ATCACG_L002_R1_020.fastq.gz 
mv SA1_ATCACG_L002_R1_021.fastq.gz NA10831_ATCACG_L002_R1_021.fastq.gz 
mv SA1_ATCACG_L002_R1_022.fastq.gz NA10831_ATCACG_L002_R1_022.fastq.gz 
mv SA1_ATCACG_L002_R1_023.fastq.gz NA10831_ATCACG_L002_R1_023.fastq.gz 
mv SA1_ATCACG_L002_R1_024.fastq.gz NA10831_ATCACG_L002_R1_024.fastq.gz 
mv SA1_ATCACG_L002_R1_025.fastq.gz NA10831_ATCACG_L002_R1_025.fastq.gz 
mv SA1_ATCACG_L002_R1_026.fastq.gz NA10831_ATCACG_L002_R1_026.fastq.gz 
mv SA1_ATCACG_L002_R1_027.fastq.gz NA10831_ATCACG_L002_R1_027.fastq.gz 
mv SA1_ATCACG_L002_R1_028.fastq.gz NA10831_ATCACG_L002_R1_028.fastq.gz 
mv SA1_ATCACG_L002_R1_029.fastq.gz NA10831_ATCACG_L002_R1_029.fastq.gz 
mv SA1_ATCACG_L002_R1_030.fastq.gz NA10831_ATCACG_L002_R1_030.fastq.gz

not only is this method tedious, but it is also more error-prone than alternative methods.

method 2: write a bash script
we can wrap the 'mv' command inside of a bash script:

 for file in *.fastq.gz ; do 
        suffix=$(echo $file | cut -d "_" -f2-) 
        mv $file NA10831_$suffix 
done

the above script first finds every file ending in the '.fastq.gz' extension, and then uses a for loop to iterate through each item. it extracts the name of the file without the 'SA1' prefix by using the 'cut' command, and stores that result in a variable named 'suffix'. it finishes by using the 'mv' command to append the new name 'NA10831' to the stored suffix. note two things: 1) there is more than one bash script that would have accomplished this same task, and 2) this script splits the file name by underscore ("_") delimiters, which means that had the sample name 'SA1' contained an underscore, this particular script would not have worked properly. although this method is much less tedious than the previous method, there are easier methods yet.

method 3: the rename command
the 'rename' command is a c program capable of renaming a set of files with a single command. 'rename' is installed in redhat distributions; ubuntu distributions of linux come with an alternative version of 'rename'. if your version of linux does not have this 'rename' command, you can download it and compile it from source from the util-linux package.

 syntax: rename pattern replacement file...

this 'rename' command replaces the first occurence of pattern with replacement in a set of files' names. to use the 'rename' command to rename all of the files in our sample set:

 rename SA1 NA10831 *.fastq.gz

which replaces the first occurence of 'SA1' with 'NA10831' in all files ending in '.fastq.gz'. this command is limited, however, in what it can do.

method 4: an alternative rename command
there is an alternative 'rename' command, which is a perl script, and can be seen as an extended version of the command from method 3; it's more powerful, among other reasons, in that it can rename multiple files using perl regular expressions.

to get started, download the perl script and add the mode execute:

 wget http://plasmasturm.org/code/rename/rename 
chmod +x rename

 syntax: rename [switches|transforms] file... 
syntax: rename [perlexpr] file...

since this script is feature rich, there is more than one way to accomplish the same task. one method of using this command on our sample set of files is with a regular expression:

 /path/to/rename 's/SA1/NA10831/' *.fastq.gz

which, like the 'rename' from method 3, replaces just the first occurence of 'SA1' with 'NA10831' in all files ending in '.fastq.gz'. a different way of accomplishing the same thing would have been by using the command's built-in switches:

 /path/to/rename --subst SA1 NA10831 *.fastq.gz

this method uses the '--subst' switch, which reduces this rename command to the syntax of method 3. method 4 is extremely powerful (removing extensions, for example, is as easy as using the '--remove-extension' switch), and there are a variety of use cases where this script can be handy. the reader is encouraged to thoroughly read through the 'rename' manual. more use cases will be posted on this blog as they arise.

1 Comment

counting reads per gene with htseq

3/11/2014

3 Comments

aligned.bam
File Size:	5412 kb
File Type:	bam

Download File

chr4.gtf.gz
File Size:	46 kb
File Type:	gz

Download File

htseq is a python package that provides an infrastructure to process and analyze high-throughput sequencing data. here, we will use htseq to count the number of aligned sequence reads that overlap with a list of genomic features.

to get started, first download and install htseq (prerequisites = { x : python > 2.4 < 3.0, pysam, numpy, matplotlib } :

 wget --no-check-certificate https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1.tar.gz 
tar -xzf HTSeq-0.6.1.tar.gz  
cd HTSeq-0.6.1 
python setup.py install --user

next, leave the htseq install directory, and download the aligned sequence reads and the genomic annotation set provided on this blog post. the data is a subset of the data found in the pasilla bioconductor package.

 wget http://www.weebly.com/uploads/2/6/8/5/26850053/aligned.bam 
wget http://www.weebly.com/uploads/2/6/8/5/26850053/chr4.gtf.gz

the aligned sequence reads are stored in a binary sequencing alignment/map format, and the genomic annotation set is in the gtf format.

before running htseq, it is important to note that it includes three overlap resolution modes that dictate how aligned reads that overlap more than one genomic feature are treated. the three overlap resolutions modes are { union, intersection-strict, intersection-nonempty }. see figure 1 for an illustration on the effect on counting of these three modes. for a full list of htseq counting options, consult the man page.

now we are ready to use htseq to generate a list of counts per genomic feature. in this example, we will generate a list of counts per gene.

 python -m HTSeq.scripts.count --format=bam --minaqual=0 --stranded=no --type=exon --idattr=gene_name --mode=union aligned.bam chr4.gtf.gz

the command tells htseq to only consider exons when counting, and to sum up all the counts by the gene name. the output is a tab delimited table of read counts for each gene listed in the genomic annotation set, followed by a series of special counters, which count reads which were not counted for a variety of reasons, including { no_feature, ambiguous, too_low_aQual, not_aligned, alignment_not_unique }.

by default, htseq outputs the count table to stdout, which is the text terminal that initiated the command. to redirect the output to a file called 'aligned.htseq.counts', use bash's redirection operator:

 python -m HTSeq.scripts.count --format=bam --minaqual=0 --stranded=no --type=exon --idattr=gene_name --mode=union aligned.bam chr4.gtf.gz > aligned.htseq.counts

using the head command, you can examine the first five lines of the output file:

 head -n 5 aligned.htseq.counts 
# ATPsyn-beta   0 
# Actbeta       0 
# Ank   6750 
# Arf102F       0 
# Asator        205

figure 1: the three overlap modes of htseq. *image from genomic-ranges bioconductor package: http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html

3 Comments

counting reads per gene with genomic ranges

creating xkcd-styled plots in r

mass renaming files on linux

counting reads per gene with htseq

Author

Archives

Categories