summarizeOverlaps-methods {GenomicAlignments} | R Documentation |
Perform overlap queries between reads and genomic features
Description
summarizeOverlaps
extends findOverlaps
by providing
options to resolve reads that overlap multiple features.
Usage
## S4 method for signature 'GRanges,GAlignments'
summarizeOverlaps(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
## S4 method for signature 'GRangesList,GAlignments'
summarizeOverlaps(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
## S4 method for signature 'GRanges,GRanges'
summarizeOverlaps(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
## S4 method for signature 'GRangesList,GRanges'
summarizeOverlaps(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
## S4 method for signature 'GRanges,GAlignmentPairs'
summarizeOverlaps(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
## S4 method for signature 'GRangesList,GAlignmentPairs'
summarizeOverlaps(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
## mode funtions
Union(features, reads, ignore.strand=FALSE,
inter.feature=TRUE)
IntersectionStrict(features, reads, ignore.strand=FALSE,
inter.feature=TRUE)
IntersectionNotEmpty(features, reads, ignore.strand=FALSE,
inter.feature=TRUE)
## S4 method for signature 'GRanges,BamFile'
summarizeOverlaps(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, singleEnd=TRUE,
fragments=FALSE, param=ScanBamParam(), preprocess.reads=NULL, ...)
## S4 method for signature 'BamViews,missing'
summarizeOverlaps(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, singleEnd=TRUE,
fragments=FALSE, param=ScanBamParam(), preprocess.reads=NULL, ...)
Arguments
features |
A GRanges or a GRangesList object of genomic regions of interest. When a GRanges is supplied, each row is considered a feature. When a GRangesList is supplied, each higher list-level is considered a feature. This distinction is important when defining overlaps. When |
reads |
A GRanges, GRangesList
GAlignments, GAlignmentsList,
GAlignmentPairs, BamViews or
BamFileList object that represents the data to be
counted by
|
mode |
The pre-defined options are designed after the counting modes available in the HTSeq package by Simon Anders (see references).
|
ignore.strand |
A logical indicating if strand should be considered when matching. |
inter.feature |
(Default TRUE) A logical indicating if the counting There are 6 possible combinations of the |
preprocess.reads |
A function applied to the reads before counting. The first argument
should be The distinction between a user-defined 'mode' and user-defined 'preprocess.reads' function is that in the first case the user defines how to count; in the second case the reads are preprocessed before counting with a pre-defined mode. See examples. |
... |
Additional arguments passed to functions or methods called
from within A |
singleEnd |
(Default TRUE) A logical indicating if reads are single or
paired-end. In Bioconductor > 2.12 it is not necessary to sort
paired-end BAM files by |
fragments |
(Default FALSE) A logical; applied to paired-end data only.
When When The term ‘mated pairs’ refers to records paired with the algorithm
described on the |
param |
An optional See |
Details
summarizeOverlaps
offers counting modes to resolve reads that overlap multiple features. Themode
argument defines a set of rules to resolve the read to a single feature such that each read is counted a maximum of once. New to GenomicRanges >= 1.13.9 is theinter.feature
argument which allows reads to be counted for each feature they overlap. Wheninter.feature=TRUE
the counting modes are aware of feature overlap; reads that overlap multiple features are dropped and not counted. Wheninter.feature=FALSE
multiple feature overlap is ignored and reads are counted once for each feature they map to. This essentially reduces modes ‘Union’ and ‘IntersectionStrict’ tocountOverlaps
withtype="any"
, andtype="within"
, respectively. ‘IntersectionNotEmpty’ is not reduced to a derivative ofcountOverlaps
because the shared regions are removed before counting.The
BamViews
,BamFile
andBamFileList
methods summarize overlaps across one or several files. The latter usesbplapply
; control parallel evaluation using theregister
interface in the BiocParallel package.- features :
-
A ‘feature’ can be any portion of a genomic region such as a gene, transcript, exon etc. When the
features
argument is a GRanges the rows define the features. The result will be the same length as the GRanges. Whenfeatures
is a GRangesList the highest list-level defines the features and the result will be the same length as the GRangesList.When
inter.feature=TRUE
, each countmode
attempts to assign a read that overlaps multiple features to a single feature. If there are ranges that should be considered together (e.g., exons by transcript or cds regions by gene) the GRangesList would be appropriate. If there is no grouping in the data then a GRanges would be appropriate. - paired-end reads :
-
Paired-end reads are counted as a single hit if one or both parts of the pair are overlapped. Paired-end records can be counted in a GAlignmentPairs container or BAM file.
Counting pairs in BAM files:
The
singleEnd
argument should be FALSE.When
reads
are supplied as a BamFile or BamFileList, theasMates
argument to the BamFile should be TRUE.When
fragments
is FALSE, aGAlignmentPairs
object is used in counting (pairs only).When
fragments
is TRUE, aGAlignmentsList
object is used in counting (pairs, singletons, unmapped mates, etc.)
Value
A RangedSummarizedExperiment object. The
assays
slot holds the counts, rowRanges
holds the annotation
from features
.
When reads
is a BamFile
or BamFileList
colData
is an empty DataFrame with a single row named ‘counts’. If
count.mapped.reads=TRUE
, colData
holds the output of
countBam
in 3 columns named ‘records’ (total records),
‘nucleotides’ and ‘mapped’ (mapped records).
When features
is a BamViews
colData
includes
2 columns named bamSamples
and bamIndices
.
In all other cases, colData
has columns of ‘object’
(class of reads) and ‘records’ (length of reads
).
Author(s)
Valerie Obenchain
References
HTSeq : http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
htseq-count : http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
See Also
The DESeq2, DEXSeq and edgeR packages.
The RangedSummarizedExperiment class defined in the SummarizedExperiment package.
The GAlignments and GAlignmentPairs classes.
The BamFileList and BamViews classes in the Rsamtools package.
The readGAlignments and readGAlignmentPairs functions.
Examples
reads <- GAlignments(
names = c("a","b","c","d","e","f","g"),
seqnames = Rle(c(rep(c("chr1", "chr2"), 3), "chr1")),
pos = as.integer(c(1400, 2700, 3400, 7100, 4000, 3100, 5200)),
cigar = c("500M", "100M", "300M", "500M", "300M",
"50M200N50M", "50M150N50M"),
strand = strand(rep("+", 7)))
gr <- GRanges(
seqnames = c(rep("chr1", 7), rep("chr2", 4)), strand = "+",
ranges = IRanges(c(1000, 3000, 3600, 4000, 4000, 5000, 5400,
2000, 3000, 7000, 7500),
width = c(500, 500, 300, 500, 900, 500, 500,
900, 500, 600, 300),
names=c("A", "B", "C1", "C2", "D1", "D2", "E", "F",
"G", "H1", "H2")))
groups <- factor(c(1,2,3,3,4,4,5,6,7,8,8))
grl <- splitAsList(gr, groups)
names(grl) <- LETTERS[seq_along(grl)]
## ---------------------------------------------------------------------
## Counting modes.
## ---------------------------------------------------------------------
## First count with a GRanges as the 'features'. 'Union' is the
## most conservative counting mode followed by 'IntersectionStrict'
## then 'IntersectionNotEmpty'.
counts1 <-
data.frame(union=assays(summarizeOverlaps(gr, reads))$counts,
intStrict=assays(summarizeOverlaps(gr, reads,
mode="IntersectionStrict"))$counts,
intNotEmpty=assays(summarizeOverlaps(gr, reads,
mode="IntersectionNotEmpty"))$counts)
colSums(counts1)
## Split the 'features' into a GRangesList and count again.
counts2 <-
data.frame(union=assays(summarizeOverlaps(grl, reads))$counts,
intStrict=assays(summarizeOverlaps(grl, reads,
mode="IntersectionStrict"))$counts,
intNotEmpty=assays(summarizeOverlaps(grl, reads,
mode="IntersectionNotEmpty"))$counts)
colSums(counts2)
## The GRangesList ('grl' object) has 8 features whereas the GRanges
## ('gr' object) has 11. The affect on counting can be seen by looking
## at feature 'H' with mode 'Union'. In the GRanges this feature is
## represented by ranges 'H1' and 'H2',
gr[c("H1", "H2")]
## and by list element 'H' in the GRangesList,
grl["H"]
## Read "d" hits both 'H1' and 'H2'. This is considered a multi-hit when
## using a GRanges (each range is a separate feature) so the read was
## dropped and not counted.
counts1[c("H1", "H2"), ]
## When using a GRangesList, each list element is considered a feature.
## The read hits multiple ranges within list element 'H' but only one
## list element. This is not considered a multi-hit so the read is counted.
counts2["H", ]
## ---------------------------------------------------------------------
## Counting multi-hit reads.
## ---------------------------------------------------------------------
## The goal of the counting modes is to provide a set of rules that
## resolve reads hitting multiple features so each read is counted
## a maximum of once. However, sometimes it may be desirable to count
## a read for each feature it overlaps. This can be accomplished by
## setting 'inter.feature' to FALSE.
## When 'inter.feature=FALSE', modes 'Union' and 'IntersectionStrict'
## essentially reduce to countOverlaps() with type="any" and
## type="within", respectively.
## When 'inter.feature=TRUE' only features "A", "F" and "G" have counts.
se1 <- summarizeOverlaps(gr, reads, mode="Union", inter.feature=TRUE)
assays(se1)$counts
## When 'inter.feature=FALSE' all 11 features have a count. There are
## 7 total reads so one or more reads were counted more than once.
se2 <- summarizeOverlaps(gr, reads, mode="Union", inter.feature=FALSE)
assays(se2)$counts
## ---------------------------------------------------------------------
## Counting BAM files.
## ---------------------------------------------------------------------
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
## (i) Single-end :
## Large files can be iterated over in chunks by setting a
## 'yieldSize' on the BamFile.
bf_s <- BamFile(untreated1_chr4(), yieldSize=50000)
se_s <- summarizeOverlaps(exbygene, bf_s, singleEnd=TRUE)
table(assays(se_s)$counts > 0)
## When a character (file name) is provided as 'reads' instead
## of a BamFile object summarizeOverlaps() will create a BamFile
## and set a reasonable default 'yieldSize'.
## (ii) Paired-end :
## A paired-end file may contain singletons, reads with unmapped
## pairs or reads with more than two fragments. When 'fragments=FALSE'
## only reads paired by the algorithm are included in the counting.
nofrag <- summarizeOverlaps(exbygene, untreated3_chr4(),
singleEnd=FALSE, fragments=FALSE)
table(assays(nofrag)$counts > 0)
## When 'fragments=TRUE' all singletons, reads with unmapped pairs
## and other fragments will be included in the counting.
bf <- BamFile(untreated3_chr4(), asMates=TRUE)
frag <- summarizeOverlaps(exbygene, bf, singleEnd=FALSE, fragments=TRUE)
table(assays(frag)$counts > 0)
## As expected, using 'fragments=TRUE' results in a larger number
## of total counts because singletons, unmapped pairs etc. are
## included in the counting.
## Total reads in the file:
countBam(untreated3_chr4())
## Reads counted with 'fragments=FALSE':
sum(assays(nofrag)$counts)
## Reads counted with 'fragments=TRUE':
sum(assays(frag)$counts)
## ---------------------------------------------------------------------
## Use ouput of summarizeOverlaps() for differential expression analysis
## with DESeq2 or edgeR.
## ---------------------------------------------------------------------
fls <- list.files(system.file("extdata", package="GenomicAlignments"),
recursive=TRUE, pattern="*bam$", full=TRUE)
names(fls) <- basename(fls)
bf <- BamFileList(fls, index=character(), yieldSize=1000)
genes <- GRanges(
seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600,
4000, 7500, 5000, 5400),
width=c(rep(500, 3), 600, 900, 500, 300, 900,
300, 500, 500)))
se <- summarizeOverlaps(genes, bf)
## When the reads are BAM files, the 'colData' contains summary
## information from a call to countBam().
colData(se)
## Start differential expression analysis with the DESeq2 or edgeR
## package:
library(DESeq2)
deseq <- DESeqDataSet(se, design= ~ 1)
library(edgeR)
edger <- DGEList(assays(se)$counts, group=rownames(colData(se)))
## ---------------------------------------------------------------------
## Filter records by map quality before counting.
## (user-supplied 'mode' function)
## ---------------------------------------------------------------------
## The 'mode' argument can take a custom count function whose
## arguments are the same as those in the current counting modes
## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
## In this example records are filtered by map quality before counting.
mapq_filter <- function(features, reads, ignore.strand, inter.feature)
{
require(GenomicAlignments) # needed for parallel evaluation
Union(features, reads[mcols(reads)$mapq >= 20],
ignore.strand, inter.feature)
}
genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100))
param <- ScanBamParam(what="mapq")
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param)
assays(se)$counts
## The count function can be completely custom (i.e., not use the
## pre-defined count functions at all). Requirements are that
## the input arguments match the pre-defined modes and the output
## is a vector of counts the same length as 'features'.
my_count <- function(features, reads, ignore.strand, inter.feature) {
## perform filtering, or subsetting etc.
require(GenomicAlignments) # needed for parallel evaluation
countOverlaps(features, reads)
}
## ---------------------------------------------------------------------
## Preprocessing reads before counting with a standard count mode.
## (user-supplied 'preprocess.reads' function)
## ---------------------------------------------------------------------
## The 'preprocess.reads' argument takes a function that is
## applied to the reads before counting with a pre-defined mode.
ResizeReads <- function(reads, width=1, fix="start", ...) {
reads <- as(reads, "GRanges")
stopifnot(all(strand(reads) != "*"))
resize(reads, width=width, fix=fix, ...)
}
## By default ResizeReads() counts reads that overlap on the 5' end:
summarizeOverlaps(grl, reads, mode=Union, preprocess.reads=ResizeReads)
## Count reads that overlap on the 3' end by passing new values
## for 'width' and 'fix':
summarizeOverlaps(grl, reads, mode=Union, preprocess.reads=ResizeReads,
width=1, fix="end")
## ---------------------------------------------------------------------
## summarizeOverlaps() with BamViews.
## ---------------------------------------------------------------------
## bamSamples and bamPaths metadata are included in the colData.
## bamExperiment metadata is put into the metadata slot.
fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
rngs <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584)))
samp <- DataFrame(info="test", row.names="ex1")
view <- BamViews(fl, bamSamples=samp, bamRanges=rngs)
se <- summarizeOverlaps(view, mode=Union, ignore.strand=TRUE)
colData(se)
metadata(se)