GEO数据库目前收录了4348个数据集记录,包含人的数据1772个,小鼠的数据1642个,大鼠的数据360个,其中属于组织样品有1183个,细胞品系有857个。下面,小编就跟大家详细讲解如何利用GEO表达谱数据进行差异表达分析,期待您的评论沟通和转发哦~
一GEO数据下载
打开NCBI官网,选择GEO DataSets,这里我们随便搜一个转录组的数据,如上图所示,由于前面几个GSE所提供的表达量文件不规范,这里我们选择登录号为GSE132287,点击进入。 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132287
下载文件打开后,如图所示:
二数据准备
In [1]:
gene <- read.table('GSE132287_Gene-count-table.xls',header = T, row.names = 1, sep = 't', check.names = FALSE)
head(gene,10)
Out[1]:
gene_name gene_type chr start end strand length MDA-A1 MDA-A2 MDA-A3 MDA-G1 MDA-G2 MDA-G3
ENSG00000000003.14_2 TSPAN6 protein_coding chrX 99882106 99894988 - 4535 2704 3694 2946 3204 2435 3498
ENSG00000000005.5_2 TNMD protein_coding chrX 99839799 99854882 + 1610 0 0 0 0 0 0
ENSG00000000419.12_2 DPM1 protein_coding chr20 49551404 49575092 - 1207 4045 5254 4031 4740 3872 4867
ENSG00000000457.13_2 SCYL3 protein_coding chr1 169818772 169863408 - 6883 1352 1849 1431 1546 1291 1752
ENSG00000000460.16_4 C1orf112 protein_coding chr1 169631245 169823221 + 5967 3490 4828 4071 3885 3376 4586
ENSG00000000938.12_2 FGR protein_coding chr1 27938575 27961788 - 3474 1 0 0 1 0 3
ENSG00000000971.15_2 CFH protein_coding chr1 196621008 196716634 + 8145 1620 2346 2001 2148 1863 2507
ENSG00000001036.13_2 FUCA2 protein_coding chr6 143815948 143832827 - 2793 9471 12928 10236 11808 10084 13539
ENSG00000001084.10_3 GCLC protein_coding chr6 53362139 53481768 - 8463 2461 3384 2572 2761 2392 3046
ENSG00000001167.14_2 NFYA protein_coding chr6 41040684 41067715 + 3811 4549 4770 3828 4146 4264 4997
In [2]:
info= read.table('GSE132287_sample_group.txt',header = F,col.names = c('sample','group'))
info
Out[2]:
sample group
MDA-A1 MDA-A
MDA-A2 MDA-A
MDA-A3 MDA-A
MDA-G1 MDA-G
MDA-G2 MDA-G
MDA-G3 MDA-G
In [3]:
df = gene[as.character(info$sample)]
head(df)
Out[3]:
MDA-A1 MDA-A2 MDA-A3 MDA-G1 MDA-G2 MDA-G3
ENSG00000000003.14_2 2704 3694 2946 3204 2435 3498
ENSG00000000005.5_2 0 0 0 0 0 0
ENSG00000000419.12_2 4045 5254 4031 4740 3872 4867
ENSG00000000457.13_2 1352 1849 1431 1546 1291 1752
ENSG00000000460.16_4 3490 4828 4071 3885 3376 458
6ENSG00000000938.12_2 1 0 0 1 0 3
In [4]:
coldata <- data.frame(group = info$group )
coldata
Out[4]:
group
MDA-A
MDA-A
MDA-A
MDA-G
MDA-G
MDA-G
三Deseq数据分析
In [5]:
install.packages('BiocManager')
BiocManager::install('DESeq2')
In [6]:
library(DESeq2)
In [7]:
dds <- DESeqDataSetFromMatrix(df, DataFrame(coldata), design= ~ group )
dds <- DESeq(dds,betaPrior=FALSE)
dds
Out[7]:
class: DESeqDataSet
dim: 60461 6
metadata(1): version
assays(4): counts mu H cooks
rownames(60461): ENSG00000000003.14_2 ENSG00000000005.5_2 ...ENSG00000284747.1_1 ENSG00000284748.1_1
rowData names(22): baseMean baseVar ... deviance maxCooks
colnames(6): MDA-A1 MDA-A2 ... MDA-G2 MDA-G3
colData names(2): group sizeFactor
In [8]:
df <- as.data.frame(counts(dds, normalized=TRUE))
In [9]:
df['MDA-A_mean'] = apply(
df[as.character(info[info$group=='MDA-A',1])],1,mean)
df['MDA-G_mean'] = apply(
df[as.character(info[info$group=='MDA-G',1])],1,mean)
df['gene'] = rownames(df)head(df)
MDA-A1 MDA-A2 MDA-A3 MDA-G1 MDA-G2 MDA-G3 MDA-A_mean MDA-G_mean gene
ENSG00000000003.14_2 2994.378835 3147.518 3196.921 3101.6740411 2819.888 2992.716062 3112.9393704 2971.425929
ENSG00000000003.14_2ENSG00000000005.5_2 0.000000 0.000 0.000 0.0000000 0.000 0.000000 0.0000000 0.000000
ENSG00000000005.5_2ENSG00000000419.12_2 4479.386978 4476.735 4374.335 4588.6188998 4484.027 4163.964858 4443.4855603 4412.203499
ENSG00000000419.12_2ENSG00000000457.13_2 1497.189418 1575.463 1552.883 1496.6254893 1495.062 1498.924683 1541.8452966 1496.870591
ENSG00000000457.13_2ENSG00000000460.16_4 3864.786292 4113.757 4417.742 3760.9249843 3909.627 3923.555134 4132.0948079 3864.702246
ENSG00000000460.16_4ENSG00000000938.12_2 1.107389 0.000 0.000 0.9680631 0.000 2.566652 0.3691295 1.178238 ENSG00000000938.12_2
res <- results(dds, contrast = c('group', 'MDA-A', 'MDA-G'), pAdjustMethod = 'fdr', alpha = 0.05)
res = as.data.frame(res)
head(res)
baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000000003.14_2 3042.1826497 0.06681719 0.07222184 0.9251661 0.3548795 0.7074537
ENSG00000000005.5_2 0.0000000 NA NA NA NA NAE
NSG00000000419.12_2 4427.8445298 0.01055749 0.06762388 0.1561207 0.8759379 0.9691177
ENSG00000000457.13_2 1519.3579439 0.04308650 0.07852322 0.5487103 0.5832043 0.8550993
ENSG00000000460.16_4 3998.3985272 0.09654288 0.07227280 1.3358121 0.1816107 0.5297536
ENSG00000000938.12_2 0.7736839 -1.73212025 2.76415576 -0.6266363 0.5308977 NA
In [11]:
res['type']='Not DEG'
res[which(res$log2FoldChange >= 1 & res$pvalue < 0.05),'type'] <- 'Up'
res[which(res$log2FoldChange <= 1 & res$pvalue < 0.05),'type'] <- 'Down'
res['gene'] = rownames(res)head(res)
baseMean log2FoldChange lfcSE stat pvalue padj type gene
ENSG00000000003.14_2 3042.1826497 0.06681719 0.07222184 0.9251661 0.3548795 0.7074537 Not DEG E
NSG00000000003.14_2ENSG00000000005.5_2 0.0000000 NA NA NA NA NA Not DEG ENSG00000000005.5_2
ENSG00000000419.12_2 4427.8445298 0.01055749 0.06762388 0.1561207 0.8759379 0.9691177 Not DEG
ENSG00000000419.12_2ENSG00000000457.13_2 1519.3579439 0.04308650 0.07852322 0.5487103 0.5832043 0.8550993 Not DEG
ENSG00000000457.13_2ENSG00000000460.16_4 3998.3985272 0.09654288 0.07227280 1.3358121 0.1816107 0.5297536 Not DEG
ENSG00000000460.16_4ENSG00000000938.12_2 0.7736839 -1.73212025 2.76415576 -0.6266363 0.5308977 NA Not DEG
ENSG00000000938.12_2
合并数据
In [12]:
result_merge = merge(df,res,by = 'gene')
result_merge = result_merge[order(result_merge$pvalue),]
head(result_merge)
gene MDA-A1 MDA-A2 MDA-A3 MDA-G1 MDA-G2 MDA-G3 MDA-A_mean MDA-G_mean baseMean log2FoldChange lfcSE stat pvalue padj type1875
ENSG00000090339.8_2 4852.57694 4787.73773 5220.7701 27261.624 26204.689 25266.976 4953.69492 26244.430 15599.062 -2.405646 0.06409772 -37.53091 0.000000e+00 0.000000e+00 Down11138
ENSG00000163739.4_2 193.79301 177.22895 210.5237 3221.714 3519.359 3184.359 193.84854 3308.477 1751.163 -4.097951 0.10170786 -40.29139 0.000000e+00 0.000000e+00 Down11703
ENSG00000165795.23_3 17.71822 28.11805 41.2366 45895.870 41106.667 43346.472 29.02429 43449.669 21739.347 -10.548113 0.16963433 -62.18148 0.000000e+00 0.000000e+00 Down12624
ENSG00000169429.10_2 2099.60883 2407.92789 2647.8235 28439.757 28773.277 27150.899 2385.12008 28121.311 15253.215 -3.559289 0.07909170 -45.00205 0.000000e+00 0.000000e+00 Down10535
ENSG00000160710.15_3 85045.23143 85230.08188 83116.6997 21604.263 22847.460 22927.901 84464.00435 22459.875 53461.939 1.911000 0.05663594 33.74183 1.409012e-249 4.237464e-246 Up11364
ENSG00000164400.5_2 759.66860 834.16893 959.2934 6727.070 5916.553 6392.674 851.04366 6345.432 3598.238 -2.898827 0.08922127 -32.49032 1.460968e-231 3.661428e-228 Down
In [13]:
write.csv(result_merge,file = 'Differential_Expression_Genes_Summary.csv', quote=F,row.names =F)
往期相关链接:
【绘图进阶】之交互式可删减分组和显示样品名的PCA 图(三);
3分钟学会CHIP-seq类实验测序数据可视化 —IGV的使用手册;
10分钟搞定多样性数据提交,最快半天内获取登录号,史上最全的多样性原始数据提交教程;
【WGS服务升级】人工智能软件SpliceAI助力解读罕见和未确诊疾病中的非编码突变;
20分钟搞定GEO上传,史上最简单、最详细的GEO数据上传攻略;
如果您对本文案介绍的方法或代码有疑问,
请扫码添加QQ群沟通
【本群将为大家提供】
分享生信分析方案
提供数据素材及分析软件支持
定期开展生信分析线上讲座
QQ号:1040471849