Warm tip: This article is reproduced from serverfault.com, please click

r-过滤具有特定条件的所有列的行

(r - Filter in rows of all columns with specific conditions)

发布于 2020-11-27 23:38:21

我仍在学习R，我有这个数据集，它有5列，第一列是tracking_id，接下来的四列具有四个组的值。

首先，我要过滤具有等于或大于1的值的行，然后要基于对最后三列（“ CD44hi_CD69low_rep”，“ CD44hi_CD69hi_CD103low_rep”，“ CD44hi_CD69hi_CD103hi_rep”）进行比较的8倍或4来比较行与列（“ CD44low_rep”）相比折叠低。

输出应具有5列，其值等于或大于1，这比第二列高8倍或比最后三列低4倍。

我应该得到这样的东西：

为了过滤等于或大于1的行，我尝试了以下操作：

df1 %>% select_if(is.numeric) %>%  filter_all(all_vars(. >= 1))

然后要过滤高8倍或少4倍，我尝试过（由于@akrun）：

nm1 <- c("CD44hi_CD69low_rep",  "CD44hi_CD69hi_CD103low_rep", 
         "CD44hi_CD69hi_CD103hi_rep")
i1 <- (rowSums(df1[nm1]  >= (df1$CD44low_rep * 8)) == 3) &
     (rowSums(df1[nm1]  <= (df1$CD44low_rep * 4)) == 3)

但是，我没有任何输入。

我正在按照以下步骤操作：

Analysis and graphic display of RNA-Seq data. A total of 9,085 genes for which
the maximum fragments per kilobase of exon per million mapped reads value in all
samples was ≥1.0 were subjected to further analyses. A principal component analysis
was performed using R (https://www.r-project.org/). Clustering was performed using
APCluster (an R Package for Affinity Propagation Clustering). The transcriptional
signatures of CD44hiCD69lo, CD44hiCD69hiCD103lo and CD44hiCD69hiCD103hi CD4+
T cells were defined with genes for which the expression was eightfold higher or
fourfold lower than that in CD44loCD69lo CD4+ T cells.
For the visualization of the co-regulation network, the 500 genes in the CD44hi
CD4+ T cell groups that showed the greatest variation compared with the naive
(CD44loCD69lo) CD4+ T cell group were subjected to further analyses. The first-
neighbor genes were determined using the following two criteria: (1) a correlation
of >0.8; and (2) a ratio of norm of 0.8–1.25. The network graph of 483 genes was
visualized using Cytoscape (http://www.cytoscape.org/).

我感兴趣的ID是：

values <- c('S100a10', 'Esm1', 'Itgb1', 'Anxa2', 'Hist1h1b', 
                                                'Il2rb', 'Lgals1', 'Mki67', 'Rora', 'S100a4', 
                                                'S100a6', 'Adam8', 'Areg', 'Bcl2l1', 'Calca', 
                                                'Capg', 'Ccr2', 'Cd44', 'Csda', 'Ehd1', 
                                                'Id2', 'Il10', 'Il1rl1', 'Il2ra', 'Lmna', 
                                                'Maf', 'Penk', 'Podnl1', 'Tiam1', 'Vim',
                                                'Ern1', 'Furin', 'Ifng', 'Igfbp7', 'Il13', 
                                                'Il4', 'Il5', 'Nrp1', 'Ptprs', 'Rbpj', 
                                                'Spry1', 'Tnfsf11', 'Vdr', 'Xcl1', 'Bmpr2', 
                                                'Csf1', 'Dst', 'Foxp3', 'Itgav', 'Itgb8', 
                                                'Lamc1', 'Myo1e', 'Pmaip1', 'Prdm1', 'Ptpn5', 
                                                'Ramp1', 'Sdc4')

在应用@RonakShah（谢谢！）之后，我得到的是21而不是57：

library(dplyr)
df09 <- read.csv('https://raw.githubusercontent.com/learnseq/learning/main/dfpilot.csv')

filtertrial <- df09 %>% 
  #Keep rows where all the values are greater than 1
  filter(across(where(is.numeric), ~. >= 1)) %>%
  #Rows where any value is higher than 8 times CD44low_rep
  #Or 4 times less than CD44low_rep
  filter(Reduce(`|`, across(CD44hi_CD69low_rep:CD44hi_CD69hi_CD103hi_rep, 
         ~. >= CD44low_rep*8 | . <= CD44low_rep/4)))

values <- c('S100a10', 'Esm1', 'Itgb1', 'Anxa2', 'Hist1h1b', 
                                                'Il2rb', 'Lgals1', 'Mki67', 'Rora', 'S100a4', 
                                                'S100a6', 'Adam8', 'Areg', 'Bcl2l1', 'Calca', 
                                                'Capg', 'Ccr2', 'Cd44', 'Csda', 'Ehd1', 
                                                'Id2', 'Il10', 'Il1rl1', 'Il2ra', 'Lmna', 
                                                'Maf', 'Penk', 'Podnl1', 'Tiam1', 'Vim',
                                                'Ern1', 'Furin', 'Ifng', 'Igfbp7', 'Il13', 
                                                'Il4', 'Il5', 'Nrp1', 'Ptprs', 'Rbpj', 
                                                'Spry1', 'Tnfsf11', 'Vdr', 'Xcl1', 'Bmpr2', 
                                                'Csf1', 'Dst', 'Foxp3', 'Itgav', 'Itgb8', 
                                                'Lamc1', 'Myo1e', 'Pmaip1', 'Prdm1', 'Ptpn5', 
                                                'Ramp1', 'Sdc4')

#Make sure the sorting won't change by using match function and reverse it to get the right order as 
#shown in the original plot.
dfgll <- filtertrial %>% slice(match(rev(values), tracking_id))


dfgll

如何实现呢？

Questioner

user432797

Viewed

Original

English

library(dplyr) df <- read.csv('https://raw.githubusercontent.com/learnseq/learning/main/dfpilot.csv') df %>% #Keep rows where all the values are greater than 1 filter(across(where(is.numeric), ~. > 1)) %>% #Rows where any value is higher than 8 times CD44low_rep #Or 4 times less than CD44low_rep filter(Reduce(`|`, across(CD44hi_CD69low_rep:CD44hi_CD69hi_CD103hi_rep, ~. > CD44low_rep*8 | . < CD44low_rep/4))) -> result head(result) # tracking_id CD44low_rep CD44hi_CD69low_rep CD44hi_CD69hi_CD103low_rep #1 42624 1.91 6.68 17.50 #2 A930005H10Rik 9.41 4.63 1.48 #3 Actn1 42.01 21.77 1.71 #4 Adora2a 1.31 7.05 15.51 #5 Ahnak 12.09 152.43 362.87 #6 Als2cl 11.17 1.98 1.01 # CD44hi_CD69hi_CD103hi_rep #1 22.51 #2 1.55 #3 1.22 #4 13.31 #5 299.07 #6 1.26

user432797 2020-11-28 16:47:36

谢谢@RonakShah，我删除了另一个问题，但是由于某种原因输出似乎不正确，在比较了57个选定的ID之后，我只得到了21行而不是57行，我什至将>更改为> =，也将<到<=，但仍然只有21行，有输入吗？

Ronak Shah 2020-11-29 00:28:19

对于您共享的数据，我得到197行，并运行与上面相同的代码。您如何获得21行？

user432797 2020-11-29 01:22:24

我的确得到197，但是当我过滤我感兴趣的ID时，我只会得到21而不是57！我用在代码之后使用的代码更新了问题。@罗纳尔·沙（RonalShah）

Ronak Shah 2020-11-29 01:34:57

是的，因此只有21个ID满足您提到的条件。请再次检查您的条件。您的第一个条件是仅保留所有值均大于1的行Esm1。最终输出中缺少的ID之一是。如果您检查它的值，df09 %>% filter(tracking_id == 'Esm1')您会发现它CD44low_rep是0.1740859正确的。因此，由于它小于1，因此代码将其从列表中删除。我敢肯定，在最终输出中不存在其余ID的情况也是如此。

user432797 2020-11-29 22:03:05

谢谢@RonakShah，我同意，我不知道他们为什么要在准则中删除小于1的值，这可能意味着在bash而不是R中进行数据预处理。

r-过滤具有特定条件的所有列的行

(r - Filter in rows of all columns with specific conditions)

热门github