I'm still learning R, I have this dataset, it has 5 columns, first column is tracking_id, the next four columns have values of four groups.
First, I want to filter rows that have values equal or larger than 1, then I want to filter rows based on comparison of the last three columns ("CD44hi_CD69low_rep","CD44hi_CD69hi_CD103low_rep","CD44hi_CD69hi_CD103hi_rep") that are 8 folds higher or 4 folds lower compared to column ("CD44low_rep").
The output should have 5 columns, with values equal or larger than 1 that are 8 fold higher or 4 fold less of the last three column compared to second column.
I should get something like this:
To filter rows equal or larger than 1, I tried this:
df1 %>% select_if(is.numeric) %>% filter_all(all_vars(. >= 1))
Then to filter 8 folds high or 4 fold less, I tried (thanks to @akrun):
nm1 <- c("CD44hi_CD69low_rep", "CD44hi_CD69hi_CD103low_rep",
"CD44hi_CD69hi_CD103hi_rep")
i1 <- (rowSums(df1[nm1] >= (df1$CD44low_rep * 8)) == 3) &
(rowSums(df1[nm1] <= (df1$CD44low_rep * 4)) == 3)
However, I'm getting no input.
I'm following these steps:
Analysis and graphic display of RNA-Seq data. A total of 9,085 genes for which
the maximum fragments per kilobase of exon per million mapped reads value in all
samples was ≥1.0 were subjected to further analyses. A principal component analysis
was performed using R (https://www.r-project.org/). Clustering was performed using
APCluster (an R Package for Affinity Propagation Clustering). The transcriptional
signatures of CD44hiCD69lo, CD44hiCD69hiCD103lo and CD44hiCD69hiCD103hi CD4+
T cells were defined with genes for which the expression was eightfold higher or
fourfold lower than that in CD44loCD69lo CD4+ T cells.
For the visualization of the co-regulation network, the 500 genes in the CD44hi
CD4+ T cell groups that showed the greatest variation compared with the naive
(CD44loCD69lo) CD4+ T cell group were subjected to further analyses. The first-
neighbor genes were determined using the following two criteria: (1) a correlation
of >0.8; and (2) a ratio of norm of 0.8–1.25. The network graph of 483 genes was
visualized using Cytoscape (http://www.cytoscape.org/).
The IDs that I'm interested in are:
values <- c('S100a10', 'Esm1', 'Itgb1', 'Anxa2', 'Hist1h1b',
'Il2rb', 'Lgals1', 'Mki67', 'Rora', 'S100a4',
'S100a6', 'Adam8', 'Areg', 'Bcl2l1', 'Calca',
'Capg', 'Ccr2', 'Cd44', 'Csda', 'Ehd1',
'Id2', 'Il10', 'Il1rl1', 'Il2ra', 'Lmna',
'Maf', 'Penk', 'Podnl1', 'Tiam1', 'Vim',
'Ern1', 'Furin', 'Ifng', 'Igfbp7', 'Il13',
'Il4', 'Il5', 'Nrp1', 'Ptprs', 'Rbpj',
'Spry1', 'Tnfsf11', 'Vdr', 'Xcl1', 'Bmpr2',
'Csf1', 'Dst', 'Foxp3', 'Itgav', 'Itgb8',
'Lamc1', 'Myo1e', 'Pmaip1', 'Prdm1', 'Ptpn5',
'Ramp1', 'Sdc4')
After applying @RonakShah (thank you!), I get only 21 instead of 57:
library(dplyr)
df09 <- read.csv('https://raw.githubusercontent.com/learnseq/learning/main/dfpilot.csv')
filtertrial <- df09 %>%
#Keep rows where all the values are greater than 1
filter(across(where(is.numeric), ~. >= 1)) %>%
#Rows where any value is higher than 8 times CD44low_rep
#Or 4 times less than CD44low_rep
filter(Reduce(`|`, across(CD44hi_CD69low_rep:CD44hi_CD69hi_CD103hi_rep,
~. >= CD44low_rep*8 | . <= CD44low_rep/4)))
values <- c('S100a10', 'Esm1', 'Itgb1', 'Anxa2', 'Hist1h1b',
'Il2rb', 'Lgals1', 'Mki67', 'Rora', 'S100a4',
'S100a6', 'Adam8', 'Areg', 'Bcl2l1', 'Calca',
'Capg', 'Ccr2', 'Cd44', 'Csda', 'Ehd1',
'Id2', 'Il10', 'Il1rl1', 'Il2ra', 'Lmna',
'Maf', 'Penk', 'Podnl1', 'Tiam1', 'Vim',
'Ern1', 'Furin', 'Ifng', 'Igfbp7', 'Il13',
'Il4', 'Il5', 'Nrp1', 'Ptprs', 'Rbpj',
'Spry1', 'Tnfsf11', 'Vdr', 'Xcl1', 'Bmpr2',
'Csf1', 'Dst', 'Foxp3', 'Itgav', 'Itgb8',
'Lamc1', 'Myo1e', 'Pmaip1', 'Prdm1', 'Ptpn5',
'Ramp1', 'Sdc4')
#Make sure the sorting won't change by using match function and reverse it to get the right order as
#shown in the original plot.
dfgll <- filtertrial %>% slice(match(rev(values), tracking_id))
dfgll
How to achieve this?
You can try the following :
library(dplyr)
df <- read.csv('https://raw.githubusercontent.com/learnseq/learning/main/dfpilot.csv')
df %>%
#Keep rows where all the values are greater than 1
filter(across(where(is.numeric), ~. > 1)) %>%
#Rows where any value is higher than 8 times CD44low_rep
#Or 4 times less than CD44low_rep
filter(Reduce(`|`, across(CD44hi_CD69low_rep:CD44hi_CD69hi_CD103hi_rep,
~. > CD44low_rep*8 | . < CD44low_rep/4))) -> result
head(result)
# tracking_id CD44low_rep CD44hi_CD69low_rep CD44hi_CD69hi_CD103low_rep
#1 42624 1.91 6.68 17.50
#2 A930005H10Rik 9.41 4.63 1.48
#3 Actn1 42.01 21.77 1.71
#4 Adora2a 1.31 7.05 15.51
#5 Ahnak 12.09 152.43 362.87
#6 Als2cl 11.17 1.98 1.01
# CD44hi_CD69hi_CD103hi_rep
#1 22.51
#2 1.55
#3 1.22
#4 13.31
#5 299.07
#6 1.26
Thank you @RonakShah, I deleted the other question, but the output seems not right for some reason, I get only 21 rows output at the end instead of 57 after I compared 57 selected IDs, I even changed > to >=, also < to <=, but still, I got only 21 rows, any input?
I get 197 rows for the data you have shared running the same code as above. How do you get 21 rows?
I do get 197 but when I filter the ids that I'm interested in I get only 21 instead of 57! I updated the question with the code that I used after your code. @RonalShah
Yes, so only 21 id's satisfy the criteria that you have mentioned. Please review your criteria again. Your first criteria is keep only those rows where all values is greater than 1. One of the id which is absent in final output is
Esm1
. If you check it's valuesdf09 %>% filter(tracking_id == 'Esm1')
you'll see thatCD44low_rep
is0.1740859
for it. Hence, the code removes it from the list since it is less than 1. I am sure that is the similar case with rest of the id's which is not present in the final output.Thank you @RonakShah, I agree, I don't know why they put in the criteria to remove values less than 1, probably they meant something in data preprocessing in bash, not R.