Any one able to assist me with a problem using R to create a sum of previous events in a specific time period please? I apologise if I do not follow protocol this is my first question here.
I have a df of IDs and event dates. In the genuine df the events are date times but to keep things simpler I have used dates here.
I am trying to create a variable which is a tally of the number of previous events within the last 2 years for that ID number (or 750 days as Im not too concerned about leap years but that would be nice to factor in).
There is an additional issue in that some IDs will have more than one event on the same date (and time in the genuine df). I do not want to remove these as in the real df there is an additional variable which is not necessarily the same. However, in the sum I want to count events happening on the same date as one event i.e. if an ID had only 2 events but they were on the same day the result would be 0, or there may be 3 rows of dates within the previous 2 years for an ID - but as two are the same date the result is 2. I have created a outcome vector to give an example of what I have after ID 7 has examples of this.
If there were 3 previous events all on the same day the result sum would be 1 and any subsequent events in 2 years
ID <- c(10,1,11,2,2,13,4,5,6,6,13,7,7,7,8,8,9,9,9,10,1,11,2,11,12,9,13,14,7,15,7)
event.date<-c('2018-09-09','2016-06-02','2018-08-20', '2018-11-03', '2018-07-10', '2017-03-08', '2018-06-16', '2017-05-20', '2016-04-02', '2016-07-27', '2018-07-15', '2018-06-15', '2018-06-15', '2018-01-16', '2017-10-07', '2016-08-17','2018-08-01','2017-01-22','2016-08-05', '2018-08-13', '2016-11-28', '2018-11-24','2016-06-01', '2018-03-26', '2017-02-04', '2017-12-01', '2016-05-16', '2017-11-25', '2018-04-01', '2017-09-21', '2018-04-01')
df<-data.frame(ID,event.date)
df<-df%>%
arrange(ID,event.date)
The resulting column should look something like this.
event.count <- c(0,1,0,0,1,0,0,0,1,0,1,1,2,2,0,1,0,1,2,3,0,1,0,1,2,0,0,1,1,0,0)
df$event.count<-event.count
I have tried various if else and use of lag() but cannot get what I am after
thank you.
Here is a solution with data.table
.
To subtract 2 years from the event.date
, you can use lubridate
and subtract years(2)
.
After grouping by
both ID
and event.date
, you can subset all dates that fall in between 2 years ago and the date (incbounds
of between
will exclude the upper and lower bounds).
Using uniqueN
will prevent duplicate dates from being counted multiple times.
library(data.table)
library(lubridate)
df$event.date <- as.Date(df$event.date)
setDT(df)[, new.event.count := uniqueN(df$event.date[df$ID == ID][
between(df$event.date[df$ID == ID],
event.date - years(2),
event.date,
incbounds = FALSE)]),
by = c("ID", "event.date")][]
Output
ID event.date event.count new.event.count
1: 1 2016-06-02 0 0
2: 1 2016-11-28 1 1
3: 2 2016-06-01 0 0
4: 2 2018-07-10 0 0
5: 2 2018-11-03 1 1
6: 4 2018-06-16 0 0
7: 5 2017-05-20 0 0
8: 6 2016-04-02 0 0
9: 6 2016-07-27 1 1
10: 7 2018-01-16 0 0
11: 7 2018-04-01 1 1
12: 7 2018-04-01 1 1
13: 7 2018-06-15 2 2
14: 7 2018-06-15 2 2
15: 8 2016-08-17 0 0
16: 8 2017-10-07 1 1
17: 9 2016-08-05 0 0
18: 9 2017-01-22 1 1
19: 9 2017-12-01 2 2
20: 9 2018-08-01 3 3
21: 10 2018-08-13 0 0
22: 10 2018-09-09 1 1
23: 11 2018-03-26 0 0
24: 11 2018-08-20 1 1
25: 11 2018-11-24 2 2
26: 12 2017-02-04 0 0
27: 13 2016-05-16 0 0
28: 13 2017-03-08 1 1
29: 13 2018-07-15 1 1
30: 14 2017-11-25 0 0
31: 15 2017-09-21 0 0
ID event.date event.count new.event.count