Task: Get the total number of unique cumulative customers by each decline reason and by each day.
Input data sample:
+---------+--------------+------------+------+
| Cust_Id | Decline_Dt | Reason | Days |
+---------+--------------+------------+------+
| A | 08-09-2020 | Reason_1 | 0 |
| A | 08-09-2020 | Reason_1 | 1 |
| A | 08-09-2020 | Reason_1 | 2 |
| A | 08-09-2020 | Reason_1 | 4 |
| B | 08-09-2020 | Reason_1 | 0 |
| B | 08-09-2020 | Reason_1 | 2 |
| B | 08-09-2020 | Reason_1 | 3 |
| C | 08-09-2020 | Reason_1 | 1 |
+---------+--------------+------------+------+
1) Decline_dt - The date on which the payment was declined. (Ignore it for this task)
2) Days - Indicates the # of days after the payment decline happened, the customer interacted with IVR channel.
3) Reason - Indicates the payment decline reason
--Expected Output:
+---------------+-----------+---------------+----------------------------+
| Reason | Days | Unique_mtns | total_cumulative_customers |
+---------------+-----------+---------------+----------------------------+
| Reason_1 | 0 | 2 | 2 |
| Reason_1 | 1 | 2 | 3 |
| Reason_1 | 2 | 2 | 3 |
| Reason_1 | 3 | 1 | 3 |
| Reason_1 | 4 | 1 | 3 |
+------------------------------------------------------------------------+
My Hive query:
select a.Reason
, a.days
-- , count(distinct a.cust_id) as unique_mtns
, count(distinct a.cust_id) over (partition by Reason
order by a.days rows between unbounded preceding and current row)
as total_cumulative_customers
from table as a
group by a.reason
, a.days
Output (Incorrect):
+---------------+-----------+----------------------------+
| Reason | Days | total_cumulative_customers |
+---------------+-----------+----------------------------+
| Reason_1 | 0 | 2 |
| Reason_1 | 1 | 2 |
| Reason_1 | 2 | 2 |
| Reason_1 | 3 | 1 |
| Reason_1 | 4 | 1 |
+--------------------------------------------------------+
Ideally, I would expect the window function to be executed without group by. However, I get an error without group by. When I use group by, I don't get the cumulative customers.
If I follow you correctly, you can use a subquery to compute the first day per customer/reason tuple, and then do conditional aggregation:
select reason, days,
count(distinct cust_id) as unique_mtns,
sum(sum(case when days = min_days then 1 else 0 end))
over(partition by reason order by days) as total_cumulative_customers
from (
select reason, cust_id,
min(days) over(partition by reason, cust_id) as min_days
from mytable
) t
group by reason, days
Some columns are unavailable in outer query. Also
min_days
is not present in group by, so there should be anothersum
or it should throw an error (maybe in Hive it works differently?). This can be fixed by marking first day in inner query withlag(0, 1, 1) over(partition by reason, cust_id order by days asc)
Thank you @GMB! Your solution works fine after the edit made by you by adding another sum. Thank you @astentx, by adding another sum, I got the query running.