任务:按每个拒绝原因和每天获取唯一的累积客户总数。
Input data sample:
+---------+--------------+------------+------+
| Cust_Id | Decline_Dt | Reason | Days |
+---------+--------------+------------+------+
| A | 08-09-2020 | Reason_1 | 0 |
| A | 08-09-2020 | Reason_1 | 1 |
| A | 08-09-2020 | Reason_1 | 2 |
| A | 08-09-2020 | Reason_1 | 4 |
| B | 08-09-2020 | Reason_1 | 0 |
| B | 08-09-2020 | Reason_1 | 2 |
| B | 08-09-2020 | Reason_1 | 3 |
| C | 08-09-2020 | Reason_1 | 1 |
+---------+--------------+------------+------+
1) Decline_dt - The date on which the payment was declined. (Ignore it for this task)
2) Days - Indicates the # of days after the payment decline happened, the customer interacted with IVR channel.
3) Reason - Indicates the payment decline reason
--Expected Output:
+---------------+-----------+---------------+----------------------------+
| Reason | Days | Unique_mtns | total_cumulative_customers |
+---------------+-----------+---------------+----------------------------+
| Reason_1 | 0 | 2 | 2 |
| Reason_1 | 1 | 2 | 3 |
| Reason_1 | 2 | 2 | 3 |
| Reason_1 | 3 | 1 | 3 |
| Reason_1 | 4 | 1 | 3 |
+------------------------------------------------------------------------+
我的Hive查询:
select a.Reason
, a.days
-- , count(distinct a.cust_id) as unique_mtns
, count(distinct a.cust_id) over (partition by Reason
order by a.days rows between unbounded preceding and current row)
as total_cumulative_customers
from table as a
group by a.reason
, a.days
输出(不正确):
+---------------+-----------+----------------------------+
| Reason | Days | total_cumulative_customers |
+---------------+-----------+----------------------------+
| Reason_1 | 0 | 2 |
| Reason_1 | 1 | 2 |
| Reason_1 | 2 | 2 |
| Reason_1 | 3 | 1 |
| Reason_1 | 4 | 1 |
+--------------------------------------------------------+
理想情况下,我希望不使用group by来执行window函数。但是,如果没有分组依据,则会出现错误。使用分组依据时,我没有获得累积的客户。
如果我正确地遵循了你的说明,则可以使用子查询来计算每个客户/原因元组的第一天,然后进行条件汇总:
select reason, days,
count(distinct cust_id) as unique_mtns,
sum(sum(case when days = min_days then 1 else 0 end))
over(partition by reason order by days) as total_cumulative_customers
from (
select reason, cust_id,
min(days) over(partition by reason, cust_id) as min_days
from mytable
) t
group by reason, days
某些列在外部查询中不可用。另外
min_days
是不是由组在场,所以应该有另一个sum
或它应该抛出一个错误(也许在蜂巢它的工作方式不同?)。可以通过在内部查询中标记第一天来解决此问题lag(0, 1, 1) over(partition by reason, cust_id order by days asc)
谢谢@GMB!在您添加另外一个总和进行编辑后,您的解决方案可以正常工作。谢谢@astentx,通过添加另一个总和,我运行了查询。