I'm trying to perform a null backfill, similar to Panda's dataframe bfill, in BigQuery. Reading the docs, the last_value
function seems to be a good choice. However, this leaves some null
spots until it finds the first value (quite reasonable, given the name of the function). How can I backfill those null
? Or I just have to drop them?
This is a sample query:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;
Actual output:
time sn_6 f0_
1 null null
2 1 1
3 null 1
4 null 1
5 null 1
6 0 0
7 null 0
8 null 0
Desired output:
time sn_6 f0_
1 null 1 <---Back fill all the gaps!
2 1 1
3 null 1
4 null 1
5 null 1
6 0 0
7 null 0
8 null 0
The real data has a timestamp
column followed by 6 float
columns and there are null values everywhere.
If the intention is to make the missing "backfill" to be a "forward-fill", you can use first_value
function to look forward to locate the first non-null value, as:
select table_path.*,
coalesce(
last_value(sn_6 ignore nulls) over (order by time),
first_value(sn_6 ignore nulls) over (order by time RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;
This worked. Thank you very much :). I feel that I am lacking a lot of bigquery sql dialect. Is there any course, tutorial or something you could recommend me?
Google Cloud has official course on Coursera coursera.org/professional-certificates/gcp-data-engineering and bunch of other courses