Thursday, November 10, 2011

Densify your data!

Oracle SQL for analysis and reporting ROCKS!


Data stored in traditional OLTP systems isn't always easily usable in analytic reporting systems - the data is almost always geared towards operational reporting systems. Oracle SQL for analysis and reporting gives you options to take your operational data and convert it into meaningful information using analytic functions.


One of the frequent requirements in a reporting system is to fill in the gaps i.e. create data where it doesn't exist, but if it did, it would make the data more consistent and the reporting would be much easier. In BI speak, this is known as data densification - the process of converting sparse data into dense form.  


This is easier to explain with an example. Consider the following data set which is a result of the SQL query


select salesperson, period, item, amount
from items_sold
where salesperson = 'S1'
  and period between P1 and P3;


Figure 1

Sales Person
Period
Item
Amount
S1
P1
W1
100
S1
P1
W2
200
S1
P1
W4
400
S1
P1
W4
200
S1
P2
W3
200
S1
P2
W5
100
S1
P2
W1
20
S1
P3
W5
400
S1
P3
W2
200



The output sought is:

Figure 2
Sales Person
Period
Item
Total
S1
P1
W1
100
S1
P1
W2
200
S1
P1
W3
0
S1
P1
W4
600
S1
P1
W5
0
S1
P2
W1
20
S1
P2
W2
0
S1
P2
W3
200
S1
P2
W4
0
S1
P2
W5
100
S1
P3
W1
0
S1
P3
W2
200
S1
P3
W3
0
S1
P3
W4
0
S1
P3
W5
400



i.e. "Show me everything sales person S1 sold in periods P1, P2 and P3 even if he sold 0 amount of an item".


Step 1: Group the data by period and item to get the total amount 

select salesperson, period, item, sum(amount)
from items_sold
where salesperson = 'S1'
  and period between P1 and P3
group by salesperson, period, item;


Step 2: Densify the data - fill in the gaps using Partitioned Outer Joins


Assuming that the period values are stored in the periods  table -


select q1.salesperson, q2.period, q1.item, nvl(q1.amt,0)
from
(

select salesperson, period, item, sum(amount) amt
from items_sold
where salesperson = 'S1'
  and period between P1 and P3
group by salesperson, period, item

) q1
PARTITION BY (q1.salesperson, q1.item)
RIGHT OUTER JOIN
(
select period from periods
where period between P1 and P3 ) q2
on ( q1.period = q2.period )


So what did I just do? Basically, I took the query from step 1, created an independent sub set of rows using the PARTITION clause - effectively the set of all items S1 sold between P1 and P3


Figure 3

S1
W1
S1
W2
S1
W3
S1
W4
S1
W5


I outer join this subset with a static set of periods (using the inline view q2). Run the new query and you get the desired output as shown in Figure 2. 

Densify away!

No comments:

Post a Comment