sql - Calculate variance of frequencies when dataset does not contain entries of frequency zero -

January 15, 2013

i have dataset has 3 fields: id, feature , frequency. want find out, group of given id's, feature has largest spread of frequencies. result want if split group of id's 2 sub-groups, using median value of frequency feature, have 2 groups different each other , yet of equal size.

my first thought calculate variance of frequencies each feature , use feature variance highest.

given database table looks this:

id | feature | frequency ---+---------+-------------  0 | 0       | 1  0 | 1       | 1  0 | 2       | 0  1 | 0       | 2  1 | 1       | 2  1 | 2       | 0  2 | 0       | 3  2 | 1       | 3  2 | 2       | 8  3 | 0       | 4  3 | 1       | 8  3 | 2       | 10  4 | 0       | 5  4 | 1       | 10  4 | 2       | 12

feature 0 has frequencies of 1, 2, 3, 4, 5
feature 1 has frequencies of 1, 2, 3, 9, 10
feature 2 has frequencies of 0, 0, 4, 10, 12

we can see feature 2 has biggest spread , splitting on 4 make nice point split 2 groups (0, 0 , 4 1 group , 10 , 12 other group).

i can calculate following sql query:

select feature, variance(frequency) f dataset id in (<list of ids>) group feature order f desc limit 1;

this works fine, has 1 flaw. dataset sparse (most entries have frequency of zero) , expensive me (both in terms of space , in terms of time takes insert entries) store 0 frequency items in database. therefore actual tables this:

id | feature | frequency ---+---------+-------------  0 | 0       | 1  0 | 1       | 1  1 | 0       | 2  1 | 1       | 2  2 | 0       | 3  2 | 1       | 3  2 | 2       | 8  3 | 0       | 4  3 | 1       | 8  3 | 2       | 10  4 | 0       | 5  4 | 1       | 10  4 | 2       | 12

the above sql query not correct results now, needs consider 0 frequency entries calculate correct variance value. sql skills aren't enough figure out (performant) query can around limitation...

my next thought calculate maximum entropy instead suffers fact not take actual frequency values (and "frequency"/counts of times same frequency value in same dataset) account - number of distinct values. unless i'm misunderstanding entropy formula.

so questions are:

is there way in sql?
if not, there way of "adjusting" variance calculated account number of 0 entries? (assume know how many 0 entries omitted)
if yes, there way of doing in single sql query above? (again, assume know beforehand how many 0 entries omitted)
if neither possible, there way of using entropy , adjusting actual values?
is there other measure (eg kurtosis?) should consider? there can adjusted missing 0 entries?
or other suggestions or alternative solutions?

with respect filling in gaps in table, can use "helper" temp table valid list of features union missing zero-frequency values way of cross join. "how" depends on database language using. example, suppose have table named "helper" 3 rows (for 3 different features). might work:

select id, feature, frequency have union select b.id      , a.feature      , 0 frequency helper cross join have b not exists (    select 1 have b1    b1.id=b.id      , b1.feature = a.feature    )

here sqlfiddle.

Search This Blog

DIs

sql - Calculate variance of frequencies when dataset does not contain entries of frequency zero -

Comments

Post a Comment

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

php - Boolean search on database with 5 million rows, very slow -

css - Text drops down with smaller window -