python - Pandas Shaping Data for Covariance -

August 15, 2014

i need conduct simple covariance analysis in time series. raw data comes in shape this:

week_end_date              title_short          sales   2012-02-25 00:00:00.000000 "bob" (ebk)         1                            "bob" (ebk)         1 2012-03-31 00:00:00.000000 "bob" (ebk)         1                            "bob" (ebk)         1 2012-03-03 00:00:00.000000 "sally" (ebk)          1 2012-03-10 00:00:00.000000 "sally" (ebk)          1 2012-03-17 00:00:00.000000 "sally" (ebk)          1                            "sally" (ebk)          1 2012-04-07 00:00:00.000000 "sally" (ebk)          1

as can see, there duplicates. unless i'm missing something, need data become set of vectors each title, can use numpy.cov.

question:

how find duplicates in date , name , aggregate them sum? i've been trying use pandas groupby week_end_date , tittle_short comes out indexed in way don't understand.

edit: specific, when try df.groupby(["week_end_date", "title_short"]), this:

>df.ix[0:3]  week_end_date               title_short                2012-02-04 00:00:00.000000  'salem's lot (ebk)            <pandas.core.indexing._ndframeindexer object a...                             'tis season! (ebk)        <pandas.core.indexing._ndframeindexer object a...                             (not asked) (ebk)    <pandas.core.indexing._ndframeindexer object a... dtype: object

and trying select df.ix[1,] gets error:

traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/series.py", line 613, in __getitem__     return self.index.get_value(self, key)   file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 1630, in get_value     loc = self.get_loc(key)   file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2285, in get_loc     result = slice(*self.slice_locs(key, key))   file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2226, in slice_locs     start_slice = self._partial_tup_index(start, side='left')   file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2250, in _partial_tup_index     raise exception('level type mismatch: %s' % lab) exception: level type mismatch: 3

i'm not entirely know what's going on, here's i'd start with. first, data (which looks fixed-width me):

>>> df = pd.read_fwf("weekend.dat", widths=(26, 20, 9), parse_dates=[0]) >>> df = df.fillna(method="ffill") >>> df         week_end_date    title_short  sales 0 2012-02-25 00:00:00    "bob" (ebk)      1 1 2012-02-25 00:00:00    "bob" (ebk)      1 2 2012-03-31 00:00:00    "bob" (ebk)      1 3 2012-03-31 00:00:00    "bob" (ebk)      1 4 2012-03-03 00:00:00  "sally" (ebk)      1 5 2012-03-10 00:00:00  "sally" (ebk)      1 6 2012-03-17 00:00:00  "sally" (ebk)      1 7 2012-03-17 00:00:00  "sally" (ebk)      1 8 2012-04-07 00:00:00  "sally" (ebk)      1

then aggregate dups:

>>> g = df.groupby(["week_end_date", "title_short"]).sum().reset_index() >>> g         week_end_date    title_short  sales 0 2012-02-25 00:00:00    "bob" (ebk)      2 1 2012-03-03 00:00:00  "sally" (ebk)      1 2 2012-03-10 00:00:00  "sally" (ebk)      1 3 2012-03-17 00:00:00  "sally" (ebk)      2 4 2012-03-31 00:00:00    "bob" (ebk)      2 5 2012-04-07 00:00:00  "sally" (ebk)      1

and whatever cov stuff need (note cov series/dataframe/groupby method too, shouldn't need call np.cov specifially).

Search This Blog

DIs

python - Pandas Shaping Data for Covariance -

Comments

Post a Comment

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

css - Text drops down with smaller window -

php - Boolean search on database with 5 million rows, very slow -