python - Pandas Shaping Data for Covariance -
i need conduct simple covariance analysis in time series. raw data comes in shape this:
week_end_date title_short sales 2012-02-25 00:00:00.000000 "bob" (ebk) 1 "bob" (ebk) 1 2012-03-31 00:00:00.000000 "bob" (ebk) 1 "bob" (ebk) 1 2012-03-03 00:00:00.000000 "sally" (ebk) 1 2012-03-10 00:00:00.000000 "sally" (ebk) 1 2012-03-17 00:00:00.000000 "sally" (ebk) 1 "sally" (ebk) 1 2012-04-07 00:00:00.000000 "sally" (ebk) 1
as can see, there duplicates. unless i'm missing something, need data become set of vectors each title, can use numpy.cov.
question:
how find duplicates in date , name , aggregate them sum? i've been trying use pandas groupby week_end_date , tittle_short comes out indexed in way don't understand.
edit: specific, when try df.groupby(["week_end_date", "title_short"])
, this:
>df.ix[0:3] week_end_date title_short 2012-02-04 00:00:00.000000 'salem's lot (ebk) <pandas.core.indexing._ndframeindexer object a... 'tis season! (ebk) <pandas.core.indexing._ndframeindexer object a... (not asked) (ebk) <pandas.core.indexing._ndframeindexer object a... dtype: object
and trying select df.ix[1,]
gets error:
traceback (most recent call last): file "<stdin>", line 1, in <module> file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/series.py", line 613, in __getitem__ return self.index.get_value(self, key) file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 1630, in get_value loc = self.get_loc(key) file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2285, in get_loc result = slice(*self.slice_locs(key, key)) file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2226, in slice_locs start_slice = self._partial_tup_index(start, side='left') file "/library/python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2250, in _partial_tup_index raise exception('level type mismatch: %s' % lab) exception: level type mismatch: 3
i'm not entirely know what's going on, here's i'd start with. first, data (which looks fixed-width me):
>>> df = pd.read_fwf("weekend.dat", widths=(26, 20, 9), parse_dates=[0]) >>> df = df.fillna(method="ffill") >>> df week_end_date title_short sales 0 2012-02-25 00:00:00 "bob" (ebk) 1 1 2012-02-25 00:00:00 "bob" (ebk) 1 2 2012-03-31 00:00:00 "bob" (ebk) 1 3 2012-03-31 00:00:00 "bob" (ebk) 1 4 2012-03-03 00:00:00 "sally" (ebk) 1 5 2012-03-10 00:00:00 "sally" (ebk) 1 6 2012-03-17 00:00:00 "sally" (ebk) 1 7 2012-03-17 00:00:00 "sally" (ebk) 1 8 2012-04-07 00:00:00 "sally" (ebk) 1
then aggregate dups:
>>> g = df.groupby(["week_end_date", "title_short"]).sum().reset_index() >>> g week_end_date title_short sales 0 2012-02-25 00:00:00 "bob" (ebk) 2 1 2012-03-03 00:00:00 "sally" (ebk) 1 2 2012-03-10 00:00:00 "sally" (ebk) 1 3 2012-03-17 00:00:00 "sally" (ebk) 2 4 2012-03-31 00:00:00 "bob" (ebk) 2 5 2012-04-07 00:00:00 "sally" (ebk) 1
and whatever cov
stuff need (note cov
series/dataframe/groupby method too, shouldn't need call np.cov
specifially).
Comments
Post a Comment