pulling links and scraping those pages in python -
i scrape links page.
http://www.covers.com/pageloader/pageloader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html
this gets links want.
boxurl = urllib2.urlopen(url).read() soup = beautifulsoup(boxurl) boxscores = soup.findall('a', href=re.compile('boxscore'))
i scrape every boxscore page. have made code scrape boxscore don't know how @ them.
edit
i guess way better since strips out html tags. still need know how open them.
for link in soup.find_all('a', href=re.compile('boxscore')): print(link.get('href'))
edit2: how scrape of data first link of page.
url = 'http://www.covers.com/pageloader/pageloader.aspx?page=/data/wnba/results/2012/boxscore841602.html' boxurl = urllib2.urlopen(url).read() soup = beautifulsoup(boxurl) def _unpack(row, kind='td'): return [val.text val in row.findall(kind)] tables = soup('table') linescore = tables[1] linescore_rows = linescore.findall('tr') roadteamq1 = float(_unpack(linescore_rows[1])[1]) roadteamq2 = float(_unpack(linescore_rows[1])[2]) roadteamq3 = float(_unpack(linescore_rows[1])[3]) roadteamq4 = float(_unpack(linescore_rows[1])[4]) print roadteamq1, roadteamq2, roadteamq3, roadteamq4
however when try this.
url = 'http://www.covers.com/pageloader/pageloader.aspx? page=/data/wnba/teams/pastresults/2012/team665231.html' boxurl = urllib2.urlopen(url).read() soup = beautifulsoup(boxurl) tables = pages[0]('table') linescore = tables[1] linescore_rows = linescore.findall('tr') roadteamq1 = float(_unpack(linescore_rows[1])[1]) roadteamq2 = float(_unpack(linescore_rows[1])[2]) roadteamq3 = float(_unpack(linescore_rows[1])[3]) roadteamq4 = float(_unpack(linescore_rows[1])[4])
i error. tables = pages0 typeerror: 'str' object not callable
print pages[0]
spits out of html of first link normal. that's not confusing. summarize can links still can scrape them.
something pulls pages of found links array, first page pages[0], second pages[1] etc
boxscores = soup.findall('a', href=re.compile('boxscore')) basepath = "http://www.covers.com" pages=[] in boxscores: pages.append(urllib2.urlopen(basepath + a['href']).read())
Comments
Post a Comment