pulling links and scraping those pages in python -


i scrape links page.

http://www.covers.com/pageloader/pageloader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html 

this gets links want.

boxurl = urllib2.urlopen(url).read() soup = beautifulsoup(boxurl) boxscores = soup.findall('a', href=re.compile('boxscore')) 

i scrape every boxscore page. have made code scrape boxscore don't know how @ them.

edit

i guess way better since strips out html tags. still need know how open them.

for link in soup.find_all('a', href=re.compile('boxscore')):     print(link.get('href')) 

edit2: how scrape of data first link of page.

url = 'http://www.covers.com/pageloader/pageloader.aspx?page=/data/wnba/results/2012/boxscore841602.html'   boxurl = urllib2.urlopen(url).read() soup = beautifulsoup(boxurl) def _unpack(row, kind='td'):     return [val.text val in row.findall(kind)]  tables = soup('table') linescore = tables[1]    linescore_rows = linescore.findall('tr') roadteamq1 = float(_unpack(linescore_rows[1])[1]) roadteamq2 = float(_unpack(linescore_rows[1])[2]) roadteamq3 = float(_unpack(linescore_rows[1])[3]) roadteamq4 = float(_unpack(linescore_rows[1])[4])   print roadteamq1, roadteamq2, roadteamq3, roadteamq4 

however when try this.

url = 'http://www.covers.com/pageloader/pageloader.aspx?    page=/data/wnba/teams/pastresults/2012/team665231.html' boxurl = urllib2.urlopen(url).read() soup = beautifulsoup(boxurl)  tables = pages[0]('table') linescore = tables[1]    linescore_rows = linescore.findall('tr') roadteamq1 = float(_unpack(linescore_rows[1])[1]) roadteamq2 = float(_unpack(linescore_rows[1])[2]) roadteamq3 = float(_unpack(linescore_rows[1])[3]) roadteamq4 = float(_unpack(linescore_rows[1])[4]) 

i error. tables = pages0 typeerror: 'str' object not callable

print pages[0] 

spits out of html of first link normal. that's not confusing. summarize can links still can scrape them.

something pulls pages of found links array, first page pages[0], second pages[1] etc

boxscores = soup.findall('a', href=re.compile('boxscore')) basepath =  "http://www.covers.com" pages=[] in boxscores:    pages.append(urllib2.urlopen(basepath + a['href']).read()) 

Comments

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -