*Update: How to parse html with python/ beautifulsoup -


first, i'm pretty new python. i'm trying scrape contact information offline websites , output info csv. i'd grab page url(not sure how html), email, phone, location data if possible, names, phone numbers , tag line html site if exists.


updated #2 code:

import os, csv, re bs4 import beautifulsoup  topdir = 'c:\\projects\\training\\html' output = csv.writer(open("scrape.csv", "wb+")) output.writerow(["headline", "name", "email", "phone", "location", "url"]) all_contacts = []  root, dirs, files in os.walk(topdir):     f in files:         if f.lower().endswith((".html", ".htm")):             soup = beautifulsoup(f)              def mailto_link(soup):                       if soup.name != 'a':                 return none             key, value in soup.attrs:                 if key == 'href':                     m = re.search('mailto:(.*)',value)                 if m:                     all_contacts.append(m)                 return m.group(1)             return none              ul in soup.findall('ul'):             contact = []             li in soup.findall('li'):                 s = li.find('span')                 if not (s , s.string):                     continue                 if s.string == 'email:':                     = li.find(mailto_link)                     if a:                     contact['email'] = mailto_link(a)                 elif s.string == 'website:':                     = li.find('a')                     if a:                     contact['website'] = a['href']                 elif s.string == 'phone:':                     contact['phone'] = unicode(s.nextsibling).strip()             all_contacts.append(contact)             output.writerow([all_contacts])  print "finished" 

this output doesn't return other row headers. missing here? should @ least returning info html file, page: http://bendoeslife.tumblr.com/about

there (at least) 2 problems here.

first, f filename, not file contents, or soup made contents. so, f.find('h2') going find 'h2' within filename, isn't useful.

second, find methods (including str.find, you're calling) return index, not substring. calling str on index going give string version of number. example:

>>> s = 'a string h2 in it' >>> = s.find('h2') >>> str(i) '17' 

so, code doing this:

>>> f = 'c:\\python\\training\\offline\\somehtml.html' >>> headline = f.find('h2') >>> str(headline) '-1' 

you want call methods on soup object, rather f. beautifulsoup.find returns "sub-tree" of soup, want stringify here.

however, it's impossible test without sample input, can't promise that's problem in code.

meanwhile, when stuck this, should try printing out intermediate values. print out f, , headline, , headline2, , more obvious why headline3 wrong.


just replacing f soup in find calls, , fixing indentation error, running against sample file http://bendoeslife.tumblr.com/about works.

it doesn't useful, however. since there's no h2 tag anywhere in file, headline ends none. , same goes of other fields. thing does find url, because you're asking find empty string, find something arbitrary. 3 different parsers, <p>about</p> or <html><body><p>about</p></body></html>, , <html><body></body></html>

you need understand structure of file you're trying parse before can useful it. in case, example, there email address, it's in <a> element title of "email", <li> element id of "email". so, need write find locate based on 1 of criteria, or else matches.


Comments

Popular posts from this blog

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -

Trying to import CSV file to a SQL Server database using asp.net and c# - can't find what I'm missing -