*Update: How to parse html with python/ beautifulsoup -
first, i'm pretty new python. i'm trying scrape contact information offline websites , output info csv. i'd grab page url(not sure how html), email, phone, location data if possible, names, phone numbers , tag line html site if exists.
updated #2 code:
import os, csv, re bs4 import beautifulsoup topdir = 'c:\\projects\\training\\html' output = csv.writer(open("scrape.csv", "wb+")) output.writerow(["headline", "name", "email", "phone", "location", "url"]) all_contacts = [] root, dirs, files in os.walk(topdir): f in files: if f.lower().endswith((".html", ".htm")): soup = beautifulsoup(f) def mailto_link(soup): if soup.name != 'a': return none key, value in soup.attrs: if key == 'href': m = re.search('mailto:(.*)',value) if m: all_contacts.append(m) return m.group(1) return none ul in soup.findall('ul'): contact = [] li in soup.findall('li'): s = li.find('span') if not (s , s.string): continue if s.string == 'email:': = li.find(mailto_link) if a: contact['email'] = mailto_link(a) elif s.string == 'website:': = li.find('a') if a: contact['website'] = a['href'] elif s.string == 'phone:': contact['phone'] = unicode(s.nextsibling).strip() all_contacts.append(contact) output.writerow([all_contacts]) print "finished" this output doesn't return other row headers. missing here? should @ least returning info html file, page: http://bendoeslife.tumblr.com/about
there (at least) 2 problems here.
first, f filename, not file contents, or soup made contents. so, f.find('h2') going find 'h2' within filename, isn't useful.
second, find methods (including str.find, you're calling) return index, not substring. calling str on index going give string version of number. example:
>>> s = 'a string h2 in it' >>> = s.find('h2') >>> str(i) '17' so, code doing this:
>>> f = 'c:\\python\\training\\offline\\somehtml.html' >>> headline = f.find('h2') >>> str(headline) '-1' you want call methods on soup object, rather f. beautifulsoup.find returns "sub-tree" of soup, want stringify here.
however, it's impossible test without sample input, can't promise that's problem in code.
meanwhile, when stuck this, should try printing out intermediate values. print out f, , headline, , headline2, , more obvious why headline3 wrong.
just replacing f soup in find calls, , fixing indentation error, running against sample file http://bendoeslife.tumblr.com/about works.
it doesn't useful, however. since there's no h2 tag anywhere in file, headline ends none. , same goes of other fields. thing does find url, because you're asking find empty string, find something arbitrary. 3 different parsers, <p>about</p> or <html><body><p>about</p></body></html>, , <html><body></body></html>…
you need understand structure of file you're trying parse before can useful it. in case, example, there email address, it's in <a> element title of "email", <li> element id of "email". so, need write find locate based on 1 of criteria, or else matches.
Comments
Post a Comment