*Update: How to parse html with python/ beautifulsoup -
first, i'm pretty new python. i'm trying scrape contact information offline websites , output info csv. i'd grab page url(not sure how html), email, phone, location data if possible, names, phone numbers , tag line html site if exists.
updated #2 code:
import os, csv, re bs4 import beautifulsoup topdir = 'c:\\projects\\training\\html' output = csv.writer(open("scrape.csv", "wb+")) output.writerow(["headline", "name", "email", "phone", "location", "url"]) all_contacts = [] root, dirs, files in os.walk(topdir): f in files: if f.lower().endswith((".html", ".htm")): soup = beautifulsoup(f) def mailto_link(soup): if soup.name != 'a': return none key, value in soup.attrs: if key == 'href': m = re.search('mailto:(.*)',value) if m: all_contacts.append(m) return m.group(1) return none ul in soup.findall('ul'): contact = [] li in soup.findall('li'): s = li.find('span') if not (s , s.string): continue if s.string == 'email:': = li.find(mailto_link) if a: contact['email'] = mailto_link(a) elif s.string == 'website:': = li.find('a') if a: contact['website'] = a['href'] elif s.string == 'phone:': contact['phone'] = unicode(s.nextsibling).strip() all_contacts.append(contact) output.writerow([all_contacts]) print "finished"
this output doesn't return other row headers. missing here? should @ least returning info html file, page: http://bendoeslife.tumblr.com/about
there (at least) 2 problems here.
first, f
filename, not file contents, or soup made contents. so, f.find('h2')
going find 'h2'
within filename, isn't useful.
second, find
methods (including str.find
, you're calling) return index, not substring. calling str
on index going give string version of number. example:
>>> s = 'a string h2 in it' >>> = s.find('h2') >>> str(i) '17'
so, code doing this:
>>> f = 'c:\\python\\training\\offline\\somehtml.html' >>> headline = f.find('h2') >>> str(headline) '-1'
you want call methods on soup
object, rather f
. beautifulsoup.find
returns "sub-tree" of soup, want stringify here.
however, it's impossible test without sample input, can't promise that's problem in code.
meanwhile, when stuck this, should try printing out intermediate values. print out f
, , headline
, , headline2
, , more obvious why headline3
wrong.
just replacing f
soup
in find
calls, , fixing indentation error, running against sample file http://bendoeslife.tumblr.com/about works.
it doesn't useful, however. since there's no h2
tag anywhere in file, headline
ends none
. , same goes of other fields. thing does find url
, because you're asking find empty string, find something arbitrary. 3 different parsers, <p>about</p>
or <html><body><p>about</p></body></html>
, , <html><body></body></html>
…
you need understand structure of file you're trying parse before can useful it. in case, example, there email address, it's in <a>
element title of "email"
, <li>
element id
of "email"
. so, need write find locate based on 1 of criteria, or else matches.
Comments
Post a Comment