java - Get nodes in html document contains word -

August 15, 2012

i want write script checks document keywords , specifies html document nodes in contained (possibly assign unique identifier).

i not professional programmer , not know strength of low-level languages and things plo.. i'm afraid of doing bad , unsupported.

how possible isolate desired nodes?

my experience - js , php - php simple things. also, not want use opportunity work js nodes. thoughts:

to make string of html
verify existence of words on page
if word on page exists: foreach node in body element first , last positions (for example, see opening tag each character know position , therefore calculate first position tag opened , last closed. , on nodes).

we know position of word (eg 192, 199) , check in range got (in case, these bands - nodes html document).

i need ideas experienced programmers. not matter language programming (except web-oriented)- every opinion important me. there libraries solve such problems. hope understand me. english not native language.

i recommend beautiful soup kind of thing. python library allows parse xml/html documents quickly. quite running extracts text each div element have thought. using pythons built-in string manipulation tools i'm sure searching particular words simple.

Search This Blog

DIs

java - Get nodes in html document contains word -

Comments

Post a Comment

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

php - Boolean search on database with 5 million rows, very slow -

css - Text drops down with smaller window -