java - Get nodes in html document contains word -
i want write script checks document keywords , specifies html document nodes in contained (possibly assign unique identifier).
i not professional programmer , not know strength of low-level languages and things plo.. i'm afraid of doing bad , unsupported.
how possible isolate desired nodes?
my experience - js , php - php simple things. also, not want use opportunity work js nodes. thoughts:
- to make string of html
- verify existence of words on page
- if word on page exists: foreach node in body element first , last positions (for example, see opening tag each character know position , therefore calculate first position tag opened , last closed. , on nodes).
we know position of word (eg 192, 199) , check in range got (in case, these bands - nodes html document).
i need ideas experienced programmers. not matter language programming (except web-oriented)- every opinion important me. there libraries solve such problems. hope understand me. english not native language.
i recommend beautiful soup kind of thing. python library allows parse xml/html documents quickly. quite running extracts text each div element have thought. using pythons built-in string manipulation tools i'm sure searching particular words simple.
Comments
Post a Comment