java - Get nodes in html document contains word -


i want write script checks document keywords , specifies html document nodes in contained (possibly assign unique identifier).

i not professional programmer , not know strength of low-level languages ​​and things plo.. i'm afraid of doing bad , unsupported.

how possible isolate desired nodes?

my experience - js , php - php simple things. also, not want use opportunity work js nodes. thoughts:

  • to make string of html
  • verify existence of words on page
  • if word on page exists: foreach node in body element first , last positions (for example, see opening tag each character know position , therefore calculate first position tag opened , last closed. , on nodes).

we know position of word (eg 192, 199) , check in range got (in case, these bands - nodes html document).

i need ideas experienced programmers. not matter language programming (except web-oriented)- every opinion important me. there libraries solve such problems. hope understand me. english not native language.

i recommend beautiful soup kind of thing. python library allows parse xml/html documents quickly. quite running extracts text each div element have thought. using pythons built-in string manipulation tools i'm sure searching particular words simple.


Comments

Popular posts from this blog

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -

Trying to import CSV file to a SQL Server database using asp.net and c# - can't find what I'm missing -