java - How to import and consolidate similar (but not equal) text from several different sources -
i importing set of data several files (excel files) holds records no identifiers on daily basis. data needs stored in relational database (oracle).
the problem text may different each resource , because there's no unique identifier need somehow base comparison on text values.
let's example information different sources:
source a: dark knight source b: batman dark knight source c: dark knight 2008 source d: dark knight rises
if database hold item item_name "the dark knight" when import lines sources a,b,c i'll "full match" not d cause that's different movie.
things know:
- the process not 100% automation, if there's not match there user interaction match manually or create new record.
- although there user interaction want keep minimum (especially after user manually matched item).
how go solve without inflating database tons of synonyms each item ?
update 05/21/2013
i have found that: http://matpalm.com/resemblance/
it's use of jaccard coefficient. altough i'm not sure it's best case cause of complexity, matching m x n times m size of imported records , n total database records tens of thousands long.
Comments
Post a Comment