java - How to import and consolidate similar (but not equal) text from several different sources -


i importing set of data several files (excel files) holds records no identifiers on daily basis. data needs stored in relational database (oracle).

the problem text may different each resource , because there's no unique identifier need somehow base comparison on text values.

let's example information different sources:

source a: dark knight source b: batman dark knight source c: dark knight 2008 source d: dark knight rises 

if database hold item item_name "the dark knight" when import lines sources a,b,c i'll "full match" not d cause that's different movie.

things know:

  • the process not 100% automation, if there's not match there user interaction match manually or create new record.
  • although there user interaction want keep minimum (especially after user manually matched item).

how go solve without inflating database tons of synonyms each item ?

update 05/21/2013

i have found that: http://matpalm.com/resemblance/

it's use of jaccard coefficient. altough i'm not sure it's best case cause of complexity, matching m x n times m size of imported records , n total database records tens of thousands long.


Comments

Popular posts from this blog

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -

Trying to import CSV file to a SQL Server database using asp.net and c# - can't find what I'm missing -