Hi. I have a file, where I need to delete duplicate numbers in a column and I need to delete words with an apostrophe (')

I am building a thesaurus and I need only one word and its synonym! The problem is that in the words list, I have many odd words with apostrophe, which of course cannot have synonyms and in the synonym list I have more than one synonym per word (sometimes I have up to 5 synonyms per word!!!) and I need only one!

In the "words" file, I have two columns:

"word id" and "word", where "word id" is the number of the word and "word" is the word itself!

In the file "synonyms", I have three columns:

"synonym id", "word id" and "synonym", where "synonym id" is the number of the synonym (1,2,3,4,5, etc...), "word id" is the word number from the other file (the file with words) and "synonym" is the very synonym of the words from the first file!

Now, my purpose is to combine these words and synonyms, while I remove the words with apostrophe (') from the first file and the superfluous synonyms from the second (synonyms) file!

I want to build a thesaurus and it has around 85000 words (or at least 60000-70000)!

I need this arrangement, because the thesaurus will be for a .php script and if there are any discrepancies between words and synonyms, it won't work correctly!!!

That's why i have to clean those files from the superfluous words with apostrophes and superfluous synonyms (As I said, I need only one word and its synonym, while the document now has the word, the word with apostrophe in the first file and up to five synonyms per word in the second)

How can this be done automatically???

Can the repeated numbers in the first column of the file synonyms, be deleted, along with their corresponding synonyms in the second column of the same file?

Let's say:

we have the word "find", which has id "19", and it has in the synonym file 5 synonyms

The word id is again "19", but the words here are five! And here is what we get:

Column 1 (word id) Column 2 (Synonym)


19 locate
19 come across
19 discover
19 uncover
19 reveal

I want to eliminate automatically, the superfluous 4 synonyms and only one to remain!!!

Means:

Column 1 Column 2
19 locate
20 -------
21 -------
etc...

Look at the very files:

http:www.mypicx.com/06122009/Pictures/


Have in mind that I also have to remove the words with apostrophes from the "words" file, in order to make the thesaurus work in the script!


What should be the solution of that?