Concordance of text

2012-03-11

An answer to this question on Stack Overflow.

Question

I have been reading the cookbook for Linux to get a hang of it. I am fairly new to it.

I cam across a topic called Concordance of text. Now I understand what it is, but I am not able to get a sequence of commands using tr, sort and uniq ( That's what the cookbook says ) that would generate the concordance.

Can someone tell me how to create a basic concordance? i.e. just sort and display word frequency for each unique word.

The idea presented in the cookbook to use tr to translate all spaces to newline characters so that each word goes into a new line, which is then passed to the sorter, and then passed to the uniq with the -c flag to make a count of the unique terms.

I am not able to figure out the correct parameters though. Can someone explain please while explaining what each parameter does?

I have googled out for this but I am not able to get a clearly defined answer to my problem.

Any help is much appreciated!

Answer

There are many ways to do this, but this is my solution. It uses different commands than you mention, but, through the use of sed and a final `sort, it may produce more desirable output.

find . -type f -print0 | xargs -0 cat | sed 's/[[:punct:]]//g' | sed -r 's/\s+/\n/g' | sort | uniq -c | sort -n

find . -type f -print0 will recursively search all the folders and files from your current directory downwards. -type f will return only files. -print0 will use the special \0 character to end file names so that spaces aren't confusing to the next the command in the pipe.

xargs takes input and turns it into arguments for a command, in this case cat. cat will print the contents of all files given to it as arguments. The -0 tells xargs that its input is delimited by the special \0 character, not by spaces.

sed is a pattern-matching stream editor. The first sed command subsitutes (s) all punctuation using the [[:punct:]] pattern and replaces the punctuation with nothing. It matches all such patterns in each line given to it (g).

The second sed command turns all instances of 1 or more spaces in a row (\s+) into newlines (\n) through the input string (g).

sort organizes the words alphabetically.

uniq -c eliminates adjacent duplicates in the output list while counting how many there were.

sort -n sorts this output numerically yielding a list of words sorted by word frequency.

sed and xargs are very powerful commands, especially if used in conjunction. But, as another poster has noted, find also has almost unbridled power. tr is useful, but is more specific than sed.