Lab - File Content, Regular Expressions

Mundofik · December 4, 2024, 6:51pm

Hello,

Some of the solutions provided in the exercises are not correct:

For this exercise (n. 10)

The solution provided is egrep -o ‘\b[A-Z][a-z]{2, }\b’ /etc/nsswitch.conf > /home/bob/filtered1

because it says "Filter out the lines that contain any word that starts with a capital letter and are then followed by exactly two lowercase letters

So why are there:
\b - what does that mean?
{2, } is not correct, because that means minimum 2 and max unlimited (or at least 2). But the exercise says EXACTLY 2 lowercase. Shouldn’t that be {2} ?

Thanks

john_doe · December 5, 2024, 8:51am

-o does not “filter out the lines”, it actually filters out only the matching substrings.
\b is a “word boundary” (see e.g. Regular expression - Wikipedia)
Regarding {2} - yes, you’re probably right (almost, I think)
I believe \b[A-Z][a-z]{2}\b would be even more appropriate (at least for Latin / ASCII alphabet) if we are to extract only complete, 3-character words.
(but \b[[:upper:]][[:lower:]]{2}\b if you asked me about any other alphabets with diacritics).

Mundofik · December 5, 2024, 9:10am

So it seems that using -o + \b is like using ** grep -w** ?
I matches just a word, not a line.

Anyway, the solution does not seem to be correct, as the exercise asks to filter out the lines, not the just the words (which is what the solution provided instead does).

john_doe · December 5, 2024, 11:09am

Well, not exactly. As -w will match a word, it will still display the entire line with this matching word, whereas -o will only display the matching substring (“word” itself), and it will not display the entire line containing it.