Contacts

Using Grep and Regular Expressions to Find Text Patterns in Linux. Regular Expressions ▍Validating Email Addresses

Bash shell regular expressions are intended to be one of the primary tools that allow interaction between the user and the operating system. Through the shell, the user can manipulate files and directories present in the machine's file system, process their contents, and execute other programs using his terminal's keyboard as the input unit and the terminal's alphanumeric screen as the output device.

Bash regular expressions were developed by Brian Fox for the GNU Project as an alternative software replacement for the Bourne shell. The command language was published in 1989 and became widespread as the default login shell for Linux and MacOS distributions through Apple (formerly OS X). A version is also available for Windows 10 and is the default user shell in Solaris 11.

Bash is an instructional processor, traditionally running in a text terminal where the developer runs commands that cause actions. Bash regular expressions are read and executed from a file called a shell script. Together with Unix, it recognizes filenames (wildcard comparison), protocols, documents, directive substitution, and control structures for testing criteria. In the main words, the syntax and other key features of the language are reproduced from csh and ksh. Bash is a POSIX-compliant shell, but with some extensions. The name of the shell is an abbreviation for

Brian Fox began coding Bash on January 10, 1988, after Richard Stallman became dissatisfied with the lack of progress in developing a free shell that could run existing scripts. Fox released Bash as a beta version on June 8, 1989, and remained the main developer of the project from mid-1992 until mid-1994, after which he was fired from the FSF and replaced by Chet Rami.

During this period, Bash was the most popular program among Linux users, becoming the default interactive shell in various distributions of the operating system, as well as Apple's MacOS. Bash has also been incorporated into Microsoft Win with Cygwin, DOS through the DJGPP project, and Android through various terminal emulation applications.

In early September 2014, a significant security flaw was discovered in Bash version 1.03, released in August 1989, called Shellshock, which led to a number of attacks over the Internet. The bug was considered serious because Bash was exploited to allow exploitation of arbitrary code. Patches to correct the errors became available as soon as they were discovered, but not all computers were updated.

Shell syntax features

Bash is a superset of the Bourne shell commands and uses bracket expansion, command line completion, basic debugging, and trapping exception handling among other features. Executes the vast majority of Bourne shell scripts unchanged, except for scripts that are interpreted differently or attempt to run a system command. Bash grep regular expressions, as well as GNU tools, use a concise way of scanning for software errors and setting an exit status, which allows threads to go to traditional destinations.

If a developer presses the tab key in the command shell, Bash automatically applies command line endings to match typed program, file, and variable names. The command line termination system is infinitely flexible and controllable, and is often composed with functions that store arguments and file names for specific programs and jobs. The Bash syntax has a number of extensions that are missing from the Bourne shell.

Bash regular expressions: Perform integer calculations of arithmetic evaluation, using the ((...)) command and $ ((...)) syntax argument to simplify I/O redirection. For example, it has the ability to redirect output (stdout) and failure (stderr) synchronously with support for the &> operator. The real one is easier to type than the Bourne shell equivalent "command > file 2>&1".

Bash uses process substitution with support for Linux regular expression syntax and substitutes command output (input) that traditionally uses a file name. When using the "function" keyword, Bash declarations are incompatible with Bourne and Korn scripts, since the Korn shell has the same problem when using "function", but it accepts the same function declaration syntax as the above shells, being POSIX-compliant.

Because of these and other differences, scripts rarely run under the Bourne and Korn interpreters unless they were specifically written with this compatibility in mind, which is something to consider when planning to work with Bash regular expressions. Associative arrays allow fake support for indexed arrays, similar to AWK. Bash 4.x was not integrated into the new version of MacOS due to license restrictions. An example of an associative array.

The shell has two command execution modes: batch and parallel. Commands in batch mode are separated by ";". Bash regular expressions example:

  • command1;
  • command2.

In this example, when command 1 is completed, command 2 is executed. And you can also run command 1 in the background using (symbol &) at the end of execution, the process will execute in the background, immediately returning control to the shell and allowing the user to use the executed commands.

To run commands 1 and 2 at the same time, they must be executed in the shell as follows:

  • command1 & command2.

In this case, command 1 is executed in the background & symbol, immediately returning control to the shell, which executes command 2 in the foreground. Bash grep regular expressions can be stopped and control returned by typing Ctrl + z while the process is running in the foreground. A list of all processes, both background and stopped, can be achieved by running jobs.

The state of a process can be changed using various commands. The "fg" command brings a process to the foreground, and the "bg" command stops a process running in the background. Bg" and "fg" can take a job ID as their first argument to indicate which process to act on. Without this, they use the default process, indicated by the plus sign in the "jobs" output. The "kill" command can be used to terminate the process prematurely by sending it a signal The job ID must be specified after the percent sign:

  • kill -s SIGKILL% 1 or kill -9%.

Bash supplies "conditional execution" to command separators, which execute "contingent" commands at the exit code set by the use case command. An external command called "bashbug" reports shell errors. When the command is called, it launches the default editor for the user with the completed form. The form is sent to Bash parties or possibly other email addresses, providing a global replacement for Bash regular expressions.

When Bash starts running, it executes various dot files. Even for similar script commands that have permission to be executed and ordered by the interpreter, for example:

  • #!/bin/bash.

Initialization files used by Bash with assignment expressions do not require this. File execution order:

  1. When the shell starts, it reads and executes /etc/profile, if present.
  2. This file initiates /etc/bash.bashrc.
  3. After defining this file, it looks for ~/.bash_profile, reading and executing the 1st one that exists and is readable.
  4. If the shell follows from , it defines and executes ~/.bash_logout.
  5. When run as a shell, it defines and executes /etc/bash.bashrc, and then ~/.bashrc.
  6. This has the ability to be disabled via the "--norc" option.
  7. The "--rcfile" file option forces Bash to read and execute it.
  8. Mapping to Bourne shell and csh startup, exits the Bourne shell and csh. They allow you to narrow the general use of files with Bourne and allow certain launch functions known to csh users.

Calling Bash with the -posix option or specifying set -o posix in a script causes Bash's escaping regular expression to conform very closely to the POSIX 1003.2 standard. Shell scripts intended for portability should at least take into account the Bourne shell it intends to replace. Bash has certain features that the traditional Bourne shell lacks. These include:

  1. Some advanced calling options.
  2. Command substitution using $() notation. This feature is part of the POSIX 1003.2 standard.
  3. Expanding parentheses.
  4. Some operations with arrays and associative arrays.
  5. Extending the test construct with double brackets.
  6. Arithmetic-evaluation construction of Bash regular expressions in "if".
  7. Some string manipulation operations.
  8. Replacement process.
  9. Regular expression matching operator.
  10. "Bash"-specific built-in Coprocesses.

Bash arithmetic expressions use "readline" to provide keyboard shortcuts and command line editing using default key bindings (Emacs). Vi bindings can be enabled by running "set -o vi".

Bracket substitution, also called interleaving, is a function that is copied from the C shell. It generates a set of alternative combinations. The generated results do not need to exist as files. The results of each expanded row are not sorted and are stored in right order. Users should not use bracket expansions in portable shell scripts because the Bourne shell does not produce the same output.

When bracket expansion is combined with wildcards, the brackets are first expanded and then the resulting wildcards are replaced. In addition to interleaving, bracket expansion can be used for sequential ranges between two integers or characters separated by double dots. Newer versions of Bash's regular expression usage allow a third integer to specify an increment.

When bracket expansion is combined with variable expansion, it is performed after bracket expansion, which in some cases may require the use of the "eval" builtin, thus:

  • $start = 1 ;
  • end = 10 $ echo ( $ start .. $ end ) # cannot expand due to evaluation order (1..10);
  • $ eval echo ( $ start .. $ end ) # variable expansion occurs, then the resulting string is evaluated: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

Syntactic aspects of the Basha language

Shell scripts must be stored in an ASCII text file created using the "editor" program, which does not introduce additional characters or sequences to format the text. For example, editors suitable for creating shell scripts are the vi or Emacs programs available on UNIX/Linux, or programs such as Notepad, TextEdit and UltraEdit on Microsoft Windows.

A good practice is to insert the sequence "#!" on the first line of every Bash script. /bin/bash", which shows the absolute path of the program in the file system of the machine on which you want to run the script. This way, you can run it directly on the command line without specifying the filename as an argument to the "bash" command.

An indication of the translator program that will be used by the operating system to translate and execute the script instructions is provided in the first line of the script itself, immediately after the sequence of “#!” characters. The interpreter executable is considered to be located in the /bin directory, but on different systems it may be installed in other directories, for example:

  • "/usr/bin", "/usr/local/bin".

In general, the "#" character allows you to enter a comment in the script source. Any character on the script line after the "#" character is ignored by the command interpreter. In fact, it is often used to insert comments into a script source to describe how it works or to explain the impact of specific commands. As with inserting commands interactively, even when coding a script, each program instruction may be written on a separate line or split over multiple lines and ending each line except the last with a "\" character. Additional instructions can be reported on the same line using ";".

Program instructions may be "indented" to make the source code more readable, but attention should be paid to the use of spaces. The Bash interpreter is more picky than other interpreters or compilers, and in some cases it does not allow arbitrary spaces to be inserted between members that make up statements; in other cases, the use of space is important to the correct interpretation of the statement.

There are no symbols to delimit blocks of instructions inserted into a control structure, for example, which must be repeated in an iterative control structure. On the other hand, there are corresponding language keywords that allow you to correctly identify the beginning and end of a block. These keywords vary depending on the instruction used to control program flow. In the Bash regular expression "match" example syntax, certain characters take on a special meaning, that is, if they are present in a character string or as a command argument, they perform a very precise function.

With minimal simplification, we can say that a shell is a program that interactively always performs the same operation. It waits for a command as input, evaluates it to make sure the command is syntactically correct and executes it, then returns to wait for the next command. This process ends when the shell receives a signal indicating that the login is complete and no other commands will be sent to it. At this point, the shell program exits, freeing allocated memory and other machine resources available to the operating system.

The script is launched automatically by the operating system when the user logs into the system itself, that is, it can be executed by the user using a command issued on an already open shell, or using special graphical utilities if he is working on a system with a graphical user interface. For example, on an Apple Macintosh computer running Mac OS X, you can use a command shell by running the Terminal utility, located in the Utility folder in the Application folder.

On workstation On Linux with a graphical desktop manager such as GNOME or KDE, you can open a command shell by selecting the Terminal program from the Applications → Accessories menu. After activating the command shell, we can view the name of the shell we are using by running the following commands:

  • $echo;
  • $SHELL /bin/bash.

If the default shell is not Bash, you can check if it is present on the system in one of the directories listed in the PATH environment variable using the "which" command, and execute it using the "bash" command:

  • $ echo $SHELL /bin/tcsh $ which bash /bin/bash $ bash bash-2.03$.

The shell thus operates interactively, receiving the input to each individual command and the parameters specified on the command line, and executing the command itself. The output is displayed in the same terminal window. Each command passed to the shell ends with the Invio/Enter key pressed. You can issue multiple commands on one line, separating them from each other with a ";". It is also possible to split a command insert into two or more lines, ending each intervening line with a "\" character.

Typically, in programming languages, quotes and double quotes are used to delimit strings, and the use of one or the other character depends on the syntax adopted in a particular language. In scripting languages, the use of quotes and backreferences has a different meaning, and Bash is no exception to this.

Single quotes are used to delimit character strings. The interpreter does not enter into the contents of the string and simply uses a sequence of characters separated by quotes. Thus, characters that would otherwise take on a different meaning can also be part of the string. The only character that cannot be used in a quoted string is those same quotes. To define such a string, you must delimit it with quotes.

Double quotes are used to delimit strings, but if the string is delimited by this character, the interpreter performs what is called "interpolation" and resolves the value of any variables in Bash regular expressions on the string. In practice, if a string enclosed in double quotes contains a reference to a variable, the string replaces the variable's name with its value. To print characters, such as double quotes or dollars, that would otherwise be interpreted to have a different meaning, you must prefix each of them with a backslash character "\". To print a backslash character in a double-quoted string, you need to return two backslashes.

Backtracking has the most characteristic behavior typical of scripting languages ​​and is absent from mainstream high-level programming languages. The quotation mark allows you to delimit a string that is interpreted by Bash as a command and must be executed, returning as an output value to the same output standard pipe product.

If you want to execute a shell so that it processes a sequence of commands shown in an ASCII text file:

  • $pwd;
  • echo $SHELL ;
  • hostaname /home/marco /bin/bash aquilante $ echo \ > $SHELL /bin/bash.

If you want to prepare a file called "script.sh" that is stored in your home directory, the contents of the file could be as follows:

  • echo -n "Oggi e" il " 2 date +%d/%m/%Y.

Run this very simple script by specifying the file name on the command line from which the shell is called:

  • $ bash script.sh Oggi e" il 10/6/2011.

The shell can also accept a sequence of commands to be executed through a pipe, which redirects the output of another command to Bash's standard input:

  • $ cat script.sh | bash Oggi" il 10/6/2011.

You can highlight the Bash program's regular expression string, marked "#!". The absolute path of the interpreter that will be used to execute the script is run directly without an OS by running Bash and passing the script as input:

  • $ cat script.sh #!/bin/bash echo -n "Oggi e" il " date +%d/%m/%Y $ chmod 755 script.sh $ ls -l script.sh -rwxr-xr-x 1 marco users 49 18 Apr 23:58 script.sh $ ./script.sh Oggi e" il 10/6/2011.

The last command in the previous example, which directly causes execution of the script stored in the file "script.sh" present in the current directory, specifies the relative path "./" to the file name. You must specify the path to the directory in which the executable script is located, because often, for security reasons, the current directory is not in the list of directories in which the shell should look for external executable commands. The list of such directories is stored in Bash regular expression variables.

Benefits of an Operating System with Bash

It is the most efficient shell scripting language. It gives the user an easy way to automate the work if they are already familiar with using the shell interactively. If a developer programs systems, then he must know how the shell works.

If you compare scripts to learning a configuration or automation system "yaml" or "json", they are much more versatile. Bash scripts are simpler because the script runs by default.

Bash is a simpler language, and this forces developers to focus on other complexities of the system. Bash works great for shell writing. Everything else basically either uses a command shell or implements its own shell, copying the good parts from it. Additionally, there are good Bash regular expression builders that make working with the shell much easier.

With Bash, developers can deliver an interactive web experience, leveraging the Linux command line experience without the boundaries of time and place. Using this feature does not require strict rules or effort, and users can access an authenticated workstation, managing Azure resources and environment with one click, even while they are using mobile applications Azure, Azure Portal and Azure Documentation.

Unlike a traditional command line environment, there is no need to install and select tools before getting started, and you can save time and effort with Bash. All CLI tools such as text, assemblies, containers and source are available in Bash and one can use secure and easy tool authentication using CLI 2.0.

We looked at examples of Bash regular expressions. Good luck learning!

In today's article I want to touch on such a huge topic as Regular Expressions. I think everyone knows that the topic of regexes (as regular expressions are called in slang) is vast in the scope of one post.

Let me start by saying that there are several types of regular expressions:

1. Traditional Regular Expressions(they are also basic, basic and basic regular expressions(BRE))

  • The syntax of these expressions is defined as obsolete, but nevertheless is still widespread and used by many UNIX utilities
  • Basic regular expressions include the following metacharacters (more on their meanings below):
    • \( \) - initial version for ( ) (in extended)
    • \(\) - initial version for () (in extended)
    • \n, Where n- number from 1 to 9
  • Features of using these metacharacters:
    • An asterisk must follow the expression corresponding to a single character. Example: *.
    • Expression \( block\)* should be considered incorrect. In some cases it matches zero or more repetitions of the string block. In others it corresponds to the string block* .
    • Within a character class, special character meanings are largely ignored. Special cases:
    • To add a ^ character to a set, it must not be placed first there.
    • To add a - character to a set, it must be placed there first or last. For example:
      • DNS name template, which may include letters, numbers, minus and a dot: [-0-9a-zA-Z.] ;
      • any character except minus and numbers: [^-0-9] .
    • To add a [ or ] character to a set, it must be placed there first. For example:
      • matches ], [, a or b.

2. Advanced Regular Expressions(they are extended regular expressions(ERE))

  • The syntax of these expressions is similar to the syntax of the main expressions, with the exception of:
    • Removed the use of backslashes for the ( ) and () metacharacters.
    • A backslash before a metacharacter overrides its special meaning.
    • Rejected theoretically irregular design\ n .
    • Added metacharacters + , ? , | .

3. Regular expressions compatible with Perl(they are Perl-compatible regular expressions(PCRE))

  • have a richer and at the same time predictable syntax than even POSIX ERE, so they are often used by applications.

Regular Expressions consist of templates, or rather set a template search. The template consists from rules searches, which are made up of characters And metacharacters.

Search rules are determined by the following operations:

Enumeration |

Pipe (|) separates valid options, one might say - logical OR. For example, "gray|grey" matches gray or gray.

Group or union()

Round brackets are used to define the scope and precedence of operators. For example, "gray|grey" and "gr(a|e)y" are different patterns, but they both describe a set containing gray And gray.

Quantify()? * +

Quantifier after a character or group determines how many times antecedent expression may occur.

general expression, repetitions may be from m to n inclusive.

general expression m or more repetitions.

general expression no more than n repetitions.

smoothn repetitions.

Question mark means 0 or 1 times, same as {0,1} . For example, "colou?r" matches and color, And color.

Star means 0, 1 or any number once ( {0,} ). For example, "go*gle" matches ggle, Google, google and etc.

Plus means at least 1 once ( {1,} ). For example, "go+gle" matches Google, google etc. (but not ggle).

The exact syntax of these regular expressions is implementation dependent. (that is, in basic regular expressions symbols ( And )- escaped with a backslash)

Metacharacters, in simple terms, are symbols that do not correspond to their real meaning, that is, a symbol. (dot) is not a dot, but any one character, etc. Please familiarize yourself with the metacharacters and their meanings:

. corresponds alone any symbol
[something] Compliant any single character from those enclosed in brackets. In this case: The “-” character is interpreted literally only if it is located immediately after an opening or before a closing parenthesis: or [-abc]. Otherwise, it denotes a character interval. For example, matches "a", "b" or "c". corresponds to lower case letters of the Latin alphabet. These designations can be combined: matches a, b, c, q, r, s, t, u, v, w, x, y, z. To match the characters “[” or “]”, it is enough that the closing bracket was the first character after the opening character: matches "]", "[", "a" or "b". If the value in square brackets is preceded by a ^ character, then the value of the expression matches single character from among those which are not in brackets
^ . For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any character except lowercase characters in the Latin alphabet.
$ Matches the beginning of the text (or the beginning of any line if the mode is line-by-line).
Matches the end of the text (or the end of any line if the mode is line-by-line). \(\) or () n). A "marked subexpression" is also a "block". Unlike other operators, this one (in traditional syntax) requires a backslash; in extended and Perl, the \ character is not needed.
\n Where n is a number from 1 to 9; corresponds n the th marked subexpression (for example (abcd)\0, that is, the characters abcd are marked with zero). This design is theoretically irregular, it was not accepted in the extended regular expression syntax.
*
  • Star after an expression matching a single character, matches zero or more copies this (preceding) expression. For example, "*" matches the empty string, "x", "y", "zx", "zyx", etc.
  • \n*, Where n is a number from 1 to 9, matches zero or more occurrences to match n th marked subexpression. For example, "\(a.\)c\1*" matches "abcab" and "abcaba", but not "abcac".

An expression enclosed in "\(" and "\)" followed by a "*" should be considered illegal. In some cases, it matches zero or more occurrences of the string that was enclosed in parentheses. In others, it matches the expression enclosed in parentheses, given the "*" character.

\{x,y\} Corresponds to the last one ( upcoming) block occurring at least x and no more y once. For example, "a\(3,5\)" matches "aaa", "aaaa" or "aaaaa". Unlike other operators, this one (in traditional syntax) requires a backslash.
.* Designation of any number of any characters between two parts of a regular expression.

Metacharacters help us use various matches. But how can we represent a metacharacter as a regular character, that is, the symbol [ (square bracket) with the meaning of a square bracket? Just:

  • must be preceded ( shield) metacharacter (. * + \ ? ( )) backslash. For example \. or \[

To simplify the definition of some character sets, they were combined into the so-called. classes and categories of characters. POSIX has standardized the declaration of certain character classes and categories, as shown in the following table:

POSIX class similarly designation
[:upper:] uppercase characters
[:lower:] lowercase characters
[:alpha:] upper and lower case characters
[:alnum:] numbers, upper and lower case characters
[:digit:] numbers
[:xdigit:] hexadecimal digits
[:punct:] [.,!?:…] punctuation marks
[:blank:] [\t] space and TAB
[:space:] [\t\n\r\f\v] skip characters
[:cntrl:] control characters
[:graph:] [^\t\n\r\f\v] seal symbols
[:print:] [^\t\n\r\f\v] seal symbols and skip symbols

In regex there is such a thing as:

Greed regex

I will try to describe it as clearly as possible. Let's say we want to find all HTML tags in some text. Having localized the problem, we want to find the values ​​contained between< и >, along with these same brackets. But we know that tags have different lengths and there are at least 50 tags themselves. Listing them all, enclosing them in metasymbols, is too time-consuming a task. But we know that we have an expression.* (dot asterisk), which characterizes any number of any characters in the line. Using this expression we will try to find in the text (

So, How to create RAID level 10/50 on an LSI MegaRAID controller (also relevant for: Intel SRCU42x, Intel SRCS16):

) all values ​​between< и >. As a result, the ENTIRE line will match this expression. why, because regex is GREEDY and tries to capture ANY ALL number of characters between< и >, respectively the entire line, starting < p>So... and ending ...> will belong to this rule!

I hope this example makes it clear what greed is. To get rid of this greed, you can follow the following path:

  • take into account the symbols Not corresponding to the desired pattern (for example:<[^>]*> for the above case)
  • get rid of greed by adding a definition of the quantifier as non-greedy:
    • *? - "not greedy" ("lazy") equivalent *
    • +? - “not greedy” (“lazy”) equivalent +
    • (n,)? - “not greedy” (“lazy”) equivalent (n,)
    • .*? - “not greedy” (“lazy”) equivalent.*

I would like to add to all of the above extended regular expression syntax:

Regular expressions in POSIX are similar to traditional Unix syntax, but with the addition of some metacharacters:

Plus indicates that previous symbol or group may be repeated one or more times. Unlike the asterisk, at least one repetition is required.

Question mark does previous symbol or group optional. In other words, in the corresponding line it may be absent or present smooth one once.

Vertical bar separates regular expression alternatives. One character specifies two alternatives, but there can be more of them, just use more vertical bars. It is important to remember that this operator uses as much of the expression as possible. For this reason, the alternative operator is most often used inside parentheses.

The use of backslashes has also been abolished: \(…\) becomes (…) and \(…\) becomes (…).

To conclude the post, I will give some examples of using regex:

$ cat text1 1 apple 2 pear 3 banana $ grep p text1 1 apple 2 pear $ grep pea text1 2 pear $ grep "p*" text1 1 apple 2 pear 3 banana $ grep "pp*" text1 1 apple 2 pear $ grep " x" text1 $ grep "x*" text1 1 apple 2 pear 3 banana $ cat text1 | grep "l\|n" 1 apple 3 banana $ echo -e "find an\n* here" | grep "\*" * here $ grep "pp\+" text1 # lines containing one p and 1 or more p 1 apple $ grep "pl\?e" text1 1 apple 2 pear $ grep "pl\?e" text1 # pe with possible character l 1 apple 2 pear $ grep "p.*r" text1 # p, on lines where there is r 2 pear $ grep "a.." text1 # lines with a followed by at least 2 characters 1 apple 3 banana $ grep "\(an\)\+" text1 # Search for more repetitions an 3 banana $ grep "an\(an\)\+" text1 # search for 2 repetitions an 3 banana $ grep "" text1 # search lines where there are 3 or p 1 apple 2 pear 3 banana $ echo -e "find an\n* here\nsomewhere." | grep "[.*]" * here somewhere. $ # Searches for characters 3 through 7 $ echo -e "123\n456\n789\n0" | grep "" 123 456 789 $ # We are looking for a digit that does not have the letters n and r before the end of the line $ grep "[[:digit:]][^nr]*$" text1 1 apple $ sed -e "/\(a .*a\)\|\(p.*p\)/s/a/A/g" text1 # replacing a with A in all lines where a comes after a or p comes after r 1 Apple 2 pear 3 bAnAnA $ sed -e "/^[^lmnXYZ]*$/s/ear/each/g" text1 # replace ear with each in lines not starting with lmnXYZ 1 apple 2 peach 3 banana $ echo "First. A phrase. This is a sentence." |\ # replacing the last word in a sentence with LAST WORLD. > sed -e "s/ [^ ]*\./ LAST WORD./g" First. A LAST WORD. This is a LAST WORD.

Original: Linux Fundamentals
Author: Paul Cobbaut
Published date: October 16, 2014
Translation: A. Panin
Translation date: December 17, 2014

Chapter 19. Regular Expressions

Regular expression engines are very powerful tool Linux systems. Regular expressions can be used with many programs, such as bash, vi, rename, grep, sed and others.

This chapter provides a basic understanding of regular expressions.

Regular expression syntax versions

There are three different versions regular expression syntaxes: BRE: Basic Regular Expressions ERE: Extended Regular Expressions PCRE: Perl Regular Expressions

Depending on the tool used, one or more of the mentioned syntaxes may be used.

For example, the grep tool supports the -E option to force the use of extended regular expression syntax (ERE) when parsing a regular expression, while the -G option forces the use of basic regular expression syntax (BRE) and the -P option - Perl Programming Language Regular Expression (PCRE) syntax.

Also note that grep also supports the -F option, which allows you to read the regular expression without processing.

The sed tool also supports options that allow you to choose the regular expression syntax.

Always read the manual pages of the tools you are using!

grep utility

Printing strings matching a pattern

The grep utility is a popular tool on Linux systems designed to find strings that match a specific pattern. Below are examples of the simplest regular expressions that can be used when working with it.

This is the contents of the test file used in the examples. This file contains three lines (or three newlines).

paul@rhel65:~$ cat names Tania Laura Valentina

When searching for a single character, only those lines that contain the specified character will be returned.

paul@rhel65:~$ grep u names Laura paul@rhel65:~$ grep e names Valentina paul@rhel65:~$ grep i names Tania Valentina

The comparison to the pattern used in this example is straightforward; if a given character occurs in a string, grep will print that string.

Combining characters

To find combinations of characters in strings, regular expression characters must be combined in a similar way.

This example demonstrates how the grep utility works, according to which the regular expression ia will match the string Tan ia, but not the string Valentina, and the regular expression in will match the string Valent in a, but not the string Ta ni a.

Note that we use the -E option to grep to force our regular expression to be interpreted as an expression using extended regular expression (ERE) syntax.

We will have to escape the channel creation character in a regular expression using Basic Regular Expression (BRE) syntax to interpret this character similarly as logical operation"OR". paul@debian7:~$ grep -G "i|a" list paul@debian7:~$ grep -G "i\|a" list Tania Laura

One or more matches

The * character matches zero, one or more occurrences of the previous character, and the + character matches the following character.

paul@debian7:~$ cat list2 ll lol lool loool paul@debian7:~$ grep -E "o*" list2 ll lol lool loool paul@debian7:~$ grep -E "o+" list2 lol lool loool paul@debian7: ~$

Match at the end of the line

In the following examples we will use this file: paul@debian7:~$ cat names Tania Laura Valentina Fleur Floor

The two examples below show a technique for using the dollar sign to match the end of a string.

paul@debian7:~$ grep a$ names Tania Laura Valentina paul@debian7:~$ grep r$ names Fleur Floor

Match at the beginning of the line

The caret character (^) allows you to search for a match at the beginning (or the first characters) of a string.

These examples use the file discussed above.

paul@debian7:~$ grep ^Val names Valentina paul@debian7:~$ grep ^F names Fleur Floor

The dollar and caret characters used in regular expressions are called anchors.

Splitting words

Sometimes it is easier to combine a simple regular expression with grep's options than to create a more complex regular expression. These options were discussed previously: grep -i grep -v grep -w grep -A5 grep -B5 grep -C5

Preventing the shell from expanding a regular expression

The dollar sign is a special character for both regular expression and shell (think of shell variables and shell inlineds). Therefore, it is recommended that you escape regular expressions under all circumstances, because escaping a regular expression helps prevent the shell from expanding the expression.

paul@debian7:~$ grep "r$" names Fleur Floor rename

rename utility

Implementations of the rename utility

In the Debain Linux distribution, along the /usr/bin/rename path there is a link to the /usr/bin/prename script, installed from the perl package.

paul@pi ~ $ dpkg -S $(readlink -f $(which rename)) perl: /usr/bin/prename

Distributions based on the Red Hat distribution do not create a similar symbolic link to point to the described script (unless, of course, a symbolic link is created to a manually installed script), so this section will not describe the implementation of the rename utility from Red Hat distribution.

There is usually confusion in discussions about the rename utility on the Internet due to the fact that solutions that work fine on the Debian distribution (also Ubuntu, xubuntu, Mint, ...) cannot be used on the Red Hat distribution (and CentOS , Fedora, ...). perl package The rename command is actually implemented in the form of a script using the regular expressions of the perl programming language. WITH complete guide 67121 files and directories installed.) Perl-doc is unpacked (from.../perl-doc_5.14.2-21+rpi2_all.deb) ... Adding "diversion of /usr/bin/perldoc to /usr/bin/perldoc. stub by perl-doc" Processing triggers for man-db ... Configuring perl-doc package (5.14.2-21+rpi2) ... root@pi:~# perldoc perlrequick

Well known syntax

The most common use of the rename utility is to find files with names that match a specific pattern in the form of a string, and replace that string with another string.

Typically this action is described using the regular expression s/string/another string/ , as shown in the example: paul@pi ~ $ ls abc allfiles.TXT bllfiles.TXT Scratch tennis2.TXT abc.conf backup cllfiles.TXT temp.TXT tennis. TXT paul@pi ~ $ rename "s/TXT/text/" * paul@pi ~ $ ls abc allfiles.text bllfiles.text Scratch tennis2.text abc.conf backup cllfiles.text temp.text tennis.text

Here's another example that uses the well-known rename utility syntax to change the extensions of the same files again: paul@pi ~ $ ls abc allfiles.text bllfiles.text Scratch tennis2.text abc.conf backup cllfiles.text temp.text tennis .text paul@pi ~ $ rename "s/text/txt/" *.text paul@pi ~ $ ls abc allfiles.txt bllfiles.txt Scratch tennis2.txt abc.conf backup cllfiles.txt temp.txt tennis.txt paul @pi~$

The reason these two examples work is that the strings we use occur exclusively in file extensions. Remember that file extensions do not matter when using the bash shell.

The following example demonstrates the problem that you may encounter when using this syntax.

paul@pi ~ $ touch atxt.txt paul@pi ~ $ rename "s/txt/problem/" atxt.txt paul@pi ~ $ ls abc allfiles.txt backup cllfiles.txt temp.txt tennis.txt abc.conf aproblem .txt bllfiles.txt Scratch tennis2.txt paul@pi ~ $

When executing the command in question, only the first occurrence of the searched string is replaced.

Global replacement

The syntax used in the previous example can be described as follows: s/regex/string to replace/ . This description is simple and obvious, since all you have to do is place a regular expression between the first two slashes and a replacement string between the last two slashes.

Now the syntax we use can be described as s/regex/string to replace/g, where the s modifier indicates a switch operation, and the g modifier indicates that a global change should be performed.

Note that in this example, the -n option was used to display information about the operation being performed (instead of performing the operation itself, which is to directly rename the file).

Case insensitive replacement

Another modifier that may be useful is the i modifier. The example below shows a case-insensitive technique for replacing a string with another string.

paul@debian7:~/files$ ls file1.text file2.TEXT file3.txt paul@debian7:~/files$ rename "s/.text/.txt/i" * paul@debian7:~/files$ ls file1. txt file2.txt file3.txt paul@debian7:~/files$

Changing extensions

The Linux command line interface has no understanding of file extensions similar to those used in the MS-DOS operating system, but many users and GUI applications use them.

This section provides an example of using the rename utility to change file extensions only. The example uses the dollar sign to indicate that the starting point for replacement is the end of the file name.

paul@pi ~ $ ls *.txt allfiles.txt bllfiles.txt cllfiles.txt really.txt.txt temp.txt tennis.txt paul@pi ~ $ rename "s/.txt$/.TXT/" *.txt paul @pi ~ $ ls *.TXT allfiles.TXT bllfiles.TXT cllfiles.TXT really.txt.TXT temp.TXT tennis.TXT paul@pi ~ $

Note that the dollar sign within the regular expression denotes the end of the line. Without the dollar sign, this command should fail when processing the file name really.txt.txt.

sed utility

Data Flow Editor

The stream editor, or sed for short, uses regular expressions to modify a data stream.

In this example, the sed utility is used to replace a string.

Although sed is designed for stream processing, it can also be used for interactive file processing.

paul@debian7:~/files$ echo Monday > today paul@debian7:~/files$ cat today Monday paul@debian7:~/files$ sed -i "s/Mon/Tue/" today paul@debian7:~/files $ cat today Tuesday

The ampersand character can be used to refer to the searched (and found) string.

In this example, the ampersand is used to double the number of lines found.

echo Monday | sed "s/Monday/&&/" MondayMonday echo Monday | sed "s/nickname/&&/" Monday

Parentheses are used to group parts of a regular expression that can later be referenced.

Consider the following example: paul@debian7:~$ echo Sunday | sed "s_\(Sun\)_\1ny_" Sunnyday paul@debian7:~$ echo Sunday | sed "s_\(Sun\)_\1ny \1_" Sunny Sunday

Dot to indicate any character

In a regular expression, a simple dot character can represent any character.

paul@debian7:~$ echo 2014-04-01 | sed "s/....-..-../YYYY-MM-DD/" YYYY-MM-DD paul@debian7:~$ echo abcd-ef-gh | sed "s/....-..-../YYYY-MM-DD/" YYYY-MM-DD

If more than one pair of parentheses is used, each of them can be referenced by using consecutive numeric values.

paul@debian7:~$ echo 2014-04-01 | sed "s/\(..\)-\(..\)-\(..\)/\1+\2+\3/" 2014+04+01 paul@debian7:~$ echo 2014 -04-01 | sed "s/\(..\)-\(..\)-\(..\)/\3:\2:\1/" 01:04:2014

This feature is called grouping.

Space

The character sequence \s can be used to refer to a character such as the space or tab character.

This example searches globally for sequences of space characters (\s) and replaces them with 1 space character.

paul@debian7:~$ echo -e "today\twarm\tday" today is a warm day paul@debian7:~$ echo -e "today\twarm\tday" | sed "s_\s_ _g" today is a warm day

This example searches for strings with exactly three o characters.

paul@debian7:~$ cat list2 ll lol lool loool paul@debian7:~$ grep -E "o(3)" list2 loool paul@debian7:~$ cat list2 | sed "s/o\(3\)/A/" ll lol lool lAl paul@debian7:~$

From n to m repetitions

And in this example, we clearly indicate that the symbol must be repeated from a minimum (2) to a maximum (3) number of times.

paul@debian7:~$ cat list2 ll lol lool loool paul@debian7:~$ grep -E "o(2,3)" list2 lool loool paul@debian7:~$ grep "o\(2,3\)" list2 lool loool paul@debian7:~$ cat list2 | sed "s/o\(2,3\)/A/" ll lol lAl lAl paul@debian7:~$

History of the bash shell

The bash shell can also interpret some regular expressions.

This example shows a technique for manipulating the exclamation mark character within a search mask in the history of the bash shell.

paul@debian7:~$ mkdir hist paul@debian7:~$ cd hist/ paul@debian7:~/hist$ touch file1 file2 file3 paul@debian7:~/hist$ ls -l file1 -rw-r--r-- 1 paul paul 0 Apr 15 22:07 file1 paul@debian7:~/hist$ !l ls -l file1 -rw-r--r-- 1 paul paul 0 Apr 15 22:07 file1 paul@debian7:~/hist $ !l:s/1/3 ls -l file3 -rw-r--r-- 1 paul paul 0 Apr 15 22:07 file3 paul@debian7:~/hist$

This technique also works if you use numbers when reading the command history of the bash shell.

Many people, when they first see regular expressions, immediately think that they are looking at a meaningless jumble of characters. But this, of course, is far from the case. Take a look at this regex for example


In our opinion, even an absolute beginner will immediately understand how it works and why it is needed :) If you don’t quite understand it, just read on and everything will fall into place.
A regular expression is a pattern that programs like sed or awk use to filter text. Templates use regular ASCII characters that represent themselves, and so-called metacharacters that play a special role, for example, allowing reference to certain groups of characters.

Types of Regular Expressions

Implementations of regular expressions in different environments, for example, in programming languages ​​like Java, Perl and Python, and in Linux tools like sed, awk and grep, have certain features. These features depend on so-called regular expression engines, which interpret patterns.
Linux has two regular expression engines:
  • An engine that supports the POSIX Basic Regular Expression (BRE) standard.
  • An engine that supports the POSIX Extended Regular Expression (ERE) standard.
Most Linux utilities conform to at least the POSIX BRE standard, but some utilities (including sed) understand only a subset of the BRE standard. One of the reasons for this limitation is the desire to make such utilities as fast as possible in text processing.

The POSIX ERE standard is often implemented in programming languages. It allows you to use a large number of tools when developing regular expressions. For example, these could be special sequences of characters for frequently used patterns, such as searching for individual words or sets of numbers in text. Awk supports the ERE standard.

There are many ways to develop regular expressions, depending both on the opinion of the programmer and on the features of the engine for which they are created. It's not easy to write universal regular expressions that any engine can understand. Therefore, we will focus on the most commonly used regular expressions and look at the features of their implementation for sed and awk.

POSIX BRE regular expressions

Perhaps the simplest BRE pattern is a regular expression for searching for the exact occurrence of a sequence of characters in text. This is what searching for a string looks like in sed and awk:

$ echo "This is a test" | sed -n "/test/p" $ echo "This is a test" | awk "/test/(print $0)"

Finding text by pattern in sed


Finding text by pattern in awk

You may notice that the search for a given pattern is performed without taking into account the exact location of the text in the line. In addition, the number of occurrences does not matter. After the regular expression finds the specified text anywhere in the string, the string is considered suitable and is passed on for further processing.

When working with regular expressions, you need to take into account that they are case sensitive:

$ echo "This is a test" | awk "/Test/(print $0)" $ echo "This is a test" | awk "/test/(print $0)"

Regular expressions are case sensitive

The first regular expression did not find any matches because the word “test”, starting with a capital letter, does not appear in the text. The second, configured to search for a word written in capital letters, found a suitable line in the stream.

In regular expressions, you can use not only letters, but also spaces and numbers:

$ echo "This is a test 2 again" | awk "/test 2/(print $0)"

Finding a piece of text containing spaces and numbers

Spaces are treated as regular characters by the regular expression engine.

Special symbols

When using various characters in regular expressions, there are some things to consider. Thus, there are some special characters, or metacharacters, the use of which in a template requires a special approach. Here they are:

.*^${}\+?|()
If one of them is needed in the template, it will need to be escaped using a backslash (backslash) - \ .

For example, if you need to find a dollar sign in the text, you need to include it in the template, preceded by an escape character. Let's say there is a file myfile with the following text:

There is 10$ on my pocket
The dollar sign can be detected using this pattern:

$awk "/\$/(print $0)" myfile

Using a special character in a pattern

In addition, the backslash is also a special character, so if you need to use it in a pattern, it will also need to be escaped. It looks like two slashes following each other:

$ echo "\ is a special character" | awk "/\\/(print $0)"

Escaping a backslash

Although the forward slash is not included in the list of special characters above, attempting to use it in a regular expression written for sed or awk will result in an error:

$ echo "3 / 2" | awk "///(print $0)"

Incorrect use of forward slash in a pattern

If it is needed, it must also be escaped:

$ echo "3 / 2" | awk "/\//(print $0)"

Escaping a forward slash

Anchor symbols

There are two special characters for linking a pattern to the beginning or end of a text string. The cap character - ^ allows you to describe sequences of characters that are found at the beginning of text lines. If the pattern you are looking for is somewhere else in the string, the regular expression will not respond to it. The use of this symbol looks like this:

$ echo "welcome to likegeeks website" | awk "/^likegeeks/(print $0)" $ echo "likegeeks website" | awk "/^likegeeks/(print $0)"

Finding a pattern at the beginning of a string

The ^ character is designed to search for a pattern at the beginning of a line, while the case of characters is also taken into account. Let's see how this affects the processing of a text file:

$awk "/^this/(print $0)" myfile


Finding a pattern at the beginning of a line in text from a file

When using sed, if you place a cap somewhere inside the pattern, it will be treated like any other regular character:

$ echo "This ^ is a test" | sed -n "/s ^/p"

Cap not at the beginning of the pattern in sed

In awk, when using the same template, this character must be escaped:

$ echo "This ^ is a test" | awk "/s\^/(print $0)"

Cover not at the beginning of the template in awk

We have figured out the search for text fragments located at the beginning of a line. What if you need to find something located at the end of a line?

The dollar sign - $, which is the anchor character for the end of the line, will help us with this:

$ echo "This is a test" | awk "/test$/(print $0)"

Finding text at the end of a line

You can use both anchor symbols in the same template. Let's process the file myfile, the contents of which are shown in the figure below, using the following regular expression:

$ awk "/^this is a test$/(print $0)" myfile


A pattern that uses special characters to start and end a line

As you can see, the template responded only to a line that fully corresponded to the given sequence of characters and their location.

Here's how to filter out empty lines using anchor characters:

$awk "!/^$/(print $0)" myfile
In this template I used a negation symbol, an exclamation point - ! . Using this pattern searches for lines that contain nothing between the beginning and end of the line, and thanks to exclamation point Only lines that do not match this pattern are printed.

Dot symbol

The period is used to match any single character except the newline character. Let's pass the file myfile to this regular expression, the contents of which are given below:

$awk "/.st/(print $0)" myfile


Using a dot in regular expressions

As can be seen from the output data, only the first two lines from the file correspond to the pattern, since they contain the sequence of characters “st” preceded by another character, while the third line does not contain a suitable sequence, and the fourth does have it, but is in at the very beginning of the line.

Character classes

A dot matches any single character, but what if you want to be more flexible in limiting the set of characters you're looking for? In this situation, you can use character classes.

Thanks to this approach, you can organize a search for any character from a given set. To describe a character class, square brackets are used:

$awk "/th/(print $0)" myfile


Description of a character class in a regular expression

Here we are looking for a sequence of "th" characters preceded by an "o" character or an "i" character.

Classes come in handy when searching for words that can begin with either an uppercase or lowercase letter:

$ echo "this is a test" | awk "/his is a test/(print $0)" $ echo "This is a test" | awk "/his is a test/(print $0)"

Search for words that may begin with a lowercase or uppercase letter

Character classes are not limited to letters. Other symbols can be used here. It is impossible to say in advance in what situation classes will be needed - it all depends on the problem being solved.

Negation of character classes

Character classes can also be used to solve the inverse problem described above. Namely, instead of searching for symbols included in a class, you can organize a search for everything that is not included in the class. In order to achieve this regular expression behavior, you need to place a ^ sign in front of the list of class characters. It looks like this:

$ awk "/[^oi]th/(print $0)" myfile


Finding characters not in a class

In this case, sequences of “th” characters will be found that are preceded by neither “o” nor “i”.

Character ranges

In character classes, you can describe ranges of characters using dashes:

$awk "/st/(print $0)" myfile


Description of a range of characters in a character class

In this example, the regular expression responds to the sequence of characters "st" preceded by any character located, in alphabetical order, between the characters "e" and "p".

Ranges can also be created from numbers:

$ echo "123" | awk "//" $ echo "12a" | awk "//"

Regular expression to find any three numbers

A character class can include several ranges:

$awk "/st/(print $0)" myfile


A character class consisting of several ranges

This regular expression will find all sequences of “st” preceded by characters from the ranges a-f and m-z .

Special character classes

BRE has special character classes that you can use when writing regular expressions:
  • [[:alpha:]] - matches any alphabetic character, written in upper or lower case.
  • [[:alnum:]] - matches any alphanumeric character, namely characters in the ranges 0-9 , A-Z , a-z .
  • [[:blank:]] - matches a space and a tab character.
  • [[:digit:]] - any digit character from 0 to 9.
  • [[:upper:]] - uppercase alphabetic characters - A-Z .
  • [[:lower:]] - lowercase alphabetic characters - a-z .
  • [[:print:]] - matches any printable character.
  • [[:punct:]] - matches punctuation marks.
  • [[:space:]] - whitespace characters, in particular - space, tab, characters NL, FF, VT, CR.
You can use special classes in templates like this:

$ echo "abc" | awk "/[[:alpha:]]/(print $0)" $ echo "abc" | awk "/[[:digit:]]/(print $0)" $ echo "abc123" | awk "/[[:digit:]]/(print $0)"


Special character classes in regular expressions

Star symbol

If you place an asterisk after a character in a pattern, this will mean that the regular expression will work if the character appears in the string any number of times - including the situation when the character is absent in the string.

$ echo "test" | awk "/tes*t/(print $0)" $ echo "tessst" | awk "/tes*t/(print $0)"


Using the * character in regular expressions

This wildcard is usually used for words that are constantly misspelled, or for words that have different spellings:

$ echo "I like green color" | awk "/colou*r/(print $0)" $ echo "I like green color " | awk "/colou*r/(print $0)"

Finding a word with different spellings

In this example, the same regular expression responds to both the word "color" and the word "colour". This is so due to the fact that the character “u”, followed by an asterisk, can either be absent or appear several times in a row.

Another useful feature that comes from the asterisk symbol is to combine it with a dot. This combination allows the regular expression to respond to any number of any characters:

$ awk "/this.*test/(print $0)" myfile


A template that responds to any number of any characters

In this case, it doesn’t matter how many and what characters are between the words “this” and “test”.

The asterisk can also be used with character classes:

$ echo "st" | awk "/s*t/(print $0)" $ echo "sat" | awk "/s*t/(print $0)" $ echo "set" | awk "/s*t/(print $0)"


Using an asterisk with character classes

In all three examples, the regular expression works because the asterisk after the character class means that if any number of "a" or "e" characters are found, or if none are found, the string will match the given pattern.

POSIX ERE regular expressions

The POSIX ERE templates that some Linux utilities support may contain additional characters. As already mentioned, awk supports this standard, but sed does not.

Here we will look at the most commonly used symbols in ERE patterns, which will be useful to you when creating your own regular expressions.

▍Question mark

A question mark indicates that the preceding character may appear once or not at all in the text. This character is one of the repetition metacharacters. Here are some examples:

$ echo "tet" | awk "/tes?t/(print $0)" $ echo "test" | awk "/tes?t/(print $0)" $ echo "tesst" | awk "/tes?t/(print $0)"


Question mark in regular expressions

As you can see, in the third case the letter “s” appears twice, so the regular expression does not respond to the word “testst”.

The question mark can also be used with character classes:

$ echo "tst" | awk "/t?st/(print $0)" $ echo "test" | awk "/t?st/(print $0)" $ echo "tast" | awk "/t?st/(print $0)" $ echo "taest" | awk "/t?st/(print $0)" $ echo "teest" | awk "/t?st/(print $0)"


Question mark and character classes

If there are no characters from the class in the line, or one of them occurs once, the regular expression works, but as soon as two characters appear in the word, the system no longer finds a match for the pattern in the text.

▍Plus symbol

The plus character in the pattern indicates that the regular expression will match what it is looking for if the preceding character occurs one or more times in the text. However, this construction will not react to the absence of a symbol:

$ echo "test" | awk "/te+st/(print $0)" $ echo "teest" | awk "/te+st/(print $0)" $ echo "tst" | awk "/te+st/(print $0)"


The plus symbol in regular expressions

In this example, if there is no “e” character in the word, the regular expression engine will not find matches to the pattern in the text. The plus symbol also works with character classes - in this way it is similar to the asterisk and question mark:

$ echo "tst" | awk "/t+st/(print $0)" $ echo "test" | awk "/t+st/(print $0)" $ echo "teast" | awk "/t+st/(print $0)" $ echo "teeast" | awk "/t+st/(print $0)"


Plus sign and character classes

In this case, if the line contains any character from the class, the text will be considered to match the pattern.

▍Curly braces

Curly braces, which can be used in ERE patterns, are similar to the symbols discussed above, but they allow you to more precisely specify the required number of occurrences of the symbol preceding them. You can specify a restriction in two formats:
  • n - a number specifying the exact number of searched occurrences
  • n, m are two numbers that are interpreted as follows: “at least n times, but no more than m.”
Here are examples of the first option:

$ echo "tst" | awk "/te(1)st/(print $0)" $ echo "test" | awk "/te(1)st/(print $0)"

Curly braces in patterns, searching for the exact number of occurrences

In older versions of awk you had to use the --re-interval command line option to make the program recognize intervals in regular expressions, but in newer versions this is not necessary.

$ echo "tst" | awk "/te(1,2)st/(print $0)" $ echo "test" | awk "/te(1,2)st/(print $0)" $ echo "teest" | awk "/te(1,2)st/(print $0)" $ echo "teeest" | awk "/te(1,2)st/(print $0)"


Spacing specified in curly braces

In this example, the character “e” must appear 1 or 2 times in the line, then the regular expression will respond to the text.

Curly braces can also be used with character classes. The principles you already know apply here:

$ echo "tst" | awk "/t(1,2)st/(print $0)" $ echo "test" | awk "/t(1,2)st/(print $0)" $ echo "teest" | awk "/t(1,2)st/(print $0)" $ echo "teeast" | awk "/t(1,2)st/(print $0)"


Curly braces and character classes

The template will react to the text if it contains the character “a” or the character “e” once or twice.

▍Logical “or” symbol

Symbol | - a vertical bar means a logical “or” in regular expressions. When processing a regular expression containing several fragments separated by such a sign, the engine will consider the analyzed text suitable if it matches any of the fragments. Here's an example:

$ echo "This is a test" | awk "/test|exam/(print $0)" $ echo "This is an exam" | awk "/test|exam/(print $0)" $ echo "This is something else" | awk "/test|exam/(print $0)"


Logical "or" in regular expressions

In this example, the regular expression is configured to search the text for the words “test” or “exam”. Please note that between the template fragments and the symbol separating them | there should be no spaces.

Regular expression fragments can be grouped using parentheses. If you group a certain sequence of characters, it will be perceived by the system as an ordinary character. That is, for example, repetition metacharacters can be applied to it. This is what it looks like:

$ echo "Like" | awk "/Like(Geeks)?/(print $0)" $ echo "LikeGeeks" | awk "/Like(Geeks)?/(print $0)"


Grouping regular expression fragments

In these examples, the word “Geeks” is enclosed in parentheses, followed by a question mark. Recall that a question mark means “0 or 1 repetition,” so the regular expression will respond to both the string “Like” and the string “LikeGeeks.”

Practical examples

Now that we've covered the basics of regular expressions, it's time to do something useful with them.

▍Counting the number of files

Let's write a bash script that counts files located in directories that are written to the PATH environment variable. In order to do this, you will first need to generate a list of directory paths. Let's do this using sed, replacing the colons with spaces:

$ echo $PATH | sed "s/:/ /g"
The replace command supports regular expressions as patterns for searching text. In this case, everything is extremely simple, we are looking for the colon symbol, but no one bothers us to use something else here - it all depends on the specific task.
Now you need to go through the resulting list in a loop and perform the actions necessary to count the number of files. The general outline of the script will be like this:

Mypath=$(echo $PATH | sed "s/:/ /g") for directory in $mypath do done
Now let’s write the full text of the script, using the ls command to obtain information about the number of files in each directory:

#!/bin/bash mypath=$(echo $PATH | sed "s/:/ /g") count=0 for directory in $mypath do check=$(ls $directory) for item in $check do count=$ [ $count + 1 ] done echo "$directory - $count" count=0 done
When running the script, it may turn out that some directories from PATH do not exist, however, this will not prevent it from counting files in existing directories.


File counting

The main value of this example is that using the same approach, you can solve much more complex problems. Which ones exactly depends on your needs.

▍Verifying email addresses

There are websites with huge collections of regular expressions that let you check email addresses, phone numbers, and so on. However, it’s one thing to take something ready-made, and quite another to create something yourself. So let's write a regular expression to check email addresses. Let's start with analyzing the source data. Here, for example, is a certain address:

[email protected]
The username, username, can consist of alphanumeric and some other characters. Namely, this is a dot, a dash, an underscore, a plus sign. The username is followed by an @ sign.

Armed with this knowledge, let's start assembling the regular expression from its left side, which is used to check the username. Here's what we got:

^(+)@
This regular expression can be read as follows: “At the beginning of the line there must be at least one character from those that are in the group specified in square brackets, and after that there should be an @ sign.”

Now - the hostname queue - hostname . The same rules apply here as for the username, so the template for it will look like this:

(+)
The top-level domain name is subject to special rules. There can only be alphabetic characters, of which there must be at least two (for example, such domains usually contain a country code), and no more than five. All this means that the template for checking the last part of the address will be like this:

\.({2,5})$
You can read it like this: “First there must be a period, then 2 to 5 alphabetic characters, and after that the line ends.”

Having prepared templates for individual parts of the regular expression, let's put them together:

^(+)@(+)\.({2,5})$
Now all that remains is to test what happened:

$ echo " [email protected]" | awk "/^(+)@(+)\.((2,5))$/(print $0)" $ echo " [email protected]" | awk "/^(+)@(+)\.((2,5))$/(print $0)"


Validating an email address using regular expressions

The fact that the text passed to awk is displayed on the screen means that the system recognized it as an email address.

Results

If the regular expression for checking email addresses that you came across at the very beginning of the article seemed completely incomprehensible then, we hope that now it no longer looks like a meaningless set of characters. If this is true, then this material has fulfilled its purpose. In fact, regular expressions are a topic that you can study for a lifetime, but even the little that we have covered can already help you write scripts that process texts quite advanced.

In this series of materials, we usually showed very simple examples of bash scripts that consisted of literally a few lines. Next time we'll look at something bigger.

Dear readers! Do you use regular expressions when processing text in command line scripts?

The grep utility is a very powerful tool for searching and filtering text information. This article shows several examples of its use that will allow you to appreciate its capabilities.
The main use of grep is to search for words or phrases in files and output streams. You can search by typing query and search area (file) at the command line.
For example, to find the string “needle” in the hystack.txt file, use the following command:

$ grep needle haystack.txt

As a result, grep will display all occurrences of needle that it encounters in the contents of the haystack.txt file. It's important to note that in this case, grep is looking for a set of characters, not a word. For example, strings that include the word “needless” and other words that contain the sequence “needle” will be displayed.


To tell grep that you are looking for a specific word, use the -w switch. This key will limit the search to only the specified word. A word is a query delimited on both sides by any whitespace, punctuation, or line breaks.

$ grep -w needle haystack.txt

It is not necessary to limit the search to just one file; grep can search across a group of files, and the search results will indicate the file in which the match was found. The -n switch will also add the line number in which the match was found, and the -r switch will allow you to perform a recursive search. This is very convenient when searching among files with program source codes.

$ grep -rnw function_name /home/www/dev/myprogram/

The file name will be listed before each match. If you need to hide file names, use the -h switch, on the contrary, if you only need file names, then specify the -l switch
In the following example, we will search for URLs in the IRC log file and show the last 10 matches.

$ grep -wo http://.* channel.log | tail

The -o option tells grep to print only the pattern match rather than the entire line. Using pipe, we redirect the output of grep to the tail command, which by default outputs the last 10 lines.
Now we will count the number of messages sent to the irc channel by certain users. For example, all the messages I sent from home and work. They differ in nickname, at home I use the nickname user_at_home, and at work user_at_work.

$ grep -c "^user_at_(home|work)" channel.log

With the -c option, grep only prints the number of matches found, not the matches themselves. The search string is enclosed in quotes because it contains special characters that can be recognized by the shell as control characters. Please note that quotation marks are not included in the search pattern. The backslash "" is used to escape special characters.
Let's search for messages from people who like to “scream” in the channel. By “scream” we mean messages written in blondy-style, in all CAPITAL letters. To exclude random hits of abbreviations from the search, we will search for words of five or more characters:

$ grep -w "+(5,)" channel.log

For a more detailed description, you can refer to the grep man page.
A few more examples:

# grep root /etc/passwd root:x:0:0:root:/root:/bin/bash operator:x:11:0:operator:/root:/sbin/nologin

Displays lines from the /etc/passwd file that contain the string root.

# grep -n root /etc/passwd 1:root:x:0:0:root:/root:/bin/bash 12:operator:x:11:0:operator:/root:/sbin/nologin

In addition, the line numbers that contain the searched line are displayed.

# grep -v bash /etc/passwd | grep -v nologin sync:x:5:0:sync:/sbin:/bin/sync shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown halt:x:7:0:halt:/sbin :/sbin/halt news:x:9:13:news:/var/spool/news: mailnull:x:47:47::/var/spool/mqueue:/dev/null xfs:x:43:43: X Font Server:/etc/X11/fs:/bin/false rpc:x:32:32:Portmapper RPC user:/:/bin/false nscd:x:28:28:NSCD Daemon:/:/bin/false named:x:25:25:Named:/var/named:/bin/false squid:x:23:23::/var/spool/squid:/dev/null ldap:x:55:55:LDAP User: /var/lib/ldap:/bin/false apache:x:48:48:Apache:/var/www:/bin/false

Checks which users do not use bash, excluding those user accounts that have nologin specified as their shell.

# grep -c false /etc/passwd 7

Counts the number of accounts that have /bin/false as their shell.

# grep -i games ~/.bash* | grep -v history

This command displays lines from all files in the current user's home directory whose names begin with ~/.bash, excluding those files whose names include the string history, so as to exclude matches found in ~/.bash_history in which can specify the same string in upper or lower case. Please note that the search for the word “games” is carried out; you can substitute any other word instead.
grep command and regular expressions

Unlike the previous example, we will now display only those lines that begin with the line “root”:

# grep ^root /etc/passwd root:x:0:0:root:/root:/bin/bash

If we want to see which accounts haven't used the shell at all, we look for lines ending with a ":" character:

# grep:$ /etc/passwd news:x:9:13:news:/var/spool/news:

To check if the PATH variable in your ~/.bashrc file is exported, first select the lines with "export" and then look for lines starting with the line "PATH"; in this case, MANPATH and other possible paths will not be displayed:

# grep export ~/.bashrc | grep "PATH" export PATH="/bin:/usr/lib/mh:/lib:/usr/bin:/usr/local/bin:/usr/ucb:/usr/dbin:$PATH"

Character classes

The expression in square brackets is a list of characters enclosed within the characters [" and "]"". It matches any single character specified in this list; if the first character of the list is "^", then it matches any character that is NOT in the list. For example, the regular expression "" matches any single digit.

Within an expression in square brackets, you can specify a range consisting of two characters separated by a hyphen. Then the expression matches any singleton that, according to the sorting rules, falls inside these two characters, including these two characters; this takes into account the collation and character set specified in the locale. For example, when the default locale is C, the expression "" is equivalent to the expression "". There are many locales in which sorting is done in dictionary order, and in these locales "" is generally not equivalent to "", in which, for example, it may be equivalent to the expression "". To use the traditional interpretation of the bracketed expression, you can use the C locale by setting the LC_ALL environment variable to "C".

Finally, there are specially named character classes, which are specified inside expressions in square brackets. For more information about these predefined expressions, see the man pages or grep command documentation.

# grep /etc/group sys:x:3:root,bin,adm tty:x:5: mail:x:12:mail,postfix ftp:x:50: nobody:x:99: floppy:x:19: xfs:x:43: nfsnobody:x:65534: postfix:x:89:

The example displays all lines that contain either the character "y" or the character "f".
Universal characters (metacharacters)

Use "." to match any single character. If you want a list of all English words taken from the dictionary containing five characters starting with "c" and ending with "h" (useful for solving crossword puzzles):

# grep " " /usr/share/dict/words catch clash cloth coach couch cough crash crush

If you want to display lines that contain a period character as a literal, then specify the -F option in the grep command. Symbols "< " и «>" means the presence of an empty line before and, accordingly, after the specified letters. This means that the words in the words file must be written accordingly. If you want to find all words in the text according to the specified patterns without taking into account empty lines, omit the symbols "< " и «>", for a more precise search of only words, use the -w switch.

To similarly find words that can have any number of characters between the “c” and “h,” use an asterisk (*). The example below selects all words starting with "c" and ending with "h" from the system dictionary:

# grep " " /usr/share/dict/words caliph cash catch cheesecloth cheetah --output omitted--

If you want to find the literal asterisk character in a file or output stream, use single quotes to find it. The user in the example below first tries to look for an "asterisk" in the /etc/profile file without using quotes, which results in nothing being found. When quotes are used, the result is output:

# grep * /etc/profile # grep "*" /etc/profile for i in /etc/profile.d/*.sh ; do

Did you like the article? Share it