bash-programming-from-scratch/file-system.md at master

lfs/bash-programming-from-scratch

Fork 0

mirror of https://github.com/ellysh/bash-programming-from-scratch.git synced 2026-01-26 07:43:42 +00:00

Files

Ilya Shpigor 685883fbe1 Minor fixes

2021-10-02 11:16:48 +02:00

52 KiB

Raw Permalink Blame History

Navigating the File System

Let's start introducing the Unix environment and Bash with a file system. A file system is a software that dictates how to store and read data from disks. It covers the following topics:

API to access data on the disk that programs can use.
Universal way for accessing different storage devices.
Physical operations on the disk storage.

First, we will look at the differences between the directory structure in Unix and Windows. Then we will learn the Bash commands for navigating the file system.

Directory Structure

There is an address bar at the top of the Windows Explorer window. It displays the absolute path to the current directory. An absolute path shows the place of the file system object regardless of the current directory.

Another way to specify the file system object place is using the relative path. It shows you how to reach the object from the current directory.

A directory is a file system cataloging structure. It can contain files and other directories. Windows terminology calls it folder. Both names mean the same kind of file system object.

Figure 2-6 shows an Explorer window. The address bar equals This PC > Local Disk (C:) > msys64 there. It matches the C:\msys64 absolute path. Thus, we see the contents of the msys64 directory on the C drive in the Explorer window.

The letter C in the path denotes the local system disk drive. The local drive means the device that is connected to your computer physically. You can also have a network drive. You access such a device via the network. The system disk means one that has the Windows installation on it.

{caption: "Figure 2-6. Windows Explorer"}

If you run the MSYS2 terminal emulator, it shows you the current absolute path at the end of the first line. This line behaves like the address bar of Windows Explorer. When you change the current directory, the current path changes too. However, you have to consider that the terminal and Explorer show you different paths for the same current directory. It happens because directory structures of the Unix environment and Windows do not match.

Windows marks each disk drive with a Latin letter. You can open the drive using Explorer as a regular folder. Then you access its content.

For example, let's open the C system drive. It has a standard set of directories. Windows has created them during the installation process. If you open the C drive in Explorer, you see the following directories there:

Windows
Program Files
Program Files (x86)
Users
PerfLogs

These directories store OS components, user applications and temporary files.

You can connect extra disk drives to your computer. Another option is to split a single disk into several logical partitions. Windows will assign the Latin letters (D, E, F, etc) to these extra disks and partitions. You are allowed to create any directory structure there. Windows does not restrict you in any way.

The File Allocation Table (FAT) file system dictates how Windows manages disks and provides you access to them. Microsoft developed this file system for the MS-DOS OS. The principles of FAT became the basis of the ECMA-107 standard. The next-generation file system from Microsoft is called NTFS. It replaced the obsolete FAT in modern versions of Windows. However, the basic principles of disks and directory structure are the same in NAT and FAT. The reason for that is the backward compatibility requirement.

The Unix directory structure follows the POSIX standard. This structure gives you less freedom than the Windows one. It has several predefined directories that you cannot move or rename. You are allowed to put your data in the specific paths only.

The POSIX standard says that the file system should have a top-level directory. It is called the root directory. The slash sign / denotes it. All directories and files of all connected disk drives are inside the root directory.

If you want to access the contents of a disk drive, you should mount it. Mounting means embedding the contents of a disk into the root directory. When mounting is done, you can access the disk contents through some path. This path is called a mount point. If you go to the mount point, you enter the file system of the disk.

Let's compare the Windows and Unix directory structures by example. Suppose that your Windows computer has two local disks C and D. Listing 2-1 shows their directory structure.

{caption: "Listing 2-1. The directory structure in Windows", line-numbers: false}

C:\
    PerfLogs\
    Windows\
    Program Files\
    Program Files (x86)\
    Users\

D:\
    Documents\
    Install\

Suppose that you have installed the Unix environment on your Windows. Then you run the terminal emulator and get the directory structure from Listing 2-2.

{caption: "Listing 2-2. The directory structure in Unix", line-numbers: false}

/
    c/
        PerfLogs/
        Windows/
        Program Files/
        Program Files (x86)/
        Users/

    d/
        Documents/
        Install/

Since you launch the MSYS2 terminal, you enter the Unix environment. Windows paths don't work there. You should use Unix paths instead. For example, you can access the C:\Windows directory via the /c/Windows path only.

There is another crucial difference between Unix and Windows file systems besides the directory structure. The character case makes strings differ in the Unix environment. It means that two words with the same letters are not the same if their character case differs. For example, the Documents and documents words are not equal. Windows has no case sensitivity. If you type the c:\windows path in the Explorer address bar, it opens the C:\Windows directory. This approach does not work in the Unix environment. You should type all characters in the proper case.

Here is the last point to mention regarding Unix and Windows file systems. Use the slash sign / to separate directories and files in Unix paths. When you work with Windows paths, you use backslash \ for that.

We are ready to learn our first Bash commands. Here are the steps to execute a shell command:

Open the terminal window.
Type the command there.
Press Enter.

The shell will execute your command.

When the shell is busy, it cannot process your input. You can distinguish the shell's state by the command prompt. It is a sequence of one or more characters. The default prompt is the dollar sign $. You can see it in Figure 2-4. If the shell prints the prompt, it is ready for executing your command.

Windows Explorer allows you the following actions to navigate the file system:

Display the current directory.
Go to a specified disk drive or directory.
Find a directory or file on the disk.

You can do the same actions with the shell. It provides you a corresponding command for each action. Table 2-1 shows these commands.

{caption: "Table 2-1. Commands and utilities for navigating the file system", width: "100%"}

Command	Description	Examples
`ls`	Display the contents of the directory.	`ls`
	If you call the command without parameters, it shows you the contents of the current directory.	`ls /c/Windows`

`pwd`	Display the path to the current directory.	`pwd`
	When you add the `-W` parameter, the command displays the path in the Windows directory structure.

`cd`	Go to the directory at the specified	`cd tmp`
	relative or absolute path.	`cd /c/Windows`
		`cd ..`

`mount`	Mount the disk to the root file system. If you call the command without parameters, it shows a list of all mounted disks.	`mount`

`find`	Find a file or directory. The first parameter	`find . -name vim`
	specifies the directory to start searching.	`find /c/Windows -name vim`

`grep`	Find a file by its contents.	`grep "PATH" *`
		`grep -Rn "PATH" .`
		`grep "PATH" * .*`

Bash can perform pwd and cd commands of Table 2-1 only. They are called built-ins. Special utilities perform all other commands of the table. Bash calls an appropriate utility if it cannot execute your command on its own.

The MSYS2 environment provides a set of GNU utilities. These are auxiliary highly specialized programs. They give you access to the OS features in Linux and macOS. However, their capabilities are limited in Windows. Bash calls GNU utilities to execute the following commands of Table 2-1:

ls
mount
find
grep

When you read an article about Bash on the Internet, its author can confuse the "command" and "utility" terms. He names both things "commands". This is not a big issue. However, I recommend you to distinguish them. Calling a utility takes more time than calling Bash built-in. It causes performance overhead in some cases.

pwd

Let's consider the commands in Table 2-1. You have just started the terminal. The first thing you do is to find out the current directory. You can get it from the command prompt, but it depends on your Bash configuration. You do not have this feature enabled by default in Linux and macOS.

When you start the terminal, it opens the home directory of the current user. Bash abbreviates this path by the tilde symbol ~. You see this symbol before the command prompt. Use tilde instead of the home directory absolute path. It makes your commands shorter.

Call the pwd command to get the current directory. Figure 2-7 shows this call and its output. The command prints the absolute path to the user's home directory. It equals /home/ilya.shpigor in my case.

If you add the -W option to the call, the command prints the path in the Windows directory structure. It is useful when you create a file in the MSYS2 environment and open it in a Windows application afterward. Figure 2-7 shows you the result of applying the -W option.

{caption: "Figure 2-7. The output of the pwd command"}

What is a command option? When the program has a CLI only, you have very limited ways to interact with it. The program needs some data on input to do its job. The shell provides you a simple way to pass these data. Just type them after the command name. These data are called arguments of the program. Bash terminology distinguishes two kinds of arguments: parameter and option. A parameter is the regular word or character you pass to the program. An option or key is an argument that switches the mode of a program. The standard dictates the option format. It is a word or character that starts with a dash - or a double dash --.

You pass data to the CLI programs and Bash built-ins in the same way. Use parameters and options for doing that.

Typing long commands is inconvenient. Bash provides the autocomplete feature to save your time. Here are the steps for using it:

Type the first few letters of the command.
Press the Tab key.
If Bash finds the command you mean, it completes it.
If several commands start with the typed letters, autocomplete does not happen. Press Tab again to see the list of these commands.

Figure 2-8 demonstrates how the autocomplete feature works. Suppose that you type the "pw" word. Then you press the Tab key twice. Bash shows you the commands that start with "pwd" as Figure 2-8 shows.

{caption: "Figure 2-8. Autocomplete for the "pw" command"}

ls

We got the current directory using the pwd command. The next step is checking the directory content. The ls utility does this task.

Suppose that you have just installed the MSYS2 environment. Then you launched the terminal first time. You are in the user's home directory. Call the "ls" command there. Figure 2-9 shows its result. The command output has nothing. It means that the directory is empty or has hidden files and directories only.

{caption: "Figure 2-9. The output of the ls utility"}

Windows has a concept of hidden files and directories. The Unix environment also has it. Applications and OS create hidden files for their own needs. These files store configuration and temporary data.

I> Windows Explorer does not display hidden files and directories by default. Change the Explorer settings to see them.

You can make the file hidden in Windows by changing its attribute. If you want to do the same in Unix, you should add a dot at the beginning of the filename.

When you launch the ls utility without parameters, it does not show you hidden objects. You can add the -a option to see them. Figure 2-9 shows a result of such a call.

The ls utility can show the contents of the specified directory. Pass a directory's absolute or relative path to the utility. For example, the following command shows the contents of the root directory: {line-numbers: false, format: Bash}

ls /

Figure 2-10 shows the output of this command.

{caption: "Figure 2-10. The output of the ls utility"}

There are no directories /c and /d in Figure 2-10. These are the mount points of C and D disk drives according to Listing 2-2. The mounting points are in the root directory. Why does not the ls utility print them? It happens because the Windows file system does not have a concept of mount points. Therefore, it does not have directories /c and /d. They are present in the Unix environment only. These are not real directories but paths where you can access the disk file systems. The ls utility reads the directory contents in the Windows file system. Thus, it does not show the mount points. The ls utility behaves differently in Linux and macOS. It shows mount points properly there.

mount

If your computer has several disk drives, you can read their mount points. Call the mount utility without parameters for doing that. Figure 2-11 shows its output.

{caption: "Figure 2-11. The output of the mount utility"}

Consider this output as a table with four columns. The columns display the following values:

The disk drive, its partition or directory. It is the object that the OS has mounted to the root directory.
Mount point. It is the path where you can access the mounted disk drive.
The file system type of the disk drive.
Mounting parameters. An example is access permissions to the disk contents.

If we split the mount utility output into these columns, we get Table 2-2.

{caption: "Table 2-2. The output of the mount utility", width: "100%"}

Mounted partition	Mount point	FS type	Mounting parameters
`C:/msys64`	`/`	`ntfs`	`binary,noacl,auto`
`C:/msys64/usr/bin`	`/bin`	`ntfs`	`binary,noacl,auto`
`C:`	`/c`	`ntfs`	`binary,noacl,posix=0,user,noumount,auto`
`Z:`	`/z`	`hgfs`	`binary,noacl,posix=0,user,noumount,auto`

Table 2-2 confuses most Windows users. MSYS2 mounts C:/msys64 as the root directory. Then it mounts the C and Z disks into the root. Their mount points are /c and /z. It means that you can access the C drive via the C:/msys64/c Windows path in the Unix environment. However, C:/msys64 is the subdirectory of disk C in the Windows file system. We got a contradiction.

Actually, there is no contradiction. The /c path is the mount point that exists only in the Unix environment. It does not exist in the Windows file system. Therefore, Windows knows nothing about the C:/msys64/c path. It is just invalid if you try to open it via Explorer. You can imagine the mount point /c as the shortcut to drive C that exists in the MSYS2 environment only.

The output of the mount utility took up a lot of screen space. You can clear the terminal window by the Ctrl+L keystroke.

Another useful keystroke is Ctrl+C. It interrupts the currently running command. Use it if the command hangs or you want to stop it.

cd

You have got everything about the current directory. Now you can change it. Suppose that you are looking for the Bash documentation. You can find it in the /usr system directory. Installed applications stores their non-executable files there. Call the cd command to go to the /usr path. Do it this way: {line-numbers: false, format: Bash}

cd /usr

Do not forget about autocompletion. It works for both command and its parameters. Just type "cd /u" and press the Tab key. Bash adds the directory name usr automatically. Figure 2-12 shows the result of the command.

{caption: "Figure 2-12. The result of the cd command"}

The cd command does not output anything if it succeeds. It changes the current directory and that is it. You can read the new path in the line above the command prompt. This line shows the /usr path after our cd call.

The cd command accepts both absolute and relative paths. Relative paths are shorter. Therefore, you type them faster. Prefer them when navigating the file system using a command shell.

There is a simple rule to distinguish the type of path. An absolute path starts with a slash /. An example is /c/Windows/system32. A relative path starts with a directory name. An example is Windows/system32.

Now you are in the /usr directory. You can get a list of its subdirectories and go to one of them. Suppose that you want to go one level higher and reach the root directory. There are two ways for doing that: go to the absolute path / or the special relative path ... The .. path always points to the parent directory of the current one. Use it in the cd call this way: {line-numbers: false, format: Bash}

cd ..

I> In addition to .., there is another special path . (dot). It points to the current directory. If you execute the command "cd .", nothing happens. You stay in the same place. You need the . path to run programs from the current directory.

Come back to the /usr directory. Then run the ls utility there. It will show you the share subdirectory. Come to this directory and call ls again. You will find the doc directory there. It contains Bash documentation. Call the cd command this way to reach the documentation: {line-numbers: false, format: Bash}

cd doc/bash

You are in the /usr/share/doc/bash directory now. Call the ls utility there. It will show you several files. One of them is README. It contains a brief description of the Bash interpreter.

You found the documentation file. The next step is to print its contents. The cat utility does that. Here is an example of how to run it: {line-numbers: false, format: Bash}

cat README

Figure 2-13 shows the terminal window after the cat call.

{caption: "Figure 2-13. The result of the cat utility", height: "30%"}

I> Some Bash manuals do not recommend printing file contents using the cat utility. They said that the utility's purpose is to merge several files and print the result. The manuals suggest combining the echo command with a stream redirection. Here is an example of this approach: {line-numbers: false, format: Bash}

echo "$(< README.txt)"

The README file contents do not fit in the terminal window. Therefore, you see the tail of the file in Figure 2-13. Use the scroll bar on the window's right side to check the head of the file. Also, use the Shift+PageUp and Shift+PageDown hotkeys to scroll pages up and down. The Shift+↑ and Shift+↓ keystrokes scroll the lines.

Command History

Whenever you call a command, Bash saves it in the command history. You can navigate the history by up and down arrow keys. Bash automatically types the corresponding command. You just need to press Enter for launching it. For example, you have called the "cat README" command. Press the up arrow and Enter to repeat it.

The Ctrl+R shortcut brings up a search over all command history. Press Ctrl+R and start typing. Bash will show you the last called command that begins with these characters. Press Enter to execute it.

The history command shows you the whole history. Run it without parameters this way: {line-numbers: false, format: Bash}

history

The history stores the command that you have executed. It does not keep the command that you typed and then erased.

There is a trick to save the command to the history without executing it. Add the hash symbol # before the command and press Enter. Bash stores the typed line, but it does not execute it. This happens because the hash symbol means comment. When the interpreter meets a comment, it ignores this line. However, Bash adds the commented lines in the history because they are legal constructions of the language.

Here is an example of the trick with comment for our cat utility call: {line-numbers: false, format: Bash}

#cat README

You have saved the commented command in the history. Now you can find it there by pressing the up arrow key. Remove the hash symbol at the beginning of the line and press Enter. Bash will execute your command.

You can do the comment trick by the Alt+Shift+3 shortcut. It works in most modern terminal emulators. Here are the steps for using the shortcut:

Type a command, but do not press Enter.
Press Alt+Shift+3.
Bash saves the command in the history without executing it.

Sometimes you need to copy text from the terminal window. It can be a command or its output. Here is an example. Suppose that some document needs a part of the Bash README file. Use the clipboard to copy it. The clipboard is temporary storage for text data. When you select something in the terminal window with a mouse, the clipboard saves it automatically. Then you can paste this data to any other window.

These are the steps to copy text from the terminal window:

Select the text with the mouse. Hold down the left mouse button and drag the cursor over the required text.
Press the middle mouse button to paste the text from the clipboard into the same or another terminal window. You insert the text at the current cursor position.
Right-click and select the "Paste" item to paste the text to the application other than the terminal.

find

It is inconvenient to search for a file or directory with cd and ls commands. The special find utility does it better.

If you run the find utility without parameters, it traverses the contents of the current directory and prints it. The output includes hidden objects. Figure 2-14 shows the result of running find in the home directory.

{caption: "Figure 2-14. The output of the find utility", height: "30%"}

The first parameter of find is the directory to search in. The utility accepts relative and absolute paths. For example, the following command shows the contents of the root directory: {line-numbers: false, format: Bash}

find /

You can specify search conditions starting from the second parameter. If the found object does not meet these conditions, find does not print it. The conditions form a single expression. The utility has an embedded interpreter that processes this expression.

An example of the find condition is the specific filename. When you call the utility with such a condition, it prints the found files with this name only.

Table 2-3 shows the format of commonly used conditions for the find utility.

{caption: "Table 2-3. Commonly used conditions for the find utility", width: "100%"}

Condition	Meaning	Example
`-type f`	Search files only.	`find -type f`

`-type d`	Search directories only.	`find -type d`

`-name <pattern>`	Search for a file or directory with the	`find -name README`
	name that matches a glob pattern. The	`find -name READ*`
	pattern is case-sensitive.	`find -name READ??`

`-iname <pattern>`	Search for a file or directory with the name that matches a glob pattern. The pattern is case-insensitive.	`find -iname readme`

`-path <pattern>`	Search for a file or directory with the path that matches a glob pattern. The pattern is case-sensitive.	`find -path /doc/bash/`

`-ipath <pattern>`	Search for a file or directory with the path that matches a glob pattern. The pattern is case-insensitive.	`find . -ipath /DOC/BASH/`

`-a` or `-and`	Combine several conditions using the logical AND. If the found object fits all conditions, the utility prints it.	`find -name README -a -path /doc/bash/`

`-o` or `-or`	Combine several conditions using the logical OR. If the found object fits at least one condition, the utility prints it.	`find -name README -o -path /doc/bash/`

`!` or `-not`	The logical negation (NOT) of the	`find -not -name README`
	condition. If the found object does not fit the condition, the utility prints it.	`find ! -name README`

A glob pattern is a search query that contains wildcard characters. Bash allows three wildcard characters: *, ? and [. The asterisk stands for any number of any characters. A question mark means a single character of any kind.

Here is an example of glob patterns. The string README matches all following patterns:

*ME
READM?
*M?
R*M?

Square brackets indicate a set of characters at a specific position. For example, the pattern "[cb]at.txt" matches the cat.txt and bat.txt files. You can apply this pattern to the find call this way: {line-numbers: false, format: Bash}

find . -name "[cb]at.txt"

{caption: "Exercise 2-1. Glob patterns", line-numbers: false}

What of the following lines corresponds to the pattern "*ME.??" ?

* 00_README.txt
* README
* README.md

{caption: "Exercise 2-2. Glob patterns", line-numbers: false}

What of the following lines corresponds to the pattern "*/doc?openssl*" ?

* /usr/share/doc/openssl/IPAddressChoice_new.html
* /usr/share/doc_openssl/IPAddressChoice_new.html
* doc/openssl
* /doc/openssl

Let's apply glob patterns into practice. Suppose that you do not know the Bash README file location and looking for it. Then you should use the find utility.

Start searching with the utility from the root directory. Now you need a search condition. It is a common practice to store documentation in directories called doc in Unix. Therefore, you can search files in these directories only. This way, you get the following find call: {line-numbers: false, format: Bash}

find / -path */doc/*

The command shows you all documentation files on all mounted disks. This is a huge list. You can shorten it with an extra search condition. It should be a separate directory for the Bash documentation. The directory is called bash. Add this path as the second search condition. Then you get the following command: {line-numbers: false, format: Bash}

find / -path */doc/* -path */bash/*

Figure 2-15 shows the result of this command.

The following find call provides the same result: {line-numbers: false, format: Bash}

find / -path */doc/* -a -path */bash/*

Our find calls differ by the -a option between conditions. The option means logical AND. If you do not specify any logical operator between conditions, find inserts AND by default. This is a reason why both calls provide the same result.

{caption: "Figure 2-15. The output of the find utility", height: "30%"}

You can see that the find utility reports an error in Figure 2-15. The mount points of Windows disk drives cause it. The utility cannot access them when you start searching from the root directory. You can avoid the problem if you start searching from the /c mount point. Do it this way: {line-numbers: false, format: Bash}

find /c -path */doc/* -a -path */bash/*

There is an alternative solution. You should exclude mount points from the search. The -mount option does this. Apply the option this way: {line-numbers: false, format: Bash}

find / -mount -path */doc/* -a -path */bash/*

When you add the second search condition, the find utility shows a short list of documents. You can find the right README file easily there.

There are other ways to search for the documentation file. Suppose that you know its name. Then you can specify it together with an assumed path. You will get the find call like this: {line-numbers: false, format: Bash}

find / -path */doc/* -name README

Figure 2-16 shows the result of this command.

{caption: "Figure 2-16. The output of the find utility", height: "30%"}

Again you got a short list of files. It is easy to locate the right file there.

You can group the conditions of the find utility. Do it using the escaped parentheses. Here is an example of using them. Let's write the find call that searches README files with path */doc/* or LICENSE files with an arbitrary path. This call looks like this: {line-numbers: false, format: Bash}

find / \( -path */doc/* -name README \) -o -name LICENSE

Why should you apply backslashes to escape brackets here? The parentheses are part of the Bash syntax. Therefore, Bash treats them like language constructs. When Bash meets parentheses in a command, it performs an expansion. The expansion is the replacement of a part of the command with something else. When you escape parentheses, you force Bash to ignore them. Thus, Bash does not perform the expansion and passes all search conditions to the find utility as it is.

The find utility can process the found objects. You can specify an action to apply as an extra option. The utility will apply this action to each found object.

Table 2-4 shows the find options that specify actions.

{caption: "Table 2-4. Options for specifying actions on found objects", width: "100%"}

Option	Meaning	Example
`-exec command {} \;`	Execute the specified command on each found object.	`find -name README -type f -exec cp {} ~ \;`

`-exec command {} +`	Execute the specified command once over all found objects. The command receives all these objects on the input.	`find -type d -exec cp -t ~ {} +`

`-delete`	Delete each of the found files. The utility deletes empty directories only.	`find -name README -type f -delete`

Table 2-4 shows that there are two variants of the -exec action. They differ by the last symbol. It can be an escaped semicolon \; or a plus sign +. Use the plus sign only if the called command handles several input parameters. You will make a mistake if the command accepts one parameter only. It will process the first found object and skip the rest.

Let's apply the -exec action in practice. Suppose that you want to copy files with the Bash documentation into the home directory. You are interested in the HTML files only.

The first step is preparing the correct find call for searching the files. You should apply two conditions here. The first one checks the directory of the Bash documentation. The second condition checks the file extensions. If you combine these conditions, you get the following find call: {line-numbers: false, format: Bash}

find / -path "*/doc/bash/*" -name "*.html"

When you pass the glob pattern to the find utility, always enclose it in double quotes. The quotes do the same as the backslash before parentheses. They prevent Bash from expanding the patterns. Instead, Bash passes them to the find utility.

Figure 2-17 shows the result of our find call. You can see that it found HTML files correctly.

{caption: "Figure 2-17. The output of the find utility", height: "30%"}

The second step for solving your task is adding the -exec action. The action should call the cp utility. This utility copies files and directories to the specified path. It takes two parameters. The first one is the source object to copy. The second parameter is the target path. When you apply the -exec action, you get the following find call: {line-numbers: false, format: Bash}

find / -path "*/doc/bash/*" -name "*.html" -exec cp {} ~ \;

Run this command. It prints an error about the mount point. Despite the error, the command did its job. It copied the HTML files into the home directory.

How does the command work in detail? It calls the cp utility for each HTML file it found. When calling the utility, find inserts each found object instead of curly braces {}. Therefore, two cp calls happen here. They look like this: {line-numbers: true, format: Bash}

cp ./usr/share/doc/bash/bash.html ~
cp ./usr/share/doc/bash/bashref.html ~

Each cp call copies one HTML file to the home directory.

Good job! You just wrote your first program in the language of the find utility. The program works according to the following algorithm:

Find HTML files starting from the root directory. Their paths match the */doc/bash/* pattern.
Copy each found file into the home directory.

The program is quite simple and consists of two steps only. However, it is a scalable solution for finding and copying files. The program processes two or dozens of HTML files with the same speed.

You can combine the -exec actions in the same way as the search conditions. For example, let's print the contents of each found HTML file and count the number of its lines. You should call the cat utility to print the file contents. The wc utility counts the lines. It takes the filename as an input parameter. If you combine cat and wc calls, you get the following find command: {line-numbers: false, format: Bash}

find / -path "*/doc/bash/*" -name "*.html" -exec cat {} \; -exec wc -l {} \;

There is no logical operation between the -exec actions. The find utility inserts logical AND by default. This has a consequence in our case. If the cat utility fails, find does not call the wc utility. It means that find executes the second action only if the first one succeeds. You can apply the logical OR explicitly. Then find always calls wc. Here is the command with logical OR: {line-numbers: false, format: Bash}

find / -path "*/doc/bash/*" -name "*.html" -exec cat {} \; -o -exec wc -l {} \;

You can group the -exec actions with escaped parentheses $ and $. It works the same way as grouping search conditions.

{caption: "Exercise 2-3. Searching for files with the find utility", line-numbers: false}

Write a find call to search for text files in a Unix environment.
Extend the command to print the total number of lines in these files.

Boolean Expressions

The search conditions of the find utility are Boolean expressions. A Boolean expression is a programming language statement. It produces a Boolean value when evaluated. This value equals either "true" or "false".

The find condition is a statement of the utility's language. It produces the "true" value if the found object meets its requirement. Otherwise, the condition produces "false". If there are several conditions in the find call, they make a single compound Boolean expression.

When we have considered the binary numeral system, we already met Boolean algebra. This section of mathematics studies logical operators. They differ from the arithmetic operations: addition, subtraction, multiplication, and division.

You can apply a logical operator to Boolean values or expressions. Using an arithmetic operation does not make sense in this case. Addition or subtraction is trivial for Boolean values. It yields nothing. When you apply a logical operator, you get a condition with strict evaluation rules. This way, you wrote search conditions for the find utility. When you combine several conditions, you get a program with complex behavior.

An operand is an object of a logical operator. Boolean values and expressions can be operands.

Let's consider Boolean expressions using an example. The example is not related to the find utility or Bash for simplicity. Imagine that you are programming a robot for a warehouse. Its job is to move boxes from point A to point B. You can write the following straightforward algorithm for the robot:

Move to point A.
Pick up the box at point A.
Move to point B.
Put the box at point B.

This algorithm does not have any conditions. It means that the robot performs each step independently of external events.

Now imagine that an obstacle happens in the robot's way. For example, another robot stuck there. Executing your algorithm leads to the collision of the robots in this case. You should add a condition in the algorithm to prevent the collision. For example, it can look like this:

Move to point A.
Pick up the box at point A.
If there is no obstacle, move to point B. Otherwise, stop.
Put the box at point B.

The third step of the algorithm is called conditional statement. All modern programming languages have such a statement.

The conditional statement works according to the following algorithm:

Evaluate the Boolean expression in the condition.
If the expression produces "true", perform the first action.
If the expression produces "false", perform the second action.

The robot evaluates the value of the Boolean expression "there is no obstacle" in our example. If there is an obstacle, the expression produces "false" and the robot stops. Otherwise, the robot moves to point B.

When writing the conditional statement, you can combine several Boolean expressions using logical operators. Here is an example. Suppose that the robot tries to pick up a box at point A, but there is no box. Then there is no reason for him to move to point B. You can check this situation in the conditional statement. Add the new Boolean expression there using logical AND (conjunction). Then the robot's algorithm becomes like this:

Move to point A.
Pick up the box at point A.
If there is a box AND no obstacle, move to point B. Otherwise, stop.
Put the box at point B.

Logical operators produce Boolean values when evaluated. The result of a logical AND equals "true" when both operands are "true". In our example, it happens when the robot has a box and there is no obstacle on its way. Otherwise, the result of logical AND equals "false". It forces the robot to stop.

You have used two more logical operators when learning the find utility. These operators are OR (disjunction) and NOT (negation).

Actually, you have already applied logical NOT in the robot's algorithm. It stays implicitly in the expression "there is no obstacle". It equals the following negation: "there is NOT an obstacle". You can specify the logical NOT in the algorithm explicitly this way:

Move to point A.
Pick up the box at point A.
If there is a box AND there is NOT an obstacle, move to point B. Otherwise, stop.
Put the box at point B.

You can always replace logical AND by OR with some extra changes. Let's do it for our example but keep the robot's behavior the same. You should add the negation to the first Boolean expression and remove it from the second one. Also, you have to change the order of actions in the conditional statement. If the condition produces "true", the robot stops. If it produces "false", the robot moves to point B. The new algorithm looks this way:

Move to point A.
Pick up the box at point A.
If there is NOT a box OR there is an obstacle, stop. Otherwise, move to point B.
Put the box at point B.

Read the new conditional statement carefully. The robot follows the same decisions as before. It stops if it has no box or if there is an obstacle on its way. However, you have exchanged the logical AND to OR. This trick helps you to keep your conditional statements clear. Choose between logical AND and OR depending on your Boolean expressions. Pick one that fits your case better.

You wrote the Boolean expressions as sentences in English in our example. Such a sentence sounds unnatural. You have to read it several times to understand it. This happens because the natural humans' language is not suitable for writing Boolean expressions. This language is not accurate enough. Boolean algebra uses mathematical notation for that reason.

We have considered logical AND, OR and NOT. You will deal with three more operators in programming often:

Equivalence
Non-equivalence
Exclusive OR

Table 2-5 explains them.

{caption: "Table 2-5. Logical operators", width: "100%"}

Operator	Evaluation Rule
AND	It produces "true" when both operands are "true".

OR	It produces "true" when any of the operands is "true". It produces "false" when all operands are "false".

NOT	It produces "true" when the operand is "false" and vice versa.

Exclusive OR (XOR)	It produces "true" when the operands have different values (true-false or false-true). It produces "false" when the operands are the same (true-true, false-false).

Equivalence	It produces "true" when the operands have the same values.

Non-equivalence	It produces "true" when the values of the operands differ.

Try to memorize this table. It is simple to reach when you use logical operators often.

grep

The GNU utilities have one more searching tool besides find. It is called grep. This utility checks file contents when searching.

How to choose the proper utility for searching? Use find for searching a file or directory by its name, path or metadata. Metadata is extra information about an object. Examples of the file metadata are size, time of creation and last modification, permissions. Use the grep utility to find a file when you know nothing about it except its contents.

Here is an example. It shows you how to choose the right utility for searching. Suppose that you are looking for a documentation file. You know that it contains the phrase "free software". If you apply the find utility, the searching algorithm looks like this:

Call find to list all the files with the README name.
Open each file in a text editor and check if it has the phrase "free software".

Using a text editor for checking dozens of files takes too much effort and time. You should perform several operations with each file manually: open it, activate the editor's searching mode, type the "free software" phrase. The grep utility automates this task. For example, the following command finds all lines with the "free software" phrase in the specified README file: {line-numbers: false, format: Bash}

grep "free software" /usr/share/doc/bash/README

The first parameter of the utility is a string for searching. Always enclose it in double quotes. This way, you prevent Bash expansions and guarantee that the utility receives the string unchanged. Without the quotes, Bash splits the phrase into two separate parameters. This mechanism of splitting strings into words is called word splitting.

The second parameter of grep is a relative or absolute path to the file. If you specify a list of files separated by spaces, the utility processes them all. In the example, we passed the README file path only.

Figure 2-18 shows the result of the grep call.

{caption: "Figure 2-18. The output of the grep utility", width: "100%"}

You see all lines of the file where the utility found the specified phrase. The -n option adds the line numbers to the grep output. It can help you to check big text files. Add the option before the first parameter when calling the utility. Figure 2-18 shows the output in this case.

We have learned how to use grep to find a string in the specified files. Now let's apply the utility to solve our task. You are looking for the documentation files with the phrase "free software". There are two ways to find them with the grep utility:

Use Bash glob patterns.
Use the file search mechanism of the grep utility.

The first method works well when you have all files for checking in the same directory. Suppose that you found two README files: one for Bash and one for the xz utility. You have copied them to the home directory with the names bash.txt and xz.txt. The following two commands find the file that contains the phrase "free software": {line-numbers: true, format: Bash}

cd ~
grep "free software" *

The first command changes the current directory to the user's home. The second command calls the grep utility.

When calling grep, we have specified the asterisk for the target file path. This wildcard means any string. Bash expands all wildcards in the command before launching it. In our example, Bash replaces the asterisk with all files of the home directory. The resulting grep call looks like this: {line-numbers: false, format: Bash}

grep "free software" bash.txt xz.txt

Launch both versions of the grep call: with the * pattern and with a list of two files. The utility prints the same result for both cases.

You can search for the phrase in a single command. Just exclude the cd call. Then add the home directory to the search pattern. You will get the following grep call: {line-numbers: false, format: Bash}

grep "free software" ~/*

This command does not handle subdirectories. It means that the grep call does not check the files in the ~/tmp directory, for example.

There is an option to check how the Bash expands a glob pattern. Use the echo command for that. Here are echo calls for checking our patterns: {line-numbers: true, format: Bash}

echo *
echo ~/*

Run these commands. The first one lists files and subdirectories of the current directory. The second command does the same for the home directory.

Do not enclose search patterns in double quotes. Here is an example of the wrong command: {line-numbers: false, format: Bash}

grep "free software" "*"

Quotes prevent the Bash expansion. Therefore, Bash does not insert the filenames to the command but passes the asterisk to the grep utility. The utility cannot handle the glob pattern properly as find does. Thus, you will get an error like Figure 2-19 shows.

{caption: "Figure 2-19. The result of processing a search pattern by grep", width: "100%"}

When expanding the * pattern, Bash ignores hidden files and directories. Therefore, the grep utility ignores them too in our example. Add the dot before the asterisk to get the glob pattern for hidden objects. It looks like .*. If you want to check all files at once, specify two patterns separated by the space. Here is an example grep call: {line-numbers: false, format: Bash}

grep "free software" * .*

The second approach to search files with grep is using its built-in mechanism. It traverses the directories recursively and checks all files there. The -r option enables this mechanism. When using this option, specify the search directory in the second utility's parameter.

Here is an example of using the -r option: {line-numbers: false, format: Bash}

grep -r "free software" .

This command finds the "free software" phrase in the files of the current directory. It processes the hidden objects too.

If you work on Linux or macOS, prefer the -R option instead of -r. It forces grep to follow symbol links when searching. Here is an example: {line-numbers: false, format: Bash}

grep -R "free software" .

I> A symbolic link is a file of a special type. Instead of data, it contains a pointer to another file or directory.

You can specify the starting directory for searching by a relative or absolute path. Here are the examples for both cases: {line-numbers: true, format: Bash}

grep -R "free software" ilya.shpigor/tmp
grep -R "free software" /home/ilya.shpigor/tmp

Suppose that you are interested in a list of files that contain a phrase. You do not need all occurrences of the phrase in each file. The -l option switches the grep utility in the mode you need. Here is an example of using it: {line-numbers: false, format: Bash}

grep -Rl "free software" .

Figure 2-20 shows the result of this command.

{caption: "Figure 2-20. The grep outputs filenames only"}

You see a list of files where the phrase "free software" occurs at least once. Suppose that you need the opposite result: a list of files without the phrase. Use the -L option for finding them. Here is an example: {line-numbers: false, format: Bash}

grep -RL "free software" .

The grep utility processes the text files only. Therefore, it deals well with the source code files. You can use the utility as an add-on to your code editor or IDE.

You may have liked the grep utility. You want to process PDF and MS Office documents with it. Unfortunately, this approach does not work. The contents of these files are not text. It is encoded. You need another utility to process such files. Table 2-6 shows grep alternatives for non-text files.

{caption: "Table 2-6. Utilities for text searching in PDF and MS Office files", width: "70%", column-widths: "20% *"}

Utility	Features
pdftotext	It converts a PDF file into text format.

pdfgrep	It searches PDF files by their contents.

antiword	It converts an MS Office document into text format.

catdoc	It converts an MS Office document into text format.

xdoc2txt	It converts PDF and MS Office files into text format.

Some of these utilities are available in the MSYS2 environment. Use the pacman package manager for installing them. The last chapter of the book describes how to use it.

{caption: "Exercise 2-4. Searching for files with the grep utility", line-numbers: false}

Write a grep call to find system utilities with a free license.
Here are widespread licenses for open-source software:

1. GNU General Public License
2. MIT license
3. Apache license
4. BSD license

52 KiB Raw Permalink Blame History