3.6 Utilities
1. Generating test data
Most of the time we will need test data to work with.
The wordlist
command is used to generate a list of words.
To install the command on Ubuntu, run the following command:
# if you try to install wordlist, you will get the following error
sudo apt install wordlist
# Result will be
# Reading package lists... Done
# Building dependency tree... Done
# Reading state information... Done
# Package wordlist is a virtual package provided by:
# wspanish 1.0.30
# witalian 1.10
# wfrench 1.2.7-2
# wswedish 1.4.5-3
# wcatalan 0.20111230b-14
# wcanadian-small 2020.12.07-2
# wcanadian-large 2020.12.07-2
# wcanadian-insane 2020.12.07-2
# wcanadian-huge 2020.12.07-2
# wcanadian 2020.12.07-2
# wbritish-small 2020.12.07-2
# wbritish-large 2020.12.07-2
# wbritish-insane 2020.12.07-2
# wbritish-huge 2020.12.07-2
# wbritish 2020.12.07-2
# wamerican-small 2020.12.07-2
# wamerican-large 2020.12.07-2
# wamerican-insane 2020.12.07-2
# wamerican-huge 2020.12.07-2
# wamerican 2020.12.07-2
# wnorwegian 2.2-4
# miscfiles 1.5+dfsg-4
# wgerman-medical 20220425-1
# wportuguese 20220621-1
# wukrainian 1.8.0+dfsg-1
# wgalician-minimos 0.5-48
# wfaroese 0.4.2+repack1-4
# wpolish 20220301-1
# wswiss 20161207-11
# wngerman 20161207-11
# wogerman 1:2-38
# wesperanto 2.1.2000.02.25-61
# wdutch 1:2.20.19-2
# wdanish 1.6.36-14
# wbrazilian 3.0~beta4-24
# wbulgarian 4.1-7
# You should explicitly select one to install.
# E: Package 'wordlist' has no installation candidate
# You need to install one of the packages listed above to get wordlist command
sudo apt-get install wamerican-small
To generate a file with 1000 words, run the following command:
# Pick 1000 random words and write them to random_words.txt
shuf -n 1000 /usr/share/dict/words > random_words.txt
If you want more control over the words you can use awk to filter the words.
# Pick 1000 random words and write them to random_words.txt
awk 'length($0) > 5 && length($0) < 10' /usr/share/dict/words | shuf -n 1000 > random_words.txt
2. sed
sed
, short for “stream editor,” is a command-line utility that allows for parsing and transforming text. It is commonly used for text substitutions, deletions, insertions, and more. Available on UNIX and UNIX-like operating systems such as Linux and macOS, sed
reads text input—either from a file or from a stream—edits it according to a set of expressions or commands, and then outputs the edited text.
Basic Syntax
The basic syntax of a sed command is:
sed [options] 'command' [input-file]
Common Operations
Substitute: Replace occurrences of a string (or regular expression) with another string.
# Replace 'apple' with 'orange' in the file random_words.txt
sed 's/apple/orange/' random_words.txt
Note
Note that this only replaces the first occurrence of ‘apple’ on each line. To replace all occurrences, you’d use the g (global) flag:
sed 's/apple/orange/g' random_words.txt
Delete Lines: Remove lines that match a particular pattern.
# Delete all lines containing 'apple' from the file fruits.txt
sed '/apple/d' random_words.txt
Insert and Append: Insert or append lines around a match.
# Insert a line "Fruits:" before every line containing 'apple'
sed '/apple/i\Fruits:' random_words.txt
# Append a line "Yummy!" after every line containing 'apple'
sed '/apple/a\Yummy!' random_words.txt
Options
-i
: Edit the file in-place (use with caution).
sed -i 's/apple/orange/g' random_words.txt
# This replaces all instances of 'apple' with 'orange' in random_words.txt, modifying the file directly.
-n
: Suppress automatic printing (useful with thep
command to print specific lines).
# Print only lines containing 'apple'
sed -n '/apple/p' random_words.txt
-e
: Allows for multiple editing commands.
sed -e 's/apple/orange/' -e 's/banana/peach/' random_words.txt
# This replaces the first instance of 'apple' with 'orange' and the first instance of 'banana' with 'peach' on each line.
-f
: Specifies a file that contains sed commands.
sed -f commands.sed random_words.txt
# Here, commands.sed is a file containing sed commands that are to be applied to random_words.txt.
3. awk
awk
is a text-processing utility in Unix-like operating systems that is often used for data extraction and reporting. Named after its original authors—Alfred Aho, Peter Weinberger, and Brian Kernighan- awk
interprets a special-purpose programming language designed to handle a wide range of text manipulation tasks. It’s particularly good for working with tabular data or text files separated by delimiters like commas, spaces, or tabs.
Basic Syntax The basic syntax of awk is:
awk 'pattern { action }' file
pattern: Specifies a condition, or a set of conditions. If the pattern matches, the corresponding action is executed.
action: Describes what to do when a match for the pattern is found in the text or stream.
file: The file to read the data from.
Simple Example
Let’s say you have a text file named party.txt
containing:
Tequila 4
Vodka 6
Wine 12
Beer 48
You could use awk to print just the second column:
awk '{ print $2 }' party.txt
This would output:
4
6
12
48
Features and Capabilities
Column manipulation:
awk
is commonly used to read and manipulate columnar data.
# Prints the first column of the file "party.txt"
awk '{ print $1 }' party.txt
Text Transformation: You can use
awk
to perform a variety of text transformations.
# Converts the first column of data to lowercase
awk '{ print tolower($1) }' data.txt
Mathematical Operations:
awk
can perform various mathematical operations.
# Sum the second column of the file "data.txt"
awk '{ sum += $2 } END { print sum }' data.txt
Conditional Statements:
awk
supports if-else, while, for loops, and more.
awk '{ if ($2 > 10) print $1 " has value greater than 10"; else print $1 " has value less or equal to 10" }' data.txt
5. String Manipulation: awk offers functions for substring, length, and other string manipulations.
awk '{ print length($1) }' data.txt # prints the length of the first column
Built-in Functions: awk has a number of built-in functions for string manipulation, mathematical operations, and more.
awk '{ print sqrt($2) }' data.txt # prints the square root