Reference from https://en.wikipedia.org/wiki/AWK
Introduction Of AWK
AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like OS.
(DSL: A computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There is a wide variety of DSLs, ranging from widely used languages for common domains, such as HTML for web pages, down to languages used by only one or a few pieces of software, such as MUSH soft code. DSLs can be further subdivided by the kind of language, and include domain-specific markup languages, domain-specific modeling languages, and domain-specific programming languages. Special-purpose computer languages have always existed in the computer age, but the term "domain-specific language" has become more popular due to the rise of domain-specific modeling. Simpler DSLs, particularly ones used by a single application, are sometimes informally called mini-languages.)
The AWK language is a data-driven scripting language consisting of a set of actions to be taken against streams of textual data - either run directly on files or used as part of a pipeline - for purposes of extracting or transforming text, such as producing formatted reports. The language extensively used the string datatype, associative arrays(Arrays indexed by key strings), and regular expressions. While AWK has a limited intended application domain and was specially designed to support one-liner programs, the language is Turing-complete, and even the early Bell labs users of AWK often wrote well-structured large AKW programs.
Structure of AWK programs.
AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.
--Alfred V..Aho
An AWK program is a series of pattern action pairs, written as:
BEGIN { action } #BEGIN CAN BE OMITED
condition {action}
condition {action}
...
END { action } #END CAN BE OMITED
Commands
The print command is used to output text. The output text is always terminated with a predefined string called the output record separator whose default value is a newline. The simplest form of this command is:
print #Equals print $0. This displays the contents of the current record.
print $1 # Displays the first field of the current record
print $1,$3 #Displays the first and third fields of the current record
The print command. The default action is to print the current line. like:
length($0)>80 # print all lines longer than 80 characters.
/regex_pattern/ {
#Actions to perform on matches the above regex_pattern.Like output to file.
print "expression" > "file_name"
#Or through a pipe
print "expression" > "command"
}
Built-in variables
NR: Number of Records. keeps a current count of the number of input records so far from all data files. It starts at zero. but is never automatically reset to zero.
FNR: File Number of Records. keeps a current count of the number of input records so far in the current file. This variable is automatically reset to zero each time a new file is started.
NF: Number of Fields. contains the number of fields in the current input records. The last field in the input record can be designated by $NF, the 2nd-to-last field by$(NF-1), the 3rd-to-last field by $(NF-2), etc.
FILENAME: Contains the name of the current input-file.
FS: Field Separator. Contains the field separator used to divide fields in the input record. The default, "white space", includes any space and tab characters. FS can be reassigned to another character to change the field separator.
RS: Record Separator. The default record separator is "newline"
OFS: Output Field Separator. Which Separates the fields when awk prints fields. The default is a "space" character.
ORS: Output Record Separator. Which Separates the records when awk prints records, The default is a "newline" character.
OFMT: Output Format. stores the format for numeric output. The default format is "%.6g"
User-defined function
Similar to C, function definitions consist of the keyword function.Like
function add_3(number){
return number + 3
}
This statement can be invoked as follows:
(pattern){
print add_3(100) #outputs 103
}
Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is a convention to add some whitespace in the argument list before the local variables, to indicate where the parameters end and the local variables begin.
Examples
BEGIN { print "Hello, world!"} #Hello, word
#Count words in the input and print the number of lines, words, and characters.
(like command wc):
{
words += NF
chars += length+1
}
END {print NR, words, chars}
#As there is no pattern for the first line of the program, every line of input matches by default, so the increment actions are executed for every line.
{ s+=$NF}
END {print s+0} #when the file is empty, no lines, s+0 make s =0.
#Another method of count lines.
Match a range of input lines
NR %4 == 1, NR % 4 == 3 {printf "%6d %s\n", NR, $0}
#print line No and line contents. NR is a number, so using %6d for output number as 6 character-wide field.
Printing the initial or the final part of a file.
As a special case, when the firsts part of a range pattern is constantly true, e.g.1, the range will start at the beginning of the input, Similarly, if the second part is constantly false, e.g.0, the range will continue until the end of input. For example:
/^--\s*comment on 202008\s*--$/, 0
#print lines of input from the first line matching the regular expression
^--\s*comment on 202008\s*--$, to the end.
Calculate word frequencies using associative arrays
BEGIN{
FS="[^a-zA-Z]+"
}
#No Pattern, for all lines.
{
for (i=1; i<=NF; i++){
words[tolower($i)]++
}
}
END {
for (i in words){
print i, words[i] #print i as word, and words[i] as frequencies.
}
}
Match pattern from command line
#!/bin/sh
pattern="$1"
shift
awk '/'"$pattern"'/ { print FILENAME ":" $0 }' "$@"
没有评论:
发表评论