bash - 每个单词单独一行

Question

我有一句话像

这是例如

我想将其写入文件，以便将这句话中的每个单词写入单独的行。

如何在 shell 脚本中做到这一点？

score 25 · Accepted Answer

几种方法，选择你最喜欢的！

echo "This is for example" | tr ' ' '\n' > example.txt

或者干脆这样做以避免echo不必要地使用：

tr ' ' '\n' <<< "This is for example" > example.txt

该<<<表示法与此处字符串一起使用

或者，使用sed代替tr：

sed "s/ /\n/g" <<< "This is for example" > example.txt

如需更多选择，请查看其他人的答案 =)

score 19 · Accepted Answer

19

$ echo "This is for example" | xargs -n1
This
is
for
example

于 2015-10-25T17:32:02.427 回答

score 9 · Accepted Answer

尝试使用：

string="This is for example"

printf '%s\n' $string > filename.txt

或利用bash 分词

string="This is for example"

for word in $string; do
    echo "$word"
done > filename.txt

score 6 · Accepted Answer

6

example="This is for example"
printf "%s\n" $example

于 2012-11-20T22:02:52.883 回答

score 2 · Accepted Answer

尝试使用：

str="This is for example"
echo -e ${str// /\\n} > file.out

输出

> cat file.out 
This
is
for
example

score 2 · Accepted Answer

使用fmt命令

>> echo "This is for example" | fmt -w1 > textfile.txt ; cat textfile.txt
This
is
for
example

有关fmt其选项的完整描述，请查看相关的手册页。

score 1 · Accepted Answer

注意我在几个草稿中写了这个，简化了正则表达式，所以如果有任何不一致，这可能就是原因。

你在乎标点符号吗？例如，在某些调用中，您会看到如(etc)之类的“单词” ，与括号完全相同。或者这个词是“括号”。而不是“括号”。如果您正在解析带有正确句子的文件，这可能是一个问题，尤其是如果您想按单词排序甚至获取每个单词的字数。

有一些方法可以解决这个问题，但有一些警告，当然还有改进的余地。这些恰好与数字、破折号（以数字表示）和小数点/点（以数字表示）有关。也许有一套确切的规则会帮助解决这个问题，但下面的例子可以给你一些工作要做。我制作了一些人为的输入示例来演示这些缺陷（或任何你想称它们的名称）。

$ echo "This is an example sentence with punctuation marks and digits i.e. , . ; \! 7 8 9" | grep -o -E '\<[A-Za-z0-9.]*\>'
This
is
an
example
sentence
with
punctuation
marks
and
digits
i.e
7
8
9

如您所见，ie`原来只是ie，否则标点符号不会显示。好的，但这忽略了诸如 major.minor.revision-release 形式的版本号之类的东西，例如0.0.1-1；这也可以显示吗？是的：

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[-A-Za-z0-9.]*\>'
The
current
version
is
0.0.1-1
The
previous
version
was
current
from
2017-2018

注意句子不以句号结尾。如果在年份和破折号之间添加一个空格会发生什么？你不会有破折号，但每年都会有自己的路线：

$ echo "2017 - 2018" | grep -o -E '\<[-A-Za-z0-9.]*\>'
2017
2018

那么问题就变成了您是否希望-自己被计算在内；由于分隔单词的性质，如果有空格，您将不会将年份作为单个字符串。因为它本身不是一个词，我不认为。

我相信这些可以进一步简化。此外，如果您根本不需要任何标点符号或数字，您可以将其更改为：

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z]*\>'
The
current
version
is
The
previous
version
was
current
from

如果您想获得数字：

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z0-9]*\>'
The
current
version
is
0
0
1
1
The
previous
version
was
current
from
2017
2018

至于带有字母和数字的“单词”，这是另一件可能会或可能不会考虑但证明上述内容的事情：

$ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z0-9]*\>'
The
current
version
is
0
0
1
1
test1

输出它们。但以下没有（因为它根本不考虑数字）：

$ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z]*\>'
The
current
version
is

忽略标点符号很容易，但在某些情况下可能需要或渴望它们。在eg的情况下，我想您可以使用 say sed 将行更改为eg到eg，但我猜这将是个人喜好。

我可以总结一下它是如何工作的，但只是；我太累了，甚至无法考虑太多：

它是如何工作的？

我将只解释调用grep -o -E '\<[-A-Za-z0-9.]*\>'，但其他调用大部分是相同的（扩展 grep 中的竖线/竖线符号允许多个模式）：

该-o选项仅用于打印匹配项而不是整行。-E用于扩展 grep（也可以使用 egrep）。至于正则表达式本身：

和是单词边界（分别是开始和结束 - 如果需要，您可以只指定一个）<\；\>我相信该-w选项与指定两者相同，但调用可能有点不同（我实际上不知道）。

表示破折号、'\<[-A-Za-z0-9.]*\>'大小写字母和点零次或多次。至于为什么会变成eg到.eg这个时候我只能说是模式，但我没有能力去考虑更多。

词频计数的奖励脚本

#!/bin/bash

if [ $# -eq 0 ]; then
    echo "Usage: $(basename ${0}) <FILE> [FILE...]"
    exit 1
fi

for file do
    if [ -e "${file}" ]
    then
        echo "** ${file}: "
        grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|sort | uniq -c | sort -rn
    else
    echo >&2 "${1}: file not found"
    continue
    fi
done

例子：

$ cat example 
The current version is 0.0.1-1 but the previous version was non-existent.

This sentence contains an abbreviation i.e. e.g. (so actually two abbreviations).

This sentence has no numbers and no punctuation  
$ ./wordfreq example 
** example: 
   2 version
   2 sentence
   2 no
   2 This
   1 was
   1 two
   1 the
   1 so
   1 punctuation
   1 previous
   1 numbers
   1 non-existent
   1 is
   1 i.e
   1 has
   1 e.g
   1 current
   1 contains
   1 but
   1 and
   1 an
   1 actually
   1 abbreviations
   1 abbreviation
   1 The
   1 0.0.1-1

注意，我没有将大写字母转写为小写字母，因此“The”和“the”这两个词显示为不同的词。如果您希望它们全部为小写，您可以在排序之前将脚本中的 grep 调用更改为管道：

    grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|tr '[A-Z]' '[a-z]'|sort | uniq -c | sort -rn

哦，既然您问是否要将其写入文件，您只需添加到命令行（这是用于原始调用）：

> output_file

对于脚本，您可以像这样使用它：

$ ./wordfreq file1 file2 file3 > output_file

bash - 每个单词单独一行

7 回答 7

注意我在几个草稿中写了这个，简化了正则表达式，所以如果有任何不一致，这可能就是原因。

它是如何工作的？

词频计数的奖励脚本

Related

Reference