bash - 如何使用 hxselect 生成数组结果？

Question

我正在使用hxselect在 bash 中处理 HTML 文件。

在这个文件中，有多个用 '.row' 类定义的 div。

在 bash 中，我想将这些“行”提取到一个数组中。（div 是多行的，因此简单地逐行阅读是不合适的。）

有可能实现这一目标吗？（用基础工具，awk，grep等）

将行分配给数组后，我想进一步处理它：

for row in ROWS_EXTRACTED; do
PROCESS1($row)
PROCESS2($row)
done

谢谢！

score 0 · Accepted Answer

以下指示hxselect使用制表符分隔匹配项，删除所有换行符，然后将制表符分隔符转换为换行符。这使您可以将 div 作为行迭代read：

#!/bin/bash

divs=$(hxselect -s '\t' .row < "$1" | tr -d '\n' | tr '\t' '\n')

while read -r div; do
    echo "$div"
done <<< "$divs"

给定以下测试输入：

<div class="container">
  <div class="row">
    herp
    derp
  </div>
  <div class="row">
    derp
    herp
  </div>
</div>

结果：

$ ./test.sh test.html
<div class="row">    herp    derp  </div>
<div class="row">    derp    herp  </div>

score 0 · Accepted Answer

一种可能性是将标签的内容放在一个数组中，每个项目都用引号括起来。例如：

# Create array with " " as separator
array=`cat file.html | hxselect -i -c -s '" "' 'div.row'`
# Add " to the beginning of the string and remove the last
array='"'${array%'"'}

然后，在for循环中处理

for index in ${!array[*]}; do printf "  %s\n\n" "${array[$index]}"; done

如果标签包含引号字符，另一种解决方案是在标签内容中放置一个分隔符（在我的示例中为§）：

array=`cat file.html | hxselect -i -c -s '§' 'div.row'`

然后用 awk 进行处理：

# Keep only the separators to count them with ${#res}
res="${array//[^§]}"
for (( i=1; i<=${#res}; i++ ))
do
    echo $array2 | awk -v i="$i" -F § '{print $i}'
    echo "----------------------------------------"
done

bash - 如何使用 hxselect 生成数组结果？

2 回答 2

Related

Reference