5

我有一个小问题,我需要计算控制台内的单词才能阅读 doc、docx、pptx、ppt、xls、xlsx、odt、pdf ......所以不要建议我 | wc -w 或 grep 因为它们仅适用于文本或控制台输出并且它们只计算空格并且在日语、中文、阿拉伯语、印度教、希伯来语中它们使用不同的分隔符,所以字数是错误的,我试图用这个来计算

pdftotext file.pdf -| wc -w
/usr/local/bin/docx2txt.pl < file.docx | wc -w
/usr/local/bin/pptx2txt.pl < file.pptx | wc -w
antiword file.doc -| wc -w 
antiword file.word -| wc -w

在某些情况下 microsoft word , openoffice 悲伤 1000 字和计数器返回 10 或 300 字如果语言是(日语、中文、印度教等...),但如果我使用普通字符,那么我没有问题,最大的错误在于在某些情况下,少 3 个字符的女巫是“好的”

我尝试使用 soffice , openoffice 进行转换,然后尝试 WC -w 但我什至无法转换,

soffice --headless --nofirststartwizard --accept=socket,host=127.0.0.1,port=8100; --convert-to pdf some.pdf /var/www/domains/vocabridge.com/devel/temp_files/23/0/东京_1000_words_Docx.docx 

或者

 openoffice.org  --headless  --convert-to  ........

或者

openoffice.org3 --invisible 

因此,如果有人知道正确计数或使用 openoffice 或其他任何东西或使用控制台显示文档统计的任何方法,请分享

谢谢。

4

5 回答 5

2

如果你有 Microsoft Word(当然还有 Windows),你可以编写一个 VBA 宏,或者如果你想直接从命令行运行,你可以编写一个类似于以下内容的 VBScript 脚本:

wordApp = CreateObject("Word.Application")
doc = ... ' open up a Word document using wordApp
docWordCount = doc.Words.Count
' Rinse and repeat...

如果您有 OpenOffice.org/LibreOffice,您有类似(但更多)的选项。如果您想留在办公室应用程序中并运行宏,您可能可以这样做。我不太了解 StarBasic API,无法告诉您如何操作,但我可以为您提供另一种选择:创建一个 Python 脚本以从命令行获取字数。粗略地说,您执行以下操作:

于 2013-04-14T07:40:25.660 回答
1

我找到了答案创建一项服务

#!/bin/sh
#
# chkconfig: 345 99 01
#
# description: your script is a test service
#

(while sleep 1; do
  ls pathwithfiles/in | while read file; do
    libreoffice --headless -convert-to pdf "pathwithfiles/in/$file" --outdir pathwithfiles/out
    rm "pathwithfiles/in/$file"
  done
done) &

然后我需要的 php 脚本计算了所有内容

 $ext = pathinfo($absolute_file_path, PATHINFO_EXTENSION);
        if ($ext !== 'txt' && $ext !== 'pdf') {
            // Convert to pdf
            $tb = mktime() . mt_rand();
            $tempfile = 'locationofpdfs/in/' . $tb . '.' . $ext;
            copy($absolute_file_path, $tempfile);
            $absolute_file_path = 'locationofpdfs/out/' . $tb . '.pdf';
            $ext = 'pdf';
            while (!is_file($absolute_file_path)) sleep(1);
        }
        if ($ext !== 'txt') {
            // Convert to txt
            $tempfile = tempnam(sys_get_temp_dir(), '');
            shell_exec('pdftotext "' . $absolute_file_path . '" ' . $tempfile);
            $absolute_file_path = $tempfile;
            $ext = 'txt';
        }
        if ($ext === 'txt') {
            $seq = '/[\s\.,;:!\? ]+/mu';
            $plain = file_get_contents($absolute_file_path);
            $plain = preg_replace('#\{{{.*?\}}}#su', "", $plain);
            $str = preg_replace($seq, '', $plain);
            $chars = count(preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY));
            $words = count(preg_split($seq, $plain, -1, PREG_SPLIT_NO_EMPTY));
            if ($words === 0) return $chars;
            if ($chars / $words > 10) $words = $chars;
            return $words;
        }
于 2013-04-15T09:47:48.147 回答
0

wc能理解 Unicode 并使用系统iswspace函数判断 unicode 字符是否为空格。“iswspace() 函数测试 wc 是否是表示程序当前语言环境中类空间字符的宽字符代码。” 因此,如果您的语言环境 ( ) 配置正确,wc -w应该能够正确计算字数。LC_CTYPE

wc程序的源代码

iswspace功能的手册页

于 2013-03-10T01:06:39.127 回答
0

我认为这可能会达到您的目标

# Continuously updating word count
import unohelper, uno, os, time
from com.sun.star.i18n.WordType import WORD_COUNT
from com.sun.star.i18n import Boundary
from com.sun.star.lang import Locale
from com.sun.star.awt import XTopWindowListener

#socket = True
socket = False
localContext = uno.getComponentContext()

if socket:
    resolver = localContext.ServiceManager.createInstanceWithContext('com.sun.star.bridge.UnoUrlResolver', localContext)
    ctx = resolver.resolve('uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext')
else: ctx = localContext

smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext('com.sun.star.frame.Desktop', ctx)

waittime = 1 # seconds

def getWordCountGoal():
    doc = XSCRIPTCONTEXT.getDocument()
    retval = 0

    # Only if the field exists
    if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
        # Get the field
        wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
        retval = wordcountgoal.Content

    return retval

goal = getWordCountGoal()

def setWordCountGoal(goal):
    doc = XSCRIPTCONTEXT.getDocument()

    if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
        wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
        wordcountgoal.Content = goal

    # Refresh the field if inserted in the document from Insert > Fields >
    # Other... > Variables > Userdefined fields
    doc.TextFields.refresh()

def printOut(txt):
    if socket: print txt
    else:
        model = desktop.getCurrentComponent()
        text = model.Text
        cursor = text.createTextCursorByRange(text.getEnd())
        text.insertString(cursor, txt + '\r', 0)

def hotCount(st):
    '''Counts the number of words in a string.

    ARGUMENTS:

    str st: count the number of words in this string

    RETURNS:

    int: the number of words in st'''
    startpos = long()
    nextwd = Boundary()
    lc = Locale()
    lc.Language = 'en'
    numwords = 1
    mystartpos = 1
    brk = smgr.createInstanceWithContext('com.sun.star.i18n.BreakIterator', ctx)
    nextwd = brk.nextWord(st, startpos, lc, WORD_COUNT)
    while nextwd.startPos != nextwd.endPos:
        numwords += 1
        nw = nextwd.startPos
        nextwd = brk.nextWord(st, nw, lc, WORD_COUNT)

    return numwords

def updateCount(wordCountModel, percentModel):
    '''Updates the GUI.
    Updates the word count and the percentage completed in the GUI. If some
    text of more than one word is selected (including in multiple selections by
    holding down the Ctrl/Cmd key), it updates the GUI based on the selection;
    if not, on the whole document.'''

    model = desktop.getCurrentComponent()
    try:
        if not model.supportsService('com.sun.star.text.TextDocument'):
            return
    except AttributeError: return

    sel = model.getCurrentSelection()
    try: selcount = sel.getCount()
    except AttributeError: return

    if selcount == 1 and sel.getByIndex(0).getString == '':
        selcount = 0

    selwords = 0
    for nsel in range(selcount):
        thisrange = sel.getByIndex(nsel)
        atext = thisrange.getString()
        selwords += hotCount(atext)

    if selwords > 1: wc = selwords
    else:
        try: wc = model.WordCount
        except AttributeError: return
    wordCountModel.Label = str(wc)

    if goal != 0:
        pc_text =  100 * (wc / float(goal))
        #pc_text = '(%.2f percent)' % (100 * (wc / float(goal)))
        percentModel.ProgressValue = pc_text
    else:
        percentModel.ProgressValue = 0

# This is the user interface bit. It looks more or less like this:

###############################
# Word Count            _ o x #
###############################
#            _____            #
#     451 /  |500|            #
#            -----            #
# ___________________________ #
# |##############           | #
# --------------------------- #
###############################

# The boxed `500' is the text entry box.

class WindowClosingListener(unohelper.Base, XTopWindowListener):
    def __init__(self):
        global keepGoing

        keepGoing = True
    def windowClosing(self, e):
        global keepGoing

        keepGoing = False
        setWordCountGoal(goal)
        e.Source.setVisible(False)

def addControl(controlType, dlgModel, x, y, width, height, label, name = None):
    control = dlgModel.createInstance(controlType)
    control.PositionX = x
    control.PositionY = y
    control.Width = width
    control.Height = height
    if controlType == 'com.sun.star.awt.UnoControlFixedTextModel':
        control.Label = label
    elif controlType == 'com.sun.star.awt.UnoControlEditModel':
        control.Text = label
    elif controlType == 'com.sun.star.awt.UnoControlProgressBarModel':
        control.ProgressValue = label

    if name:
        control.Name = name
        dlgModel.insertByName(name, control)
    else:
        control.Name = 'unnamed'
        dlgModel.insertByName('unnamed', control)

    return control

def loopTheLoop(goalModel, wordCountModel, percentModel):
    global goal

    while keepGoing:
        try: goal = int(goalModel.Text)
        except: goal = 0
        updateCount(wordCountModel, percentModel)
        time.sleep(waittime)

if not socket:
    import threading
    class UpdaterThread(threading.Thread):
        def __init__(self, goalModel, wordCountModel, percentModel):
            threading.Thread.__init__(self)

            self.goalModel = goalModel
            self.wordCountModel = wordCountModel
            self.percentModel = percentModel

        def run(self):
            loopTheLoop(self.goalModel, self.wordCountModel, self.percentModel)

def wordCount(arg = None):
    '''Displays a continuously updating word count.'''
    dialogModel = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialogModel', ctx)

    dialogModel.PositionX = XSCRIPTCONTEXT.getDocument().CurrentController.Frame.ContainerWindow.PosSize.Width / 2.2 - 105
    dialogModel.Width = 100
    dialogModel.Height = 30
    dialogModel.Title = 'Word Count'

    lblWc = addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 2, 25, 14, '', 'lblWc')
    lblWc.Align = 2 # Align right
    addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 33, 2, 10, 14, ' / ')
    txtGoal = addControl('com.sun.star.awt.UnoControlEditModel', dialogModel, 45, 1, 25, 12, '', 'txtGoal')
    txtGoal.Text = goal

    #addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 25, 50, 14, '(percent)', 'lblPercent')

    ProgressBar = addControl('com.sun.star.awt.UnoControlProgressBarModel', dialogModel, 6, 15, 88, 10,'' , 'lblPercent')
    ProgressBar.ProgressValueMin = 0
    ProgressBar.ProgressValueMax =100
    #ProgressBar.Border = 2
    #ProgressBar.BorderColor = 255
    #ProgressBar.FillColor = 255
    #ProgressBar.BackgroundColor = 255

    addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 124, 2, 12, 14, '', 'lblMinus')

    controlContainer = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialog', ctx)
    controlContainer.setModel(dialogModel)

    controlContainer.addTopWindowListener(WindowClosingListener())
    controlContainer.setVisible(True)
    goalModel = controlContainer.getControl('txtGoal').getModel()
    wordCountModel = controlContainer.getControl('lblWc').getModel()
    percentModel = controlContainer.getControl('lblPercent').getModel()
    ProgressBar.ProgressValue = percentModel.ProgressValue

    if socket:
        loopTheLoop(goalModel, wordCountModel, percentModel)
    else:
        uthread = UpdaterThread(goalModel, wordCountModel, percentModel)
        uthread.start()

keepGoing = True
if socket:
    wordCount()
else:
    g_exportedScripts = wordCount,

链接了解更多信息

https://superuser.com/questions/529446/running-word-count-in-openoffice-writer

希望这对汤姆有帮助

编辑:然后我发现了这个

http://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=22555

于 2013-03-08T13:36:24.383 回答
0

只是建立在@Yawar 写的内容之上。这是有关如何从控制台使用 MS word 进行字数统计的更明确的步骤。

我还使用更准确Range.ComputeStatistics(wdStatisticWords)的而不是 Words 属性。有关更多信息,请参见此处:https: //support.microsoft.com/en-za/help/291447/word-count-appears-inaccurate-when-you-use-the-vba-words-property

  1. 制作一个名为的脚本wc.vbs,然后将其放入其中:

    Set word = CreateObject("Word.Application")
    word.Visible = False
    Set doc = word.Documents.Open("<replace with absolute path to your .docx/.pdf>")
    docWordCount = doc.Range.ComputeStatistics(wdStatisticWords)
    word.Quit
    Dim StdOut : Set StdOut = CreateObject("Scripting.FileSystemObject").GetStandardStream(1)
    WScript.Echo docWordCount & " words"
    
  2. wc.vbs在您保存并运行的目录中打开 powershell,cscript .\wc.vbs您将获得字数 :)

于 2019-10-01T20:03:01.550 回答