0

我有一个包含 25000 个文本文件的文件夹,我想阅读这些文件并将单词放入表中。我的文本文件以以下格式命名 1.txt、2.txt、........和以此类推至 25000.txt。每个文本文件都包含以下形式的单词。

sample contents of my file
apple
cat
rat
shoe

这些单词也可能在其他文本文件中重复,我想要可以读取文本文件的 ac# 代码识别重复的单词以及不重复的单词,然后以下列形式将它们插入 Sqlserver 中的数据库。

keyword    document name
cat        1.txt,2.txt,3.txt
rat        4.txt,1.txt
fish       5.txt

`

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.Linq;

using System.Text;


using System.Windows.Forms;

using System.IO;

using System.Data.SqlClient;



namespace RAMESH
 {
public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
    }

    private void textBox1_TextChanged(object sender, EventArgs e)
    {

    }

    private void button2_Click(object sender, EventArgs e)
    {

        string[] files = Directory.GetFiles(textBox1.Text, "*.txt");
        int i;
        string sqlstmt,str;
        SqlConnection con = new SqlConnection("data source=dell-pc\\sql1; initial         catalog=db; user id=sa; password=a;");
        SqlCommand cmd;
        sqlstmt = "delete from Items";
        cmd = new SqlCommand(sqlstmt, con);
        con.Open();
        cmd.ExecuteNonQuery();
        for (i = 0; i < files.Length; i++)
        {
            StreamReader sr = new StreamReader(files[i]);
            FileInfo f = new FileInfo(files[i]);
            string fname;
            fname = f.Name;
            fname = fname.Substring(0, fname.LastIndexOf('.'));
            //MessageBox.Show(fname);
            while ((str = sr.ReadLine()) != null)
            {
                int nstr=1;
                //int x,y;
                //for (x = 0; x < str.Length; x++)
                //{ 
                //    y = Convert.ToInt32(str.Substring(x,1));
                //    if ((y < 48 && y > 75) || (y < 65 && y > 97) || (y < 97 && y > 122)) ;
                //}
                sqlstmt = "insert into Items values('" + str + "','" + fname + "')";
                cmd = new SqlCommand(sqlstmt, con);                    
                try
                {
                    cmd.ExecuteNonQuery();
                }
                catch (Exception ex)
                {
                    sqlstmt = "update Items set docname=docname + '," + fname + "'   where itemname='" + str + "'";
                    cmd = new SqlCommand(sqlstmt, con);
                    cmd.ExecuteNonQuery();
                }
            }
            sr.Close();
        }
        MessageBox.Show("keywords added successfully");
        con.Close();
    }
}

} `

4

2 回答 2

1

首先,我将向您的数据库添加一个存储过程,以隔离更新或插入的逻辑

CREATE PROCEDURE UpsertWords
@word nvarchar(MAX), @file nvarchar(256)
as

    Declare @cnt integer
    Select @cnt = Count(*) from Items where ItemName = @word
    if @cnt = 0 
        INSERT INTO Items (@word, @file)
    else
        UPDATE Items SET docname = docname + ',' + @file where ItemName = @word

现在,我们可以大大简化您的代码

.....

// Build the command just one time, outside the loop,
// make it point to the stored procedure above
cmd = new SqlCommand("UpsertWords", con);
cmd.CommandType = CommandType.StoredProcedure;                    

// Create dummy parameters, the actual value is supplied inside the loop
cmd.Parameters.AddWithValue("@word", string.Empty);
cmd.Parameters.AddWithValue("@file", string.Empty);

// Now loop on every file
for (i = 0; i < files.Length; i++)
{
    // Open and read all the lines in the current file
    string[] lines = File.ReadAllLines(files[i]);

    // Get only the filename part without the extension
    string fname = Path.GetFileNameWithoutExtension(files[i])

    // In case of just one line per file, this loop will execute just one time
    // however we also could handle more than one line per file
    foreach(string line in lines)
    {
        // Set the actual value of the parameters created outside the loop
        cmd.Parameters["@word"] = line;
        cmd.Parameters["@file"] = fname;
        // Run the insert or update (the logic is inside the storedprocedure)
        cmd.ExecuteNonQuery();
    }

此时尚不清楚您的行是由单个单词组成,还是由多个字符(制表符、逗号、分号)分隔的多个单词。在这种情况下,您需要拆分字符串和另一个循环。

但是,我发现您的数据库架构错误。最好为每个单词及其出现的文件添加一个新行。这样一个简单的查询就像

SELECT  docname from Items where itemname = @word 

将输出所有文件而没有任何大的性能问题,并且您拥有一个更易于搜索的数据库。
或者,如果您需要计算某个单词的出现次数

SELECT ItemName, COUNT(ItemName) as WordCount 
FROM Items 
GROUP BY ItemName 
ORDER BY Count(ItemName) ASC
于 2013-03-30T12:19:26.067 回答
0

试试这个方法:

首先从您的文件开始,循环并创建一个简单的 XML 文档。

        var fname = "File12.txt";
        var keywords = new List<string>(new[]{ "dog", "cat", "moose" });           

        var miXML = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("root"));

        foreach (var el in keywords.Select(i => new XElement("item", new XAttribute("key", i))))
        {
            miXML.Root.Add(el);
        }

        using (var con = new SqlConnection("Server=localhost;Database=HT;Trusted_Connection=True;"))
        {
            con.Open();
            using (var cmd = new SqlCommand("uspUpsert", con) {CommandType = CommandType.StoredProcedure})
            {
                cmd.Parameters.AddWithValue("@X", miXML.ToString());
                cmd.Parameters.AddWithValue("@fileName", fname);
                cmd.ExecuteNonQuery();
            }
        }

然后对于您的存储过程,您可以调用此 Proc,它将将该 XML 转换为表,将关键字和文件名插入数据库。

CREATE PROCEDURE uspUpsert
    @X xml,
    @Filename varchar(100)
AS
BEGIN
SET NOCOUNT ON;

    WITH KV as (
        select 
            x.v.value('@key', 'varchar(20)') as Keyword
            ,@FileName as FileName
        FROM   @x.nodes('/root/item') x(v) 
    )
    insert into Items 
    select KV.keyWord, KV.FileName
    from KV
    left outer join Items I on I.Keyword=KV.Keyword and I.FileName=KV.FileName
    where I.id is null
END

由于您可能不希望 'file1.txt file2.txt file3.txt' 查找重复项,因此您将使用此查询来查找重复文件中的单词:

select * from items where keyword='dog'

或者,现在可以计数并在此表上进行所有其他聚合。

于 2013-03-30T18:48:41.723 回答