c# - 根据特定内容将大txt文件拆分为小文件

Question

我有一个大的基因组序列，我需要将它拆分成小的 .txt 文件。

序列看起来像这样

>supercont1.1 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.2 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.3 of Geomyces destructans 20631-21
AGATTTT (...)

并且应该将其拆分为具有以下名称的小文件：“1.1-Geomyces-destructans--20631-21”、“1.2-Geomyces...”，并包含基因组数据。

@JimMischel 帮助后的代码如下所示：

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.IO;

namespace genom1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        string filter = "Textové soubory|*.txt|Soubory FASTA|*.fasta|Všechny soubory|*.*";

        private void doit_Click(object sender, EventArgs e)
        {
            bar.Value = 0;

            OpenFileDialog opf = new OpenFileDialog();

            // filter for choosing file types
            opf.Filter = filter;

            string lineo = "error"; // test

            if (opf.ShowDialog() == DialogResult.OK)
            {
                var lineCount = 0;
                using (var reader = File.OpenText(opf.FileName))
                {
                    while (reader.ReadLine() != null)
                    {
                        lineCount++;
                    }
                }

                bar.Maximum = lineCount;
                bar.Step = 1;

                FolderBrowserDialog fbd = new FolderBrowserDialog();

                fbd.Description = "Vyber složku, do které chceš rozdělit načtený soubor: \n\n" + opf.FileName; // dialog desc
                if (fbd.ShowDialog() == DialogResult.OK)
                {
                    List<string> lines = new List<string>();
                    foreach (var line in File.ReadLines(opf.FileName))
                    {
                        bar.PerformStep();

                        if (line[0] == '>')
                        {
                           if (lines.Count >= 0)
                            {
                                // write contents of lines list to file

                                //quicker replace for better file name
                                StringBuilder prep = new StringBuilder(line);
                                prep.Replace(">supercont", "");
                                prep.Replace("of", "");
                                prep.Replace(" ", "-");
                                lineo = prep.ToString();

                                // append or writeall? how to writeall lines without append?
                                //System.IO.File.WriteAllText(fbd.SelectedPath + "\\" + lineo + ".txt", lineo);
                                StreamWriter SW;
                                SW = File.AppendText(fbd.SelectedPath + "\\" + lineo + ".txt");

                                foreach (string s in lines)
                                    {
                                        SW.WriteLine(s);
                                    }

                                SW.Close();

                                // and clear the list.
                                lines.Clear();
                            }
                        }
                        lines.Add(line);
                    }
                    // here, do the last part
                    if (lines.Count >= 0)
                    {
                        // write contents of lines list to file.

                        /* starts being little buggy here...

                        StreamWriter SW;
                        SW = File.AppendText(fbd.SelectedPath + "\\" + lineo + ".txt");
                        foreach (string s in lines)
                        {
                            SW.WriteLine(s);
                        }
                        SW.Close();

                        */
                    }
                }

            }
        }

    }
}

score 2 · Accepted Answer

如果文件足够大以适合内存，则可以调用File.ReadAllText将其转换为字符串。然后你遍历并提取字符之间的文本>。就像是：

string s = File.ReadAllText("filename");
int pos = s.IndexOf('>');
while (pos != -1)
{
    int newpos = s.IndexOf('>', pos+1);
    string text = s.Substring(pos+1, newpos - pos);
    // now write text to a file

    // update current position
    pos = newpos;
}
// here you'll have to handle the last part of the file specially.

我假设您可以弄清楚如何正确命名文件。

如果您无法将整个文件放入内存，那么您可以逐个字符地读取文件或进行某种缓冲。如果你知道>总是在一行的开头，问题就更容易了。然后你可以写：

List<string> lines = new List<string>();
foreach (var line in File.ReadLines("filename"))
{
    if (line[0] == '>')
    {
        if (lines.Count > 0)
        {
            // write contents of lines list to file.
            // and clear the list.
            lines.Clear();
        }
    }
    lines.Add(line);
}
// here, do the last part
if (lines.Count > 0)
{
    // write contents of lines list to file.
}

score 1 · Accepted Answer

我想说最简单的方法是首先使用File.ReadAllText()读取整个文件。然后只需使用String.Split(">")它将返回一个我认为是您的新文件内容的数组。

c# - 根据特定内容将大txt文件拆分为小文件

2 回答 2

Related

Reference