java - Java Mysql 大数据超出堆空间

Question

我有一个应用程序可以从 MySQL 数据库中访问大约 200 万条推文。具体来说，其中一个字段包含一条文本推文（最大长度为 140 个字符）。我将每条推文分成一个 ngram 的单词ngram，其中 1 <= n <= 3。例如，考虑以下句子：

I am a boring sentence.

对应的 nGram 是：

I
I am
I am a
am
am a
am a boring
a
a boring
a boring sentence
boring
boring sentence
sentence

大约有 200 万条推文，我正在生成大量数据。无论如何，我很惊讶地从 Java 中得到一个堆错误：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2145)
    at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1922)
    at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3423)
    at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:483)
    at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3118)
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2709)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2728)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2678)
    at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1612)
    at twittertest.NGramFrequencyCounter.FreqCount(NGramFrequencyCounter.java:49)
    at twittertest.Global.main(Global.java:40)

这是 Netbeans 的上述输出给出的问题代码语句（第 49 行）：

results = stmt.executeQuery("select * from tweets");

因此，如果我的内存不足，那一定是它试图一次返回所有结果，然后将它们存储在内存中。解决此问题的最佳方法是什么？具体来说，我有以下问题：

我怎样才能处理部分results而不是整个集合？
我将如何增加堆大小？（如果可能的话）

随意提出任何建议，如果您需要更多信息，请告诉我。

编辑而不是select * from tweets我将表划分为大小相等的子集，约占总大小的 10%。然后我尝试运行该程序。看起来它工作正常，但最终给了我同样的堆错误。这对我来说很奇怪，因为我过去曾运行过相同的程序，成功发送了 610,000 条推文。现在我有大约 2,000,000 条推文或大约 3 倍的数据。因此，如果我将数据分成三份，它应该可以工作，但我更进一步，将子集分成 10% 的大小。

是不是有些内存没有被释放？这是其余的代码：

          results = stmt.executeQuery("select COUNT(*) from tweets");
          int num_tweets = 0;
          if(results.next())
          {
              num_tweets = results.getInt(1);
          }
          int num_intervals = 10;                  //split into equally sized subets
          int interval_size = num_tweets/num_intervals;

          for(int i = 0; i < num_intervals-1; i++)        //process 10% of the data at a time
          {
            results = stmt.executeQuery( String.format("select * from tweets limit %s, %s", i*interval_size, (i+1)*interval_size));
            while(results.next())  //for each row in the tweets database
            {
                tweetID = results.getLong("tweet_id");
                curTweet = results.getString("tweet");
                int colPos = curTweet.indexOf(":");
                curTweet = curTweet.substring(colPos + 1);                           //trim off the RT and retweeted 
                if(curTweet != null)
                {
                    curTweet = removeStopWords(curTweet);
                }

                if(curTweet == null)
                {
                    continue;
                }
                reader = new StringReader(curTweet);
                tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
                //tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
                //Set stopSet = StopFilter.makeStopSet(Version.LUCENE_36, stopWords, true);
                //tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopSet);
                tokenizer = new ShingleFilter(tokenizer, 2, 3);

                charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

                while(tokenizer.incrementToken())                  //insert each nGram from each tweet into the DB
                {
                    insertNGram.setInt(1, nGramID++);
                    insertNGram.setString(2, charTermAttribute.toString().toString());
                    insertNGram.setLong(3, tweetID);
                    insertNGram.executeUpdate();
                }
            }
          }

score 1 · Accepted Answer

不要从表中获取所有行。通过设置查询限制，尝试根据您的要求选择部分数据。您正在使用 MySQL 数据库，您的查询将是 select * from tweets limit 0,10。这里 0 是起始行 ID，10 表示从开始的 10 行。

score 1 · Accepted Answer

您始终可以使用 -Xmx 参数增加 JVM 可用的堆大小。您应该阅读所有可用的旋钮（例如 perm gen size）。谷歌其他选项或阅读这个 SO 答案。

使用 32 位机器可能无法解决此类问题。您将需要 64 位和大量 RAM。

另一种选择是将其视为 map-reduce 问题。使用 Hadoop 和 Mahout 在集群上解决它。

score 0 · Accepted Answer

您是否考虑过流式传输结果集？页面的中间是关于结果集的部分，它解决了您的问题（我认为？）将 n 克写入文件，然后处理下一行？或者，我是不是误解了你的问题？ http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html

java - Java Mysql 大数据超出堆空间

3 回答 3

Related

Reference