我有一个应用程序可以从 MySQL 数据库中访问大约 200 万条推文。具体来说,其中一个字段包含一条文本推文(最大长度为 140 个字符)。我将每条推文分成一个 ngram 的单词ngram,其中 1 <= n <= 3。例如,考虑以下句子:
I am a boring sentence.
对应的 nGram 是:
I
I am
I am a
am
am a
am a boring
a
a boring
a boring sentence
boring
boring sentence
sentence
大约有 200 万条推文,我正在生成大量数据。无论如何,我很惊讶地从 Java 中得到一个堆错误:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2145)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1922)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3423)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:483)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3118)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2709)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2728)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2678)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1612)
at twittertest.NGramFrequencyCounter.FreqCount(NGramFrequencyCounter.java:49)
at twittertest.Global.main(Global.java:40)
这是 Netbeans 的上述输出给出的问题代码语句(第 49 行):
results = stmt.executeQuery("select * from tweets");
因此,如果我的内存不足,那一定是它试图一次返回所有结果,然后将它们存储在内存中。解决此问题的最佳方法是什么?具体来说,我有以下问题:
- 我怎样才能处理部分
results
而不是整个集合? - 我将如何增加堆大小?(如果可能的话)
随意提出任何建议,如果您需要更多信息,请告诉我。
编辑
而不是select * from tweets
我将表划分为大小相等的子集,约占总大小的 10%。然后我尝试运行该程序。看起来它工作正常,但最终给了我同样的堆错误。这对我来说很奇怪,因为我过去曾运行过相同的程序,成功发送了 610,000 条推文。现在我有大约 2,000,000 条推文或大约 3 倍的数据。因此,如果我将数据分成三份,它应该可以工作,但我更进一步,将子集分成 10% 的大小。
是不是有些内存没有被释放?这是其余的代码:
results = stmt.executeQuery("select COUNT(*) from tweets");
int num_tweets = 0;
if(results.next())
{
num_tweets = results.getInt(1);
}
int num_intervals = 10; //split into equally sized subets
int interval_size = num_tweets/num_intervals;
for(int i = 0; i < num_intervals-1; i++) //process 10% of the data at a time
{
results = stmt.executeQuery( String.format("select * from tweets limit %s, %s", i*interval_size, (i+1)*interval_size));
while(results.next()) //for each row in the tweets database
{
tweetID = results.getLong("tweet_id");
curTweet = results.getString("tweet");
int colPos = curTweet.indexOf(":");
curTweet = curTweet.substring(colPos + 1); //trim off the RT and retweeted
if(curTweet != null)
{
curTweet = removeStopWords(curTweet);
}
if(curTweet == null)
{
continue;
}
reader = new StringReader(curTweet);
tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
//tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
//Set stopSet = StopFilter.makeStopSet(Version.LUCENE_36, stopWords, true);
//tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopSet);
tokenizer = new ShingleFilter(tokenizer, 2, 3);
charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) //insert each nGram from each tweet into the DB
{
insertNGram.setInt(1, nGramID++);
insertNGram.setString(2, charTermAttribute.toString().toString());
insertNGram.setLong(3, tweetID);
insertNGram.executeUpdate();
}
}
}