dcsimg
Loading large text files to find duplicates
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Paulo_Nunes
Posted On:   Thursday, February 8, 2007 07:09 AM

Hi there, I have several files with 500 chars long at each line, and for each one i need to get a number (at index 213, 222), to make sure only one exists. I also need to take the last 250 lines to manipulate them according to the number given. The problem is that i've tried to store those results either in a vector or in a a table (using hsqldb), but in both case i get out of memory error while processing more than 74000 results. So what to do? Here's the code: public void Concatenate() { try { vClients = new Vector(); GroupIdentical = Main.getProp("javabop.group.identical", "N"); if   More>>

Hi there,

I have several files with 500 chars long at each line, and for each one i need to get a number (at index 213, 222), to make sure only one exists.

I also need to take the last 250 lines to manipulate them according to the number given.

The problem is that i've tried to store those results either in a vector or in a a table (using hsqldb), but in both case i get out of memory error while processing more than 74000 results.



So what to do?



Here's the code:



			


public void Concatenate() {
try {
vClients = new Vector();
GroupIdentical = Main.getProp("javabop.group.identical", "N");
if(GroupIdentical.equalsIgnoreCase("s")) vNifs = new Vector();
for(int i=0;i
BoPPanel.WriteLogPane("A ler ficheiro " + BopFiles[i] + "...");
FileInputStream fis = new FileInputStream(BopFiles[i]);
BufferedInputStream bis = new BufferedInputStream(fis);
DataInputStream in = new DataInputStream(bis);
String line = in.readLine();
//BoPPanel.SearchPane.append("
Ficheiro " + BopFiles[i] + "

");
while(line != null) {
if(line.toLowerCase().startsWith("10694")) {
GetEntry(BopFiles[i], line);
//BoPPanel.SearchPane.append(line + "
");
} else if (line.toLowerCase().startsWith("00694")) {
Header hd = GetHeader(BopFiles[i], line);
vHeaders.add(hd);
}
line = in.readLine();
}
fis.close();
bis.close();
in.close();
System.gc();
}
BoPPanel.WriteLogPane("Numero de elementos obtidos nos ficheiros: " + vClients.size());
BoPPanel.WriteLogPane("Concatenação concluída!");
//if(GroupIdentical.equalsIgnoreCase("s")) FindDuplicated();
} catch (Exception e) {
e.printStackTrace();
Main.WriteLogFile(e.getMessage());
}
}

public Header GetHeader(String file, String line) {
Header hd = new Header();
hd.headerFile = file;
hd.headerLine = line;
vHeaders.add(hd);
return hd;
}

public void Saveintable(int num, int nif, String file, int index, String series, String line) {
try {
Database db = new Database();
Connection conn = db.open();
//db.update("DROP TABLE Save");
//db.update("CREATE TABLE Save ( num INTEGER, nif INTEGER, file VARCHAR(100), index INTEGER, series VARCHAR(150), line VARCHAR(500))");
//db.update("DELETE FROM Save;");

String sqlInsert="INSERT INTO Save (num, nif, file, index, series, line) "
+ " VALUES (?,?,?,?,?,?)";
PreparedStatement prep = conn.prepareStatement(sqlInsert);
prep.setInt(1, num);
prep.setInt(2, nif);
prep.setString(3, file);
prep.setInt(4, index);
prep.setString(5, series);
prep.setString(6, line);
prep.executeUpdate();
//prep.close();
//conn.close();
} catch(Exception e) {
e.printStackTrace();
}
}


public void GetEntry(String file, String line) {
String series = line.substring(252).trim();
String numberstr = line.substring(30, 45).trim();
String nifstr = line.substring(213, 222).trim();
int num=0;
if(!numberstr.equals("")) num=Integer.parseInt(numberstr);
int nif=0;
if(!nifstr.equals("")) nif=Integer.parseInt(nifstr);
if(GroupIdentical.equalsIgnoreCase("s") && !nifstr.equals("")) vNifs.add(nifstr);
Saveintable(num, nif, file, BopIndex, series, line);
BoPPanel.SetCount(BopIndex);
BopIndex ++;
}

   <<Less

Re: Loading large text files to find duplicates

Posted By:   Robert_Lybarger  
Posted On:   Thursday, February 8, 2007 01:42 PM

Sounds like you might be leaking out some resources somewhere (that is, making objects that can't be GC'ed), though I suppose it'd be possible to completely fill up the default memory heap size after a while -- it happens when you crunching huge volume of file-based material. Look up the "java -X" option to increase heap size and run the code again. Example might be "java -Xmx256M" to run with 256 MB heap size. (IIRC, the default is, or used to be, only 64MB.)
About | Sitemap | Contact