I gotta preface this post by saying that we are very skeptical of the idea that Java is the right place to do processing that works with data in bulk. By extension, ORM is probably not an especially appropriate way to do batch processing. We think that most databases offer excellent solutions in this area: stored procedure support, and various tools for import and export. Because of this, we've neglected to properly explain to people how to use Hibernate for batch processing if they really feel they /have/ to do it in Java. At some point, we have to swallow our pride, and accept that lots of people are actually doing this, and make sure they are doing it the Right Way.
A naive approach to inserting 100 000 rows in the database using Hibernate might look like this:
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
}
tx.commit();
session.close();
This would fall over with an OutOfMemoryException somewhere after the 50 000th row. That's because Hibernate cache's all the newly inserted Customers in the session-level cache. Certain people have expressed the view that Hibernate should manage memory better, and not simply fill up all available memory with the cache. One very noisy guy who used Hibernate for a day and noticed this is even going around posting on all kinds of forums and blog comments, shouting about how this demonstrates what shitty code
Hibernate is. For his benefit, let's remember why the first-level cache is not bounded in size:
- persistent instances are /managed/ - at the end of the transaction, Hibernate synchronizes any change to the managed objects to the database (this is sometimes called /automatic dirty checking/)
- in the scope of a single persistence context, persistent identity is equivalent to Java identity (this helps eliminate data /aliasing/ effects)
- the session implements /asynchronous write-behind/, which allows Hibernate to transparently batch together write operations
For typical OLTP work, these are all very, very useful features. Since ORM is really intended as a solution for OLTP problems, I usually ignore criticisms of ORM which focus upon OLAP or batch stuff as simply missing the point.
However, it turns out that this problem is incredibly easy to work around. For the record, here is how you do batch inserts in Hibernate.
First, set the JDBC batch size to a reasonable number (say, 10-20):
hibernate.jdbc.batch_size 20
Then, flush() and clear() the session every so often:
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
if ( i % 20 == 0 ) {
//flush a batch of inserts and release memory:
session.flush();
session.clear();
}
}
tx.commit();
session.close();
What about retreiving and updating data? Well, in Hibernate 2.1.6 or later, the scroll() method is the best approach:
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
ScrollableResults customers = session.getNamedQuery("GetCustomers")
.scroll(ScrollMode.FORWARD_ONLY);
int count=0;
while ( customers.next() ) {
Customer customer = (Customer) customers.get(0);
customer.updateStuff(...);
if ( ++count % 20 == 0 ) {
//flush a batch of updates and release memory:
session.flush();
session.clear();
}
}
tx.commit();
session.close();
Not so difficult, or even shitty
, I guess. Actually, I think you'll agree that this was much easier to write than the equivalent JDBC code messing with scrollable result sets and the JDBC batch API.
One caveat: if Customer has second-level caching enabled, you can still get some memory management problems. The reason for this is that Hibernate has to notify the second-level cache /after the end of the transaction/, about each inserted or updated customer. So you should disable caching of customers for the batch process.
No problem, we get it.
Please don't mischaracterize our position on this, which is that Java is not a good place to do batch processing. We have never said that Hibernate is any better or worse than any other product for this use case. Indeed, almost all ORM products behave exactly the same as Hibernate on this.
If you actually run the code I just displayed, you will find it to be just as fast as whatever other ORM product you are using. But not as fast as a stored procedure, of course.
Actually, you don't need to do batch processing to have thousands of insertions/updates in a single transaction. It may be a design issue, but the fact is that there is such a need in some applications.
I don't see why you speak about difficulties with JDBC and scrollable resultsets when the exact same "counter" solution has been available right here with begin/endTransaction() since the very beginning.
I'm sorry but it is not an hibernate strength per se to be able to rely on the transaction API of the JDBC driver :-)
BTW, i am absolutely fond of this lovely framework !
for information, it seems that the scroll(ScrollMode) is only avaiblable in Hibernate 3 not yet in 2.1.6
Philippe
http://www.gnu.org/software/gawk/manual/html_mono/gawk.html
"Programs in awk are different from programs in most other languages, because awk programs are data-driven; that is, you describe the data you want to work with and then what to do when you find it. Most other languages are procedural; you have to describe, in great detail, every step the program is to take. When working with procedural languages, it is usually much harder to clearly describe the data your program will process. For this reason, awk programs are often refreshingly easy to read and write."
It is funny you say that because lots of people are doing it very successfully. It's just Hibernate that has issues because you don't feel people should do it. Other ORM tools do Java batch right out of the box, no configuration needed.
I just hope that your attitude against doing batch work in Java doesn't carry over to your EJB work.
Hibernate has no issues that any other full ORM tool does not have. Any JDO implementation, TopLink, etc, will behave exactly the same. See PersistenceManager.flush() and
PersistenceManager.evictAll() for the equivalent JDO pattern.
Dude, give it a break man, you are wrong. It happens to everyone once in a while ... you'll get over it :-)
"Java is indeed the worst tool to do data batching. " Why is that?
Some people can't use stored procedures to do batch jobs because they use data outside the database or need to remain somewhat neutral to the database.
You want to take back this statement, then right? Or do you disagree with Gavin?
It would be very nice if you would stop making any assumptions about our motivations. If your goal is to make us angry by acting like a bonehead, this "discussion" is over.
Here's what I'm taking away from it. jboss is against any type of batch job that doesn't happen completely inside a database.
Since that isn't the way most real world apps work, we will look for solutions outside of jboss products. Also, calling potential clients boneheads is great for business.
What has been said:
Loading *massive* amount of data into another process (VM) to do manipulation and then push it to another process (DB) is not as efficient as doing it with something close to the database.
This is a valid statement no matter what kind of ORM you use and Hibernate was not built for this, but as Gavin shows it is quite possible to do - and if you do it as described it will most likely outperform many ORM's.
We probably have been to up-tight about it and we think people have misunderstood that message so Gavin posted a blog on how to actually do this stuff efficiently if you really want to do it!
It is always very easy to take cheap shots at people who are actually *doing*. In the long run, what the doers do has relevance to the world, what the hataz do is forgotten tomorrow.
I forgive you, actually, your jealousy makes me sad.
One day, you will do something worthwhile.
peace
We are all just "muddle-ing through". We are trying to find the truth, on the basis of our experiences. Stuff the Hibernat team tells people is the best advice we know, after thinking very hard about the problem.
It's very annoying when people try to to parse our words every which way, trying to trap us in a contradiction, or trying to make us admit we are wrong as some kind of point-scoring exercise, or looking for some kind of evil motivation. Sometimes we are wrong, often at least slightly wrong. It's never malicious. Hopefully we eventually learn about our mistakes.
There is a principle in argument knows as "the principle of charity", which is that you should always interpret people's comments in the *best possible* light, and assume the most idealistic motivation. In this case, our motivation is to help people build systems that interact efficiently with a SQL database. That's all.
If people aren't interested in what we have to say, or are only interested in insulting us, they shouldn't read it. If you read it, and comment on it, follow the principle of charity. If you disagree, talk about your use case, and how it differs, and show why it's important. That way we *all* might learn something. Don't try to point-score, it's just not useful.
regards,
Gavin
P.S. We've noticed a new argument technique recently. Apparently, it is possible to win any argument by declaring "oh I was a potential JBoss customer, and now I'm not". Apart from the incredible non-provability (indeed, unlikelihood) of the claim, it is quite irrelevant and has zero bearing on the topic being debated. Just a hint....
very entertaining. I'm enjoying this discussions pretty much. Thank you a lot.
For giving me (and everyone who likes it) Hibernate and that amusing entertainment. You could simply ignore those "hirni's" (don't know an adequate term in english), but you don't. Not just giving us Hibernate but putting a smile on our faces too.
Thanks
Roman Sykora
The rest of us who actually have to *DO* something only have time to be interested in finding solutions. And we're frankly just appreciative of any advice, hints and honestly about the tools we have to work with. The Hibernate team does this very well.
Thanks
Lee
2. On batching generally, is is possible in JDBC to batch queries as well as updates? Eg. If I know up front that I need to query three different records can JDBC (and hence Hibernate) batch them?
Thanks,
Anthony
(2) No, its not possible, though Hibernate can batch-read entities by primary key. (See the doco.)
What does the setting "jdbc.batch_size" do? If I set this setting to 20 for example...then I would expect , Hibernate persists the data transparently at this frequency (every 20 records or so).
But it doesn't happen that way.
May be I am missing the point.
Regards,
Nkuma
But I think there's an other point: Maintenance.
If I've both batching and a gui driven application instrumenting the same data, I'd prefer to maintain only one domain model.
This means I can either try to do GUI codeing in PL/SQL or do batches in the JVM.
IMHO: If you can pay the price (not perfect performance) do the batching in Java (preferably with Hibernate ;-) ) and take advantage of a nice OO domain model and good support for GUI programming.
If performance _realy_ hurts, pay the other price (double the maintenance) and do the batch stuff as close to your data as you can.
HTH
Ernst
So the question can be: why are server-based approaches better than client-based ones for batch processing? That's pretty obvious: the work of translating data sets out of the DB alone adds significant overhead. Java (and O-O in general) makes it even more expensive by adding a layer of memory alocation (even if you're just doing JDBC). Finally ORMs add even more. Sure creating new objects isn't all that expensive, but it starts to add up. As a result pl/sql programmers have very steady jobs in pretty much all sizeable banks and other analytics processing shops around the world.
PS If I were you, I would delete these pointlessly obnoxious comments. I really can't believe these people. What is it, no one listens to them in real life, so they complain on blogs instead?
Do you still thing Hibernate is not the best thing for batch processing.is it better to use Pl/SQL and jdbc for batch processing.
I´ve seen a number of customers use hibernate for batch processing, the results have been mixed. First recognize that in 2004, when this thread first started, no container-managed batch processing technology really existed in the marketplace. Today there exists a batch container (WebSphere Compute Grid) that runs inside of an application server, this batch container is a peer to the EJB container, web container, etc. Note that the batch container is portable and can run outside of WebSphere Application Server, in places like JBOSS, WAS CE, Weblogic, etc. In addition to having a first class batch container, there are two popular ways to describe (read: program) your batch applications: Batch Datastream Framework that integrates with the Compute Grid batch container, and Spring Batch. Container-managed services for batch are critical. Let the container manage the: transaction for the batch application, prepared-statement management (jdbc batching, etc), checkpoint/restart, etc.
Before answering your question about the viability of Hibernate for batch (or data-intensive) applications, let´s first make clear the role of Java for this workload. Java, especially java 1.5 and beyond with generational garbage collection, a better JIT, etc is very well suited for batch applications. Surprisingly, Java has even outperformed COBOL batch for certain types of workloads on the mainframe, where COBOL batch dominates. This is a testament to the JIT optimizations made over the last several years, and we should expect the performance of Java to converge with other languages.
Using stored procedures for batch work works well, however there are a couple of important considerations to keep in mind. First, you risk duplicating business logic across your OLTP application and your batch application, since both run in fundamentally different environments. Second, life-cycle management is a major problem. A previous forum post dismissed this as a trivial problem to solve, but it´s quite difficult. The life-cycle of the database is independent of the life-cycle of the application server as well as the life-cycle of the application itself. Frequent changes to the application could require changes to the stored-procedure definition, At large IT shops, this typically requires coordination across multiple organizations: operations, database management, application development. It´s better to recognize this early, and bundle the batch application logic with the OLTP application, and manage this as a single application.
Now to your question about the viability of Hibernate for batch. The closer the data access technology is to the underlying data serving technology, the more optimizations can be made. Read this as: writing SQL and using JDBC directly will give you more performance tuning opportunities than leveraging an ORM layer. SQL queries for batch, which typically must retrieve hundreds of thousands/millions records will be highly optimized by database experts that understand the database optimizer, plans, etc. I recently saw an SQL query for a batch application that was over 300 pages (ms word, times new roman, 12pt font) long! The hibernate query was killed after taking 4 weeks to execute. The 300pg sql query completed in under 4 minutes. While there are numerous optimzations that can be made to an SQL statement for batch, two in particular are important: First, holding cursors open across transactions, where a single select can be made to the database, and multiple syncpoints/checkpoints/transactions are used to process the records; second, JDBC batching, where mulple prepared statements are accumulated in the App Server tier and sent across the wire in one RPC call. By using an ORM layer for batch, you limit your ability to tune. ORM for batch is especially a problem for selects, since for batch you only want to select the columns you really need, and with an ORM technology you may get the entire row/object.
A major problem I see is , where developers haphazardly use the ORM api´s throughout the code, versus creating a proper data-access layer that is technology independent. This limits options from an application standpoint, where we can switch the data-access object from Hibernate to SQL in the future. We instead end up in a situation where the ORM technology is ingrained in the app.
There are a number of papers on the topic of designing batch applications available now. These weren´t available in 2004 when this thread started.
- designing batch applications (pdf w/ examples): http://snehalantani.googlepages.com/designingBatchApps.zip
- data-intensive processing with websphere: http://snehalantani.googlepages.com/WebSphereDataIntensiveApps.pdf
- SwissRe and their use of WebSphere Compute Grid on z/OS for batch: http://www-01.ibm.com/software/tivoli/features/ccr2/ccr2-2008-12/swissre-websphere-compute-grid-zos.html
- J2EE batch processing: http://www.slideshare.net/chris1adkin/j2ee-batch-processing-presentation
- Hibernate chapter on batch: http://docs.jboss.org/hibernate/stable/core/reference/en/html/batch.html
- Hibernate chapter on performance tuning: http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html
Here is a presentation that describes the technology: http://snehalantani.googlepages.com/latestpresentationmaterial