JPA Select latest instance for each item
Let's say I have a Meeting entity. Each meeting has a single attendee and a meeting date. Within my meeting table I may have multiple meetings for each attendee, with different dates for each. I need a JPA query that will select only the latest meeting for all attendees. For instance, if my table looks like this
Meeting ID | Attendee ID | Meeting Date 1 | 1 | 6/1/2011 2 | 2 | 6/1/2011 3 | 1 | 6/6/2011 4 | 3 | 6/6/2011
My result should be
Meeting ID | Attendee ID | Meeting Date 2 | 2 | 6/1/2011 3 | 1 | 6/6/2011 4 | 3 | 6/6/2011
Using JPA 2 against postgres. Meeting has 1-1 to attendee and a simple timestamp date. I suspect I'm going to need to do a group by and max(blah) and maybe a join to myself, but I'm not sure of the best way to approach this.
Update: After spending the evening playing with this, I still do not have an acceptable JPQL solution to this. Here is what I have so far:
select m from Meeting m where m.meetingDate in ( select max(meet.meetingDate) from Meeting meet group by meet.attendee )
I have various other conditions that are not relevant to this question, like filtering by attendee department and whatnot. The only reason this works is because we are tracking meeting date to the second (or finer) and the chance that there will be two meetings at exactly the same time is minimal. We are putting some java stuff around it to keep only hte last meeting for each attendee just in case we do get two at the same time, but that's a pretty crappy solution. It really shouldn't be too difficult to get it all in a query, but I have yet to figure it out.
Update2: Adding sql tag because if I need to use sql to create a view and create a JPA object to map to the view I'm ok with that.
In SQL the solution is very simple - join the table with a subquery, which gives you the most recent meeting for each attendee:
select * from Meeting ALL join ( select max(meetingDate) as newest, attendee from Meeting group by attendee ) LATEST on ALL.meetingDate = LATEST.newest AND ALL.attendee = LATEST.attendee
This works, and works fast!
The problem with JPA is that it (or most implementations) won't allow a subquery for a join. After spending several hours trying what will compile in the first place, and then, how slow it is, I decided that I hate JPA. Solutions like the ones above - like EXISTS (SELECT .. ) or IN ( SELECT .. ) - take ages to execute, orders of magnitude slower than they should.
Having a solution that works meant that I just needed to access that solution from JPA. There are two magic words in SQL that help you do just that:
and the life becomes so much simpler... Just define such entity and use it. Caution: it's read-only.
Of course, any JPA purists will look down on you when you do that, so if anyone has a pure JPA solution, please let us both know!
I think I've got it with this query.
select m from Meeting m where m.meetingDate = (select max(m1.meetingDate) from Meeting m1 where m1.attendee = m.attendee ) and not exists (select m2 from Meeting m2 where m2.attendee = m.attendee and m2.meetingDate > m.meetingDate)
Well in SQL that would be quite simple I think, so I assume that can be mapped to JPA:
SELECT m.AttendeeId, MAX(m.MeetingDate) from Meeting m GROUP BY m.AttendeeId
Edit: If you need the messageId itself as well you can do that with a simple subquery that returns the messageId for a message where the other two values are equal. Just make sure you handle the case where there are several messageIds for the same Attendee and Date (eg pick the first result since they should all be equally good - although I'd doubt that such data even makes sense for meetings)
As Bulba has said appropriate way is to join a subquery with group by.
The problem is that you can't join a subquery.
Here is a workaround.
Lets see what you get in the subquery with group by. You get a list of pairs (attendee_id, max(meeting_date)). This pair is like a new unique id for row with max date you want to join on. Then note that each row in the table forms a pair (attendee_id, meeting_date). So every row has an id as a pair (attendee_id, meeting_date). Lets take a row if only it forms an id that belongs to list received in the subquery.
For simplicity lets represent this id-pair as a concatenation of attendee_id and meeting_date: concat(attendee_id, meeting_date).
Then the query in SQL(similarly for JPQL and JPA CriteriaBuilder) would be as follows:
SELECT * FROM meetings WHERE concat(attendee_id, meeting_date) IN (SELECT concat(attendee_id, max(meeting_date)) FROM meetings GROUP BY attendee_id)
Note that there is only one subquery per query, not one subquery for each row like in some answers.
Afraid of comparing strings?
We have a special offer for you!
Lets encode that id-pair to number. It will be a sum of attendee_id and meeting_date but with modifications to ensure uniqueness of code. We can take number representation of date as Unix time. We will fix the value of max date that our code can capture because final code has max value limit (e.g. bigint(int8)<263). Lets take for convenience max date as 2149-06-07 03:00:00. It equals 5662310400 in seconds and 65536 in days. I will assume here that we need precision for date in days(so we ignore hours and below). To construct unique code we can interpret it as a number in a numerical system with base of 65536. The last symbol(number from 0 to 216-1) in or code in such numerical system is number of days. Other symbols will capture attendee_id. In such interpretation code looks like XXXX, where each X is in range [0,216-1] (to be more accurate, first X is in range [0,215-1] because of 1 bit for sign), first three X represents attendee_id and last X represents meeting_date. So the max value of attendee_id our code can capture is 247-1. The code can be computed as attendee_id*65536 + "date in days".
In postgresql it will be:
attendee_id*65536 + date_part('epoch', meeting_date)/(60*60*24)
Where date_part returns date in seconds which we convert to days by dividing on constant.
And final query to get the latest meetings for all attendees:
SELECT * FROM meetings WHERE attendee_id*65536 + date_part('epoch', meeting_date)/(60*60*24) IN (SELECT attendee_id*65536 + date_part('epoch', max(meeting_date))/(60*60*24) from meetings GROUP BY attendee_id);
I have created a table with stucture as in the question and populated it with 100000 rows randomly selecting attendee_id from [1, 10000] and random date from range [1970-01-01, 2017-09-16]. I have benchmarked (with EXPLAIN ANALYZE) queries with the following techniques:
SELECT * FROM meetings m1 WHERE m1.meeting_date= (SELECT max(m2.meeting_date) FROM meetings m2 WHERE m2.attendee_id=m1.attendee_id);
Execution time: 873260.878 ms
Join subquery with group by
SELECT * FROM meetings m JOIN (SELECT attendee_id, max(meeting_date) from meetings GROUP BY attendee_id) attendee_max_date ON attendee_max_date.attendee_id = m.attendee_id;</code>
Execution time: 103.427 ms
Use pair (attendee_id, date) as a key
Concat attendee_id and meeting_date as strings
SELECT * FROM meetings WHERE concat(attendee_id, meeting_date) IN (SELECT concat(attendee_id, max(meeting_date)) from meetings GROUP BY attendee_id);
Execution time: 207.720 ms
Encode attendee_id and meeting_date to a single number(code)
SELECT * FROM meetings WHERE attendee_id*65536 + date_part('epoch',meeting_date)/(60*60*24) IN (SELECT attendee_id*65536 + date_part('epoch',max(meeting_date))/(60*60*24) from meetings GROUP BY attendee_id);
Execution time: 127.595 ms
Here is a git with table scheme, table data (as csv), code for populating table, and queries.
SELECT MAX(m.MeetingDate) FROM Meeting m