FS#6470 - Link graph job thread-join schedule changes

Attached to Project: OpenTTD
Opened by J G Rennison (JGR) - Wednesday, 25 May 2016, 19:17 GMT
Type Patch
Category Core
Status New
Assigned To No-one
Operating System All
Severity Low
Priority Normal
Reported Version trunk
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No


Pause the game instead of blocking when link graph jobs lag.

Check if the job is still running one date fract tick before it is due to join and if so pause the game until its done.
This avoids the main thread being blocked on a thread join, which appears to the user as if the game is unresponsive, as the UI does not repaint and cannot be interacted with.
Show if pause is due to link graph job in status bar, update network messages.
This does not apply for network clients.

2. (applies on top of 1)
Join more than one link graph job at once where possible.

This can occur when the link graph recalculation times are changed.
In particular, if the duration or interval are reduced, the new jobs
will be delayed until the completion of the previous long job,
and then remain backlogged thereafter as jobs are dequeued at the same
rate as they are queued.
This task depends upon

Comment by J G Rennison (JGR) - Wednesday, 25 May 2016, 19:42 GMT
The uploaded diffs somehow ended up as 0 bytes, uploading again...
Comment by fonsinchen (fonsinchen) - Sunday, 19 June 2016, 10:28 GMT
I like the lag/pause patch, but there are some stylistic issues with it:

1. The IsFinished() method should be renamed as the name is quite ambiguous, now. Make it NeedsToBeJoined() or similar.
2. Line wrapping could be improved, in particular the comment in LinkGraphSchedule::Run() is hard to read.
3. The "if" cascade in StateGameLoop_LinkGraphPauseControl() can be simplified.
4. The "extern" declaration in StateGameLoop() is unnecessary. Instead, StateGameLoop_LinkGraphPauseControl() should be declared in linkgraphschedule.h, which is already included in openttd.cpp.

The multi-join patch is somewhat problematic. I did this on purpose because the merging back of link graph state takes place on the main thread and joining multiple components in a row causes more lag than only joining one. I understand that this can hold up processing of the job queue and increase the delay between game changes and link graph reactions. Maybe some more intelligent mechanism could be found that joins at most N link graph edges at a time or triggers the joining more often when the queue is long.
Comment by fonsinchen (fonsinchen) - Sunday, 19 June 2016, 14:57 GMT
Unfortunately the flag transfer between threads is undefined behavior in C++11. Also, even before C++11, the hardware is allowed to take an arbitrarily long time to actually make the write in the link graph thread visible to the read in the main thread. The compiler might even realize that the write is never done in the main thread and optimize the whole thing away. We should use a mutex there for older compilers, and a std::atomic with a stricter memory model for C++11. That is somewhat annoying, of course. Maybe I can come up with something better.
Comment by fonsinchen (fonsinchen) - Sunday, 19 June 2016, 19:35 GMT
It's terribly ugly, but "volatile" would probably do the trick here. As we cannot generally use C++11 atomics, we cannot establish any proper synchronization without a using a mutex. volatile does exactly the right thing on MSVC, and at least prevents the compiler from optimizing the whole thing away on other platforms. Even though not specifically guaranteed, it's probably safe to assume that the change to job_completed becomes available to the main thread fairly soon after the worker thread has terminated. The CPU's cache coherency protocol should take care of that, provided that we actually reread the value from memory (or cache) every time we need it.

The #ifdef'd atomic operations are fairly useless in their current form as the relaxed memory model doesn't actually give us the guarantees we are looking for here.
Comment by J G Rennison (JGR) - Monday, 20 June 2016, 22:50 GMT
On your first comment:
1. Perhaps IsScheduledToBeJoined() ?
2. Noted, I'll see about reflowing that comment, though it probably needs re-writing anyway given your second and third comments.
3. Agreed. Looking at it now, the else if clause and its nested ifs can be collapsed.
4. Agreed.

As for merging multiple link graph jobs, merging two instead of one where available could be a way forward.
It prevents the queue from being backlogged indefinitely, but shouldn't cause an excessive latency spike.

For comments 2 and 3:
On MSVC volatile reads/writes have acquire/release semantics, which is definitely correct, but stricter than necessary.
Relaxed ordering semantics are sufficient as there are no other writes by the link graph worker thread which need to be ordered with respect to the visibility in the main thread of the change to the job_completed flag, until the worker thread exits which acts as a full memory barrier. There is only one shared write, so any ordering will do for both the reader and writer.
On non MSVC platforms volatile can be assumed to be at least relaxed ordering, and to prevent the compiler removing the load/store.
An ordinary instead of atomic write should also be OK as there is no write contention, so volatile ought to be good enough.
Comment by J G Rennison (JGR) - Friday, 15 July 2016, 01:00 GMT
I've attached an updated version of the lag/pause patch (diff 1), with changes based on the comments above.

For the multiple join patch (diff 2), instead of joining as many jobs as possible in the same tick, the patch now adds a second join check at a later date_fract in the same day as the original join check.
As joins are now checked more often than jobs are spawned, any backlog will be gradually reduced, without joining multiple jobs in a tick or excessively increasing the rate at which jobs are joined.