hi,
in another thread Darren told me that
66
You get and return your data to and from one system, but another system keeps up with what you are given and what you have returned.
When you get assigned a work unit by the assignment server, you then go to the data server to get it. When you return it, you send it directly back to the data server, but the assignment server doesn't know this until you next time you contact it. This will happen automatically when you need more work, or you can do it manually by updating the project.
99
This means that if your machine dies, or gets disconnected, or if you detach or reset your project between upload and the next WU being downloaded, you lose the credit and the project apparently loses the work even tho it has been uploaded.
Or, if your client uploads a WU an hour before the deadline, but the next WU is not downloaded till three hours later, then the WU is counted as late when in fact it was uploaded in time.
None of this is what a naive user would expect.
None of this is very friendly even to the less naive user, IMHO
My first suggestion is that after each upload the client should try to contact the assignment server asap to tell it that the WU is returned safely.
My second suggestion is that the servers should check when a WU times out, and before it is re-assigned, whether in fact it has been returned but for some reason the assignment server did not get told. That way work would not get lost to the project and credit would not get lost to the user.
Ideally both of these would be good, but either one would solve the problem of WU that turn into ghosts after they've been crunched and returned
~~gravywavy
~~gravywavy
Copyright © 2024 Einstein@Home. All rights reserved.
'update' assignment server after every WU upload
)
> My first suggestion is that after each upload the client should try to contact
> the assignment server asap to tell it that the WU is returned safely.
>
> My second suggestion is that the servers should check when a WU times out,
> and before it is re-assigned, whether in fact it has been returned but for
> some reason the assignment server did not get told. That way work would not
> get lost to the project and credit would not get lost to the user.
Sorry, looks like I created more questions by not being specific enough.
The client doesn't contact the server after each upload because of the increased burden that would put on the server. The amount of data that moves during those contacts is actually quite small, but the burden on the server of establishing that contact is relevant. By reporting work back in groups, it keeps the number of necessary contacts (and thus the burden on the server) to a minimum. Just as the assignment server assigns you several work units at once, because of the sheer number of users it works best when it gets them reported back the same way - several at once.
As far as work units expiring, it does your second suggestion already. If the assignment server hasn't heard back from you, as expiration nears it will check with the data server just to make sure it really hasn't been returned. If you return a work unit and it never gets reported, the assignment server will find it just before expiration. I don't know exactly how many seconds before expiration it makes the check, so I suppose there is a tiny little window in which you could return it and it get missed, but the client also has some safeguards for that. As long as your client is running and connected (which it had to be if you just returned a unit right at the last moment) it will notify the assignment server of uploads that are about to expire even if it doesn't need more work - again triggered by the fact that the work unit is about to expire.
Depending on your OS there is an option to notify the assignment server every time a work unit is uploaded. It appears you're using the windows client, and I don't believe this can be done (at least not easily) with the windows client. It is still an option for linux, but I believe it is being removed in upcoming versions. In general, it really doesn't create any benefit but creates a lot of burden on the servers.
Hi Darren, hey! one out
)
Hi Darren,
hey! one out of two wishes granted already, that's not bad! Glad it already does the second one, as it will (eventually) pick up any ghosts that have actually been returned.
On the first one, I accept the reason you give, and I half thought that might be why for boxes that turn round many WU/day.
I'd ask you to consider this, however, or ask the policymakers to consider them.
To be clear, my suggestion is that if a user takes that option the box only contacts a project when either a WU was just returned, or at the polling interval if there is no work at all held locally for that project. On a box where the time to crunch a WU is more than the "connect every" interval, you would tend to save network connects by going over to a once/WU returned option. Certainly if the group size is usually 1 WU on a particular box there would be no *extra* overhead compared to delayed reporting of single groups.
On the other hand it means a little extra complexity in the client (both the code and the user-interface to select the option) also counts as overhead from the viewpoint of the project programmers, and that might be well be enuff in itself to stop it ever happening, on the KISS principle.
~~gravywavy
> To be clear, my suggestion
)
> To be clear, my suggestion is that if a user takes that option the box only
> contacts a project when either a WU was just returned, or at the polling
> interval if there is no work at all held locally for that project. On a box
> where the time to crunch a WU is more than the "connect every" interval, you
> would tend to save network connects by going over to a once/WU returned
> option. Certainly if the group size is usually 1 WU on a particular box
> there would be no *extra* overhead compared to delayed reporting of single
> groups.
That could reduce network connections a bit more, but I think it would then discourage people from keeping small queues. If you were running on a 1 wu cache, you would get no new work until after you upload the work unit in progress. That would mean a few seconds of down time between work units, and if you want to see what that can do to a lot of people, just take a look around the seti boards when people have a delay in getting work.
As it is, by linking the trigger for connecting to needing more work rather than having turned in work, I think the most official guess I've seen is that it averages to about 1.1 connects for every work unit done - and it keeps people from having those short little inactive periods.
I just wish they would give us more options to define for ourselves how we connect, rather than a better functioning pre-defined option. For instance, why is there no ability to override everything and tell it to connect, for example, at 9:45 every morning and do whatever needs to be done - upload anything ready to go, get anything more I need - then be done with it till tomorrow?
> I just wish they would give
)
> I just wish they would give us more options to define for ourselves how
> we connect, rather than a better functioning pre-defined option. For
> instance, why is there no ability to override everything and tell it to
> connect, for example, at 9:45 every morning and do whatever needs to be done -
> upload anything ready to go, get anything more I need - then be done with it
> till tomorrow?
>
Actually something like this is on the task list. Connect only between the hours of x and y. If there was anything that needed doing, it would connect at x, and allow connections up to y at which point no new connections would be allowed. I am not certain if it would drop open connections at this time.
BOINC WIKI
Hi Darren, sorry for the
)
Hi Darren,
sorry for the delay, it took me some time to figure out how to explain the two facts that you are totally right in what you say but nonetheless it does not contradict what I said. Instant reporting is still better than defferred reporting.
> [...] If you were running on a 1 wu
> cache, you would get no new work until after you upload the work unit
> in progress. That would mean a few seconds of down time between work units,
> and if you want to see what that can do to a lot of people, just take a look
> around the seti boards when people have a delay in getting work.
>
> As it is, by linking the trigger for connecting to needing more work rather
> than having turned in work, I think the most official guess I've seen is that
> it averages to about 1.1 connects for every work unit done - and it keeps
> people from having those short little inactive periods.
Please consider three different scenarios, all on a hypothetical machine which takes exactly 1 day to complete a WU. In each case WU A starts to run at midnight on day 1, followed by WU B and WU C. I will also assume it takes
Your single-WU scenario, cache of max 1 WU, WU A completes at 0000 on day 2, uploads, WU B is assigned at 0005, downloaded and running by 0010. WU B completes at 0010 on day 3, uploads and is reported at 0015. Total time WU B is checked out is 1 day plus 10min overhead. This comes at the cost of 10 min downtime between each WU, which you correctly say will annoy some people (including me, btw).
Secondly, the current scenario where the new WU is downloaded when there is less than or about about 0.1 days (=2h 24min) work left. WU B is assigned at 2136 on day 1 and A is reported (having been uploaded in the meantime). At 0000 on day 2 WU A completes and WU B starts instantly. At 2136 on day 2 WU A is reported, and WU C is assigned. At 0000 on day 3 WU B is complete and WU C starts. At 2136 on day 3 WU B is reported as a side effect of asking for WU D. So WU B was checked out for two full days.
Third scenario, go for 2 WU cache but report soon after completion. Again lets assume this takes an arbitrary 10 min to happen, though I'd hope usually it would be faster.
WU A started at 0000 on day 1, and almost immediately the previous WU was uploaaded & reported and the next WU, WU B is assigned at 0010. WU B starts at 0000 on day 2, completes at 0000 on day 3, and is uploaded and reported by 0010 on day 3.
Compare the second and third scenarios. The cost to the project is exactly the same, 1 connect per WU and the database has to carry the WU for two full days in each case.
From the client point of view they are very different. In scenario 2 the client is carrying a completed WU for most of a day, and the uncompleted work oscillates between 0.1 and 1.1 WU; which is how long the client can run if it is cut off from the project (net downtime, project downtime, etc).
In scenario 3 the uncompleted work oscilates from 0.99 WU to 1.99 WU, giving corespondingly longer that the client can work in isolation. For no cost to the project the client gains 0.89 WU extra resilience.
Connect on completion, with a 2 WU cache, is therefore the optimal configuration for all computers that are too slow to make it worth reporting and assigning WU in bunches.
To put it another way, while you do the assignment and reporting in a single connection (for very good reasons as you pointed out) you always hold a whole number of WU. If you go for more than one WU to avoid the inter-WU gap, then you may as well go all the way to 2 WU, becasue that is what it costs in the database and in your stats for average turn-round.
The valid choice is between a 1 WU cache with just over 1 day trun round and a small gap between WU, or a 2 WU cache with just over a 2 day turn round and upload/report/assignment just after completion. Which you choose depends on how you rate local inactivity against doubling the database load and doubling your reported average turnround.
The current design is less optimal than either of those alternatives.
~~
~~gravywavy
Having a 2 WU cache is less
)
Having a 2 WU cache is less than optimal from the clients standpoint if multiple projects are attached. It is already the case for some computers that are attached to many projects that even 1 WU per project oversubscribes the CPU and many deadlines are missed. In these cases, we will have to go to a client side scheduler that does not attempt to keep 1 WU per project, and the projects will have to accept a higher number of connections from these hosts.
BOINC WIKI
> Having a 2 WU cache is less
)
> Having a 2 WU cache is less than optimal from the clients standpoint if
> multiple projects are attached. [...]
yes, but if you have multiple projects you will not mind having a small gap between E@H WU, as the cpu will be running the other project's WU in the meantime. I would strongly advise anyone with a slow machine and more than one project to go for a the 1 WU cache. As I explained, the current default does not give you that, it gives you the problems of a 2 WU cache but without some of the advantages.
Whatever default values are chosen, the essence of my wish (this is a wish list not a complaint board) would be that the user could configure a different behaviour where appropriate for their machine & for their choice of other projects (or for no other projects).
~~gravywavy