Validate error - What this really means!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117194552437
RAC: 35998310

RE: My current suspicion is

Quote:
My current suspicion is that these validate errors do happen 'preferably' on 64Bit machines, either Linux or recent Mac OS versions.


I have a lot of Linux hosts doing FGRP1 tasks. They all run a 32 bit OS. Every one I've looked at gets validate errors. I haven't done a proper investigation but whenever I happen to be perusing a tasks list, I routinely see them.

Cheers,
Gary.

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

RE: RE: Is it possible

Quote:
Quote:
Is it possible for me to see the error messages from the validator?

Only through an Admin or a Mod.

I've taken on the job of running the queries and trying to keep participants informed. I'll try to keep a close watch on this.


Thanks Gary!

My biggest issue of course is whether this is the result of something I'm doing.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117194552437
RAC: 35998310

RE: My biggest issue of

Quote:
My biggest issue of course is whether this is the result of something I'm doing.


If it's something you're doing then it's also something I'm doing on lots of hosts. I'm not quite ready to accept that yet :-).

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7210824931
RAC: 942831

As I posted in reply to a

As I posted in reply to a thread in Number Crunching about an hour ago, I've recently observed a case in which Validate Errors reported so far by twelve of the fifteen hosts to which a single WU has been sent suggest some common cause, not likely an unlucky conjunction of twelve random host failures.

I don't have any idea how rare or common this particular sort mabe be, but this repeat offender WU was so far dispatched to fifteen hosts, and has generated a validate error on twelve of them. The rather conservative 20,20,20 setting means this one may yet go to five more unlucky hosts before central dispatch gives up on it.

Here are TaskIDs from hosts reporting Validate Error on this single WU:

257934442
258537109
257516015
259260242
259002951
259260243
257391991
258910534
259002952
257934441
258910533
258537110

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

That looks like a good way to

That looks like a good way to help the people debugging.

Here's one with 10 validate errors, one completed, one in progress, and one error while computing:

http://einsteinathome.org/workunit/109589019

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4310
Credit: 250087727
RAC: 34894

Occasionally there are

Occasionally there are workunits which error out that way. But these are rare, being watched, and I have only seen this with BRP4 WUs. Usually we cancel such WUs, but this requires manual intervention we didn't have time for recently. This is completely independent of the validate errors of FGRP tasks from certain App versions / platforms.

BM

BM

The Xorcist
The Xorcist
Joined: 16 Aug 11
Posts: 16
Credit: 464281554
RAC: 0

Another one with lots of

Another one with lots of validate probs

http://einsteinathome.org/workunit/109582646

chiphead
chiphead
Joined: 6 Dec 11
Posts: 1
Credit: 613757
RAC: 0

http://einstein.phys.uwm.edu/

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117194552437
RAC: 35998310

RE: Another one with lots

Quote:

Another one with lots of validate probs

http://einsteinathome.org/workunit/109582646


Yes, this is obviously one of those where the workunit itself is the problem. Without manual intervention by an admin, it will eventually reach the limit of 20 error results.

I checked several of the latest ones and the error message for all the ones I checked is

Validate error [6] (00100000)
- result file has too few or too many rows

If you get a resend with lots of validate errors on previous results like this, feel free to abort it.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117194552437
RAC: 35998310

RE: http://einstein.phys.uw


Actually only two validate errors at the time of checking. One of them has these messages associated with it

Validate error [6] (00111010)
- result file has entries that aren't numbers
- a number is out of valid range for this result
- result file has (lines with) wrong number of columns
- result file has too few or too many rows

I would think that this one is due to an overstressed GPU. The other one has just this single message

Validate error [6] (00001000)
- a number is out of valid range for this result

So at this point there's no indication that there must be a problem with the workunit. You would need to see quite a few with the same message to blame the WU. Resend tasks should NOT be deleted at this point.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.