| Author |
Message |
|
|
|
Workunit 266757 Checked, but no consensus yet
Just curious??
I have had three of these Could you explain please why only one is marked as validate errors??
Stderr output
<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
argument [0]: [projects/dnahome.cs.rpi.edu_dna/Gibbs_0.18_windows_intelx86.exe]
argument [1]: [--max_sites]
argument [3]: [--blocks]
argument [4]: [0.9]
argument [5]: [0.05]
argument [6]: [0.05]
argument [7]: [--motifs]
type_string [forward]
width_string [16]
type_string [reverse]
width_string [16]
argument [11]: [--enable_shifting]
argument [14]: [--print_best_sites]
argument [16]: [--print_current_sites]
argument [17]: [--sequence_file]
argument [19]: [--seed]
argument [21]: [--burn_in_period]
argument [23]: [--current_sites]
argument [24]: [sites.txt]
blocks: 0.900000 0.050000 0.050000
seeding: 1010282157
sites were from arguments
incremented counts from checkpoint for [2066] sequences.
argument [0]: [projects/dnahome.cs.rpi.edu_dna/Gibbs_0.18_windows_intelx86.exe]
argument [1]: [--max_sites]
argument [3]: [--blocks]
argument [4]: [0.9]
argument [5]: [0.05]
argument [6]: [0.05]
argument [7]: [--motifs]
type_string [forward]
width_string [16]
type_string [reverse]
width_string [16]
argument [11]: [--enable_shifting]
argument [14]: [--print_best_sites]
argument [16]: [--print_current_sites]
argument [17]: [--sequence_file]
argument [19]: [--seed]
argument [21]: [--burn_in_period]
argument [23]: [--current_sites]
argument [24]: [sites.txt]
blocks: 0.900000 0.050000 0.050000
seeding: 1010282558
incremented counts from checkpoint for [2066] sequences.
</stderr_txt>
Michael |
|
|
|
|
|
father ambrose | log out
Is this not an invalid state or are the units checked against a valid statement.
name
test_newmotifs_3_6328_540000
application
Gibbs sampler
created
12 May 2011 | 18:26:11 UTC
minimum quorum
2
initial replication
2
max # of error/total/success tasks
2, 4, 2
Task
click for details
Computer
Sent
Time reported
or deadline
explain
Status
Run time
(sec)
CPU time
(sec)
Credit
Application
571994
2535
12 May 2011 | 18:26:21 UTC
12 May 2011 | 19:17:47 UTC
Completed, validation inconclusive
1,603.41
1,516.53
pending
Gibbs sampler v0.18
571995
1648
12 May 2011 | 18:26:21 UTC
12 May 2011 | 19:13:53 UTC
Validate error
1,350.68
1,326.02
---
Gibbs sampler v0.18
573657
1892
12 May 2011 | 19:18:01 UTC
14 May 2011 | 7:18:01 UTC
In progress
---
---
---
Gibbs sampler v0.18
michael |
|
|
|
|
|
Hi Michael,
This is the only pending task you have at the minute, http://dnahome.cs.rpi.edu/dna/workunit.php?wuid=280583
test_newmotifs_4_7505_1070000
581955 1900 13 May 2011 | 2:23:10 UTC 14 May 2011 | 14:23:10 UTC In progress --- --- --- Gibbs sampler v0.18
581956 2535 13 May 2011 | 2:23:11 UTC 13 May 2011 | 5:53:14 UTC Completed, waiting for validation 2,631.58 1,609.11 pending Gibbs sampler v0.18
As you can see, you returned your task but it hasn't been returned by another cruncher yet; each task is crunched by two crunchers, unless one fail in which case it is sent out again. |
|
|
|
|
|
Thanks for your reply.
The question I was asking was this.in both post’s expanded.
A WU is sent out to two hosts A and B.
A returns as complete awaiting validation Pending.
B returns and is compared with A result both get a result of Checked, but no consensus yet.
A keeps the Checked, but no consensus yet
B is marker as validate error.
At this point in time who is correct A or B
(I would expect B to keep [Checked, but no consensus yet until C is returned.])
The WU is then sent to C
C returns and is compared with A Validation occurs.
Michael
|
|
|
|
|
|
I get what you are saying now; A and B should remain at no consensus rather than B going to validate error, until C is returned, so C can be compared to both A and B.
While I suppose there might be some other problem that could cause a validation error (task failed after 3sec), that would be unlikely to happen repeatedly on different systems.
|
|
|
|
|
|
Just found another of these .
wuid=291964
Michael |
|
|
|
|
|
Like I did on the first look, you missed the obvious ;)
This example (like the first one too) has only the start arguments but no result data in it.
For a valid WU you should see something like
incremented counts from checkpoint for [2066] sequences.
<current_sites>
1,392:1,589.
1,219:0,389.
...
...
...
0,30:0,106.
0,41:1,71.
</current_sites>
17:50:25 (5060): called boinc_finish
</stderr_txt>
Missing result = invalid
Question is, what caused that. |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 8 Feb 10 Posts: 361 Credit: 262,678 RAC: 23
|
|
It looks like he's having some kind of computation error, maybe some kind of segmentation fault... I wonder why nothing showed up in the error log...
Does this happen to all your workunits, or just some of them? What system is it happening on? |
|
|
|
|
It looks like he's having some kind of computation error, maybe some kind of segmentation fault... I wonder why nothing showed up in the error log...
Does this happen to all your workunits, or just some of them? What system is it happening on?
Good Evening
No the problem is not just my host I have seen it on quite a few hosts. Most WU’s from both these host 2535 and 2585 of mine are successful. (Possibly 99+%)
The two logged host 1648 and host 280 and on both my hosts 2535 and 2585.
As indicated in post 763. I did miss the stderr_txt.
They would only show up in the pending state. As I only noticed while checking out standing WU’s.
If you look at this WU 291964 for host 280 his still shows validate error while mine 2586 shows completed and validated.
There will be a few more no doubt? These are the one’s I have reported. Perhaps we should start a thread just for errors to see how many there are or have been.
edit: OS on both hosts is Vista I do not know what the other hosts are running.
Michael |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 8 Feb 10 Posts: 361 Credit: 262,678 RAC: 23
|
|
It looks like some crash happened on the gas giants computer. If you look at that task, it's not outputting the <current_sites>...</current_sites> that the other workunits are:
http://dnahome.cs.rpi.edu/dna/result.php?resultid=605015
vs
http://dnahome.cs.rpi.edu/dna/result.php?resultid=605016
http://dnahome.cs.rpi.edu/dna/result.php?resultid=609620
Chances are that's because this workunit crashed sometime during it's execution. Not quite sure why it would be crashing while the other ones are running fine...
There may be some kind of memory error when I'm setting up the application; either that or it could be some kind of problem with that machine. |
|
|
|
|
|
Travis Desell.
Thank you. I wonder if it may be happening more often then reported.
i.e. validate error rather than no consensus.
edit; just found
wuid=310410
michael |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 8 Feb 10 Posts: 361 Credit: 262,678 RAC: 23
|
Travis Desell.
Thank you. I wonder if it may be happening more often then reported.
i.e. validate error rather than no consensus.
edit; just found
wuid=310410
michael
That's very interesting, because it looks like it's running to completion, but not outputting any results. |
|
|
|
|
|
Sequence of events.
Until today WU 100% complete: status running, status ready to report, [almost instant to 1 second.] status clear.
Today it appears as WU 100% complete: status running, status ready to report, [between 5 to 10 seconds.] status clear.
Could this have been the cause of loss of stderr_txt data files.
edit: I cannot ever recall seeing an uploading status.
edit 2: I spoke too soon this was returned almost instant 100% running ready to report clear
wuid=326887
Michael
I do not know if you have adjusted anything? So far no errors.
Michael |
|
|
|
|
|
I see the validate error still exsist? Has any body else been getting them.
wuid=277270
wuid=277580
Michael |
|
|