Checked, but no consensus yet

log in

Advanced search

Message boards : Number crunching : Checked, but no consensus yet

Author Message
father ambrose
Send message
Joined: 6 May 11
Posts: 10
Credit: 153,106
RAC: 0
Message 736 - Posted: 12 May 2011, 13:05:37 UTC
Last modified: 12 May 2011, 13:20:04 UTC

Workunit 266757 Checked, but no consensus yet

Just curious??

I have had three of these Could you explain please why only one is marked as validate errors??

Stderr output
<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
argument [0]: [projects/dnahome.cs.rpi.edu_dna/Gibbs_0.18_windows_intelx86.exe]
argument [1]: [--max_sites]
argument [3]: [--blocks]
argument [4]: [0.9]
argument [5]: [0.05]
argument [6]: [0.05]
argument [7]: [--motifs]
type_string [forward]
width_string [16]
type_string [reverse]
width_string [16]
argument [11]: [--enable_shifting]
argument [14]: [--print_best_sites]
argument [16]: [--print_current_sites]
argument [17]: [--sequence_file]
argument [19]: [--seed]
argument [21]: [--burn_in_period]
argument [23]: [--current_sites]
argument [24]: [sites.txt]
blocks: 0.900000 0.050000 0.050000
seeding: 1010282157
sites were from arguments
incremented counts from checkpoint for [2066] sequences.
argument [0]: [projects/dnahome.cs.rpi.edu_dna/Gibbs_0.18_windows_intelx86.exe]
argument [1]: [--max_sites]
argument [3]: [--blocks]
argument [4]: [0.9]
argument [5]: [0.05]
argument [6]: [0.05]
argument [7]: [--motifs]
type_string [forward]
width_string [16]
type_string [reverse]
width_string [16]
argument [11]: [--enable_shifting]
argument [14]: [--print_best_sites]
argument [16]: [--print_current_sites]
argument [17]: [--sequence_file]
argument [19]: [--seed]
argument [21]: [--burn_in_period]
argument [23]: [--current_sites]
argument [24]: [sites.txt]
blocks: 0.900000 0.050000 0.050000
seeding: 1010282558
incremented counts from checkpoint for [2066] sequences.

</stderr_txt>

Michael

father ambrose
Send message
Joined: 6 May 11
Posts: 10
Credit: 153,106
RAC: 0
Message 740 - Posted: 12 May 2011, 19:22:56 UTC

father ambrose | log out


Is this not an invalid state or are the units checked against a valid statement.

name

test_newmotifs_3_6328_540000



application

Gibbs sampler



created

12 May 2011 | 18:26:11 UTC



minimum quorum

2



initial replication

2



max # of error/total/success tasks

2, 4, 2




Task
click for details

Computer

Sent

Time reported
or deadline
explain

Status

Run time
(sec)

CPU time
(sec)

Credit

Application



571994

2535

12 May 2011 | 18:26:21 UTC

12 May 2011 | 19:17:47 UTC

Completed, validation inconclusive

1,603.41

1,516.53

pending

Gibbs sampler v0.18



571995

1648

12 May 2011 | 18:26:21 UTC

12 May 2011 | 19:13:53 UTC

Validate error

1,350.68

1,326.02

---

Gibbs sampler v0.18



573657

1892

12 May 2011 | 19:18:01 UTC

14 May 2011 | 7:18:01 UTC

In progress

---

---

---

Gibbs sampler v0.18



michael

Profile skgiven
Avatar
Send message
Joined: 3 May 11
Posts: 31
Credit: 827,977
RAC: 0
Message 745 - Posted: 13 May 2011, 9:35:35 UTC - in response to Message 740.

Hi Michael,
This is the only pending task you have at the minute, http://dnahome.cs.rpi.edu/dna/workunit.php?wuid=280583

test_newmotifs_4_7505_1070000

581955 1900 13 May 2011 | 2:23:10 UTC 14 May 2011 | 14:23:10 UTC In progress --- --- --- Gibbs sampler v0.18
581956 2535 13 May 2011 | 2:23:11 UTC 13 May 2011 | 5:53:14 UTC Completed, waiting for validation 2,631.58 1,609.11 pending Gibbs sampler v0.18

As you can see, you returned your task but it hasn't been returned by another cruncher yet; each task is crunched by two crunchers, unless one fail in which case it is sent out again.

father ambrose
Send message
Joined: 6 May 11
Posts: 10
Credit: 153,106
RAC: 0
Message 747 - Posted: 13 May 2011, 10:22:58 UTC

Thanks for your reply.

The question I was asking was this.in both post’s expanded.

A WU is sent out to two hosts A and B.

A returns as complete awaiting validation Pending.

B returns and is compared with A result both get a result of Checked, but no consensus yet.

A keeps the Checked, but no consensus yet

B is marker as validate error.

At this point in time who is correct A or B

(I would expect B to keep [Checked, but no consensus yet until C is returned.])

The WU is then sent to C

C returns and is compared with A Validation occurs.

Michael

Profile skgiven
Avatar
Send message
Joined: 3 May 11
Posts: 31
Credit: 827,977
RAC: 0
Message 748 - Posted: 13 May 2011, 13:16:13 UTC - in response to Message 747.

I get what you are saying now; A and B should remain at no consensus rather than B going to validate error, until C is returned, so C can be compared to both A and B.

While I suppose there might be some other problem that could cause a validation error (task failed after 3sec), that would be unlikely to happen repeatedly on different systems.

father ambrose
Send message
Joined: 6 May 11
Posts: 10
Credit: 153,106
RAC: 0
Message 762 - Posted: 15 May 2011, 9:45:30 UTC

Just found another of these .

wuid=291964

Michael

Len LE/GE
Send message
Joined: 28 Jun 10
Posts: 17
Credit: 327,686
RAC: 0
Message 763 - Posted: 15 May 2011, 14:14:34 UTC - in response to Message 762.
Last modified: 15 May 2011, 14:16:02 UTC

Like I did on the first look, you missed the obvious ;)

This example (like the first one too) has only the start arguments but no result data in it.

For a valid WU you should see something like


incremented counts from checkpoint for [2066] sequences.
<current_sites>
1,392:1,589.
1,219:0,389.
...
...
...
0,30:0,106.
0,41:1,71.
</current_sites>
17:50:25 (5060): called boinc_finish

</stderr_txt>


Missing result = invalid

Question is, what caused that.

Travis Desell
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 8 Feb 10
Posts: 364
Credit: 262,678
RAC: 0
Message 766 - Posted: 15 May 2011, 16:49:37 UTC - in response to Message 763.
Last modified: 15 May 2011, 16:49:50 UTC

It looks like he's having some kind of computation error, maybe some kind of segmentation fault... I wonder why nothing showed up in the error log...

Does this happen to all your workunits, or just some of them? What system is it happening on?

father ambrose
Send message
Joined: 6 May 11
Posts: 10
Credit: 153,106
RAC: 0
Message 769 - Posted: 15 May 2011, 17:56:03 UTC - in response to Message 766.
Last modified: 15 May 2011, 18:00:29 UTC

It looks like he's having some kind of computation error, maybe some kind of segmentation fault... I wonder why nothing showed up in the error log...

Does this happen to all your workunits, or just some of them? What system is it happening on?


Good Evening
No the problem is not just my host I have seen it on quite a few hosts. Most WU’s from both these host 2535 and 2585 of mine are successful. (Possibly 99+%)

The two logged host 1648 and host 280 and on both my hosts 2535 and 2585.

As indicated in post 763. I did miss the stderr_txt.

They would only show up in the pending state. As I only noticed while checking out standing WU’s.

If you look at this WU 291964 for host 280 his still shows validate error while mine 2586 shows completed and validated.

There will be a few more no doubt? These are the one’s I have reported. Perhaps we should start a thread just for errors to see how many there are or have been.

edit: OS on both hosts is Vista I do not know what the other hosts are running.


Michael

Travis Desell
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 8 Feb 10
Posts: 364
Credit: 262,678
RAC: 0
Message 770 - Posted: 15 May 2011, 18:15:29 UTC - in response to Message 769.

It looks like some crash happened on the gas giants computer. If you look at that task, it's not outputting the <current_sites>...</current_sites> that the other workunits are:

http://dnahome.cs.rpi.edu/dna/result.php?resultid=605015

vs

http://dnahome.cs.rpi.edu/dna/result.php?resultid=605016
http://dnahome.cs.rpi.edu/dna/result.php?resultid=609620

Chances are that's because this workunit crashed sometime during it's execution. Not quite sure why it would be crashing while the other ones are running fine...

There may be some kind of memory error when I'm setting up the application; either that or it could be some kind of problem with that machine.

father ambrose
Send message
Joined: 6 May 11
Posts: 10
Credit: 153,106
RAC: 0
Message 771 - Posted: 15 May 2011, 18:25:56 UTC - in response to Message 770.
Last modified: 15 May 2011, 18:33:09 UTC

Travis Desell.

Thank you. I wonder if it may be happening more often then reported.

i.e. validate error rather than no consensus.

edit; just found

wuid=310410

michael

Travis Desell
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 8 Feb 10
Posts: 364
Credit: 262,678
RAC: 0
Message 772 - Posted: 15 May 2011, 20:50:43 UTC - in response to Message 771.

Travis Desell.

Thank you. I wonder if it may be happening more often then reported.

i.e. validate error rather than no consensus.

edit; just found

wuid=310410

michael


That's very interesting, because it looks like it's running to completion, but not outputting any results.

father ambrose
Send message
Joined: 6 May 11
Posts: 10
Credit: 153,106
RAC: 0
Message 775 - Posted: 16 May 2011, 14:12:35 UTC
Last modified: 16 May 2011, 14:55:20 UTC

Sequence of events.

Until today WU 100% complete: status running, status ready to report, [almost instant to 1 second.] status clear.

Today it appears as WU 100% complete: status running, status ready to report, [between 5 to 10 seconds.] status clear.

Could this have been the cause of loss of stderr_txt data files.

edit: I cannot ever recall seeing an uploading status.

edit 2: I spoke too soon this was returned almost instant 100% running ready to report clear

wuid=326887

Michael

I do not know if you have adjusted anything? So far no errors.

Michael

father ambrose
Send message
Joined: 6 May 11
Posts: 10
Credit: 153,106
RAC: 0
Message 945 - Posted: 24 Oct 2011, 8:36:43 UTC
Last modified: 24 Oct 2011, 8:37:00 UTC

I see the validate error still exsist? Has any body else been getting them.

wuid=277270

wuid=277580

Michael


Post to thread

Message boards : Number crunching : Checked, but no consensus yet


Main page · Your account · Message boards


Copyright © 2014 Travis Desell, Lee Newberg, Malik Magdon-Ismail and Boleslaw K. Szymanski