Attached is a .txt file you can find in the Evolutionary_Loop directory as Loop_Error_List.txt. It's a list of the current errors we sometimes experience in the loop, along with how to fix them if you encounter them while running. If people encounter errors that aren't in the list, let everyone know in the #genstudents chat on slack and update the file with the error message and when it was encountered (what state the loop was in) and the possible cause and solution if you know it. |
Below is a list of errors we may encounter in the loop as of 11/25/20:
Error: "Pre-while, pre-for"
Description: This is an error you'll encounter after AraSim has "completed." The loop will hang after outputting "Pre-while" and then "Pre-for." This comes from the fitness function--the loop is indicating that it is inside the fitness function right before it enters the two loops it runs over the AraSim data. Hanging here indicates that there was an issue running AraSim. Specifically, it indicates that at least one of the jobs for the *first* individual in AraSim failed.
Potential causes:
This may be caused by an issue in generating the gain files used for AraSim. These gain files are placed in the AraSim directory under the names a_{num}.txt, where {num} represents the individual number. You can check a_1.txt in the AraSim directory and see if it's complete (if it isn't, you can usually tell just by opening the file and seeing that only two lines have been printed to it).
One way of looking for the cause of this error is to look at the job error files. Inside the {runname} directory are directories containing the error and output files from the AraSim jobs. These are .../{gen}_AraSim_Errors and .../{gen}_AraSim_Output, where {gen} is the generation number. One example of an error message I've seen in the error files was
"
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr
/var/spool/slurmd/job2345909/slurm_script: line 37: 171196 Aborted (core dumped) ./AraSim setup.txt $runNum outputs/ a_${num}.txt > $TMPDIR/AraOut_${gen}_${num}_${Seeds}.txt
"
This appeared in all of the error files for the first individual's jobs.
Resolution:
The best way to resolve this is to start by checking the error files. In the case of the error message above, it would be best to go to the AraSim directory and check for the a_{num}.txt file. If you see just one (ex: a_1.txt), then that's likely to be the culprit (especially if the file is obviously not completely filled)--these files should be removed, so if one is left it may not have been moved correctly, likely due to permissions errors. Remove the a_{num}.txt file and restart from the AraSim job submissions (potential speed up: add part of the error message to the self-correcting phrases in Part_D2_AraSeed.sh to only rerun those individuals).
It's also possible that the issue was caused in XF. Make sure you follow the instructions above and start back at stage 2 to restart from the beginning of XF if starting back from AraSim doesn't work. Try to take notes on the differences you see to add to this.
It's also notable that this may be caused by permissions issues. Every time someone is handing the loop off to someone else, the OpenPermissions.sh script should be run (passing the {runName} as an argument). Look in that script to determine which files need to have open permissions. If the person with ownership of the closed files isn't available to open them, you can remove them and start back from where they would have been created. This usually occurs in AraSim and the AraOut files in Antenna_Performance_Metric should have their permissions fixed or be removed (for that generation only).
*****************************************************************************
Error: <Loop hangs while outputting dimensions and fitness scores>
Description:
This is similar to the error above, except that instead of hanging on the first individual, it hangs on some later individual.
Resolution:
The instructions for resolving this should be the same as the ones above. This seems to be less common and is usually resolved by the self-correcting code in Part_D2. Regardless, if you encounter this error the first step should be to follow the instructions below in case there is just one or a handful of failed AraSim jobs. If that doesn't work, step back to stage 5 to resubmit the AraSim jobs after clearing out the possible offending files. If that doesn't work, step back to XF (you can always just step back to XF at the beginning if you're unsure that stepping back to AraSim will resolve this to potentially save time).
It's also possible that there is an error in just a handful (or even just one) of the AraSim jobs. This might be caused by opening permissions after someone has already taken over running the loop. In this case, you might be able to start the loop back up without needing to resubmit all of the AraSim jobs or step back all the way to XF. To do this, you'll need to figure out which AraSim job failed. Check the AraSim error files and output files for that generation (specifically, check to see if one is *missing*). You should be able to figure out which individual the loop is stuck on by counting how many sets of dimensions and fitness scores were printed to screen before the loop started hanging. Go to /Antenna_Performance_Metric (inside .../Evolutionary_Loop) and list all of the AraOut files corresponding to that individual and check them to see if any of them appear incomplete (AKA don't have an effective volume at the bottom).
Once you find the individual jobs that failed, you can set up the loop to only rerun those jobs. First, go to the AraSim flags directory inside the RunName directory and populate the the flags like so:
>for i in `seq 1 <NPOP>
>do
>for j in `seq 1 <jobs per individual>
>do
>echo <gen> > ${i}_${j}.txt
>echo $i >> ${i}_${j}.txt
>echo $j >> ${i}_${j}.txt
>done
>done
This will populate all of the flag files needed for AraSim to move on. Remove the flag files corresponding to the identified failed AraSim jobs. Next, go into the AraSim error file directory in the RunName directory (/<gen>_AraSim_Errors) and replace any text inside the error files corresponding to the failed AraSim jobs with the phrase "segmentation violation" (spelling and capitalization matter!). This is one of the phrases used in the self correcting part of the loop in Part_D2 and indicates to the loop to resubmit the AraSim job for that individual.
After doing this, you should be able to return to .../Evolutionary_Loop and change the savestate in /savestates to 6 from 7. Now you can start back the loop and it will tell you that it's waiting for the AraSim jobs to finish. After 1/2 minutes it will notice that the error files have "segmentation violation" in them and will resubmit only the AraSim jobs you specified as having failed.
*****************************************************************************
Error: "cannot connect to X11 forwarding" (or something to that effect)
Description:
This usually occurs during XF, but it may occur during the display of plots. In the case of plots being unable to display, the loop should still be able to operate, though plots might not update. However, if this message occurs during XF the data for the gain patterns for the antennas won't be properly created. This can occur at the first opening of the XF GUI or on the second part (after the xfsolver jobs have run).
Potential causes:
First, you should make sure that you are logged in to OSC using <ssh -XY userID>. The -XY allows x11 forwarding, which is needed for the XF GUI to appear. Also remember to indicate that you need X11 forwarding when requesting your interactive job (using --x11). It's also possible that your connection to the X11 forwarding can be interrupted after a long time (I've seen the loop work for several generations over multiple interactive job submissions and then suddenly get this error).
Resolution:
My advice is to log out and back in to OSC each time your interactive job ends. This is an uncommon error but it's easy to miss. Once you've logged back in, you'll need to restart the save state back to the part of XF where you got this error (either 2 or 3 depending on which par the error appeared in).
*****************************************************************************
Error:
|