Categories

Versions

Troubleshooting

This article outlines common problems while upgrading RapidMiner Server.

Timeout during RapidMiner Server start

You might see the following log lines in the server.log file within the RapidMiner Server home directory:

ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) JBAS013412: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
    ("core-service" => "management"),
    ("management-interface" => "native-interface")
]'
ERROR [org.jboss.as.controller.client] (Controller Boot Thread) JBAS014781: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@2821a6c1 for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException

Explanation: JBoss requires a lot of time to start after the initial upgrade and then times out. This can happen because a column has been added in the new version and the existing table needs to be migrated. When such tables are large, the migration can take a lot of time and exceed the JBoss deployment time which is 300 seconds by default.

Solution: The property jboss.as.management.blocking.timeout is used to determine how long a deployment might take before JBoss aborts the deployment. A solution to the problem is to temporarily increase the default timeout. Please use the following statement (.bat on Windows) to start RapidMiner Server for a temporary timeout increase: ./bin/standalone.sh -Djboss.as.management.blocking.timeout=3600. After the upgrade completed successfully the timeout increase is not required anymore.

Overlapping Job Container ports on a single host

With RapidMiner Server 9.5 the Job Container architecture changed fundamentally and requires system ports for the Job Agent to Job Container communication which are only used locally on the machine on which the Job Agent is deployed.

In case multiple Job Agents are hosted on a single shared machine, the definition of duplicate ports might result in the following log lines:

Job container '1' cannot be spawned, because port '10000' is not available
Job container '1' started successfully with PID 'null'.

Such a scenario occurs when multiple Job Agents define the same value for the jobagent.container.listenPortRangeStart property and are hosted on a single machine. To overcome this problem, ensure to define distinct port start ranges for each deployed Job Agent on the same machine to avoid overlapping ports of Job Containers.

Job Archive contains pending or running jobs

When you've upgraded to 9.10.4 while not all executions have been finished (see instructions on the changelog page), non-final executions show in the Only archived executions view. This is expected because the underlying database migration only renamed the tables to have the a_ prefix.

Those jobs are also not picked up by the Job Cleanup because their state is not final.

To overcome this, you need to delete those archived jobs which aren't in a final state from the Job Archive tables manually.

Here's an example for Postgres of how to view all jobs of the archive table which are still in a non-final state:

Remember to always backup before executing any destructive database operation.

SELECT * FROM a_jobservice_job WHERE state IN ('PENDING', 'STARTING', 'RUNNING')

If you execute a proper DELETE statement, ensure that it cascades also referenced tables like a_jobservice_job_error or a_jobservice_job_log.