AlwaysON

AlwaysON database not synchronizing suspect mode

Advertisements

 

I got a call the database, not online/available mode, we suspect there was a corruption. One of my AlwaysON secondary database went suspect mode, it’s because of the log file and drive was full. I tried to resume the database. ALTER DATABASE dbname SET HADR RESUME;.

It went “in recovery“ phase and fails. Since, it does not have even 1 MB space in the log file and drive to do a recovery phase.

The drive has a good capacity for log, even though there was a huge transaction from application and the important one transaction log backup failed for five hours. I have a transaction log backup every five minutes. Since it is a high load critical OLTP database. The log full caused by a transaction log backup failure and the active massive open transactions. We had issues with NetBackup storage.

Error 3414 is a generic error, it’s triggered by SQL, when the database is going to suspect mode. It does not mean that we had a corruption and need to run and follow the corruption procedure. All I want to say read the error log and understand the issue, work on accordingly. This was fixed after adding a storage space on the drive with HADR resume and “7407 transactions rolled forward in database”.

In the primary:

Error: 9002, Severity: 17, State: 2.

The transaction log for database ‘DB’is full due to ‘LOG_BACKUP’.

 

From MSDN: https://msdn.microsoft.com/en-us/library/ff877972.aspx

SUSPEND_FROM_REDO = An error occurred during the redo phase

SUSPEND_FROM_APPLY = An error occurred when writing the log to file (see error log)

In the secondary:

Error: 3414, Severity: 21, State: 1.

An error occurred during recovery, preventing the database ‘DBName’ (7:0) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

For more on 3414: https://support.microsoft.com/en-in/kb/2015741

In our case, it is a different issue and error: 3414 is not applicable.

Errors from the error log

Error: 18210, Severity: 16, State: 1.

BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device ‘VNBU0-94692-3252-1473459870’. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

Error: 17053, Severity: 16, State: 1.

L:\SQL_Log\DBNAME_LOG.LDF: Operating system error 112(There is not enough space on the disk.) encountered.

Error: 5149, Severity: 16, State: 3.

MODIFY FILE encountered operating system error 112(There is not enough space on the disk.) while attempting to expand the physical file ‘L:\SQL_Log\DBNAME_LOG.LDF’.

AlwaysOn Availability Groups data movement for database ‘DBName’ has been suspended for the following reason: “system” (Source ID 4; Source string: ‘SUSPEND_FROM_APPLY‘). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.

Error: 3041, Severity: 16, State: 1.

ALTER DB param option: RESUME

AlwaysOn Availability Groups data movement for database ‘DBName’ has been resumed. This is an informational message only. No user action is required.

AlwaysOn Availability Groups connection with primary database established for secondary database ‘DBName’ on the availability replica ‘SERVER’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

Error: 35285, Severity: 16, State: 1.

The recovery LSN (256072:1024000:1) was identified for the database with ID 7. This is an informational message only. No user action is required.

 

Error: 3313, Severity: 21, State: 1.

During redoing of a logged operation in database ‘DBName’, an error occurred at log record ID (256072:1023498:1). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.

AlwaysOn Availability Groups data movement for database ‘DBName’ has been suspended for the following reason: “system” (Source ID 2; Source string: ‘SUSPEND_FROM_REDO’). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.

Error: 3414, Severity: 21, State: 1.

An error occurred during recovery, preventing the database ‘DBName’ (7:0) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

Error: 926, Severity: 14, State: 1.

Database ‘DBName’ cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

 

ALTER DB param option: RESUME

AlwaysOn Availability Groups data movement for database ‘DBName’ has been resumed. This is an informational message only. No user action is required.

Nonqualified transactions are being rolled back in database DBName for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

State information for database ‘DBName’ – Hardended Lsn: ‘(256072:1024000:1)’    Commit LSN: ‘(0:0:0)’    Commit Time: ‘Jan  1 1900 12:00AM’

AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘DBName’ on the availability replica ‘SERVER’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

State information for database ‘DBName’ – Hardended Lsn: ‘(256072:1024000:1)’    Commit LSN: ‘(0:0:0)’    Commit Time: ‘Jan  1 1900 12:00AM’

Starting up database ‘DBName’.

Recovery of database ‘DBName’ (7) is 0% complete (approximately 903 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.

Recovery of database ‘DBName’ (7) is 0% complete (approximately 902 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.

Recovery of database ‘DBName’ (7) is 0% complete (approximately 902 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

 

After adding a storage space on the partition/drive. The database went throw the recovery phase and back in online ALTER DATABASE dbname SET HADR RESUME;.

AlwaysOn Availability Groups connection with primary database established for secondary database ‘DBName’ on the availability replica ‘SERVER’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

Error: 35285, Severity: 16, State: 1.

The recovery LSN (256072:1024000:1) was identified for the database with ID 7. This is an informational message only. No user action is required.

7407 transactions rolled forward in database ‘DBName’ (7:0). This is an informational message only. No user action is required.

Recovery completed for database DBName (database ID 7) in 24 second(s) (analysis 1561 ms, redo 13390 ms, undo 0 ms.) This is an informational message only.

No user action is required.

AlwaysOn Availability Groups connection with primary database established for secondary database ‘DBName’ on the availability replica ‘SERVER’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

Error: 35285, Severity: 16, State: 1.

The recovery LSN (256072:1024000:1) was identified for the database with ID 7. This is an informational message only. No user action is required.

CHECKDB for database ‘DBName’ finished without errors on 2016-09-03 04:56:19.650 (local time). This is an informational message only; no user action is required.

 

At the same time, I had WSFC issue, my secondary node went offline in the fail over cluster manager. But, the node is online and ping. I tried to bring online, but no luck. I had some automatic fail-overs and node was up and down. The fail over cluster manager was not working and even does not show the cluster name. I manually tried to connect the cluster by clicking –> connect to clusters –> Typed the name –> Timeout and cannot able to connect.

I validated the cluster status and vote by PowerShell. The vote was zero, which means node cannot able to talk to the cluster. The cluster works fine, since it is a windows server 2012 with dynamic quorum.

I tried to validate the cluster by clicking –> validate configuration –> next –> Enter name

–> Click browse button –> the node of secondary has not added and other nodes are added.

See the image, the cluster name does not come in in the cluster manager and validation does not add the secondary node. It only added a primary and witness/quorum node.

I have no clue and the cluster errors are generic and reboot the secondary server, which shows the cluster name and all the nodes online in the WSFC. There was some glitch in the secondary node, which made the disconnection in the WSFC.

Errors from the alwaysON health check:

Message: A connection timeout has occurred while attempting to establish a connection to availability replica ‘server’ with id [3284E6A0-CA68-41ED-92CA-759757477E54]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

Statement: ALTER AVAILABILITY GROUP [Groupname] FAILOVER;

From Cluster log

The state of the local availability replica in availability group ‘GroupName’has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_PENDING_FAILOVER’.

The state changed because of a user initiated failover.  For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘DB’on the availability replica ‘server’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

AlwaysOn: The local replica of availability group ‘GroupName’is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

The state of the local availability replica in availability group ‘GroupName’has changed from ‘RESOLVING_PENDING_FAILOVER’ to ‘RESOLVING_NORMAL’.

The state changed because the availability group is coming online.  For more information, see the SQL Server error log,

Windows Server Failover Clustering (WSFC) management console, or WSFC log.

AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘DB’ on the availability replica ‘server’ with Replica ID: {3284e6a0-ca68-41ed-92ca-759757477e54}. This is an informational message only. No user action is required.

Conclusion: The suspect mode of alwaysON is because of log file full in the primary server, due to transaction log backup failure, it does not have any database level corruption.

I’m currently working as a SQL server DBA in one of the top MNC. I’m passionate about SQL Server And I’m specialized in Administration and Performance tuning. I’m an active member of SQL server Central and MSDN forum. I also write articles in SQL server Central. For more Click here

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

54 + = 63