Previous Topic: BFC Upgrade TipsNext Topic: Grid Controller and Grid Node Troubleshooting


Brownbag Session - 2012-02-09 - Backbone Fabric Controller 3.1: Logging

Notes from the meeting.

Overview: BFC Log Naming

How do logs get named? It seems to make no sense.

Roll based on file size:

ContainerX_python.log

Purpose

BFCLog.x

Purpose

Erlang.log.x

Purpose

ContainerX_erlang.log.x

Purpose

Stream of Consciousness (SOC)

Purpose

Boot sequence for a server going into a XEN grid:

  1. Raw hardware is power cycled
  2. Server boots onto utility image
  3. Utility image copies over deployer code to memory and executes
  4. Server returns deployer results to BFC
  5. Server boots gridOS image
  6. BFC runs aldo new which does the rest to get the gridOS running on the server, copy over applications, catalogs etc.

Boot sequence for a server going into a VMware grid:

  1. BFC sets up kickstart
  2. Raw hardware is power cycled
  3. Server boots linux installer for vmware and installs
  4. Server returns deployer results to BFC
  5. Server boots girdOS image
  6. BFC runs aldo new which does the rest to get the gridOS running on the server, copy over applications, catalogs, etc.

Inventory and Deployer Logs

Miscellaneous Logs

Mnesia and Erlang Processes

Mnesia

Erlang Processes

Heartbeat Process

Common Issues

  1. Server does not discover
  2. Server quarantined after discovery
  3. Grid create fails because servers never report deployer results
  4. No versions available when doing a version import
  5. Server "stuck" on an add or remove due to ssh issues (and takes 2 hours to time out)
1. Server Does Not Discover
2. Server quarantined after discovery

SOC

Server is quarantined due to no external IP reserved for this server:

2012-02-08 14:17:12 : Raw Hardware (1667) - State information is now: Compute Host Power Status: Compute host is currently under active power control..
2012-02-08 14:17:12 : Raw Hardware (1667) - State information is now: IP Assignment: Unable to find a Subnet with IP capacity for: 'bbc/applogic_external_network' usage..
2012-02-08 14:17:07 : Raw Hardware (1667) - State information is now : Compute Host Inventory Status: Configured IPMI power user: 'PowerAdmin__BFC' in slot: '3' with a system generated password..
2012-02-08 14:17:07 : Raw Hardware (1667) - State information is now: Compute Host Inventory Status: Received inventory data from server..
2012-02-08 14:16:50 : Raw Hardware (1667) - State information is now: Power IP Assignment: Reserved Power IP: '10.10.175.22' for: '00:19:B9:DD:C4:CD'..
2012-02-08 14:16:50 : Raw Hardware (1667) - State information is now: Compute Host Inventory Status: Awaiting inventory data from server..
2012-02-08 14:16:50 : Raw Hardware (1667) - State information is now: IP Assignment: Allocated:'192.168.0.225' to: '00:19:B9:DD:C4:CD' for: 'bbc/applogic_backbone_network' usage..

BFCLog.x

Search for the component ID of the server (1667 in this case)

Grep -A 12 :1667 BFCLog.*

2012-02-08 14:17:12.426 INFO ['core/resources.compute_host':1667]

compute_host_resource_driver:598

Setting State Info Badge for: 1667, to: [{<<"args">>,
[{1,
<<"bbc/applogic_external_network">>}]},
{<<"category_code">>,
<<"core/services.compute_host.category_compute_host_ips">>},
{<<"category_fmt>>,
<<"IP Assignment">>},
{<<"fmt">>,
<<"Unable to find a Subnet with IP capacity for: '{0}' usage.">>}]
3. Grid create fails: servers never report deployer results

SOC

2012-02-07 13:56:33 : Grid xen (1469) is now failed but remains partially running.
2012-02-07 13:56:33 : Grid xen (1469) - State information is now: Failed in starting the resource: Could not acquire enough nodes to meet min.
2012-02-07 13:56:26 : Xen Server (1542) is now failed but remains partially running.
2012-02-07 13:56:26 : Xen Server (1542) - State information is now: Failed in starting the resource: Utility Image Deployer Results Status never recieved, file: 
"/var/bfcadmin/deployer/results.192.168.0.33".
2012-02-07 13:56:18 : Xen Server (1533) is now failed but remains partially running.
2012-02-07 13:56:18 : Xen Server (1533) - State information is now: Failed in starting the resource: Utility Image Deployer Results Status never received, file: 
"/var/bfcadmin/deployer/results.192.168.0.34".
2012-02-07 13:34:42 : Xen Server (1533) is now being booted.
2012-02-07 13:34:42 : Xen Server (1542) is now being booted.
2012-02-07 13:34:42 : Xen Server (1533) - State information is now: Deploying image to host.
2012-02-07 13:34:42 : Linux Control Utility (1532) is now running.
2012-02-07 13:34:42 : Xen Server (1542) - State information is now: Deploying image to host.
2012-02-07 13:34:42 : Linux Control Utility (1541) is now running.
2012-02-07 13:34:42 : Linux Control Utility (1532) is now being booted.
2012-02-07 13:34:42 : Raw Hardware (1361) is now running.
2012-02-07 13:34:42 : Linux Control Utility (1541) is now being booted.
2012-02-07 13:34:42 : Raw Hardware (1423) is now running.
2012-02-07 13:34:42 : Raw Hardware (1361) - State information is now: Compute Host Power Status: Compute host has no power controller and will never be shut down. If a power failure occurs and the host does not return to online status, user intervention will be required.
2012-02-07 13:34:42 : Raw Hardware (1423) - State information is now: Compute Host Power Status: Compute host's power controller is currently unreponsive.  It will be reconfigured at the next reboot. If a power failure occurs prior to the next reboot and the host does not return to online status, user intervention will be required.
2012-02-07 13:34:42 : Raw Hardware (1361) is now being booted.
2012-02-07 13:34:42 : Raw Hardware (1361) - State information has been cleared.
2012-02-07 13:34:42 : Raw Hardware (1423) is now being booted.
2012-02-07 13:34:42 : Raw Hardware (1423) - State information has been cleared.
2012-02-07 13:34:42 : Linux Control Utiltiy (1541) is now waiting for other resources to boot before booting.
2012-02-07 13:34:42 : Raw Hardware (1423) is now waiting for other resources to boot before booting.
2012-02-07 13:34:42 : Xen Server (1533) is now waiting for other resources to boot before booting.
2012-02-07 13:34:42 : Raw Hardware (1361) is now waiting for other resources to boot before booting.
2012-02-07 13:34:42 : Linux Control Utiltiy (1532) is now waiting for other resources to boot before booting.
2012-02-07 13:34:42 : Xen Server (1533) is now waiting for other resources to boot before booting.
2012-02-07 13:34:42 : Grid xen (1469) - State information is now: Creating Grid: Installing the grid servers.
2012-02-07 13:34:42 : Grid xen (1469) - State information is now: Creating Grid: There were no unneeded servers..
2012-02-07 13:34:42 : Grid xen (1469) - State information is now: Creating Grid: Checking for and removing any unneeded servers.
2012-02-07 13:34:42 : Grid xen (1469) is now being booted.
/var/bfc/admin/deployer
4. No version savailable when doing a version import

/opt/bfc/logs/ContainerX_python.log

Download directory does not contain any AppLogic versions, and/or contains extraneous directories:

2012-02-08 12:47:57,895|grid_service_driver|ERROR|:1257 - Invalid AppLogic version directory:/tmp/3.5-522
2012-02-08 12:47:57,896|grid_service_driver|ERROR|:1257 - Missing packing.slip
2012-02-08 12:47:57,900|grid_service_driver|ERROR|:1257 - Invalid AppLogic version directory:/tmp/wmtrace
2012-02-08 12:47:57,902|grid_service_driver|ERROR|:1257 - Missing packing.slip
2012-02-08 12:47:57,906|grid_service_driver|ERROR|:1257 - Invalid AppLogic version directory:/tmp/orbit-root
2012-02-08 12:47:57,907|grid_service_driver|ERROR|:1257 - Missing packing.slip
2012-02-08 12:47:57,911|grid_service_driver|ERROR|:1257 - Invalid AppLogic version directory:/tmp/3.1-ga 
2012-02-08 12:47:57,913|grid_service_driver|ERROR|:1257 - Missing packing.slip
5. Server "stuck" on an add or remove due to ssh issues

SOC

No activity for a long time (in this case, these entries were taken at 1:55) (could also indicate the deployer hanging)

2012-02-08 12:55:38 : Grid xen (1469) - State information is now: Creating Grid: Installing the grid servers.
2012-02-08 12:55:38 : Grid xen (1469) - State information is now: Creating Grid: There were no unneeded servers..
2012-02-08 12:55:38 : Grid xen (1469) - State information is now: Creating Grid: Checking for and removing any uneeded servers.
2012-02-08 12:55:38 : Grid xen (1469) is now being booted.
BFCLog.x
2012-02-07 13:34:26,172|grid_resource_driver|DEBUG|:1469 - attempting to connect to version 1294
2012-02-07 13:34:26,175|grid_resource_driver|DEBUG|:1469 - versions before []
2012-02-07 13:34:26,179|grid_resource_driver|DEBUG|:1469 - versions after []
2012-02-07 13:34:26,190|grid_version_service_driver|DEBUG|:1227 - consumeCapacity()
2012-02-07 13:34:26,193|fs_storage_service_driver|DEBUG|:1203 - fs_storage_service_driver.getInstances()
2012-02-07 13:34:26,211|fs_storage_service_driver|DEBUG|:1203 - fs_storage_service_driver.getconsumeCapacity()
2012-02-07 13:34:26,249|grid_version_service_driver|DEBUG|:1227 - releaseCapacity()
2012-02-07 13:34:26,250|fs_storage_service_driver|DEBUG|:1203 - fs_storage_service_drive.releaseCapacity()
2012-02-07 13:34:27,256|grid_resource_driver|DEBUG|:1469 - connected resource to 1294 - new conn 1498 - conn list [1498]
2012-02-07 13:34:27,264|grid_resource_driver|DEBUG|:1469 - setPrivateHotfixes(), hotfixes: []
2012-02-07 13:34:27,265|grid_resource_driver|DEBUG|:1469 - create_hotfix_connections(), hotfixes: []
2012-02-07 13:34:27,267|grid_resource_driver|DEBUG|:1469 - getHotfixes
2012-02-07 13:34:27,275|grid_service_driver|DEBUG|:1257 - reserveIPs(), newGrid: 1469
2012-02-07 13:34:27,276|grid_service_driver|DEBUG|:1257 - reserveIPs(), AppIPs:[(Atom('grid_ip_range'), 1422,'public','10.10.75.143', '10.10.75.143')]
2012-02-07 13:34:27,277|grid_service_driver|DEBUG|:1257 - extractIpRange(), ipRange:(Atom('grid_ip_range'), 1422,'public','10.10.75.143', '10.10.75.143')
2012-02-07 13:34:27,278|grid_service_driver|DEBUG|:1257 - ipRangeList:[(1422,'10.10.75.143','10.10.75.143')]
2012-02-07 13:34:27,304|grid_service_driver|DEBUG|:1257 - reserveIPs,appIpSubPoolId: 1463
2012-02-07 13:34:27,305|grid_util|DEBUG|:1257 - grid_ip_pool_name() Name: GridIpPool_1469_1422
2012-02-07 13:34:28,697|grid_service_driver|DEBUG|:1257 - gridSubPoolId:1499
2012-02-07 13:34:28,700|grid_resource_driver|DEBUG|:1469 - addIpResPoolConn(), resPoolId: 1499
2012-02-07 13:34:31,286|grid_service_driver|DEBUG|:1257 - createIpPoolConnections() gridSubpoolId: 1499
2012-02-07 13:34:31,287|grid_service_driver|DEBUG|:1257 - grid res pool conn: 1539
2012-02-07 13:34:31,288|grid_service_driver|DEBUG|:1257 - validateControllerIp(),gridId: 1469, controllerIP: 10.10.75.142
2012-02-07 13:34:31,324|grid_util|DEBUG|:1257 - grid_ip_pool_name() Name: GridIpPool_1469_1422
2012-02-07 13:34:31,356|grid_util|DEBUG|:1257 - grid_ip_pool_name() Name: GridIpPool_1469_1422
2012-02-07 13:34:33,524|grid_util|DEBUG|:1257 - controller subpool ID: 1510
2012-02-07 13:34:33,528|grid_service_driver|DEBUG|:1257 - set_private_resource_fields()
2012-02-07 13:34:33,535|grid_service_driver|DEBUG|:1257 - set_private_resource_fields(),acctId: undefined, defaultAccountId:dev_optimize
2012-02-07 13:34:33,543|grid_resource_driver|DEBUG|:1469 - setPrivateFields(),args: cUser: cadev@ca.com, accountKeyfile: undefined
2012-02-07 13:34:33,563|grid_service_driver|DEBUG|:1257 - releaseCapacity()
2012-02-07 13:34:33,727|grid_resource_driver|DEBUG|:1469 - ready()
2012-02-07 13:34:33,738|grid_resource_driver|DEBUG|:1469 - adding 2 xen and 0 esx nodes to connection
2012-02-07 13:34:33,743|grid_resource_driver|DEBUG|:1469 - adding #1
2012-02-07 13:34:35,226|grid_resource_driver|DEBUG|:1469 - No cached version found, trying new
2012-02-07 13:34:37,561|grid_resource_driver|DEBUG|:1469 - Configure GridOS:
2012-02-07 13:34:37,562|grid_resource_driver|DEBUG|:1469 -  Version      => 1294
2012-02-07 13:34:37,564|grid_resource_driver|DEBUG|:1469 -  Striping enabled => False
2012-02-07 13:34:37,565|grid_resource_driver|DEBUG|:1469 -  Stripe size    =>64
2012-02-07 13:34:37,566|grid_resource_driver|DEBUG|:1469 -   add result for bbc/properties.gridos_node_xen: 1543
2012-02-07 13:34:37,568|grid_resource_driver|DEBUG|:1469 - adding #2
2012-02-07 13:34:38,619|grid_resource_driver|DEBUG|:1469 - No cached version found, trying new
2012-02-07 13:34:41,870|grid_resource_driver|DEBUG|:1469 - Configure GridOS:
2012-02-07 13:34:41,872|grid_resource_driver|DEBUG|:1469 -  Version      => 1294
2012-02-07 13:34:41,873|grid_resource_driver|DEBUG|:1469 -  Striping enabled => False
2012-02-07 13:34:41,874|grid_resource_driver|DEBUG|:1469 -  Stripe size   => 64
2012-02-07 13:34:41,876|grid_resource_driver|DEBUG|:1469 -   add result for bbc/properties.gridos_node_xen: 1534
2012-02-07 13:34:41,877|grid_resource_driver|DEBUG|:1469 - added 2 xen and 0 esx nodes to connection
2012-02-07 13:34:41,928|grid_version_resource_driver|DEBUG|:1294 - getGridOSHost()
2012-02-07 13:34:41,946|grid_version_resource_driver|DEBUG|:1294 - getGridOSHost(),osHost: 192.168.0.32
2012-02-07 13:34:41,950|grid_version_resource_driver|DEBUG|:1294 - getGridOSHost()
2012-02-07 13:34:41,956|grid_version_resource_driver|DEBUG|:1294 - getGridOSHost(),osHost: 192.168.0.32
2012-02-07 13:34:41,959|grid_version_resource_driver|DEBUG|:1294 - getGridOSHost()
2012-02-07 13:34:41,963|fs_storage_resource_driver|DEBUG|:1293 - getStorageBaseDir(), rootdir: /opt/bfc/applogic_versions
2012-02-07 13:34:41,965|grid_version_resource_driver|DEBUG|:1294 - imageBaseDir: /opt/bfc/applogic_versions
2012-02-07 13:34:41,980|grid_version_resource_driver|DEBUG|:1294 - getGridOSPath()
2012-02-07 13:34:41,985|fs_storage_resource_driver|DEBUG|1293 - getStorageBaseDir(), rootdir: /opt/bfc/applogic_viersions
2012-02-07 13:34:41,990|grid_version_resource_driver|DEBUG|:1294 - imageBaseDir: /opt/bfc/applogic_versions
2012-02-07 13:34:42,188|grid_resource_driver|DEBUG|:1469 - removing resources, removal field: []

And then nothing else.