Friday, May 6, 2011

Nagios




In our new VMware View and vCenter infrastructure, we're humming along, although in a matter of three weeks, we're nearly at capacity with our current hardware infrastructure (mostly in memory and disk space usage). As a result of our heavily loaded infrastructure, we've realized the criticality of setting up a monitoring system to monitor our servers and infrastructure components.

My previous experience has been limited to using Microsoft SCOM and I was not very impressed; granted I did not have the opportunity to really learn its ins and outs, I just know that we constantly had rogue alerts, false positives, and SCOM server problems (although these issues could very well be attributed to how it was set up, not the product itself).

Furthermore, our infrastructure is primarily Linux-based (RHEL Oracle server and ESX hosts), so why pay for a Microsoft tool designed to really monitor primarily Windows servers when we have the world of open-source available to us? Enter the picture Nagios, a free, open-source IT infrastructure monitor. Like many open source projects (think Red Hat, SLES, etc) the "core" is available for free and there is a pay-for commercial version. With our limited budget and no need for elaborate performance metrics and reports, we opted for Nagios Core. While it requires a decent amount of time to set up the service and host definitions, install the correct plugins, and configure the client monitors, we've been very pleased with it as a monitoring solution. It's pretty straightforward to set up and tweak, we can customize and adjust alert levels for individual hosts. And thanks to the engineers over at op5 (a monitor based heavily on Nagios core), we can monitor our VMware virtual datacenter with Nagios. While we're still waiting for our SMTP relay to be configured (the client wants ownership), so I can't opine on the email alert functionality, thus far we've been very pleased.

Tuesday, March 29, 2011

RHEL iSCSI part deux

So as it turns out, the earlier post was not always the best approach to using iSCSI with RHEL. An important omission from the earlier post is that we are using Dell EqualLogic storage, and thus installed the EqualLogic Host Integration Tools (HIT). These tools help integrate iSCSI and multipathing into the server, changing the approach in the process. Below are the steps that we took to get our EqualLogic iSCSI storage online and working:

  1. Create a new volume using the administration console of the SAN and ensure that a CHAP user has been created that has access to the volume

  2. Install equallogic-host-tools on the server.

  3. As root (or using sudo), ensure that ehcmd is started service ehcmd status

  4. Edit /etc/iscsi/iscsi.conf by setting node startup as manual (as otherwise, iSCSI will try to establish too many redundant connections):

    #***************** # Startup settings #***************** #
    To request that the iscsi initd scripts startup a session set to "automatic". # node.startup = automatic # # To manually startup the session set to "manual". The default is automatic.
    node.startup = manual


    enable CHAP and setting the CHAP user and password


    # ************* # CHAP Settings # ************* #
    To enable CHAP authentication set node.session.auth.authmethod # to CHAP. The default is None.
    node.session.auth.authmethod = CHAP
    # To set a CHAP username and password for initiator # authentication by the target(s), uncomment the following lines:
    node.session.auth.username = user
    node.session.auth.password = password


    ...


    # To set a discovery session CHAP username and password for the initiator # authentication by the target(s), uncomment the following lines: discovery.sendtargets.auth.username = user discovery.sendtargets.auth.password = password


  5. Run eqltune -v and follow any recommendations made by the tool

  6. run ehcmcli -d to view the ehcmcli diagnostics information

  7. Check to see which subnets EqualLogic host connection manager will manage. # rswcli -L. Exclude any subnets that you do not wish to use with multipath. In our case, we have two interfaces for data on one subnet and two interfaces for iSCSI on two other interfaces: rswcli -E -network 10.52.195.128 -mask 255.255.255.128 Run rswcli -L and ehcmcli -d to confirm that the correct adapters/interfaces are being used

  8. Run Discovery:
    # iscsiadm -m discovery -p 10.10.5.50 -t st 10.10.5.50



    iqn.2001-05.com.equallogic:0-8a0906-a884a4502-fdefc658d2e4d064-oracle-data 10.10.5.50:3260,1

    iqn.2001-05.com.equallogic:0-8a0906-a884a4502-fdefc658d2e4d064-oracle-data 10.10.5.50:3260,1

    iqn.2001-05.com.equallogic:0-8a0906-a884a4502-fdefc658d2e4d064-oracle-data 10.10.5.50:3260,1

    iqn.2001-05.com.equallogic:0-8a0906-a884a4502-fdefc658d2e4d064-oracle-data 10.10.5.50:3260,1

    Multiple identical results will be returned as a result of the intefaces that the EqualLogic daemon creates.

  9. Login to the target specifying the interface:
    # iscsiadm -m node -T iqn.2001-05.com.equallogic:0-8a0906-a884a4502-fdefc658d2e4d064-oracle-data -I eql.eth1_0 -l

  10. Allow automatic logins using the node.startup property. Recall that this was set to manual in the /etc/iscsi/iscsi.conf file in the steps above, since we don't want to open sessions on all interfaces because the EqualLogic daemon creates multiple interfaces using the same adapters (eql.eth1_0, eql.eth1_1, etc). Instead, we allowed automatic startup on two interfaces, one for each adapter (eql.eth1_0 and eql.eth3_0):
    # iscsiadm –m node –T -I eql.eth1_0 –o update –n node.startup –v automatic

    # iscsiadm –m node –T -I eql.eth3_0 –o update –n node.startup –v automatic

  11. Edit /etc/equallogic/eql.conf [MPIO Autologins] section to specify which adapters to use:
    # Autologin configuration [MPIO Autologins]

    # ehcmd will update any targets configured for autologin so the boot time logins
    # are only made through two iscsiadm ifaces using independent ethernet ports.
    # The ehcmd will create additional iSCSI sessions if necessary at a later point.
    # # If you wish to control which ethernet interfaces are used for autologin,
    # list the desired interfaces below, using a separate line for each.
    # Example: # Port = eth0
    # Port = eth1
    Port = eth1
    Port = eth3


  12. Create a physical volume using the /dev/mapper/ name, not the SD device name or one of the derivatives created by the equallogic daemon (eql-0-8a0906-d11b93e09-93720a256664d91d-a, eql-0-8a0906-d11b93e09-93720a256664d91d-oracle-datap1, etc) # pvcreate /dev/mapper/eql-0-8a0906-d11b93e09-93720a256664d91d-oracle-data

  13. Create the volume group on the newly created physical volume. # vgcreate SANVolGroup00 /dev/mapper/eql-0-8a0906-d11b93e09-93720a256664d91d-oracle-data

  14. Create the logical volume. In our case, we created the volume the entire size of the SAN volume: # lvcreate --extents 100%VG --name ora_data2 SANVolGroup00

  15. Format the file system as ext3: # mkfs.ext3 /dev/SANVolGroup00/ora_data2

  16. Mount the new lv to a mount point: # mount /dev/SANVolGroup00/ora_data2 /oracle/ora_data2

  17. Finally, edit /etc/fstab to mount the logical volume on startup: /dev/SANVolGroup00/ora_data2 /oracle/ora_data2 ext3 _netdev,defaults 0 0 Ensure that the mount option _netdev is used, so the volume does not mount until network interfaces are up

Wednesday, March 23, 2011

RHEL Mapper stealing Filesystem

Today, we've setup our RHEL server to use iSCSI with CHAP authentication to connect to our iSCSI array volumes. We found this tutorial very helpful, although instead of open-iscsi, we simply used RedHat's iscsi-initiator-utils package (probably the same thing).

One thing to note is that after successfully authenticating to the array via CHAP and partitioning the block (which was /dev/sdd) with LVM2, we could not successfully run the pvcreate /dev/sdd1 command to create a physical volume on which we can build our LVMs. When running pvcreate, we'd get the following:

#pvcreate /dev/sdd1
Can't open /dev/sdd1 exclusively. Mounted filesystem?

Although running 'mount' wouldn't show the drive mapped anywhere. As it turns out, after writing LVM2 to the partition table, /dev/mapper grabs onto the device which effectively "mounts" it within the mapper. In our case, it was listed in the mapper as three devices:

eql-0-8a0906-8a0b93e09-01120a256614d8a6-ora-data01p1
eql-0-8a0906-8a0b93e09-01120a256614d8a6-ora-data01
eql-0-8a0906-8a0b93e09-01120a256614d8a6_a


Note the long name referencing our Equillogic iSCSI SAN. We found two workarounds:

  1. Run pvcreate using the /dev/mapper/ name, so:
    #pvcreate /dev/mapper/eql-0-8a0906-8a0b93e09-01120a256614d8a6-ora-data01p1
    This was less desirable, however, because of the long file name
  2. Remove the device from the /dev/mapper and then run pvcreate:

    #dmsetup ls
    #dmsetup remove eql-0-8a0906-8a0b93e09-01120a256614d8a6-ora-data01p1
    #dmsetup remove eql-0-8a0906-8a0b93e09-01120a256614d8a6-ora-data01
    #dmsetup remove eql-0-8a0906-8a0b93e09-01120a256614d8a6_a
    #pvcreate /dev/sdd1
    Physical volume "/dev/sdd1" successfully created

That solved our problem, allowing us to create the physical volume and then create our LVMs on top of it. Thanks to this forum article for the tip.

Tuesday, March 15, 2011

Active Directory Port Requirements

Here on the client site, we're in the process of getting our VMWare View 4.6 environment up and working. As one of the required steps, we need to get our vCenter and View Connection Servers on the domain. It so happens that our client runs their own domain (as much as I'd like to run our own), so we've had to coordinate with them for this effort. When trying to join our server to the domain, the servers (which are Windows 2003) are able to query DNS, resolved the FQDN, and identify the Domain Controllers, however we could never get to the point where we're prompted to join our servers. Instead, we receive the following error:


DNS was successfully queried for the service location (SRV) resource record
used to locate an Active Directory Domain Controller for domain local:

The query was for the SRV record for _ldap._tcp.dc._msdcs.local

The following AD DCs were identified by the query:

server1.fqdn
server2.fqdn

Common causes of this error include:

- Host (A) records that map the name of the AD DCs to its IP addresses are
missing or contain incorrect addresses.

- Active Directory Domain Controllers registered in DNS are not connected to
the network or are not running.


We initially thought the error to be related to WINS as TCP port 137 was closed from our VLAN to the client's VLAN that hosts the Domain Controller (yes, even in a 2008 AD Environment, the client still uses WINS). We ensured that TCP 389 was open by using telnet and confirmed that it was. As it turns out, that wasn't the issue. We found that although TCP 389 was allowed, UDP 389 was being blocked. After enabling UDP 389, we were able to query AD fine and join the domain.


This was a bit perplexing to us as I've never known AD to require UDP 389, but perhaps that is because its always been open for me. After a bit of Googling, I learned that LDAP only uses UDP 389 when querying the domain to find the list of domain controllers. After querying DNS and getting a list of possible DCs, the client's netlogon service sends a datagram to each domain controller in the form of a UDP LDAP packet, over UDP 389. All available domain controllers respond with the information for DcGetDCName. The client's netlogon service then caches this information to prevent having to constatly query the DCs. Thanks to Marc Nivens at Experts-Exchange for the tip [link].

So long story short, open TCP AND UDP 389 for LDAP/Active Directory.

Addendum


As noted above, after we had opened UDP 389, we were able to get a credential prompt, however after entering correct credentials, we would still receive some odd errors and could not join the server to the domain. After a bit of Googling, we found this Microsoft Technet article outlines the ports required for 2003 and 2008 domains, most notably the high-level ports.

Friday, March 11, 2011

Lesson Learned

After not eating lunch today and doing some quick configurations withing vCenter, I accidentally deleted a vDisk off of the wrong VM. Not good. Sure enough, when logging into the ESX host, the vDisk was gone, unrecoverable. I recall VMWare ESX 3.5 had a utility named vmfs-undelete for the purpose of recovering deleted vmdk files. No longer:

vmfs-undelete utility is not available for ESX/ESXi 4.0
ESX/ESXi 3.5 Update 3 included a utility called vmfs-undelete, which could be used to recover deleted .vmdk files. This utility is not available with ESX/ESXi 4.0.

Workaround: None. Deleted .vmdk files cannot be recovered.

[VMware ESX 4.0 Update 1 Release Notes]

Fortunately the data on the vDisk can be recreated by the developer in about an hour, so all told we're not set back too far. An important lesson learned though:
  1. Always double check before permanently deleting
  2. Don't have multiple configuration windows open at the same time
  3. Eat lunch

Thursday, February 24, 2011

The perils of using Wifi for Mobile Geographic Positioning

It is pretty well known that the Google StreetView cars that capture the images used for Google Street also collect information about Wifi networks in the area, to improve its location based services. In the past, Google has been known to have been collection too much information on private Wifi networks. That aside, it is a pretty cool concept- if your mobile device, such as an ipod touch, lacks GPS capabilities, it can still find where you are based on the Wifi networks around you.


The ingenuity of their system is also their crux. Take for example what I learned today. I recently made the jump to a smartphone, an HTC Evo. I LOVE the phone and 4G, one of the primary reasons I opted for the phone was for its location-based capabilities. This weekend, I am down at the family beach house enjoying a week away from work and getting in some riding/hiking/sleeping time. However I've noticed recently that when in the house, the phone's weather report is mapping Athens, GA as my current location, although I am 250+ miles south. I wrote this inaccuracy off as a glitch in Sprint's towers or something along those lines, that is until I took a closer look at Google Maps on my phone today.


I found it curious that the phone was putting me in Athens, which is where I attended the University of Georgia for four years. Out of curiosity I wanted to see where in Athens it was putting me, as I'm pretty familiar with the city and its streets. After zooming in a bit, I noticed that it was mapping me near the intersection of Baxter and Milledge ave. After zooming in all the way (which is not as far as I've been able to do when GPS is turned on), I realized where my phone was pinpointing me- off of Springdale Street, which is where I lived for two years while at the University. Sure enough, my phone was using Google's massive database of Wifi data (MAC, SSID, etc) to locate me, the same place where it had captured the router's MAC 3-4 years ago. Although the SSID of the router has changed (as well as the encryption key, etc), it is obvious that Google maps is using the MAC (media access control) address of the router as its primary identifier- it makes sense, as the primary purpose of MAC addresses is to be a unique identifier. So although I am not really near Athens at all, I am near the router that was in Athens at the time Google captured its MAC.


This presents somewhat of a quandry using Google maps. I guess Google just didn't count on people moving very often, or they figured they update their database frequently enough to mitigate this problem (note that it has been 3 years since the router as resided at that location, it has since changed geographies twice). It begs the question; is there somewhere I can go to update their database to reflect the router's new location? Also, would this occur no matter where in the world the router was located, so long as there were not any indexed networks nearby? Regardless, it is an illustrative example of how Google Maps works using Wifi data and how this clever system also has its limitations.

Sunday, February 13, 2011

Gnome Wallpaper Changer

One feature I really liked about Windows 7 that is lacking on F13 Gnome is the ability to automatically rotate desktop wallpapers. This minor aesthetic functionality is easily implemented using crontab and a simple python script. I set the script to run every 15 minutes using crontab and to read from a specific folder of wallpapers (.backgrounds/dual). I used Davyd's Madeley's python script:


#!/usr/bin/python
#
# change-background.py
#
# A script to change to a random background image
#
# (c) 2004, Davyd Madeley
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#

backgrounds = ".backgrounds/dual"

import gconf
import os
import random
import mimetypes

class GConfClient:
def __init__ (self):
self.__client__ = gconf.client_get_default ()
def get_background (self):
return self.__client__.get_string ("/desktop/gnome/background/picture_filename")
def set_background (self, background):
self.__client__.set_string ("/desktop/gnome/background/picture_filename", background)

client = GConfClient ()


dir_items = os.listdir (os.path.join (os.environ["HOME"], backgrounds))
items = []

for item in dir_items:
mimetype = mimetypes.guess_type (item)[0]
if mimetype and mimetype.split ('/')[0] == "image":
items.append (item)

item = random.randint (0, len (items) - 1)
current_bg = client.get_background ()

while (items[item] == current_bg):
item = random.randint (0, len (items) - 1)

client.set_background (os.path.join (os.environ["HOME"], backgrounds, items[item]))



The crontab file is set as:


*/15 * * * * DISPLAY=:0.0 /usr/local/bin/change-background.py