Author: admin

How to add a new SCSI LUN while server is Live

REDHAT/CENTOS:

In order to get wwn ids from a server:

 cat /sys/class/scsi_host/host0/device/fc_host\:host0/port_name
 cat /sys/class/scsi_host/host1/device/fc_host\:host1/port_name

Or:

 systool -av -cfc_host | grep port_name | awk ‘{ print $3 }’ | cut -d\” -f 2 | cut -dx -f 2

.

1.To add a new SAN LUN while live:

Run this to find the new disks after you have added them to your VM

 rescan-scsi-bus.sh

Note: rescan-scsi-bus.sh is part of the sg3-utils package

2.Check that it has been found, will be mpath(something)
 multipath –l

# That’s it, unless you want to fix the name from mpath(something) to something else

1.Change the shortcut name

 vi /etc/multipath_bindings

 

2.Remove the default mpath device autogenerated
 multipath –f mpath(something)

# Go into the multipath consolde and re add the multipath device with your new shortcut name (nickdsk2 in this case)

 multipathd –k

 add map nickdsk2

Note: Not going to lie, sometimes you could do all this and still need a reboot, majority of the time this should work. But what do i know…haha

How to figure out switch and port via tcpdump

Okay if you have ever worked in a place where their network was complete choas with no documentation or network maps to help you figure out where something resides.

You can sometimes use tcpdump to help you figure out where the server is sitting by using tcpdump.

Syntax

tcpdump -nn -v -i <NIC_INTERFACE> -s 1500 -c 1 ‘ether[20:2] == 0x2000’


Example:

root@ansible:~ # tcpdumpnn -v –i eth0 -s 1500 -c 1 ‘ether[20:2] == 0x2000’
tcpdump: listening on eth3, link-type EN10MB (Ethernet), capture size 1500 bytes
03:25:22.146564 CDPv2, ttl: 180s, checksum: 692 (unverified), length 370
   Device-ID (0x01), length: 11 bytes: switch-sw02‘ 
   Address (0x02), length: 13 bytes: IPv4 (1) 192.168.1.15
   Port-ID (0x03), length: 15 bytes: ‘Ethernet0/1
   Capability (0x04), length: 4 bytes: (0x00000028): L2 Switch, IGMP snooping
   Version String (0x05), length: 220 bytes:
   Cisco Internetwork Operating System Software
   IOS ™ C2950 Software (C2950-I6Q4L2-M), Version 12.1(14)EA1a, RELEASE SOFTWARE (fc1)
   Copyright (c) 1986-2003 by cisco Systems, Inc.
   Compiled Tue 02-Sep-03 03:33 by Nicola tesla
   Platform (0x06), length: 18 bytes: ‘cisco WS-C2950T-24’
   Protocol-Hello option (0x08), length: 32 bytes:
   VTP Management Domain (0x09), length: 6 bytes: ‘ecomrd
   Duplex (0x0b), length: 1 byte: full
   AVVID trust bitmap (0x12), length: 1 byte: 0x00
   AVVID untrusted ports CoS (0x13), length: 1 byte: 0x00
1 packets captured
2 packets received by filter
0 packets dropped by kernel

root@ansible:~ #

Written by Nick Tailor

How to increase disk size on virtual scsi drive using gpart

1.Login to VMware vSphere Client and shutdown server VM guest.
2.Select VM Guest server and click “Edit virtual machine settings”. Virtual Machine Properties window will appear. Under “Hardware”, click Hard Disk 2 (which is /data partition) and edit provision size to 200 GB as shown in below screenshot.
Power ON VM guest after editing disk size.

vsphere

3.Take VM snapshot of VM guest.
4.Log on to VM Guest using SSH client, like PuTTy, with “root” user.
5.List the SCSI devices using command – cat /proc/scsi/scsi
disk7
6.Run following command to see the name of the partition
ls -d /sys/block/sd*/device/scsi_device/* |awk -F ‘[/]’ ‘{print $4,”- SCSI”,$7}’
disk8
7.Run following commands to confirm the size of the SCSI disk for which you have increased size in step 4 has been updated by the following steps
1.echo 1 > /sys/class/scsi_device/2\:0\:1\:0/device/rescan
2.fdisk -l | grep Disk
3.df -h
disk9_1
disk9_2
8.Stop cron and services using commands
service crond stop
9.Unmount “/data” partition using command — umount /data
Note: If you observe “Device is busy” error then make sure that your current session is not in /data partition.
10.Perform following steps to grow added disk space of /data partition based on partition type

For GPT partition type

In this case parted -l command will give below for “sdb” disk partition
*****************************************************

Model: VMware Virtual disk (
scsi)
Disk /dev/
sdb: 215GB
Sector size (logical/physical): 512B/512B

Partition Table: gpt
Number  Start   End    Size   File system  Name       Flags

1      1049kB  215GB 
215GB  ext4         Linux LVM  lvm
*****************************************************

4.Execute command — gdisk /dev/sdb
5.Type “p” to print the partition
gpt_p
6.Type “d” to delete partition
gpt_d
7.Type “p” to check if partition is deleted
gpt_p1
8.Type “n” to create new partition
9.Type “1” as partition number
10.Press “Enter” twice
11.Type “8E00” as GUID
gpt_n
12.Type “p” to check newly added partition
gpt_p2
13.Type “w” to alter partition table
14.Type “Y” to continue
gpt_w
15.To mount /data partition run command – mount /data
16.To resize the file system run command – resize2fs /dev/sdb1 
17.To check increased disk space run command – df -h
gpt_df

.

How to compare your route table isn’t missing any routes from your ansible config

REDHAT/CENTOS

Okay so those of you who use ansible like me and deal with complicated networks where they have a route list that’s a mile long on servers that you might need to migrate or copy to ansible and you want to save yourself some time and be accurate by ensuring the routes are correct and the file isn’t missing any routes as missing routes can be problematic and time consuming to troubleshoot after the fact.

Here is something cool you can do.

On your server you can

  1.  On the client server
  • You can use “ip” command with a flag r for routes

Example:

It will look look something like this.

[root@ansibleserver]# ip r
default via 192.168.1.1 dev enp0s8
default via 10.0.2.2 dev enp0s3 proto dhcp metric 100
default via 192.168.1.1 dev enp0s8 proto dhcp metric 101
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 metric 100
192.168.1.0/24 dev enp0s8 proto kernel scope link src 192.168.1.12 metric 101
10.132.100.0/24 dev mgt proto kernel scope link src 10.16.110.1 metric 1011
10.132.10.0/24  dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.136.100.0/24 dev mgt proto kernel scope link src 10.16.110.1 metric 1011
10.136.10.0/24  dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.134.100.0/24 dev mgt proto kernel scope link src 10.16.110.1 metric 1011
10.133.10.0/24  dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.127.10.0/24  dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.122.100.0/24 dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.134.100.0/24 dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.181.100.0/24 dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.181.100.0/24dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.247.200.0/24dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.172.300.0/24dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.162.100.0/24dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.161.111.0/24 dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.161.0.0/16   dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.233.130.0/24 dev mgt proto kernel scope link src 10.16.110.1 metric 101
10.60.140.0/24   dev mgt proto kernel scope link src 10.16.110.1 metric 101

Now what you want to do is take the all the ips that show up on “mgt” interface and put them in a text file

  • vi ips1 
  • save the file 

copy on the section of one after the other in a column and save the file.

10.132.100.0/24
10.132.10.0/24

10.136.100.0/24
10.136.10.0/24
10.134.100.0/24
10.133.10.0/24
10.127.10.0/24
10.122.100.0/24

  1. Now your ansible route section will probably look something like this…
Example of ansible yaml file “ansblefile”
routes:
    - device: mgt
      gw: 10.16.110.1
      route:
        - 10.132.100.0/24
        - 10.132.10.0/24
        - 10.136.100.0/24
        - 10.136.10.0/24
        - 10.134.100.0/24
        - 10.133.10.0/24
        - 10.127.10.0/24
        - 10.122.100.0/24
        - 10.134.100.0/24
        - 10.181.100.0/24
        - 10.181.100.0/24
        - 10.247.200.0/24
        - 10.172.300.0/24
        - 10.162.100.0/24
        - 10.161.111.0/24
        - 10.161.0.0/16
        - 10.233.130.0/24
  1. So you what you want to do now is copy and paste the routes from the file so they line up perfectly with the correct spacing in your yaml file.Note:
    If they aren’t lined up correctly your playbook will fail.
  2. So you can either copy them into a text editor like textpad or notepad++ and just use the replace function to add the “- “ (8 spaces before the – and 1 space before the – and ip) or  you can you perl or sed  script to do it right from the command line.
# If you want to edit the file in-place
sed -i -e 's/^/prefix/' file

Example:

sed -e 's/^/ - /' ips1 > ips2
  1. Okay now you should have a new file called ips2 that looks like below with 8 space from the left margin.
– 10.136.100.0/24
–  10.136.10.0/24
– 10.134.100.0/24
– 10.133.10.0/24
– 10.127.10.0/24
– 10.122.100.0/24
  1. Now you if you cat that ips2
  • cat ips2
  • Then highlight everything inside the file
[highlighted]
- 10.136.100.0/24
- 10.136.10.0/24
- 10.134.100.0/24
- 10.133.10.0/24
- 10.127.10.0/24
- 10.122.100.0/24
[highlighted]

7. Open your ansible yaml that contains the route section and just below “route:” right against the margin paste what you highlighted. Everything should line up perfectly and save the ansible file.

routes:
– device: mgt
gw: 10.16.110.1
route:
[paste highlight]
- 10.132.100.0/24
- 10.132.10.0/24
- 10.136.100.0/24
- 10.136.10.0/24
- 10.134.100.0/24
- 10.133.10.0/24

[paste highlight]

Okay no we need to check to ensure that you didn’t accidently miss any routes between the route table and inside your ansible yaml.

  1. Now with the original ips1 file with just the routes table without the –
    • Make sure the ansible yaml file and the ips1 file are inside the same directory to make life easier.
  • We can run a little compare script like so
    while read a b c d e; do if [[ $(grep -w $a ansiblefile) ]]; then :; else echo $a $b $c $d $e; fi  ; done < <(cat ips1)

Note:
If there are any routes missing from the ansible file it will spit them out. You can keep running this until the list shows no results, minus any gateway ips of course.

Example:

[root@ansibleserver]# while read a b c d e; do if [[ $(grep -w $a  ansiblefile) ]]; then:; else echo $a $b $c $d $e; fi  ; done < <(cat ips1)
10.168.142.0/24
10.222.100.0/24
10.222.110.0/24

By Nick Tailor

How to change the currently active slave of a bonded interface

RedHat / CentOS :

Interface Bonding as we all know is very useful in providing the fault tolerance and increased bandwidth. We can change the active slave interface of bonding without interrupting the production work. In the example below we have the interface bonding bond0 with 2 slaves em0 and em1 (em1 being the active slave). We will be replacing slave em0 with new slave em2.

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: em0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 5000
Down Delay (ms): 5000

Slave Interface: em0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:21:28:b2:65:26
Slave queue ID: 0

Slave Interface: em1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:21:28:b2:65:27
Slave queue ID: 0

1. Change the active slave to em1

ifenslave command can be used to attach or detach or change the currently active slave interface from the bonding. Now, Change the active slave interface to em1.

# ifenslave -c bond0 em1

Check the bonding status again to ensure that em1 is the new active slave :

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: em1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 5000
Down Delay (ms): 5000

Slave Interface: em0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:3b:26:b2:68:26
Slave queue ID: 0

Slave Interface: em1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:3b:26:b2:68:27
Slave queue ID: 0
The switch of active slave should get effective immediately, but on critical production systems, please schedule maintenance window or make some test in an identical test environment first.

2. Attach the new slave interface

We can now attach the new slave interface em2 to the bonding.

# ifenslave bond0 em2

3. Unattach the old slave interface

Once we have attached a new slave interface, we can unattach the old slave and remove it from the bonding.

# ifenslave -d bond0 em0

4. Verify

Confirm that the new slave is now the standby interface in the bonding.

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: em1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 5000
Down Delay (ms): 5000

Slave Interface: em1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:21:29:bf:55:30
Slave queue ID: 0

Slave Interface: em2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:19:1a:d1:43:61
Slave queue ID: 0

If you want to make the changes more permanent

The changes we just made, are temporary and will be cleared after a reboot of the server. To make these changes permanent we will have to make few changes.

Make sure you delete the file /etc/sysconfig/network-scripts/ifcfg-em0 as we are no longer are using this interface in bonding. Create a new file for the new slave interface in the bonding :

# rm /etc/sysconfig/network-scripts/ifcfg-em0
# vi /etc/sysconfig/network-scripts/ifcfg-em2
DEVICE=em2
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes

How to fix in the infiniband issue when migrating multiple bonded nics to redhat 7

Okay so some of you be using malenox FPGA cards which basically bypasses the BUS to give lower latency on your network response time.

Now if you have used an OS like SUSE and had a butt load of bonded nics and then want to migrate the OS and all the bonded nics configurations in an automated fashion using ansible or something configuration management tool.

What some of you might run into is when the OS comes up for the first time, some of the Mellanox nics will boot up in infiniband mode. Which will result in the bonded nics showing up as down. I will show you how to determine this and fix this.

.

So the first thing you want to do is determine which bonds are showing down

How to check which bonds are down.

1.grep -c down /proc/net/bonding/*
 this will list out all the bonds that show an interface is down

Example

root@ansibleclient:~> grep -c down /proc/net/bonding/*

.

/proc/net/bonding/bond1:0

/proc/net/bonding/bond2:0

/proc/net/bonding/bond3:1 (this indicates that one interface is down)

.

2.Once you determine the bond has an interface that is down you want to figure out if it’s the Mellanox card nic.
 cat /proc/net/bonding/bond3
i.this will give you the nic mac address that are inside the bond.

Example

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: None

Currently Active Slave: eth4

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 0

Down Delay (ms): 0

.

Slave Interface: eth4

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: 00:02:c9:e9:e9:11

Slave queue ID: 0

.

Slave Interface: eth5

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: 00:02:c9:e9:e9:12

Slave queue ID: 0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

.

3.now what you want to do next is run ‘ip a’ and see if those interfaces are listed

.

Example – should look something like this. If you don’t see the down nic here for our example lets say its eth5. This could mean its in infiniband mode and not ethernet mode. It also shows if the interface is up or down. Which is very important when troubleshooting the interface

.

[root@nickansible]# ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    inet 127.0.0.1/8 scope host lo

       valid_lft forever preferred_lft forever

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000

link/ether 08:00:26:9a:33:59 brd ff:ff:ff:ff:ff:ff

    inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic enp0s3

       valid_lft 82770sec preferred_lft 82770sec

3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000

link/ether 08:00:26:88:5a:fd brd ff:ff:ff:ff:ff:ff

    inet 192.168.1.11/24 brd 192.168.1.255 scope global noprefixroute dynamic enp0s8

       valid_lft 82773sec preferred_lft 82773sec

.

4.Okay now we need to determine if eth5 is infact the Mellanox card. So now we need the nic information
 Ethtool -I eth5

.

Example.
It will look something like this.

.

[root@nick ansible# ethtool -i eth5

driver: e1000

version: 7.3.21-k8-NAPI

firmware-version:

expansion-rom-version:

bus-info: 0000:00:18.0 (this is the important info you need)

supports-statistics: yes

supports-test: yes

supports-eeprom-access: yes

supports-register-dump: yes

supports-priv-flags: no

.

 Now you want to take the bus info and determine if it is infact the Mellanox card
 lspcis 0000:00:18.00  

.

Example

.

[root@nick ansible]# lspci -s 0000:00:18.0.0

00:18:00 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s]

.

5.Okay now we know for sure this nic is infact the Mellanox nic that is down. So now we went to manually force it into ethernet but first check to see what it says
 cat /sys/bus/pci/devices/0000\:18\:00.0/mlx4_port0
ii.if this doesn’t return “ETH” then its in infiniabnd mode
 cat /sys/bus/pci/devices/0000\:18\:00.0/mlx4_port1
iii.if this doesn’t return “ETH” then its in infiniabnd mode

.

6.Now what we want to od is manually change the nic to ethernet mode
 echo eth > /sys/bus/pci/devices/0000\:18\:00.0/mlx4_port1
 echo eth > /sys/bus/pci/devices/0000\:18\:00.0/mlx4_port1
iv.If you cat them now it should say “ETH”

.

Okay so now when you do ‘ip a’ you should should see the nics up and if you check the status of the bond there should be 0 bonds down. You might have to bring the bond down and up.

.

7.You can do this simply by
 Ifdown eth5 & ifup eth5
v.If there are no errors, the cursor will simply move to the next line with a brief delay.

.

.

Now the issue here is that if you aren’t able to get rpms from Mellanox that are supported by patching in your organisation. You’re going to need a way to ensure that if the server reboots the nic will start up in ethernet mode, otherwise you could be in a very bad situation if the server boots and the nic came up in infiniband mode.

.

So there are a couple of ideas I came up with to solve this.

Option:

1.You can simply add the echo lines in the /etc/rc.local
 echo eth > /sys/bus/pci/devices/0000\:18\:00.0/mlx4_port1
 echo eth > /sys/bus/pci/devices/0000\:18\:00.0/mlx4_port1
i.This should bring the interface back to “ETH”, however you might need to add some more lines to bring the interface up properly.

.

1.This the approach I chose and the cooler way to go about it. In redhat 7 you can define a if-preup-local script which will run anytime “ifup” is run.

Here is how you set that up.

1.Create a file called “/etc/sysconfig/network-scripts/ifup-pre-local’
a.vi /etc/sysconfig/network-scripts/ifup-pre-local

.

2.Now you can add whatever script you want. My colleague and I came up with a script that determined based on mac and bus info and if it certain buses and mac showed up it would run the echo to move the ports into eth mode

.

ADD this inside and save the file

#!/bin/bash

.

#

.

LID=”00:00:00:00″

for i in `ls /etc/sysconfig/network-scripts/ifcfg-* 2> /dev/null`

do

for j in `grep HWADDR $i |awk -F\” ‘{print $2}’`

do

ID1=$(echo $j | awk -F\: ‘{print $2″:”$3}’)

ID2=$(echo $j | awk -F\: ‘{print $4″:”$5}’)

ID=”$ID1:$ID2″

PORT=$(echo $j | cut -c 16-17)

  for k in `ls /sys/bus/pci/devices/0000\:*\:00.0/net/ib[0-9]/address 2> /dev/null`

do

grep “$ID1.*$ID2” $k 1> /dev/null

if [ $? -eq 0 ]; then

if [ “x$ID” != “x$LID” ]; then

     mlxport=1

else

let “mlxport++”

fi

LID=$ID

p=$(echo $k | awk -F/ ‘{print “/sys/bus/pci/devices/”$6″/”}’)

echo “Running: echo eth > ${p}mlx4_port${mlxport}”

echo eth > ${p}mlx4_port${mlxport}

fi

done

done

done

.

3.Next you want to create a symlink in side /sbin
b.Move into /sbin
i.cd /sbin
c.now create a symlink for ifup-pre-local
ii.ln -s /etc/sysconfig/network-scripts/ifup-pre-local ifup-pre-local

.

Now when you run ifup it will run that script that check to see if the any of those bus and macs are in infinband mode and bring them into eth. It safer to do this way because if you restart the network and for some reason the nic goes back into infiniband and someone new had no idea. They would spend awhile trying to figure this out.

.

.

.

How do deploy this fix via anisble role coming soon……

.

.

.

How to build a server using kickstart satellite 6.x

Note: This document is assuming that your capsule server are already configured and your dhcpd service is running and your subnets have been added to the config already.

Manual process:

HOST TAB

1.On the top menu bar click on the HOSTS
a.Create hosts

Under Create hosts there are a bunch of tabs that need to be filled out.

Name * (This is the name of your vm) – “nick.test1.com”
This value is used also as the host’s primary interface name.

Organisation * Which ever ORG which want the host to live in (LCH)

Location * london

Host Group – We will do this late for now just choose an existing non-prod group.

Deploy on – Bare Metal

Lifecycle Environment Non-Prod

Content View – Select a content view that exists, check under content view

Content Source – leave blank

Interfaces TAB

Type : Interface

MAC address : Grab the mac address from vcenter or login in existing OS and get interface mac-address

Device identifier :  en016780032

DNS name “nick.test1.com

Domain : nicktailor.com

IPv4 Subnet: subnet the vlan lives on(this is setup on capsule server)
nick-10.61.120.0-26(10.61.120.0/26)

IPv6 Subnet

IPv4 address : 10.61.120.45

Managed (checked)

Primary (checked)

Provision(checked)

Remote execution(checked)

.

Operating System TAB

Architecture * :x86_64

Operating system *: RHEL Server 7.4

Media SelectionSynced Content All Media

Select the installation media that will be used to provision this host. Choose ‘Synced Content’ for Synced Kickstart Repositories or ‘All Media’ for other media.

Media *: RHE7-cap01 (this is where the repositories live)

Partition table *: RHEL7-TESTING (make sure this attached to a hostgroup and operating sytem) Under HOSTS & CONFIGURE)

PXE loader : PXELinux BIOS (this is for the PXE Boot)

Custom partition table (leave blank unless you want to overide

What ever text(or ERB template) you use in here, would be used as your OS disk layout options If you want to use the partition table option, delete all of the text from this field

Root password : password

Password must be 8 characters or more

Pamameters TAB

Puppet class parameters

Puppet class  Name  Value  Omit

.

Global parameters:

Capsule : nick-cap01.com

Activation_keys: RHEL7-2017-12-PROD

                      nick-cap01.com
                       kt_activation_keys: RHEL7-2017-12-Prod

                      (if you override the default key it shows up below)

                      puppet_server  : nick-pup02.com

.

Host parameters:

Name  Value  Actions

kt_activation_keys

RHEL7-2017-12-Non-Prod (nonprod)

Additional Information TAB

Owned by: Nick Tailor

Enabled: Include this host within satellite reporting (check this)

Hardware Model

Commen: Blank

.

Next Step – Create a hostgroup

Under Configure select Host Groups( You need a host group in for your deployment to work properly without this is will not work )

Note: Generally its easier to clone an existing hostgroup, change the name and edit the settings to save you time. However for the purposes of this document. We are going to go through the process.

1.Click on Create Host Group (Top right)

Host Group Tab

Parent

Name *:  Nick-hax0r-servers (Project name – servers)

Lifecycle Environment: NON-PROD (make sure you have lifecycle environment configured)

Content View : RHEL7-2019-03 (Make sure to select a content view that exists, you can go to content views and look at which it exists and the copy and paste the name exactly)

Content Source: nick-cap01.com(This is the capsule server where the content for the repositories exist for the dev environment, in addition where the subnets are defined that these project servers can dhcp from pxeboot)

Puppet Environment: Non_Production_RHEL7_2019_03_127
Note: (Define this is you have a puppet environment configured with satellite. You will need to have your puppet environment match this content view if you do)

Compute profile : Blank

Puppet Master: Blank

Puppet CA: Blank

OpenSCAP Capsule : Blank

Note: (This is good for pulling server information and vulnerabilities)

Network TAB

Domain: nicktailor.com

IPv4 Subnet: NTC-10.61.120.0-26(10.61.120.0/26)
Note: (
These subnets are defined in satellite under Infrastructure and then Subnets)

IPv6 : No Subnet

Realm: Blank

.

Operating System TAB

Architecture: x86_64

Operating system * : RHEL Server 7.4
(Note: This section is very important. You will need to attach the partition table to the operating system under Hosts and Operating System. If you do not when you make your provision template this host group will not be able to see the partition table you created when you choose the OS you want to deploy.

Media Selection    Synced Content All Media

Select the installation media that will be used to provision this host. Choose ‘Synced Content’ for Synced Kickstart Repositories or ‘All Media’ for other media.

Media *: RHEL7-nick-cap01

Partition table *: RHEL7-Testing
(Note: This is created under HOSTS and Partition Table)

PXE loader: Blank

Root password: Password (set this for your server to desired setting)

.

Parameters TAB

Global Parameters

Host group parameters:

Name:         Value:

Capsule       nick-cap01.com

puppet_server      nick-pup02.com
Note:(You only need this define i`f you have a puppet server environment configured)

.

Locations TAB

Under Selected Items:

  Add London

Organizations TAB

Under Selected Items:
Add organizations you want to have access to the host group
ADD: LCH

.

Activation Keys TAB

Activation keys: RHEL7-2017-12-Non-Prod (this key defines which organization, host group, repositories, life cycle environment and organization the host initially gets registered with. You can manually change these setting after, however its probably good to make a proper key to save you lots of time.

.

Next Step – Created Patition Table
HOSTS and Partition Tables

.

1.Click On Create Parition Table
(Note: Its generally better to clone an exitsing table and edit as needed, however for the purposes of this doc, we will go through the settings) You will also need to add this table to your operating system under Hosts and Operating system for the provision template to work properly)

Template TAB

Name * : GTP-RHEL7-Testing (Name your partition table scheme)

Default

Default templates are automatically added to new organisations and locations

Snippet

Operating system family: RED HAT

Input:

Note: This is a standard lvm setup using ext4 for the OS. If you are going to use dual boot, then you want to change the first 3 lines

zerombr

clearpart –drives=sda –all –initlabel

part /boot –fstype ext4 –size=1024 –asprimary –ondisk=sda

part pv.00 –size=1 –grow –asprimary –ondisk=sda

volgroup vgroot pv.00

logvol / –name=lv_root –vgname=vgroot –size=15360 –fstype ext4

logvol swap –name=lv_swap –vgname=vgroot –size 6144 –fstype swap

logvol /var –name=lv_var –vgname=vgroot –size 10240 –fstype ext4

logvol /opt –name=lv_opt –vgname=vgroot –size 10240 –fstype ext4

logvol /var/tmp –name=lv_var_tmp –vgname=vgroot –size 5120 –fstype ext4 –fsoptions=nodev,nosuid,noexec

logvol /var/log –name=lv_var_log –vgname=vgroot –size 5120 –fstype ext4

logvol /var/log/audit –name=lv_var_log_audit –vgname=vgroot –size 2048 –fstype ext4

logvol /var/coredumps –name=lv_crash –vgname=vgroot –size 16384 –fstype ext4

logvol /tmp –name=lv_tmp –vgname=vgroot –size 5120 –fstype ext4 –fsoptions=nodev,nosuid,noexec

logvol /home –name=lv_home –vgname=vgroot –size 5120 –fstype ext4 –fsoptions=nodev

.

Dual Boot template:

.

Note: Change the drive designation from sda to sdx (x being whatever the new drive designation is) In the example below its /dev/sdc

.

clearpart –drives=sdc –all –initlabel

part /boot –fstype ext4 –size=1024 –asprimary –ondisk=sdc

part pv.00 –size=1 –grow –asprimary –ondisk=sdc

volgroup vgroot pv.00

logvol / –name=lv_root –vgname=vgroot –size=15360 –fstype ext4

logvol swap –name=lv_swap –vgname=vgroot –size 6144 –fstype swap

logvol /var –name=lv_var –vgname=vgroot –size 10240 –fstype ext4

logvol /opt –name=lv_opt –vgname=vgroot –size 10240 –fstype ext4

logvol /var/tmp –name=lv_var_tmp –vgname=vgroot –size 5120 –fstype ext4 –fsoptions=nodev,nosuid,noexec

logvol /var/log –name=lv_var_log –vgname=vgroot –size 5120 –fstype ext4

logvol /var/log/audit –name=lv_var_log_audit –vgname=vgroot –size 2048 –fstype ext4

logvol /var/coredumps –name=lv_crash –vgname=vgroot –size 16384 –fstype ext4

logvol /tmp –name=lv_tmp –vgname=vgroot –size 5120 –fstype ext4 –fsoptions=nodev,nosuid,noexec

logvol /home –name=lv_home –vgname=vgroot –size 5120 –fstype ext4 –fsoptions=nodev

.

Locations TAB

.

Under Selected Items:

  ADD: London

.

Organization TAB

.

Under Selected Items:

  ADD: NTC

.

.

Next Step – ADD New Partition Table to Operating System

1.Click on HOSTS and Operating Systems
2.Select the OS
a. RHEL 7.2
Note- (This part is important. The way to figure out which OS to choose is to check the which repositories are available on the capsule server defined. Say you chose RHEL7. 4, but the repository doesn’t exist there. The provision template will then choose the default template and your partition template and everything will no longer be there and you could accident deploy on the wrong disk wiping out data potentially)

.

Example if we chose the content view RHEL7-2019-03 and the OS RHEL7.4 in the provision template but on the capsule server. The path shows only 7.5 under that content view, the url would fail during the deployment and revert 7.2 default settings and would use a different partition table if the one you created wasn’t available under the default OS setting.

.

root@nick-cap01:/var/lib/pulp/published/yum/http/repos/NTC/Non-Production/RHEL7-2019-03/content/dist/rhel/server/7/7.5

.

I found its best to use the default OS and then just ensure that yum update is in the kickstart file that is going to be used

.

Will look like this in the kickstart file.

# update all the base packages from the updates repository

yum -t -y -e 0 update

b.Next under Parition Table tab
i.Under selected Items:
ii.Add new partition table (GTP-RHEL7-TESTING)

.

Now go back to your New your provision template.

.

Under Hosts and Provision Template.

1.Select your new template (GTP-kickstartprofile-testing)
a.Under Association TAB
i.Ensure the OS RHEL 7.2 is under selected items
b.Under Host Group
ii.Your new Host Group is selected (GTP-servers)

.

Now to set your server to build status so that the PXEboot is able to pick it up on network book.

.

1.Under Hosts and All hosts
2.Search for your host
a.Click on the host nick.test1.com
b.Click on BUILD on the far right (Note this will create the pxeboot file on the capsule server so when you network boot this host it will know which server to deploy the provision templates to.

.

.

Now we can test the deployment from VCENTER

.

1.Under VM’s
a.Find your Vm (nick.test2.com)
b.Open your console on the vm
c.Reboot
i.During the reboot hit f12 for the network boot option.
d.If all goes well the kickstart server should deploy without any intervention and reboot into your OS

.

.

.

.

.

.

SMTP auth relay with postfix

In case that your SMTP server has authentication (like Gmail), For this purpose we will use Postfix. I

Install the needed packages:

Ubuntu

apt-get install postfix mailutils libsasl2-2 ca-certificates libsasl2-modules

CentOS

yum update && yum install postfix mailx cyrus-sasl cyrus-sasl-plain
  • Set Postfix config file /etc/postfix/main.cf. Add this lines to the end of the file:
  1. Set Postfix config file /etc/postfix/main.cf. Add this lines to the end of the file:

    Ubuntu

    relayhost = [smtp.gmail.com]:587
    smtp_sasl_auth_enable = yes
    smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
    smtp_sasl_security_options = noanonymous
    smtp_tls_CAfile = /etc/ssl/certs/thawte_Primary_Root_CA.pem
    smtp_use_tls = yes
    

    CentOS

    relayhost = [smtp.gmail.com]:587
    smtp_sasl_auth_enable = yes
    smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
    smtp_sasl_security_options = noanonymous
    smtp_tls_CAfile = /etc/ssl/certs/ca-bundle.crt
    smtp_use_tls = yes
    
  2. Configure email address and password:

    echo [smtp.gmail.com]:587 USERNAME@gmail.com:PASSWORD > /etc/postfix/sasl_passwd
    postmap /etc/postfix/sasl_passwd
    chmod 400 /etc/postfix/sasl_passwd
    
  3. Secure DB password

    chown root:root /etc/postfix/sasl_passwd /etc/postfix/sasl_passwd.db
    chmod 0600 /etc/postfix/sasl_passwd /etc/postfix/sasl_passwd.db
    
  4. Reload Postfix

    systemctl reload postfix
  • Test you configuration with:

    echo "Test mail from postfix" | mail -s "Test Postfix" you@example.com
    echo "Test mail from postfix" | mail -s "Test Postfix" -r someone@whatever.com you@example.com (this is if your reply address has to come from a specific sender)
    

    You should receive an email on you@example.com

You should receive an email on you@example.com

Note: If you are using centos 7, which uses systemD. You will find that postfix doesn’t send maillogs to /var/log/maillog. Someone decided it would be a great idea to send it to the journaler instead.

In order to find your mail logs and or tail the logs for postfix.

  • Journalctl -u postfix (view all logs)
  • Journalctl -u postfix -f (to tail the mail logs)

How to properly upgrade wazuh with a major update (standalone setup)

Upgrade from the same major version (3.x)

The following steps show how to upgrade to the latest available version of Wazuh 3.x (which implies upgrading to the latest version of Elastic Stack 6.x).

Starting the upgrade

If you followed our manager or agents installation guides, probably you disabled the repository in order to avoid undesired upgrades. It’s necessary to enable them again to get the last packages.

  • For CentOS/RHEL/Fedora:

#sedi“s/^enabled=0/enabled=1/” /etc/yum.repos.d/wazuh.repo

Upgrade the Wazuh manager

Note

Since Wazuh v3.7.0 the File Integrity Monitoring database is not used anymore. In order to add to Wazuh DB the file and registry entries stored from previous versions it’s necessary to run the FIM migration tool.

  • Upgrade the wazuh-manager package:
  • For CentOS/RHEL/Fedora:

# yum upgrade wazuh-manager

 

  • Upgrade the wazuh-api package:
  • For CentOS/RHEL/Fedora:

# yum upgrade wazuh-api

Note

The installation of the updated packages will automatically restart the services for the Wazuh manager, API and agents. Your Wazuh config file will keep unmodified, so you’ll need to manually add the settings for the new capabilities. Check the User Manual for more information.

 

Finishing the Wazuh upgrade

You’ve finished upgrading your Wazuh installation to the latest version. Now you can disable again the Wazuh repositories in order to avoid undesired upgrades and compatibility issues.

  • For CentOS/RHEL/Fedora:

#sedi“s/^enabled=1/enabled=0/” /etc/yum.repos.d/wazuh.repo

Upgrade to the latest Elastic Stack version

Since the release of Wazuh 3.0.0, there’s been several updates to the 6.x version of the Elastic Stack, introducing several bugfixes and important changes. In order to use the latest version of Wazuh, it’s necessary to install the latest compatible Elastic Stack packages.

  • Stop the services:

#systemctl stop filebeat (This you do not need to stop in a standalone setup, because it should not be installed. Filebeat is only used when you have a clustered setup. It sends logs back to the manager when clustered)

#systemctl stop logstash

#systemctl stop kibana

#systemctl stop elasticsearch

  • Enable the Elastic repository:

If you followed our Elastic Stack Installation Guide, probably you disabled the repository in order to avoid undesired upgrades for the Elastic Stack. It’s necessary to enable them again to get the last packages.

  • For CentOS/RHEL/Fedora:

#sedi“s/^enabled=0/enabled=1/” /etc/yum.repos.d/elastic.repo

Upgrade Elasticsearch

  • Upgrade the elasticsearch package:
  • For CentOS/RHEL/Fedora:

# yum install elasticsearch-6.5.1

 

  • Start the Elasticsearch service:

#systemctl daemon-reload

#systemctlenableelasticsearch.service

#systemctl start elasticsearch.service

It’s important to wait until the Elasticsearch server finishes starting. Check the current status with the following command, which should give you a response like the shown below:

# curl "http://localhost:9200/?pretty"

{
  "name" : "Zr2Shu_",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "M-W_RznZRA-CXykh_oJsCQ",
  "version" : {
    "number" : "6.5.1",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "053779d",
    "build_date" : "2018-07-20T05:20:23.451332Z",
    "build_snapshot" : false,
    "lucene_version" : "7.3.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Updating the Elasticsearch template to the latest version is mandatory in order to avoid compatibility issues with the latest versions of Wazuh and the Elastic Stack.

# curl https://raw.githubusercontent.com/wazuh/wazuh/3.7/extensions/elasticsearch/wazuh-elastic6-template-alerts.json | curl -X PUT "http://localhost:9200/_template/wazuh" -H 'Content-Type: application/json' -d @-

Upgrade Logstash

  • Upgrade the logstash package:
  • For CentOS/RHEL/Fedora:

# yum install logstash-6.5.1

  • Download and set the Wazuh configuration for Logstash:
  1. Local configuration:
# cp /etc/logstash/conf.d/01-wazuh.conf /backup_directory/01-wazuh.conf.bak
# curl -so /etc/logstash/conf.d/01-wazuh.conf https://raw.githubusercontent.com/wazuh/wazuh/3.7/extensions/logstash/01-wazuh-local.conf
# usermod -a -G ossec logstash
  1. Remote configuration: (We are not using this in our standalone setup and therefore do not need to run this)
# cp /etc/logstash/conf.d/01-wazuh.conf /backup_directory/01-wazuh.conf.bak
# curl -so /etc/logstash/conf.d/01-wazuh.conf https://raw.githubusercontent.com/wazuh/wazuh/3.7/extensions/logstash/01-wazuh-remote.conf
  • Start the Logstash service:

#systemctl daemon-reload

#systemctlenablelogstash.service

#systemctl start logstash.service

Note

The Logstash configuration file has been replaced for an updated one. If you already configured the encryption between Filebeat and Logstash, don’t forget to check again Setting up SSL for Filebeat and Logstash if you’re using a distributed architecture.

Upgrade Kibana

  • Upgrade the kibana package:
  • For CentOS/RHEL/Fedora:

# yum install kibana-6.5.1

  • Uninstall the Wazuh app from Kibana:
  • Update file permissions. This will avoid several errors prior to updating the app:

#chown -R kibana:kibana /usr/share/kibana/optimize

#chown -R kibana:kibana /usr/share/kibana/plugins

  • Remove the Wazuh app:

#sudo -u kibana /usr/share/kibana/bin/kibana-plugin remove wazuh

  • Upgrade the Wazuh app:

# rm -rf /usr/share/kibana/optimize/bundles

#sudo -u kibanaNODE_OPTIONS=“–max-old-space-size=3072” /usr/share/kibana/bin/kibana-plugin install https://packages.wazuh.com/wazuhapp/wazuhapp-3.7.1_6.5.1.zip

Warning

The Wazuh app installation process may take several minutes. Please wait patiently.

  • Start the Kibana service:

#systemctl daemon-reload

#systemctlenablekibana.service

#systemctl start kibana.service

This section only applies if you have clustered/distributed setup

Upgrade Filebeat

  • Upgrade the filebeat package:
  • For CentOS/RHEL/Fedora:

# yum install filebeat-6.5.1

  • Start the Filebeat service:

#systemctl daemon-reload

#systemctlenablefilebeat.service

#systemctl start filebeat.service

Finishing the Elastic Stack upgrade

You’ve finished upgrading your Wazuh installation to the latest version. Now you can disable again the Elastic Stack repositories in order to avoid undesired upgrades and compatibility issues with the Wazuh app.

  • For CentOS/RHEL/Fedora:

#sedi“s/^enabled=1/enabled=0/” /etc/yum.repos.d/elastic.repo

Things you will need to fix after the upgrade

1. Running migration tool for versions before 3.7 for that have upgraded recently: 

  • If you upgraded from wazuh 3.6 or newer you will need to run the following migration tool, which migrate the database into a new format for wazuh 3.7......When they first introduced the tool it had some fail to exit code if it couldn’t decode a line and it would halt the migration. They have since fixed that, however it look something like this.

    2018-11-12 15:45:38 [INFO]   [32/239] Added 3339 file entries in agent ‘033’ database.

    2018-11-12 15:45:38 [INFO]   Setting FIM database for agent ‘033’ as completed…

    2018-11-12 15:45:38 [INFO]   [33/239] Upgrading FIM database for agent ‘034’…

    2018-11-12 15:45:38 [INFO]   [33/239] Added 61 file entries in agent ‘034’ database.

    2018-11-12 15:45:38 [INFO]   [33/239] Upgrading FIM database (syscheck-registry) for agent ‘034’…

    2018-11-12 15:45:38 [ERROR]  Couldn’t decode line at syscheck database.

    Traceback (most recent call last):

      File “./fim_migrate“, line 320, in <module>

        if not check_file_entry(agt[0], decoded[2], s):

      File “./fim_migrate“, line 91, in check_file_entry

        msg = msg + cfile + b”‘;”

    TypeError: cannot concatenate ‘str’ and ‘NoneType‘ objects 


    working
    migration tool below
    https://raw.githubusercontent.com/wazuh/wazuh/3.7/tools/migration/fim_migrate.py

2. error “api version type mismatch 3.6.1′‘ :

 

  • Next run this on the server side to confirm they match
  • cat /usr/share/kibana/plugins/wazuh/package.json | grep –i -E “version|revision

“version”: “3.7.0”,

        “revision”: “0413”,

       “version”: “6.4.3”

 

If all those match then you simply need to do the following to fix it.

  • Delete the .wazuh-version index:

curl -XDELETE http://elastic_ip:9200/.wazuh-version

  • Restart Kibana:

systemctl restart kibana

Wait for about 30 s – 1 min and now open a new window in your browser, then you should navigate without any more troubles regarding the version mismatching.

Notes: The Wazuh app creates that index when you restart Kibana if it’s not present. If your standalone setup is using localhost then the curl command should be localhost and not the elastic ip.

3. Items listed per screen when listing agents will default back to 17 items for screen and is extremely annoying. You will need to fix this in the following manner.:

# systemctl stop kibana

Let’s open the file under /usr/share/kibana/plugins/wazuh/public/templates/agents-prev/agents-prev.html and look for lines 103-109:

 

<wz-table flex path=“‘/agents'” keys=“[‘id’,{value:’name’,size:2},’ip’,’status’,’group’,’os.name‘,’os.version’,’version’]” allow-click=“true” row-sizes=“[17,15,13]”></wz-table>

The wz-table tag is related to a Wazuh custom directive which has parameters to easy change that limit.

Replace [17,15,13] by your desired size [50,50,50], where each value refers to different screen sizes. Use 50 for all screen sizes,

and you’ll see 50 agents per page regardless your screen size. Use your desired value, it can be 100 or 150…

My suggestion is to don’t increase more than 50 (Angular performance reasons).

Once you are done save and close the file. Now remove old bundles and check the permissions again:

rm rf /usr/share/kibana/optimize/bundleschown R kibana:kibana /usr/share/kibana/optimizechown R kibana:kibana /usr/share/kibana/plugins

Restart Kibana:

# systemctl restart kibana

It takes a few of minutes until it’s completed, you can check the status using the next command:

# systemctl status kibana -l

 

You’ll see “Optimizing…”, once you see “App ready to be used” you can remove cache/cookies from your browser and type your App address for accessing it.

 4. Errors in wazuh log after upgrade [FORBIDDEN/12/index read-only / allow delete (api)];”}: 

 

  • If you see the following your wazuh.log
  • tail -n500 /usr/share/kibana/optimize/wazuh-logs/wazuhapp.log

{“date”:”2018-11-22T14:24:15.613Z”,”level”:”info”,”location”:”[monitoring][init]”,”message”:”Checking if wazuh-monitoring pattern exists…”}

{“date”:”2018-11-22T14:24:15.625Z”,”level”:”error”,”location”:”[initialize][checkKnownFields]”,”message”:”[cluster_block_exception] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];”}

{“date”:”2018-11-22T14:24:15.632Z”,”level”:”info”,”location”:”[monitoring][init]”,”message”:”Updating known fields for wazuh-monitoring pattern…”}

{“date”:”2018-11-22T14:24:15.646Z”,”level”:”info”,”location”:”[monitoring][init]”,”message”:”Didn’t find wazuh-monitoring pattern for Kibana v6.x. Proceeding to create it…”}

{“date”:”2018-11-22T14:24:15.650Z”,”level”:”info”,”location”:”[monitoring][createWazuhMonitoring]”,”message”:”No need to delete old wazuh-monitoring pattern.”}

{“date”:”2018-11-22T14:24:15.650Z”,”level”:”info”,”location”:”[monitoring][configureKibana]”,”message”:”Creating index pattern: wazuh-monitoring-3.x-*”}

{“date”:”2018-11-22T14:24:15.658Z”,”level”:”info”,”location”:”[initialize][checkAPIEntriesExtensions]”,”message”:”Successfully updated API entry extensions with ID: 1535484412304″}

{“date”:”2018-11-22T14:24:15.660Z”,”level”:”error”,”location”:”[monitoring][configureKibana]”,”message”:”[cluster_block_exception] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];”

    • This usually means that one of your partitions is near full and wazuh goes into read only mode because of this, super annoying…..

To fix this you must :

    • first add diskspace to your lvm, if you don’t know how to do this look it up…haha 😛
    • The you must go into the kibana interface and under under dev tools run the following
    • PUT wazuhmonitoring-*/_settings{    “index”: {        “blocks”: {            read_only_allow_delete: “false”        }    }}

    • Make sure to restart kibana:
      • systemctl restart kibana

Once kibana is restarted the log should look show something like this.

  • tail -n500 /usr/share/kibana/optimize/wazuh-logs/wazuhapp.log

   {“date”:”2018-11-23T00:00:02.464Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

Note: Initially you might only see on entry for that day, however after a few days the logs will look like this.

   {“date”:”2018-11-22T14:25:09.166Z”,”level”:”error”,”location”:”[monitoring][configureKibana]”,”message”:”[cluster_block_exception] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];”}

{“date”:”2018-11-23T00:00:02.464Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-11-24T00:00:01.894Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-11-25T00:00:02.055Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-11-26T00:00:01.983Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-11-27T00:00:02.785Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-11-28T00:00:02.458Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-11-29T00:00:02.163Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-11-29T14:41:46.871Z”,”level”:”info”,”location”:”[initialize]”,”message”:”Kibana index: .kibana“}

{“date”:”2018-11-29T14:41:46.874Z”,”level”:”info”,”location”:”[initialize]”,”message”:”App revision: 0413″}

{“date”:”2018-11-29T14:41:46.874Z”,”level”:”info”,”location”:”[monitoring][configuration]”,”message”:”wazuh.monitoring.enabled: true”}

{“date”:”2018-11-29T14:41:46.874Z”,”level”:”info”,”location”:”[monitoring][configuration]”,”message”:”wazuh.monitoring.frequency: 3600 (0 */60 * * * *) “}

{“date”:”2018-11-29T14:41:46.874Z”,”level”:”info”,”location”:”[monitoring][checkKibanaStatus]”,”message”:”Waiting for Kibana and Elasticsearch servers to be ready…”}

{“date”:”2018-11-29T14:41:48.241Z”,”level”:”info”,”location”:”[initialize][checkWazuhIndex]”,”message”:”Checking .wazuh index.”}

{“date”:”2018-11-29T14:41:48.241Z”,”level”:”info”,”location”:”[initialize][checkWazuhVersionIndex]”,”message”:”Checking .wazuh-version index.”}

{“date”:”2018-11-29T14:41:48.246Z”,”level”:”info”,”location”:”[monitoring][init]”,”message”:”Creating/Updating wazuh-agent template…”}

{“date”:”2018-11-29T14:41:48.246Z”,”level”:”info”,”location”:”[monitoring][checkTemplate]”,”message”:”Updating wazuh-monitoring template…”}

{“date”:”2018-11-29T14:41:48.945Z”,”level”:”info”,”location”:”[initialize][checkKnownFields]”,”message”:”x-pack enabled: no”}

{“date”:”2018-11-29T14:41:48.999Z”,”level”:”info”,”location”:”[initialize][checkKnownFields]”,”message”:”Found 2 index patterns”}

{“date”:”2018-11-29T14:41:48.999Z”,”level”:”info”,”location”:”[initialize][checkKnownFields]”,”message”:”Found 1 valid index patterns for Wazuh alerts”}

{“date”:”2018-11-29T14:41:48.999Z”,”level”:”info”,”location”:”[initialize][checkKnownFields]”,”message”:”Default index pattern found”}

{“date”:”2018-11-29T14:41:48.999Z”,”level”:”info”,”location”:”[initialize][checkKnownFields]”,”message”:”Refreshing known fields for \”index-pattern:wazuh-alerts-3.x-*\””}

{“date”:”2018-11-29T14:41:49.092Z”,”level”:”info”,”location”:”[initialize][checkKnownFields]”,”message”:”App ready to be used.”}

{“date”:”2018-11-29T14:41:49.181Z”,”level”:”info”,”location”:”[initialize][checkAPIEntriesExtensions]”,”message”:”Checking extensions consistency for all API entries”}

{“date”:”2018-11-29T14:41:49.188Z”,”level”:”info”,”location”:”[initialize][checkAPIEntriesExtensions]”,”message”:”Successfully updated API entry extensions with ID: 1535484412304″}

{“date”:”2018-11-29T14:41:49.266Z”,”level”:”info”,”location”:”[monitoring][init]”,”message”:”Creating today index…”}

{“date”:”2018-11-29T14:41:49.295Z”,”level”:”info”,”location”:”[monitoring][init]”,”message”:”Checking if wazuh-monitoring pattern exists…”}

{“date”:”2018-11-29T14:41:49.314Z”,”level”:”info”,”location”:”[monitoring][init]”,”message”:”Updating known fields for wazuh-monitoring pattern…”}

{“date”:”2018-11-29T14:41:49.320Z”,”level”:”info”,”location”:”[monitoring][init]”,”message”:”Skipping wazuh-monitoring pattern creation. Already exists.”}

{“date”:”2018-11-30T00:00:01.567Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-12-01T00:00:02.368Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-12-02T00:00:01.297Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-12-03T00:00:02.052Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-12-04T00:00:01.602Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-12-05T00:00:01.886Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

{“date”:”2018-12-06T00:00:02.870Z”,”level”:”info”,”location”:”[monitoring][createIndex]”,”message”:”Successfully created today index.”}

 5. Setup DiskSpaceWatch Cron:

    • I was getting annoyed with having to deal with the diskspace issues which leads to loss of logs and therefore setup a little bash script called “/usr/bin/diskspacewatchThe script runs as root cron every 30 mins, to get to the cron type ‘crontab -e’

#!/bin/sh

df -h | grep –vE ‘^Filesystem|tmpfs|cdrom‘ | awk‘{ print $5 ” ” $1 }’ | while read output;

do

echo $output

usep=$(echo $output | awk‘{ print $1}’ | cut -d’%’ -f1 )

partition=$(echo $output | awk‘{ print $2 }’ )

if [ $usepge75 ]; then

echo “Running out of space!! on wazuh production server. Add space or wazuh will go into read only mode. \”$partition ($usep%)\” on $(hostname) as on $(date)” |

mail -s “Alert: Almost out of disk space, add diskspace to wazuhprod server. $usep%” nick@nicktailor.com

fi

done

 

    • If any of the partitions reach 75 percent it will send out an email alert to nick@nicktailor.com
    • This is to help avoid log loss from wazuh going into read only mode because of diskspace.

 

    • Upgrading wazuh agents to 3.7

Note: Lastly They say in wazuh documentation that the agent is backwards compatible however this is not true in my opinion. Reason being features stop working and now require you to update all the agents. This is not a simple as simply updating the agent….

 

    • If you attempt to update the agent simply by yum or apt. It will result in the agent loosing the manager ip and key created.
    • This particular piece of the upgrade is something that you should test in a test environment by cloning your entire system to a dev one and running simulations. I learned this the hard way and how to be inventive to get it working.
    • There is an agent_upgrade tool they provide which is supposed to download the new agent, install, and recopy the manager ip and key to the agent all in one go

 

    • List out the agents that need to be upgraded
    • /var/ossec/bin/agent_upgrade -l

 Example.

Example.

waz01 ~]# /var/ossec/bin/agent_upgrade -l

ID    Name                                Version

003   centosnewtemp                       Wazuh v3.6.0

165   test1                     Wazuh v3.6.1

192   test2                       Wazuh v3.6.1

271   test3                        Wazuh v3.3.1

277   test4                           Wazuh v3.3.1

280   test5                           Wazuh v3.3.1

306   test6                           Wazuh v3.3.1

313   test6                           Wazuh v3.3.1

    • Manual update of agent(Successful)

# /var/ossec/bin/agent_upgrade -d -a 003

 

Manager version: v3.7.0

Agent version: v3.3.1

Agent new version: v3.7.0

WPK file already downloaded: /var/ossec/var/upgrade/wazuh_agent_v3.7.0_windows.wpk – SHA1SUM: 79678fd4ab800879aacd4451a64e799c62688b64

Upgrade PKG: wazuh_agent_v3.7.0_windows.wpk (2108 KB)

MSG SENT: 271 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: ok

MSG SENT: 271 com lock_restart -1

RESPONSE: ok

Chunk size: 512 bytes

Sending: /var/ossec/var/upgrade/wazuh_agent_v3.7.0_windows.wpk

MSG SENT: 271 com close wazuh_agent_v3.7.0_windows.wpk

RESPONSE: ok

MSG SENT: 271 com sha1 wazuh_agent_v3.7.0_windows.wpk

RESPONSE: ok 79678fd4ab800879aacd4451a64e799c62688b64

WPK file sent

MSG SENT: 271 com upgrade wazuh_agent_v3.7.0_windows.wpk upgrade.bat

RESPONSE: ok 0

Upgrade procedure started

MSG SENT: 271 com upgrade_result

RESPONSE: err Maximum attempts exceeded

MSG SENT: 271 com upgrade_result

RESPONSE: err Cannot read upgrade_result file.

MSG SENT: 271 com upgrade_result

RESPONSE: ok 0

 

Agent upgraded successfully

 

    • Using the list provided by agent_upgrade you can copy the agent id’s to a txt file like
    • vi agentupgrade.txt   

003

165

192

271

Etc

    • You can then use a for loop like so to cycle through the list
    • for name in `cat agentupgrade.txt`; do /var/ossec/bin/agent_upgrade -a $name; echo $name; done

 Notes: Exiting the script once its running as it may cause issues as I didn’t put in any error fail to exit obviously.. The other issue I did notice that windows 2016 and windows 7 machines had issues updating the agent I saw the following errors as indicated below. This would update the agent, and then timeout without reinputting the manager ip and key. I had to manually update the failed machines as Wazuh was unable to provide me with answer as to why it was failing. I was able to replicate the issue on 50 machines. So in short if your going to upgrade and have 1000 machines. I highly recommend doing lots of simulations before you upgrade as this is one the most important parts of the upgrade. If they fail to mention in their documentation.

 

Errors:

                  # /var/ossec/bin/agent_upgrade -d -a 298 

Manager version: v3.7.0

Agent version: v3.3.1

Agent new version: v3.7.0

WPK file already downloaded: /var/ossec/var/upgrade/wazuh_agent_v3.7.0_windows.wpk – SHA1SUM: 79678fd4ab800879aacd4451a64e799c62688b64

Upgrade PKG: wazuh_agent_v3.7.0_windows.wpk (2108 KB)

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

MSG SENT: 298 com open wb wazuh_agent_v3.7.0_windows.wpk

RESPONSE: err Maximum attempts exceeded

Error 1715: Error sending WPK file: Maximum attempts exceeded

Traceback (most recent call last):

File “/var/ossec/bin/agent_upgrade”, line 165, in <module>

main()

File “/var/ossec/bin/agent_upgrade”, line 119, in main

rl_timeout=-1 if args.timeout == None else args.timeout, use_http=use_http)

File “/var/ossec/bin/../framework/wazuh/agent.py”, line 2206, in upgrade

show_progress=show_progress, chunk_size=chunk_size, rl_timeout=rl_timeout, use_http=use_http)

File “/var/ossec/bin/../framework/wazuh/agent.py”, line 2102, in _send_wpk_file

raise WazuhException(1715, data.replace(“err “,””))

wazuh.exception.WazuhException: Error 1715 – Error sending WPK file: Maximum attempts exceeded

 

 

 

 

 

 

 

 

 

 

 

 

How to survive a log flood – wazuh

This is directly from wazuh documentation, but I thought it would good to have here for people browsing through. I guess the main section to take notice of is how to augment the agent buffer via the ossec.conf on the client side for troubleshooting purposes

Survive a log flood

A centralized logging system needs to be able to process many events per second (eps) from many different log sources at the same time, but sometimes things just get completely out of hand. A variety of problems like infinite loop conditions, poorly written software, and misconfigured applications can cause one or a few individual devices to suddenly start producing a huge and unstopping stream of log messages rushing at your logging system at a rate of hundreds or even thousands of events per second. When such a device or devices suddenly take up vastly more than their fair share of network and log processing resources, it can become widely disruptive. Log flooding can saturate your network bandwidth and/or overtax your Wazuh and Elastic system components while one gigabyte after another of likely the same repeated log messages are being reanalyzed and churned to disk.

Thankfully the Wazuh agent has a flood protection mechanism to prevent out of control log production on one system from creating disruptions to your network or to your Wazuh/Elastic services. In this lab we will create a small log flood and observe how it is gracefully contained by the Wazuh agent before it departs the system where the logs are produced. We will also take a look at the leaky bucket queue that Wazuh uses to accomplish this. Lastly we will note the alerts that are produced to keep us informed about the onset of, escalation of, and recovery from log flooding events.

Configure the Wazuh agent client buffer on linux-agent

  • In this lab, we will limit agent log production to 20 events per second (eps). By default, this limit is prevented from being set to lower than 50, so we will override that by changing the relevant internal options setting. This does not actually set an eps limit. Rather, it is a strictly agent-side setting that protects the agent from being inadvertently subjected to overly restrictive eps limits pushed to it via Wazuh manager centralized configuration. Here we make it possible to enforce an eps as low as 10.
  • #echoagent.min_eps=10″ >> /var/ossec/etc/local_internal_options.conf
  • Open /var/ossec/etc/ossec.conf and find the <client_buffer> section, which looks like this:
  • <client_buffer>
  • <!– Agent buffer options –>
  • <disabled>no</disabled>
  • <queue_size>5000</queue_size>
  • <events_per_second>500</events_per_second>
  • </client_buffer>
  • Restart the Wazuh agent
  • #ossec-control restart

Note

The client buffer is explained in detail in the Wazuh User manual. Search for “Anti-flooding mechanism”. In brief, it allows a Wazuh agent to limit the rate at which it sends log events to the Wazuh Manager. If events are produced at a rate in excess of the configured eps limit, then they are stored in a leaky bucket queue until the eps rate slows down enough that the queue contents can be sent along to the Wazuh Manager. If the queue gets full, then any new events are droped, i.e the bucket leaks. Various alerts are sent to the Wazuh Manager about all of this.

  • To ensure our flood simulation causes queueing and ultimately overflows the queue, change <queue_size> to 500 and <events_per_second> to 20. Save and close ossec.conf. The new section should look like this:
  • <client_buffer>
  • <!– Agent buffer options –>
  • <disabled>no</disabled>
  • <queue_size>500</queue_size>
  • <events_per_second>20</events_per_second>
  • </client_buffer>

Warning

These settings are small for simulation purposes. You would not want to make them this low in production.

Make Wazuh manager record alerts for each flooded event record

Because we will intentionally include the word “fatal” in the flooding log records we generate, they each will trigger generic Wazuh rule 1002 which has a low severity level of 2. By default, Wazuh Manager does not record alerts on rules of severity levels less than 3, so for this lab we will lower the threshold.

  • Edit /var/ossec/etc/ossec.conf and change <log_alert_level> from 3 to 1 so that the <alerts> section looks like below. Now alerts of all severity levels will show up in Kibana.
  • <alerts>
  • <log_alert_level>1</log_alert_level>
  • <email_alert_level>12</email_alert_level>
  • </alerts>
  • Restart Wazuh Manager.
  • #ossec-control restart

Generate a log flood on linux-agent

  • Create a script called /usr/local/bin/makeflood, with this content:
  • #!/bin/bash
  • for i in {1..10000}
  • do
  • echo -n “1:floodtest:Feb 3 03:08:47 linux-agent centos: fatal firehose $i” | ncatUu /var/ossec/queue/ossec/queue
  • echo -n “.”
  • done

Note

While we could write records to a log file monitored by Wazuh agent, this script takes an even faster approach of writing records directly to the Wazuh agent’s internal socket where, for example, ossec-logcollector streams new log lines from log files. The script uses netcat to do this, but any tool that can write datagrams to a Unix socket will do the job. Sometimes it is desirable to have a script on a Wazuh agent send results directly back to the Wazuh manager while completely bypassing the agent’s filesystem. The quoted log line that is piped to netcat consists of three colon-separated parts. First, the “1” corresponds to the syslog log type. The second field causes the location metadata value to be set to “floodtest”. After that is a log line just like you might see in /var/log/messages.

  • Make the script executable and then run it to generate a rapid flood of 10,000 log entries.
  • #chmod700 /usr/local/bin/makeflood
  • #makeflood
  • Notice that the periods representing log messages are scrolling acrosss the screen at a rate well above our 20 eps limit.

See what happened according to Kibana

  • Query Kibana for “firehose”. Click [Add] next to “full_log” for readability. Change the scale from “Auto” to “Second”.

Wazuh Agent Client Buffer

  • Notice that the flooding events only arrived at the Wazuh Manager at a rate of 20 eps, our intended limit. The client buffer eps limit worked!
  • Notice that only 1,269 hits are reported for a flood. It appears many of the flooded events were lost.
  • Expand one of the “firehose” records and compare the field values to the script you used to produce these records.

  • Query Kibana for “agent_flooding”. Click [Add] additionally next to “rule.description” and “data.level” for readability.

  • Observe how Wazuh alerts us at various stages of a flooding event so that we can know when we need to intervene with an over-logging system that is not recovering to a normal state on its own.

Return linux-agent to normal client buffer settings

  • In the <client_buffer> section of /var/ossec/etc/ossec.conf file, change it back to this:
  • <client_buffer>
  • <!– Agent buffer options –>
  • <disabled>no</disabled>
  • <queue_size>5000</queue_size>
  • <events_per_second>500</events_per_second>
  • </client_buffer>
  • Restart the Wazuh agent
  • #ossec-control restart

Congratulations on completing this lab. You survived the log flood!

0