Disk Full in SPS 6.12

Hi Safeguard experts,

We experienced 100% disk full in SPS and now it rejects all the incoming connections. Is there a way to quickly check what files are taking the space and free up the disk from the admin page? We have already defined a cleanup policy (to delete data after 4 days and scheduled daily) and enabled the cleanup if it reaches 90% of the disk capacity. However it still grows beyond that limit.  Any suggestions are appreciated. Thank you.

Ronald

Parents
  • We followed this guide Recover from full disk situation (261510) (oneidentity.com) but we are not able to restart the lighttpd.service. Could anyone help? Thank you

    (core/master/test)root@localhost:~# df -h /mnt/drbd
    Filesystem Size Used Avail Use% Mounted on
    none 84G 56G 24G 71% /
    (core/master/test)root@localhost:~# systemctl restart lighttpd.service
    Failed to restart lighttpd.service: Unit lighttpd.service not found.

  • HI Ronald,

    Please refer to this KB in regards to the Web Service, should be nginx rather than lighttpd in the newer SPS versions:

    Try: systemctl restart nginx.service

    https://support.oneidentity.com/one-identity-safeguard-for-privileged-sessions/kb/333085/what-web-server-service-is-running-on-the-sps-appliance

    Thanks!

  • Hi Tawfig,

    Thanks for the information. I tried that command and it returned the following

    (boot/master/test)root@localhost:~# systemctl restart nginx.service
    Job for nginx.service failed because the control process exited with error code.
    See "systemctl status nginx.service" and "journalctl -xe" for details.

    boot/master/test)root@localhost:~# systemctl status nginx.service
    â—Ź nginx.service - A high performance web server and a reverse proxy server
    Loaded: loaded (/lib/systemd/system/nginx.service; disabled; vendor preset>
    Active: failed (Result: exit-code) since Tue 2022-06-07 08:51:58 AEST; 2mi>
    Docs: man:nginx(8)
    Process: 2042614 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_pr>
    Process: 2042618 ExecStart=/usr/sbin/nginx -g daemon on; master_process on;>

    Jun 07 08:51:55 localhost systemd[1]: Starting A high performance web server an>
    Jun 07 08:51:55 localhost nginx[2042618]: nginx: [emerg] bind() to 0.0.0.0:443 >
    Jun 07 08:51:56 localhost nginx[2042618]: nginx: [emerg] bind() to 0.0.0.0:443 >
    Jun 07 08:51:56 localhost nginx[2042618]: nginx: [emerg] bind() to 0.0.0.0:443 >
    Jun 07 08:51:57 localhost nginx[2042618]: nginx: [emerg] bind() to 0.0.0.0:443 >
    Jun 07 08:51:57 localhost nginx[2042618]: nginx: [emerg] bind() to 0.0.0.0:443 >
    Jun 07 08:51:58 localhost nginx[2042618]: nginx: [emerg] still could not bind()
    Jun 07 08:51:58 localhost systemd[1]: nginx.service: Control process exited, co>
    Jun 07 08:51:58 localhost systemd[1]: nginx.service: Failed with result 'exit-c>
    Jun 07 08:51:58 localhost systemd[1]: Failed to start A high performance web se>

    (boot/master/test)root@localhost:~# journalctl -xe
    -- The unit nginx.service has entered the 'failed' state with result 'exit-code>
    Jun 07 08:51:58 localhost systemd[1]: Failed to start A high performance web se>
    -- Subject: A start job for unit nginx.service has failed
    -- Defined-By: systemd
    -- Support: http://www.ubuntu.com/support
    --
    -- A start job for unit nginx.service has finished with a failure.
    --
    -- The job identifier is 647 and the job result is failed.
    Jun 07 08:51:58 localhost systemd[1]: bootfw-httpd.service: Succeeded.
    -- Subject: Unit succeeded
    -- Defined-By: systemd
    -- Support: http://www.ubuntu.com/support
    --
    -- The unit bootfw-httpd.service has successfully entered the 'dead' state.
    Jun 07 08:51:58 localhost systemd[1]: Stopped HTTPd on boot firmware to serve u>
    -- Subject: A stop job for unit bootfw-httpd.service has finished
    -- Defined-By: systemd
    -- Support: http://www.ubuntu.com/support
    --
    -- A stop job for unit bootfw-httpd.service has finished.
    --
    -- The job identifier is 700 and the job result is done.
    lines 1129-1151/1151 (END)

  • Hi Ronald,

    To check the disk usage, please run the commands below:

    - Size of all audit trail files
    du -sch /mnt/firmware/var/lib/zorp/audit

    - Size of all system logs
    du -sch /var/log

    - Size of the metadb
    du -sch /var/lib/postgresql

    If it seems that the large amount of data is in the audit trail path then navigate there and run the following command:

    du /mnt/firmware/var/lib/zorp/audit -hxa -t 1G | sort -rh | head -20

    This should show the top 20 audit trails larger than 1 GiB

    if the issue is in the logs then you can run the same for the logs path:

    du /var/log -hxa -t 1G | sort -rh | head -20

    From here you can decide to either delete one of the large files if no longer needed.

    Once space is recovered, you can verify it using the command below:

    df -h

    Then reboot the appliance.

  • Hi Tawfiq,

    Thanks for the reply. We removed some of the logs and it is now 68%. We rebooted the appliance from both in the admin console and from SPS web interface. However, it is still not connecting.

    (boot/master/test)root@localhost:~# df -h
    Filesystem Size Used Avail Use% Mounted on
    udev 7.8G 0 7.8G 0% /dev
    tmpfs 1.6G 740K 1.6G 1% /run
    none 9.8G 885M 8.5G 10% /
    /dev/mapper/vg--root-boot 9.8G 885M 8.5G 10% /initrd/mnt
    /dev/loop0 222M 222M 0 100% /initrd/mnt/root-ro
    tmpfs 7.9G 0 7.9G 0% /dev/shm
    tmpfs 5.0M 0 5.0M 0% /run/lock
    tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
    tmpfs 7.9G 16K 7.9G 1% /tmp
    /dev/sdb1 32G 49M 30G 1% /mnt/azure-resource
    /dev/mapper/vg--root-core 84G 54G 26G 68% /mnt/drbd
    /dev/loop1 1.4G 1.4G 0 100% /mnt/firmware-ro
    none 84G 54G 26G 68% /mnt/firmware

    One thing we noticed after rebooting, it shows the following message when I login the admin console. Is there anything we missed?

    +-------------------Error---------------------+
    | |
    | Failed systemd units on core firmware: |
    | close-active-sessions-in-elastic.service |
    | |
    +---------------------------------------------+
    | < OK > |
    +---------------------------------------------+

  • Hi Ronald,

    If web access is now working ok but you are having another issue with connecting via SPS, I would recommend opening a Service request with One Identity support to troubleshoot this issue further.

    Thanks!

  • Hi Ronald.

    I have very useful linux command to track sessions record sizes. It's prepares CSV data with records file size, full path, username and ip address. Data is sorted by file size.

    echo "session_size,session_record_file,username,server_ip,session_start,session_end,session_id"; find /var/lib/zorp/audit/ -type f -size +1G -name *.zat | while read line; do ls -lSh "$line" | awk -v OFS=',' '{print $5,$9}' | tr '\n' ','; psql -U scb scb -t --csv -c "select remote_username,server_ip from channels where audit LIKE '%$line%' LIMIT 1";  done

    In situation when elastic service error appears you have to look at elastic service logs and then manually clear some broken data.

  • Thank you  . We tried clearing up the disk space but the connection could not be recovered. We eventually spinned another VM and reconfigured it. 

Reply Children
No Data