Solved: Re: Filestore directory becomes inaccessible/undel...

bgulko · 09-20-2024 04:55 PM

Summary:

I recently created a Compute Engine instance (with boot image, ubuntu 24.04), and successfully created and mounted a FileStore share (NFS4) via NFS/mount.

When restoring my data from a 500GB+ .tar file, I found that some (just a few) directories could not be listed nor deleted.

Details:

for problematic directory "foo" in restored ./bar/foo/ I can

$ cd bar

and

ls -1 .

, which produces "foo". In addition

cd ./bar/foo

works but, then

ls

results in

reading directory '.': Remote I/O error

while trying to remove foo by

 cd ./bar; sudo rm -rf foo

( or a variety of similar actions) results in

rm: cannot remove 'foo': Remote I/O error

I CAN mv/rename "foo" to something else (e.g. "baz") but then I can not delete the "something else".

A bit of exploration in creating directories, files, and links shows that after moving foo to a new name, I can recreate a new subdirectory called foo, and then create regular files and links in the recreated foo. However, when I create a symbolic link in foo AND the that link's target file name length exceeds 127 (!) , the problem manifests, then any preexisting contents of foo become unreadable and foo itself becomes undeletable (as above) .

This is not intermittent, I have over a dozen examples of this behavior.

Request (Help!):

How can I manage links with targets longer than 127 characters on FileStore with NFS4.1 ? (so I can usefully restore my tar files. The archive command (tar -xvf ) seems to complete without error and the archive contains a few links with long targets, generally long absolute paths)
How can I delete these old directories and release their content (short of a backup and restore to a new fileshare). Given that any attempt to remove them (or rename / remove) results in "rm: cannot remove 'foo': Remote I/O error" [considered alternative seems backup and restore to a new FS share, backup may simply not save the problematic links].

Note this problem does not arise on AWS EC2/EFS/NFS4, nor on my local directories local Ubuntu 20.04 VM.

Many thanks in advance,

bgulko

Resolution!

Indeed, it looks like this was a server-side defect. Based on reports to me it was an NFS software issue that effected regional shares in us-central1. The problematic software version was 3.26.0.1, and problem was fixed by an update to 3.27.0.4 that was rolled out a few days after my report.

The files generated by the script (and my migration) were indeed there (albeit inaccessible) and became visible (and deletable) once the server software was updated. My thanks to Google support and the FileStore team.

View solution in original post

greb

Hi @bgulko,

Welcome to Google Cloud Community!

I acknowledge your worry about the "Remote I/O" errors and problems with symbolic links on Google Cloud Filestore. It appears probable that this behavior is being caused by limitations in the NFS4.1 setup or path length. To work around the issue, here are some recommendations:

Adjust NFS Mount Options: Try tweaking options like vers=4.1, nolock, and adjust buffer sizes (rsize, wsize). This could improve how Filestore handles these requests.
Shorten Symbolic Link Targets: If possible, reduce the length of symbolic link paths or exclude longer ones when restoring from the archive.
Use Filesystem Tools: Running fsck or mounting the Filestore share on another machine may help in clearing out problematic directories.
Recreate the Volume: If deletions still fail, you could back up and restore the data to a new Filestore share, cleaning up any long links during the process.

If you have a support package, you may reach out to Google Cloud Support, they might offer additional insights or fixes for this limitation.

I hope the above information is helpful.

bgulko

First and foremost, thank you for your response!

I am still encountering this problem, and have already performed a number of the steps you mention, but not all. While I would be happy to purchase Google Cloud Support, a purchase seems to require an Organization associated with this Project (e.g. Workspace) and this project has no Organization. Acquiring one has proven problematic in the past. Perhaps you know how to access/pay for CGS directly (without an Organization)? If so please let me know, I would appreciate it.

I have attempted some further exploration of this issue. It appears that this problem occurs when the long link is provided via a streaming source (eg, SFTP) or extracted from a streamed tar file, for example:

gsutil cat foo.tar | tar -xvf –

or

[long-target link created via ssh / sftp]

The problem does NOT manifest when the link is created locally via

ln -s <long target> foo

nor

tar -xvf foo.tar

This leads me to believe that some intermediate layer is assuming that the source is a regular file which may be scanned backwards, rather than a stream which cannot (without appropriate buffering). Perhaps this is a problematic interaction with the "Long Link Trick" used to store long links in tar files.

I appreciate your suggestions, and here are the results from what I have tried:

Adjust NFS Mount Options:

I have repeatedly remounted after adjusted with mount settings while adjusting buffer sizes and versions as well as using default settings. I have also created a separate VM using Ubuntu 2204 (rater than 2404) and mounted he share from that other machine instance, all seems to offer the same problem.

Shorten Symbolic Link Targets

It seems that would work. Unfortunately, I cannot shorten the symbolic link target as it is created by an intermediate application that is run in a container. The application has control of the link target length.

Use Filesystem Tools

I’m afraid fsck does not seem to do anything from the client side of an NFS/Filestore share, and I do not have access to the server (from GCS/Filestore). I have searched for tools that might operate from a client and found none (only server-side tools), perhaps you could point some out?

Recreate the Volume

This will be my last option, and with a TB archive it is quite cumbersome. The entire directory that contains the problematic link becomes inaccessible once the link is created, so a backup of the affected share will lose valuable information in the same directory, but I can restore all files except the link, then manually recreate the few problematic links. Alternatively I create a third Filestore to hold the TB tar archive then restore without streaming the archive from the GCS bucket. There are some workarounds, but as a solo practitioner my resources are limited and this has been extremely distracting.

I’d be delighted to pay for Google Cloud Support for a support package, but haven’t been able to figure out how to do this without an Organization associated with my Project (it has none), and I’d need a great deal of technical support before opting for an Organziaion/Workspace infrastructure as this carries a great deal of problematic technical overhead. Any further help or suggestions would be greatly appreciated!

bgulko

One more bit of information. This seems related to the length of the file names in the directory, not the contents of those files. When the directory is composed of the following files:

.
..
.command.begin
.command.err
.command.log
.command.out
.command.run
.command.sh
.command.trace
.exitcode
GRCh38_tEFO1060_ALL_vcf.pgen
GRCh38_tEFO1060_ALL_vcf.psam
GRCh38_tEFO1060_ALL_vcf.pvar.zst
NO_FILE
tEFO10601_abcdefghiklmnopqrstuvw
tEFO1060_ALL_additive_1.log
tEFO1060_ALL_additive_1.sscore.vars
tEFO1060_ALL_additive_1.sscore.zst
versions.yml

The bug is NOT triggered. Also, when the file named

tEFO10601_abcdefghiklmnopqrstuvw

is shorter (e.g. tEFO10601_abcdefgh), the bug is not triggered. However, as soon as I add a single character to the filename, for example "x", via

mv tEFO10601_abcdefghiklmnopqrstuvw tEFO10601_abcdefghiklmnopqrstuvwx

the bug is triggered and the entire directory is no longer readable nor deletable.

Perhaps this is enough to identify the problem.

bgulko

Below is a script that demonstrates the problem as well as the output and my mounting entries

Script

#!/bin/bash
#
l_files=".command.begin
.command.err
.command.log
.command.out
.command.run
.command.sh
.command.trace
.exitcode
GRCh38_tEFO1060_ALL_vcf.pgen
GRCh38_tEFO1060_ALL_vcf.psam
GRCh38_tEFO1060_ALL_vcf.pvar.zst
NO_FILE
tEFO10601_abcdefghiklmnopqrstuvw
tEFO1060_ALL_additive_1.log
tEFO1060_ALL_additive_1.sscore.vars
tEFO1060_ALL_additive_1.sscore.zst
versions.yml"

echo -e "\n`date` Demonstrating NSF4 Error."
d_targ="y"

echo -e "\n`date` Operating in directory."
pwd
realpath .

echo -e "\n`date` making files in ${d_targ}"
mkdir -p "${d_targ}"

for i in $( echo "${l_files}" ); do
  f_out="${d_targ}/${i}"
  echo "touch ${f_out}"
  touch "${f_out}"
done

echo -e "\n`date` listing files before triggering error"
ls -aslF ${d_targ}/.*
ls -aslF ${d_targ}/*

echo -e "\n`date` triggering error"
echo "mv -v ${d_targ}/tEFO10601_abcdefghiklmnopqrstuvw ${d_targ}//tEFO10601_abcdefghiklmnopqrstuvwx"
mv -v "${d_targ}/tEFO10601_abcdefghiklmnopqrstuvw" "${d_targ}//tEFO10601_abcdefghiklmnopqrstuvwx"

echo -e "\n`date` demonstrating error"
set -x
ls -asl "${d_targ}"
set +x

echo -e "\n`date` Script complete - exiting."

Output

$ ./foo.sh
Sat Sep 28 06:13:25 UTC 2024 Demonstrating NSF4 Error.

Sat Sep 28 06:13:25 UTC 2024 Operating in directory.
/home/proto/wrk/00_src/experiments/005_pgs_firstset_revised/traits/EFO_0001060/v4_child_worked/work/4f
/mnt/efs/fs1/base/inf/experiments/005_pgs_firstset_revised/traits/EFO_0001060/v4_child_worked/work/4f

Sat Sep 28 06:13:25 UTC 2024 making files in y
touch y/.command.begin
touch y/.command.err
touch y/.command.log
touch y/.command.out
touch y/.command.run
touch y/.command.sh
touch y/.command.trace
touch y/.exitcode
touch y/GRCh38_tEFO1060_ALL_vcf.pgen
touch y/GRCh38_tEFO1060_ALL_vcf.psam
touch y/GRCh38_tEFO1060_ALL_vcf.pvar.zst
touch y/NO_FILE
touch y/tEFO10601_abcdefghiklmnopqrstuvw
touch y/tEFO1060_ALL_additive_1.log
touch y/tEFO1060_ALL_additive_1.sscore.vars
touch y/tEFO1060_ALL_additive_1.sscore.zst
touch y/versions.yml

Sat Sep 28 06:13:25 UTC 2024 listing files before triggering error
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.begin
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.err
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.log
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.out
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.run
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.sh
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.trace
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.exitcode
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/GRCh38_tEFO1060_ALL_vcf.pgen
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/GRCh38_tEFO1060_ALL_vcf.psam
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/GRCh38_tEFO1060_ALL_vcf.pvar.zst
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/NO_FILE
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/tEFO10601_abcdefghiklmnopqrstuvw
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/tEFO1060_ALL_additive_1.log
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/tEFO1060_ALL_additive_1.sscore.vars
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/tEFO1060_ALL_additive_1.sscore.zst
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/versions.yml

Sat Sep 28 06:13:25 UTC 2024 triggering error
mv -v y/tEFO10601_abcdefghiklmnopqrstuvw y//tEFO10601_abcdefghiklmnopqrstuvwx
renamed 'y/tEFO10601_abcdefghiklmnopqrstuvw' -> 'y//tEFO10601_abcdefghiklmnopqrstuvwx'

Sat Sep 28 06:13:25 UTC 2024 demonstrating error
+ ls -asl y
ls: reading directory 'y': Remote I/O error
total 0
+ set +x

Sat Sep 28 06:13:25 UTC 2024 Script complete - exiting.

/etc/fstab

XX.XX.XX.XX:/progenic_XXX       /mnt/efs/fs1    nfs4    rw,async        0 0

/etc/mtab

XX.XX.XX.XX:/progenic_XXX /mnt/efs/fs1 nfs4 rw,relatime,vers=4.1,rsize=360448,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.128.0.6,local_lock=none,addr=XX.XX.XX.XX 0 0

bgulko

This has now been verified by my support agent and submitted via a paid support program ticket as a P2 issue to the FileStore team. It can be referenced as:

Case 53763055: Adding file or changing name in FileStore directory renders entire directory inaccessable and undeletable.

I'll follow up further if there is a resolution (or, of course, if I am mistaken!).

bgulko

Resolution!

Indeed, it looks like this was a server-side defect. Based on reports to me it was an NFS software issue that effected regional shares in us-central1. The problematic software version was 3.26.0.1, and problem was fixed by an update to 3.27.0.4 that was rolled out a few days after my report.

The files generated by the script (and my migration) were indeed there (albeit inaccessible) and became visible (and deletable) once the server software was updated. My thanks to Google support and the FileStore team.

Filestore directory becomes inaccessible/undeletable when containing long link - "Remote I/O error "