Summary:
I recently created a Compute Engine instance (with boot image, ubuntu 24.04), and successfully created and mounted a FileStore share (NFS4) via NFS/mount.
When restoring my data from a 500GB+ .tar file, I found that some (just a few) directories could not be listed nor deleted.
Details:
for problematic directory "foo" in restored ./bar/foo/ I can
$ cd bar
and
ls -1 .
, which produces "foo". In addition
cd ./bar/foo
works but, then
ls
results in
reading directory '.': Remote I/O error
while trying to remove foo by
cd ./bar; sudo rm -rf foo
( or a variety of similar actions) results in
rm: cannot remove 'foo': Remote I/O error
I CAN mv/rename "foo" to something else (e.g. "baz") but then I can not delete the "something else".
A bit of exploration in creating directories, files, and links shows that after moving foo to a new name, I can recreate a new subdirectory called foo, and then create regular files and links in the recreated foo. However, when I create a symbolic link in foo AND the that link's target file name length exceeds 127 (!) , the problem manifests, then any preexisting contents of foo become unreadable and foo itself becomes undeletable (as above) .
This is not intermittent, I have over a dozen examples of this behavior.
Request (Help!):
Note this problem does not arise on AWS EC2/EFS/NFS4, nor on my local directories local Ubuntu 20.04 VM.
Many thanks in advance,
Solved! Go to Solution.
Resolution!
Indeed, it looks like this was a server-side defect. Based on reports to me it was an NFS software issue that effected regional shares in us-central1. The problematic software version was 3.26.0.1, and problem was fixed by an update to 3.27.0.4 that was rolled out a few days after my report.
The files generated by the script (and my migration) were indeed there (albeit inaccessible) and became visible (and deletable) once the server software was updated. My thanks to Google support and the FileStore team.
Hi @bgulko,
Welcome to Google Cloud Community!
I acknowledge your worry about the "Remote I/O" errors and problems with symbolic links on Google Cloud Filestore. It appears probable that this behavior is being caused by limitations in the NFS4.1 setup or path length. To work around the issue, here are some recommendations:
If you have a support package, you may reach out to Google Cloud Support, they might offer additional insights or fixes for this limitation.
I hope the above information is helpful.
First and foremost, thank you for your response!
I am still encountering this problem, and have already performed a number of the steps you mention, but not all. While I would be happy to purchase Google Cloud Support, a purchase seems to require an Organization associated with this Project (e.g. Workspace) and this project has no Organization. Acquiring one has proven problematic in the past. Perhaps you know how to access/pay for CGS directly (without an Organization)? If so please let me know, I would appreciate it.
I have attempted some further exploration of this issue. It appears that this problem occurs when the long link is provided via a streaming source (eg, SFTP) or extracted from a streamed tar file, for example:
gsutil cat foo.tar | tar -xvf –
or
[long-target link created via ssh / sftp]
The problem does NOT manifest when the link is created locally via
ln -s <long target> foo
nor
tar -xvf foo.tar
This leads me to believe that some intermediate layer is assuming that the source is a regular file which may be scanned backwards, rather than a stream which cannot (without appropriate buffering). Perhaps this is a problematic interaction with the "Long Link Trick" used to store long links in tar files.
I appreciate your suggestions, and here are the results from what I have tried:
Adjust NFS Mount Options:
I have repeatedly remounted after adjusted with mount settings while adjusting buffer sizes and versions as well as using default settings. I have also created a separate VM using Ubuntu 2204 (rater than 2404) and mounted he share from that other machine instance, all seems to offer the same problem.
Shorten Symbolic Link Targets
It seems that would work. Unfortunately, I cannot shorten the symbolic link target as it is created by an intermediate application that is run in a container. The application has control of the link target length.
Use Filesystem Tools
I’m afraid fsck does not seem to do anything from the client side of an NFS/Filestore share, and I do not have access to the server (from GCS/Filestore). I have searched for tools that might operate from a client and found none (only server-side tools), perhaps you could point some out?
Recreate the Volume
This will be my last option, and with a TB archive it is quite cumbersome. The entire directory that contains the problematic link becomes inaccessible once the link is created, so a backup of the affected share will lose valuable information in the same directory, but I can restore all files except the link, then manually recreate the few problematic links. Alternatively I create a third Filestore to hold the TB tar archive then restore without streaming the archive from the GCS bucket. There are some workarounds, but as a solo practitioner my resources are limited and this has been extremely distracting.
I’d be delighted to pay for Google Cloud Support for a support package, but haven’t been able to figure out how to do this without an Organization associated with my Project (it has none), and I’d need a great deal of technical support before opting for an Organziaion/Workspace infrastructure as this carries a great deal of problematic technical overhead. Any further help or suggestions would be greatly appreciated!
One more bit of information. This seems related to the length of the file names in the directory, not the contents of those files. When the directory is composed of the following files:
.
..
.command.begin
.command.err
.command.log
.command.out
.command.run
.command.sh
.command.trace
.exitcode
GRCh38_tEFO1060_ALL_vcf.pgen
GRCh38_tEFO1060_ALL_vcf.psam
GRCh38_tEFO1060_ALL_vcf.pvar.zst
NO_FILE
tEFO10601_abcdefghiklmnopqrstuvw
tEFO1060_ALL_additive_1.log
tEFO1060_ALL_additive_1.sscore.vars
tEFO1060_ALL_additive_1.sscore.zst
versions.yml
The bug is NOT triggered. Also, when the file named
tEFO10601_abcdefghiklmnopqrstuvw
is shorter (e.g. tEFO10601_abcdefgh), the bug is not triggered. However, as soon as I add a single character to the filename, for example "x", via
mv tEFO10601_abcdefghiklmnopqrstuvw tEFO10601_abcdefghiklmnopqrstuvwx
the bug is triggered and the entire directory is no longer readable nor deletable.
Perhaps this is enough to identify the problem.
Below is a script that demonstrates the problem as well as the output and my mounting entries
Script
#!/bin/bash
#
l_files=".command.begin
.command.err
.command.log
.command.out
.command.run
.command.sh
.command.trace
.exitcode
GRCh38_tEFO1060_ALL_vcf.pgen
GRCh38_tEFO1060_ALL_vcf.psam
GRCh38_tEFO1060_ALL_vcf.pvar.zst
NO_FILE
tEFO10601_abcdefghiklmnopqrstuvw
tEFO1060_ALL_additive_1.log
tEFO1060_ALL_additive_1.sscore.vars
tEFO1060_ALL_additive_1.sscore.zst
versions.yml"
echo -e "\n`date` Demonstrating NSF4 Error."
d_targ="y"
echo -e "\n`date` Operating in directory."
pwd
realpath .
echo -e "\n`date` making files in ${d_targ}"
mkdir -p "${d_targ}"
for i in $( echo "${l_files}" ); do
f_out="${d_targ}/${i}"
echo "touch ${f_out}"
touch "${f_out}"
done
echo -e "\n`date` listing files before triggering error"
ls -aslF ${d_targ}/.*
ls -aslF ${d_targ}/*
echo -e "\n`date` triggering error"
echo "mv -v ${d_targ}/tEFO10601_abcdefghiklmnopqrstuvw ${d_targ}//tEFO10601_abcdefghiklmnopqrstuvwx"
mv -v "${d_targ}/tEFO10601_abcdefghiklmnopqrstuvw" "${d_targ}//tEFO10601_abcdefghiklmnopqrstuvwx"
echo -e "\n`date` demonstrating error"
set -x
ls -asl "${d_targ}"
set +x
echo -e "\n`date` Script complete - exiting."
Output
$ ./foo.sh
Sat Sep 28 06:13:25 UTC 2024 Demonstrating NSF4 Error.
Sat Sep 28 06:13:25 UTC 2024 Operating in directory.
/home/proto/wrk/00_src/experiments/005_pgs_firstset_revised/traits/EFO_0001060/v4_child_worked/work/4f
/mnt/efs/fs1/base/inf/experiments/005_pgs_firstset_revised/traits/EFO_0001060/v4_child_worked/work/4f
Sat Sep 28 06:13:25 UTC 2024 making files in y
touch y/.command.begin
touch y/.command.err
touch y/.command.log
touch y/.command.out
touch y/.command.run
touch y/.command.sh
touch y/.command.trace
touch y/.exitcode
touch y/GRCh38_tEFO1060_ALL_vcf.pgen
touch y/GRCh38_tEFO1060_ALL_vcf.psam
touch y/GRCh38_tEFO1060_ALL_vcf.pvar.zst
touch y/NO_FILE
touch y/tEFO10601_abcdefghiklmnopqrstuvw
touch y/tEFO1060_ALL_additive_1.log
touch y/tEFO1060_ALL_additive_1.sscore.vars
touch y/tEFO1060_ALL_additive_1.sscore.zst
touch y/versions.yml
Sat Sep 28 06:13:25 UTC 2024 listing files before triggering error
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.begin
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.err
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.log
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.out
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.run
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.sh
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.command.trace
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/.exitcode
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/GRCh38_tEFO1060_ALL_vcf.pgen
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/GRCh38_tEFO1060_ALL_vcf.psam
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/GRCh38_tEFO1060_ALL_vcf.pvar.zst
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/NO_FILE
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/tEFO10601_abcdefghiklmnopqrstuvw
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/tEFO1060_ALL_additive_1.log
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/tEFO1060_ALL_additive_1.sscore.vars
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/tEFO1060_ALL_additive_1.sscore.zst
0 -rw-r--r--+ 1 bgulko prototypers 0 Sep 28 06:13 y/versions.yml
Sat Sep 28 06:13:25 UTC 2024 triggering error
mv -v y/tEFO10601_abcdefghiklmnopqrstuvw y//tEFO10601_abcdefghiklmnopqrstuvwx
renamed 'y/tEFO10601_abcdefghiklmnopqrstuvw' -> 'y//tEFO10601_abcdefghiklmnopqrstuvwx'
Sat Sep 28 06:13:25 UTC 2024 demonstrating error
+ ls -asl y
ls: reading directory 'y': Remote I/O error
total 0
+ set +x
Sat Sep 28 06:13:25 UTC 2024 Script complete - exiting.
/etc/fstab
XX.XX.XX.XX:/progenic_XXX /mnt/efs/fs1 nfs4 rw,async 0 0
/etc/mtab
XX.XX.XX.XX:/progenic_XXX /mnt/efs/fs1 nfs4 rw,relatime,vers=4.1,rsize=360448,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.128.0.6,local_lock=none,addr=XX.XX.XX.XX 0 0
This has now been verified by my support agent and submitted via a paid support program ticket as a P2 issue to the FileStore team. It can be referenced as:
Case 53763055: Adding file or changing name in FileStore directory renders entire directory inaccessable and undeletable.
I'll follow up further if there is a resolution (or, of course, if I am mistaken!).
Resolution!
Indeed, it looks like this was a server-side defect. Based on reports to me it was an NFS software issue that effected regional shares in us-central1. The problematic software version was 3.26.0.1, and problem was fixed by an update to 3.27.0.4 that was rolled out a few days after my report.
The files generated by the script (and my migration) were indeed there (albeit inaccessible) and became visible (and deletable) once the server software was updated. My thanks to Google support and the FileStore team.