Re: Running scrapy+flask on Cloud Run

Cristiano-GetIt · 05-18-2022 07:29 AM

I set up a Flask endpoint method that runs a spider, following the official documentation and some other examples I've found online. The endpoint basically waits while the spider is run in a new process and returns when this process is finished. When I run this locally, everything works fine, but when I call the endpoint in Cloud Run, it's not waiting for the process to finish before returning the response.

Is it possible to achieve this in Cloud Run? Does anyone have any suggestions?

ErnestoC

Can you share the relevant Cloud Run logs that are created when your application exits before completing? Are you also able to share a minimal example that can be reproduced in Cloud Run? It would be helpful to find the cause of this problem.

Cristiano-GetIt

The Cloud Run logs don't show any errors or anything that could indicate a problem. This is the code I'm calling from the Flask endpoint:

The run_spider is called by the Flask endpoint. It should wait the spider finish before returning the result, but it isn't. In other words, it should block on line 18 (result = queue.get(...)) for some seconds, before returning the result, but instead, the request is returned in less than 1 second, meaning that the spider process didn't run.

def run_spider():
    def script(queue):
        settings = get_project_settings()

        process = CrawlerProcess(settings)
        process.crawl("SpiderName")
        process.start()
        queue.put(None)

    get_or_create_eventloop()

    queue = Queue()

    try:
        main_process = Process(target=script, args=(queue,))
        main_process.start()
        main_process.join()

        result = queue.get(block=True, timeout=None)

        if result is not None:
            raise result

        return result
    except Exception as e:
        queue.put(e)
        raise e


def get_or_create_eventloop():
    try:
        return asyncio.get_event_loop()
    except RuntimeError as ex:
        if "There is no current event loop in thread" in str(ex):
            loop = asyncio.new_event_loop()
            asyncio.set_event_loop(loop)
            return asyncio.get_event_loop()


if __name__ == '__main__':
    app.run()

ErnestoC

In which execution environment is your service running? Have you tried running this service using the second version of the Cloud Run environment? Something else that can affect the asynchronous execution is whether your service is allocated CPU at all times. If not, the CPU time will be removed when requests are not being served. Have you tried giving your instance CPU allocation at all times? If the logs do not show any relevant error, you could try the configuration changes above.