AI models often struggle with unstructured data, such as images, audio, and raw text, due to their complexity and lack of standardized formats. Given a dataset containing various unstructured data types, how would you design a multi-modal deep learning architecture that effectively extracts meaningful patterns while ensuring generalization across different domains?