Convert Word doc and docx format to PDF in .NET Core without Microsoft.Office.Interop

I need to display Word .doc and .docx files in a browser. There's no real client-side way to do this and these documents can't be shared with Google docs or Microsoft Office 365 for legal reasons.

Browsers can't display Word, but can display PDF, so I want to convert these docs to PDF on the server and then display that.

I know this can be done using Microsoft.Office.Interop.Word, but my application is .NET Core and does not have access to Office interop. It could be running on Azure, but it could also be running in a Docker container on anything else.

There appear to be lots of similar questions to this, however most are asking about full- framework .NET or assuming that the server is a Windows OS and any answer is no use to me.

How do I convert .doc and .docx files to .pdf without access to Microsoft.Office.Interop.Word?

Answers


This was such a PITA, no wonder all the 3rd party solutions are charging $500 per developer.

Good news is the Open XML SDK recently added support for .Net Standard so it looks like you're in luck with the .docx format.

Bad news at the moment there isn't a lot of choice for PDF generation libraries on .NET Core. Since it doesn't look like you want to pay for one and you cant legally use a 3rd party service we have little choice except to roll our own.

The main problem is getting the Word Document Content transformed to PDF. One of the popular ways is reading the Docx into HTML and exporting that to PDF. It was hard to find, but there is .Net Core version of the OpenXMLSDK-PowerTools that supports transforming Docx to HTML. The Pull Request is "about to be accepted", you can get it from here:

https://github.com/OfficeDev/Open-Xml-PowerTools/tree/abfbaac510d0d60e2f492503c60ef897247716cf

Now that we can extract document content to HTML we need to convert it to PDF. There are a few libraries to convert HTML to PDF, for example DinkToPdf is a cross-platform wrapper around the Webkit HTML to PDF library libwkhtmltox.

I thought DinkToPdf was better than https://code.msdn.microsoft.com/How-to-export-HTML-to-PDF-c5afd0ce


Docx to HTML

Lets put this altogether, download the OpenXMLSDK-PowerTools .Net Core project and build it (just the OpenXMLPowerTools.Core and the OpenXMLPowerTools.Core.Example - ignore the other project). Set the OpenXMLPowerTools.Core.Example as StartUp project. Run the console project:

static void Main(string[] args)
{
    var source = Package.Open(@"test.docx");
    var document = WordprocessingDocument.Open(source);
    HtmlConverterSettings settings = new HtmlConverterSettings();
    XElement html = HtmlConverter.ConvertToHtml(document, settings);

    Console.WriteLine(html.ToString());
    var writer = File.CreateText("test.html");
    writer.WriteLine(html.ToString());
    writer.Dispose();
    Console.ReadLine();

Make sure the test.docx is a valid word document with some text otherwise you might get an error:

the specified package is invalid. the main part is missing

If you run the project you will see the HTML looks almost exactly like the content in the Word document:

However if you try a Word Document with pictures or links you will notice they're missing or broken.

This CodeProject article addresses these issues: https://www.codeproject.com/Articles/1162184/Csharp-Docx-to-HTML-to-Docx

I had to change the static Uri FixUri(string brokenUri) method to return a Uri and I added user friendly error messages.

static void Main(string[] args)
{
    var fileInfo = new FileInfo(@"c:\temp\MyDocWithImages.docx");
    string fullFilePath = fileInfo.FullName;
    string htmlText = string.Empty;
    try
    {
        htmlText = ParseDOCX(fileInfo);
    }
    catch (OpenXmlPackageException e)
    {
        if (e.ToString().Contains("Invalid Hyperlink"))
        {
            using (FileStream fs = new FileStream(fullFilePath,FileMode.OpenOrCreate, FileAccess.ReadWrite))
            {
                UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
            }
            htmlText = ParseDOCX(fileInfo);
        }
    }

    var writer = File.CreateText("test1.html");
    writer.WriteLine(htmlText.ToString());
    writer.Dispose();
}

public static Uri FixUri(string brokenUri)
{
    string newURI = string.Empty;
    if (brokenUri.Contains("mailto:"))
    {
        int mailToCount = "mailto:".Length;
        brokenUri = brokenUri.Remove(0, mailToCount);
        newURI = brokenUri;
    }
    else
    {
        newURI = " ";
    }
    return new Uri(newURI);
}

public static string ParseDOCX(FileInfo fileInfo)
{
    try
    {
        byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument wDoc =
                                        WordprocessingDocument.Open(memoryStream, true))
            {
                int imageCounter = 0;
                var pageTitle = fileInfo.FullName;
                var part = wDoc.CoreFilePropertiesPart;
                if (part != null)
                    pageTitle = (string)part.GetXDocument()
                                            .Descendants(DC.title)
                                            .FirstOrDefault() ?? fileInfo.FullName;

                WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                {
                    AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                    PageTitle = pageTitle,
                    FabricateCssClasses = true,
                    CssClassPrefix = "pt-",
                    RestrictToSupportedLanguages = false,
                    RestrictToSupportedNumberingFormats = false,
                    ImageHandler = imageInfo =>
                    {
                        ++imageCounter;
                        string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                        ImageFormat imageFormat = null;
                        if (extension == "png") imageFormat = ImageFormat.Png;
                        else if (extension == "gif") imageFormat = ImageFormat.Gif;
                        else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                        else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                        else if (extension == "tiff")
                        {
                            extension = "gif";
                            imageFormat = ImageFormat.Gif;
                        }
                        else if (extension == "x-wmf")
                        {
                            extension = "wmf";
                            imageFormat = ImageFormat.Wmf;
                        }

                        if (imageFormat == null) return null;

                        string base64 = null;
                        try
                        {
                            using (MemoryStream ms = new MemoryStream())
                            {
                                imageInfo.Bitmap.Save(ms, imageFormat);
                                var ba = ms.ToArray();
                                base64 = System.Convert.ToBase64String(ba);
                            }
                        }
                        catch (System.Runtime.InteropServices.ExternalException)
                        { return null; }

                        ImageFormat format = imageInfo.Bitmap.RawFormat;
                        ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
                                                    .First(c => c.FormatID == format.Guid);
                        string mimeType = codec.MimeType;

                        string imageSource =
                                string.Format("data:{0};base64,{1}", mimeType, base64);

                        XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                        return img;
                    }
                };

                XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
                var html = new XDocument(new XDocumentType("html", null, null, null),
                                                                            htmlElement);
                var htmlString = html.ToString(SaveOptions.DisableFormatting);
                return htmlString;
            }
        }
    }
    catch
    {
        return "The file is either open, please close it or contains corrupt data";
    }
}

You may need System.Drawing.Common NuGet package to use ImageFormat

Now we can get images:

If you only want to show Word .docx files in a web browser its better not to convert the HTML to PDF as that will significantly increase bandwidth. You could store the HTML in a file system, cloud, or in a dB using a VPP Technology.


HTML to PDF

Next thing we need to do is pass the HTML to DinkToPdf. Download the DinkToPdf (90 MB) solution. Build the solution - it will take a while for all the packages to be restored and for the solution to Compile.

IMPORTANT:

The DinkToPdf library requires the libwkhtmltox.so and libwkhtmltox.dll file in the root of your project if you want to run on Linux and Windows. There's also a libwkhtmltox.dylib file for Mac if you need it.

These dlls are in the v0.12.4 folder. Depending on your PC, 32 or 64 bit, copy the 3 files to the DinkToPdf-master\DinkToPfd.TestConsoleApp\bin\Debug\netcoreapp1.1 folder.

IMPORTANT 2:

Make sure that you have libgdiplus installed in your Docker image or on your Linux machine. The libwkhtmltox.so library depends on it.

Set the DinkToPfd.TestConsoleApp as StartUp project and change the Program.cs file to read the htmlContent from the HTML file saved with Open-Xml-PowerTools instead of the Lorium Ipsom text.

var doc = new HtmlToPdfDocument()
{
    GlobalSettings = {
        ColorMode = ColorMode.Color,
        Orientation = Orientation.Landscape,
        PaperSize = PaperKind.A4,
    },
    Objects = {
        new ObjectSettings() {
            PagesCount = true,
            HtmlContent = File.ReadAllText(@"C:\TFS\Sandbox\Open-Xml-PowerTools-abfbaac510d0d60e2f492503c60ef897247716cf\ToolsTest\test1.html"),
            WebSettings = { DefaultEncoding = "utf-8" },
            HeaderSettings = { FontSize = 9, Right = "Page [page] of [toPage]", Line = true },
            FooterSettings = { FontSize = 9, Right = "Page [page] of [toPage]" }
        }
    }
};

The result of the Docx vs the PDF is quite impressive and I doubt many people would pick out many differences (especially if they never see the original):

Ps. I realise you wanted to convert both .doc and .docx to PDF. I'd suggest making a service yourself to convert .doc to docx using a specific non-server Windows/Microsoft technology. The doc format is binary and is not intended for server side automation of office.


Using the LibreOffice binary

The LibreOffice project is a Open Source cross-platform alternative for MS Office. We can use its capabilities to export doc and docx files to PDF. Currently, LibreOffice has no official API for .NET, therefore, we will talk directly to the soffice binary.

It is a kind of a "hacky" solution, but I think it is the solution with less amount of bugs and maintaining costs possible. Another advantage of this method is that you are not restricted to converting from doc and docx: you can convert it from every format LibreOffice support (e.g. odt, html, spreadsheet, and more).

The implementation

I wrote a simple c# program that uses the soffice binary. This is just a proof-of-concept (and my first program in c#). It supports Windows out of the box and Linux only if the LibreOffice package has been installed.

This is main.cs:

using System;
using System.Collections.Generic;
using System.Text;
using System.Diagnostics;
using System.Reflection;

namespace DocToPdf
{
    public class LibreOfficeFailedException : Exception
    {
        public LibreOfficeFailedException(int exitCode)
            : base(string.Format("LibreOffice has failed with {}", exitCode))
            {}
    }

    class Program
    {
        static string getLibreOfficePath() {
            switch (Environment.OSVersion.Platform) {
                case PlatformID.Unix:
                    return "/usr/bin/soffice";
                case PlatformID.Win32NT:
                    string binaryDirectory = System.IO.Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
                    return binaryDirectory + "\\Windows\\program\\soffice.exe";
                default:
                    throw new PlatformNotSupportedException ("Your OS is not supported");
            }
        }

        static void Main(string[] args) {
            string libreOfficePath = getLibreOfficePath();

            // FIXME: file name escaping: I have not idea how to do it in .NET.
            ProcessStartInfo procStartInfo = new ProcessStartInfo(libreOfficePath, string.Format("--convert-to pdf --nologo {0}", args[0]));
            procStartInfo.RedirectStandardOutput = true;
            procStartInfo.UseShellExecute = false;
            procStartInfo.CreateNoWindow = true;
            procStartInfo.WorkingDirectory = Environment.CurrentDirectory;

            Process process = new Process() { StartInfo =      procStartInfo, };
            process.Start();
            process.WaitForExit();

            // Check for failed exit code.
            if (process.ExitCode != 0) {
                throw new LibreOfficeFailedException(process.ExitCode);
            }
        }
    }
}
Resources
Results

I had tested it on Arch Linux, compiled with mono. I run it using mon and the Linux binary, and with wine: using the Windows binary.

You can find the results in the Tests directory:

Input files: testdoc.doc, testdocx.docx

Outputs:

I've recently done this with FreeSpire.Doc. It has a limit of 3 pages for the free version but it can easily convert a docx file into PDF using something like this

    private void ConvertToPdf()
    {
        try
        {
            for (int i = 0; i < listOfDocx.Count; i++)
            {
                CurrentModalText = "Converting To PDF";
                CurrentLoadingNum += 1;

                string savePath = PdfTempStorage + i + ".pdf";
                listOfPDF.Add(savePath);

                Spire.Doc.Document document = new Spire.Doc.Document(listOfDocx[i], FileFormat.Auto);
                document.SaveToFile(savePath, FileFormat.PDF);
            }
        }
        catch (Exception e)
        {
            throw e;
        }
    }

I then sew these individual pdfs together later using itextsharp.pdf

 public static byte[] concatAndAddContent(List<byte[]> pdfByteContent, List<MailComm> localList)
    {

        using (var ms = new MemoryStream())
        {
            using (var doc = new Document())
            {
                using (var copy = new PdfSmartCopy(doc, ms))
                {
                    doc.Open();
                    //add checklist at the start
                    using (var db = new StudyContext())
                    {
                        var contentId = localList[0].ContentID;
                        var temp = db.MailContentTypes.Where(x => x.ContentId == contentId).ToList();
                        if (!temp[0].Code.Equals("LAB"))
                        {
                            pdfByteContent.Insert(0, CheckListCreation.createCheckBox(localList));
                        }
                    }

                    //Loop through each byte array
                    foreach (var p in pdfByteContent)
                    {

                        //Create a PdfReader bound to that byte array
                        using (var reader = new PdfReader(p))
                        {

                            //Add the entire document instead of page-by-page
                            copy.AddDocument(reader);
                        }
                    }

                    doc.Close();
                }
            }

            //Return just before disposing
            return ms.ToArray();
        }
    }

I dont know if this suits your use case as you havent specified the size of the documents you're trying to write but if they're >3 pages or you can manipulate them to be less than 3 pages it will allow you to convert them into pdfs

As mentioned in the comments below it is also unable to help with RTL languages, thank you @Aria for pointing that out.


Sorry I don't have enough reputation to comment but would like to put my two cents on Jeremy Thompson's answer. And hope this help someone.

When I was going through Jeremy Thompson's answer, after downloading OpenXMLSDK-PowerTools and run OpenXMLPowerTools.Core.Example, I got error like

the specified package is invalid. the main part is missing

at the line

var document = WordprocessingDocument.Open(source);

After struggling for some hours, I found that the test.docx copied to bin file is only 1kb. To solve this, right click test.docx > Properties, set Copy to Output Directory to Copy always solves this problem.

Hope this help some novice like me :)


For converting DOCX to PDF even with placeholders, I have created a free "Report-From-DocX-HTML-To-PDF-Converter" library with .NET CORE under the MIT license, because I was so unnerved that no simple solution existed and all the commercial solutions were super expensive. You can find it here with an extensive description and an example project:

https://github.com/smartinmedia/Net-Core-DocX-HTML-To-PDF-Converter

You only need the free LibreOffice. I recommend using the LibreOffice portable edition, so it does not change anything in your server settings. Have a look, where the file "soffice.exe" (on Linux it is called differently) located, because you need it to fill the variable "locationOfLibreOfficeSoffice".

Here is how it works to convert from DOCX to HTML:

string locationOfLibreOfficeSoffice =   @"C:\PortableApps\LibreOfficePortable\App\libreoffice\program\soffice.exe";

var docxLocation = "MyWordDocument.docx";

var rep = new ReportGenerator(locationOfLibreOfficeSoffice);

//Convert from DOCX to PDF
test.Convert(docxLocation, Path.Combine(Path.GetDirectoryName(docxLocation), "Test-Template-out.pdf"));


//Convert from DOCX to HTML
test.Convert(docxLocation, Path.Combine(Path.GetDirectoryName(docxLocation), "Test-Template-out.html"));

As you see, you can also convert from DOCX to HTML. Also, you can put placeholders into the Word document, which you can then "fill" with values. However, this is not in the scope of your question, but you can read about that on Github (README).


Need Your Help

What's the use of session.flush() in Hibernate

java hibernate orm

When we are updating a record, we can use session.flush() with Hibernate. What's the need for flush()?

Fancybox: iframe doesn't work

jquery fancybox

Very simple test.html page: