How to split pages of a Word document into separate files in c#

I have an OCR program that converts images to word documents. The word document contains text of the all images, and I want to split it into separate files.

Is there any way to do this in c#?

thanks

Answers


Same as other answer, but with an IEnumerator and an extension method to the document.

static class PagesExtension {
    public static IEnumerable<Range> Pages(this Document doc) {
        int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];
        int pageStart = 0;
        for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
            var page = doc.Range(
                pageStart
            );
            if (currentPageIndex < pageCount) {
                //page.GoTo returns a new Range object, leaving the page object unaffected
                page.End = page.GoTo(
                    What: WdGoToItem.wdGoToPage,
                    Which: WdGoToDirection.wdGoToAbsolute,
                    Count: currentPageIndex+1
                ).Start-1;
            } else {
                page.End = doc.Range().End;
            }
            pageStart = page.End + 1;
            yield return page;
        }
        yield break;
    }
}

The main code ends up like this:

static void Main(string[] args) {
    var app = new Application();
    app.Visible = true;
    var doc = app.Documents.Open(@"path\to\source\document");
    foreach (var page in doc.Pages()) {
        page.Copy();
        var doc2 = app.Documents.Add();
        doc2.Range().Paste();
    }
}

You can manipulate the Word document from C# using the Word object model, if you have Word installed.

First, add a reference to the Word object model. Right-click on the project, then Add Reference... -> COM -> Microsoft Word 14.0 Object Model (or something similar, depending on your version of Word).

Then, you can use the following code:

using Microsoft.Office.Interop.Word;
//for older versions of Word use:
//using Word;

namespace WordSplitter {
    class Program {
        static void Main(string[] args) {
            //Create a new instance of Word
            var app = new Application();

            //Show the Word instance.
            //If the code runs too slowly, you can show the application at the end of the program
            //Make sure it works properly first; otherwise, you'll get an error in a hidden window
            //(If it still runs too slowly, there are a few other ways to reduce screen updating)
            app.Visible = true;

            //We need a reference to the source document
            //It should be possible to get a reference to an open Word document, but I haven't tried it
            var doc = app.Documents.Open(@"path\to\file.doc");
            //(Can also use .docx)

            int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];

            //We'll hold the start position of each page here
            int pageStart = 0;

            for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
                //This Range object will contain each page.
                var page = doc.Range(pageStart);

                //Generally, the end of the current page is 1 character before the start of the next.
                //However, we need to handle the last page -- since there is no next page, the 
                //GoTo method will move to the *start* of the last page.
                if (currentPageIndex < pageCount) {
                    //page.GoTo returns a new Range object, leaving the page object unaffected
                    page.End = page.GoTo(
                        What: WdGoToItem.wdGoToPage,
                        Which: WdGoToDirection.wdGoToAbsolute,
                        Count: currentPageIndex + 1
                    ).Start - 1;
                } else {
                    page.End = doc.Range().End;
                }
                pageStart = page.End + 1;

                //Copy and paste the contents of the Range into a new document
                page.Copy();
                var doc2 = app.Documents.Add();
                doc2.Range().Paste();
            }
        }
    }
}

Reference: Word Object Model Overview on MSDN


Not easily at the Word document end, though Word creates documents with w:lastRenderedPageBreak.

Best to have your OCR program insert some marker into the document between each block of converted text.

Then, depending on what sort of Word document it is, process the file with an appropriate tool.


Need Your Help

Animated transitions for jQuery UI's sortable

jquery jquery-ui user-experience

Just out of curiosity, as I haven't been able to find anything anywhere; does anyone know of a way to get jQuery UI's sortable function to animate its sorting?

If statement always true with enum in comparison

c# if-statement enums

I'm having a problem. I'm making a utility to do procedural generated maps.