Getting color information from iTextSharp’s TextRenderInfo and ITextExtractionStrategy

In order to get color information when using an ITextExtractionStrategy in iTextSharp (5.1.1.0) you need to make the following changes to main iTextSharp code. Once you make these changes you can follow my SO post here for getting font information as well.

iTextSharp.text.pdf.parser.GraphicsState.cs

//New Fields:
internal BaseColor colorStroke;
internal BaseColor colorNonStroke;

//New Properties:
public BaseColor GetColorStroke() {
    return colorStroke;
}
public BaseColor GetColorNonStroke() {
    return colorNonStroke;
}

//changed constructors:
public GraphicsState(){
    ctm = new Matrix();
    characterSpacing = 0;
    wordSpacing = 0;
    horizontalScaling = 1.0f;
    leading = 0;
    font = null;
    fontSize = 0;
    renderMode = 0;
    rise = 0;
    knockout = true;
    colorStroke = null;
    colorNonStroke = null;
}

/**
* Copy constructor.
* @param source    another GraphicsState object
*/
public GraphicsState(GraphicsState source){
    // note: all of the following are immutable, with the possible exception of font
    // so it is safe to copy them as-is
    ctm = source.ctm;
    characterSpacing = source.characterSpacing;
    wordSpacing = source.wordSpacing;
    horizontalScaling = source.horizontalScaling;
    leading = source.leading;
    font = source.font;
    fontSize = source.fontSize;
    renderMode = source.renderMode;
    rise = source.rise;
    knockout = source.knockout;
    colorStroke = source.colorStroke;
    colorNonStroke = source.colorNonStroke;
}

iTextSharp.text.pdf.parser.PdfContentStreamProcessor.cs

//append to end of method PopulateOperators()
    RegisterContentOperator("G", new SetStrokingGray());
    RegisterContentOperator("g", new SetNonStrokingGray());
    RegisterContentOperator("RG", new SetStrokingRGB());
    RegisterContentOperator("rg", new SetNonStrokingRGB());
    RegisterContentOperator("K", new SetStrokingCMYK());
    RegisterContentOperator("k", new SetNonStrokingCMYK());
    RegisterContentOperator("CS", new SetStrokingGeneral());
    RegisterContentOperator("cs", new SetNonStrokingGeneral());
    RegisterContentOperator("SC", new SetStrokingGeneral());
    RegisterContentOperator("sc", new SetNonStrokingGeneral());
    RegisterContentOperator("SCN", new SetStrokingGeneral());
    RegisterContentOperator("scn", new SetNonStrokingGeneral());

//add new classes:
public abstract class SetColorBase : IContentOperator {
    public enum ColorStyle { Stroke = 1, NonStroke = 2 };
    public enum ColorSpace { RGB = 1, CMYK = 2, Gray = 3, Other = 4 };
    public abstract BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands);
    private ColorStyle style;
    private ColorSpace space;
    public SetColorBase(ColorStyle colorStyle, ColorSpace colorSpace) {
        this.style = colorStyle;
        this.space = colorSpace;
    }
    public void Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands) {
        BaseColor c = GetColor(oper, operands);
        GraphicsState gs = processor.gsStack.Peek();
        if (this.style == ColorStyle.Stroke) {
            gs.colorStroke = c;
        }
        else if (this.style == ColorStyle.NonStroke) {
            gs.colorNonStroke = c;
        }
    }
}
private class SetStrokingGray : SetColorBase {
    public SetStrokingGray() : base(ColorStyle.Stroke, ColorSpace.Gray) { }
    public override BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands) {
        PdfNumber g = (PdfNumber)operands[0];
        return new GrayColor(g.FloatValue);
    }
}
private class SetNonStrokingGray : SetColorBase {
    public SetNonStrokingGray() : base(ColorStyle.NonStroke, ColorSpace.Gray) { }
    public override BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands) {
        PdfNumber g = (PdfNumber)operands[0];
        return new GrayColor(g.FloatValue);
    }
}
private class SetStrokingRGB : SetColorBase {
    public SetStrokingRGB() : base(ColorStyle.Stroke, ColorSpace.RGB) { }
    public override BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands) {
        PdfNumber r = (PdfNumber)operands[0];
        PdfNumber g = (PdfNumber)operands[1];
        PdfNumber b = (PdfNumber)operands[2];
        return new BaseColor(r.FloatValue, g.FloatValue, b.FloatValue);
    }
}
private class SetNonStrokingRGB : SetColorBase {
    public SetNonStrokingRGB() : base(ColorStyle.NonStroke, ColorSpace.RGB) { }
    public override BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands) {
        PdfNumber r = (PdfNumber)operands[0];
        PdfNumber g = (PdfNumber)operands[1];
        PdfNumber b = (PdfNumber)operands[2];
        return new BaseColor(r.FloatValue, g.FloatValue, b.FloatValue);
    }
}
private class SetStrokingCMYK : SetColorBase {
    public SetStrokingCMYK() : base(ColorStyle.Stroke, ColorSpace.CMYK) { }
    public override BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands) {
        PdfNumber c = (PdfNumber)operands[0];
        PdfNumber m = (PdfNumber)operands[1];
        PdfNumber y = (PdfNumber)operands[2];
        PdfNumber k = (PdfNumber)operands[3];
        return new CMYKColor(c.FloatValue, m.FloatValue, y.FloatValue, k.FloatValue);
    }
}
private class SetNonStrokingCMYK : SetColorBase {
    public SetNonStrokingCMYK() : base(ColorStyle.NonStroke, ColorSpace.CMYK) { }
    public override BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands) {
        PdfNumber c = (PdfNumber)operands[0];
        PdfNumber m = (PdfNumber)operands[1];
        PdfNumber y = (PdfNumber)operands[2];
        PdfNumber k = (PdfNumber)operands[3];
        return new CMYKColor(c.FloatValue, m.FloatValue, y.FloatValue, k.FloatValue);
    }
}
private class SetNonStrokingGeneral : SetColorBase {
    public SetNonStrokingGeneral() : base(ColorStyle.NonStroke, ColorSpace.Other) { }
    public override BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands) {
        if (operands.Count == 2 && operands[0].IsNumber() && ((PdfNumber)operands[0]).IntValue == 0) {
            return new BaseColor(0);
        }
        if (operands.Count == 2 && operands[0].IsName()) {
            return new BaseColor(0);
        }
        if (operands.Count == 4) {
            PdfNumber r = (PdfNumber)operands[0];
            PdfNumber g = (PdfNumber)operands[1];
            PdfNumber b = (PdfNumber)operands[2];
            return new BaseColor(r.FloatValue, g.FloatValue, b.FloatValue);
        }
        return null;
    }
}
private class SetStrokingGeneral : SetColorBase {
    public SetStrokingGeneral() : base(ColorStyle.Stroke, ColorSpace.Other) { }
    public override BaseColor GetColor(PdfLiteral oper, List<PdfObject> operands) {
        if (operands.Count == 2 && operands[0].IsNumber() && ((PdfNumber)operands[0]).IntValue == 0) {
            return new BaseColor(0);
        }
        if (operands.Count == 2 && operands[0].IsName()) {
            return new BaseColor(0);
        }
        if (operands.Count == 4) {
            PdfNumber r = (PdfNumber)operands[0];
            PdfNumber g = (PdfNumber)operands[1];
            PdfNumber b = (PdfNumber)operands[2];
            return new BaseColor(r.FloatValue, g.FloatValue, b.FloatValue);
        }
    return null;
    }
}

iTextSharp.text.pdf.parser.TextRenderInfo.cs

//new methods
public BaseColor GetColorStroke() {
    return gs.GetColorStroke();
}
public BaseColor GetColorNonStroke() {
    return gs.GetColorNonStroke();
}

This code is very experimental but so far works pretty well. Depending on who generates the PDF different things can happen. Word’s built-in PDF generator seems to take the easier route and just kicks out simple RGB values. Adobe’s PDF plug-in appears to do the same but in a more complicated way, creating “named” color spaces (I think) but I’m not completely sure how to use them yet.

9 thoughts on “Getting color information from iTextSharp’s TextRenderInfo and ITextExtractionStrategy

  1. Hi, Could You please advice me or at least show the direction. I wonder if using ItextExtractionStrategy i will be able to parse pdf document with table and additionally if i will be able to retrieve cells text and it’s background color. Is it a good idea or it will be only possible in tagged pdf? Where should i search the informations about cell colors/background.
    Thank You,

    Best regards,
    Mark

  2. Sorry Mark, but PDFs don’t really have anything called a “table”, just something that looks like one. A table in a PDF is actually just a bunch of lines or boxes with their contents filled in with a color, then regular text is drawn on top of that. With that definition you can understand that text can actually be placed on top of anything, from a box with a solid color (a table cell) or an image with a million colors in it. When you’re parsing text there would be no way for the text extractor to reliably determine a color that’s behind text because there could be millions.

  3. Oh, i see but how in other algorithm not during TextExtracting retrieve this box with a solid color and it’s color and position?What methods should i try to recover it.? Then i could merge these informations

    Thanks one more time,

    Mark

  4. Hi, I am developing project in Java with itext, so I wonder how can i change iTextSharp.text.pdf.parser.GraphicsState file, where can i find?

  5. Hi Chris,

    Your code helped me take a few steps in the correct direction, but now I’m stuck again. After it detects an SCN operator and goes into SetStrokingGeneral class, you’re apparently not handling the case where the operands may be color space names or numbers. For example, I get 0 or 1 as the first operand, which I assume is the index of the color space it is using, but you’re simply skipping this case in your code.

    I tried somewhat further and found that I can extract color space objects from PdfReader, but I end up get a PdfArray with two objects, where the first is a literal named “ICCBased” and the second is a PRStream. Do you have any idea how to handle this case?

    • Hi Shujaat. As you saw, SCN and scn themselves are catchalls for everything else that’s not RGB, CMYK or Grey. Before hitting one of those two you should actually first find a CS operator whose first and only operand is the actual color space to use. There’s a bunch of options for this including DeviceRGB, DeviceCMYK, Pattern, Lab, DeviceN, etc. You can find these in table 74 of the 2008 PDF spec section 8.6.8 (page 171). My code is actually not completely correct and I shouldn’t be pushing CS and cs to the SetStrokingGeneral method but instead should do some further processing. Unfortunately none of the samples PDFs that I had at the time had this set so I couldn’t test for it. Hopefully this helps you out!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.