How to remove punctuation from a database in marklogic?

I want to remove punctuation from a database of xml document in marklogic. This is made for preprocessing purposes for machine learning. I'm new to marklogic and i don't know how to do that. Is there an xquery query that could remove punctuation?

Answers


To do a mass replacement of all text in the database, and take out punctuation, you could start with something that looks like this code (modified for your needs):

for $doc in cts:search(fn:collection(), ())
    for $text in $doc//text()
        return xdmp:node-replace($text, text{fn:replace($text, "[\.,;]", "")})

To be honest, that task is much less expensive to do on the source text files themselves - or in MarkLogic by treating the XML as string during the replacement process. Updating nodes one element at a time will be expensive.

Outside of Marklogic: use SED or AWK or a similar tool BEFORE INGESTION

Inside of MarkLogic(as a trigger, perhaps) use xdmp:quote to change the XML to a string, then replace in a sing with fn:replace and then make XML again with xdmp:unquote

let $new-doc := xdmp:unquote(fn:replace(xdmp:quote($doc), "[\.,;]", ""))

Then either store by replacing the root node with xdmp:node-replace - or store this version as a property. This all depends on if the original (punctuated version matters to you). Or perhaps you just want to keep the original and serve this cleansed version back to someone.

In all cases above, you have to make sure that your replacement does not murder your XML. Also, be aware of options for the functions above(like how cdata is handled.


Lastly, "This is for machine learning purposes". You do not elaborate. I think many of us here have a feeling that this solution (cleansing punctuation before insert) rubs against the very grain of MarkLogic - in which you store as-is and then have awesome index, tokenizing, stemming, collation, search support to find and return your data as you need. If you were to elaborate on your use case a bit, you may inspire others to give more MarkLogic-Specific suggestions.


It will work if you use 'punctuation-insensitive' and if required 'diacritic-insensitive' in cts:element-word-query()


I'm not sure if this is what you're asking, but it's technically possible to update every document in the database to remove punctuation; however, it's very expensive and I wouldn't recommend it.

Using built-in search functions, you can probably achieve the same goal without updating your documents by querying with punctuation insensitivity. For example, if you want to select documents with a title matching a case insensitive string:

cts:search(//mydoc,
  cts:element-word-query(xs:QName('title'), 'Moby-Dick', 'punctuation-insensitive'))

Or in an existing XQuery:

for $d in $documents
where cts:contains($d, 
  cts:element-word-query(xs:QName('title'), 'Moby-Dick', 'punctuation-insensitive'))
return $d/summary

Need Your Help

Pivot table not creating with datetimeindex from dataframe

python pandas dataframe pivot-table datetimeindex

I'm having trouble creating a pivot table from a dataframe with a datetimeindex as the index. Editing to show complete code

When presenting the iOS HealthKit permissions modal view, the view behind it is black

ios objective-c swift

When calling the HKHealthStore's requestAuthorizationToShareTypes method, a modal permissions view is presented.